The document describes a tutorial on generator tricks for systems programmers given by David Beazley. It introduces generators and how they can be used to process large data files like web server logs more efficiently by generating values one at a time instead of building large lists. Specifically, it presents a problem of summing the last column of a web log to find the total bytes transferred, and suggests using a generator to process the log line by line due to potentially large file sizes.
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Bio:
DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Bio:
DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
Докладчик:
Игорь Стариков
Описание:
Не секрет, что Питон, благодаря своим свойствам, имеет широчайшую область применения. Не являются исключением и мультимедийные (в том числе игровые) приложения.
В ходе этого выступления:
1. вы узнаете о некоторых средствах и принципах их построения, а также о том, как упомянутые средства могут использовать функции внешних библиотек, написанных на других языках программирования;
2. а я получу, наконец, достижение из одной известной игры, не запуская её.
My Stackato presentation given to the CopenhagenJS user group. Basic examples were implemented in Node.
More information available at: https://logiclab.jira.com/wiki/display/OPEN/Stackato
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...Ambassador Labs
One of the challenges facing Telepresence is growing the contributor community. It’s a complex application that requires a good understanding of OS networking, VPNs, Kubernetes, and everything in between. We’ll kick off this meeting with a general architectural overview of Telepresence. We’ll talk about how we’ve managed the project to date, and our investments to make it easier. We want to then turn it over for an interactive discussion with participants to see what we can do to make it easier to contribute and grow the Telepresence community.
Intro to Python for High School students, from basics up to built-in functions, recursion, and list/dict comprehensions. Does not include classes, which are in Unit #2.
Slides for the 60 minutes workshop I presented at the virtual edition of ClueCon 2020 (ClueCon Deconstructed). The many slides cover different aspects in Janus, ranging from configuration, to plugins, how to write your own plugin, core features, recording, monitoring, and so on. Unfortunately I didn't have enough time to talk about everything, but slides should be easy to follow anyway.
Presentation of ActiveStates micro-cloud solution Stackato at Open Source Days 2012.
Stackato is a cloud solution from renowned ActiveState. It is based on the Open Source CloudFoundry and offers a serious cloud solution for Perl programmers, but also supports Python, Ruby, Node.js, PHP, Clojure and Java.
Stackato is very strong in the private PaaS area, but do also support as public PaaS and deployment onto Amazon's EC2.
The presentation will cover basic use of Stackato and the reason for using a PaaS, public as private. Stackato can also be used as a micro-cloud for developers supporting vSphere, VMware Fusion, Parallels and VirtualBox.
Stackato is currently in public beta, but it is already quite impressive in both features and tools. Stackato is not Open Source, but CloudFoundry is and Stackato offers a magnificent platform for deployment of Open Source projects, sites and services.
ActiveState has committed to keeping the micro-cloud solution free so it offers an exciting capability and extension to the developers toolbox and toolchain.
More information available at: https://logiclab.jira.com/wiki/display/OPEN/Stackato
Try to imagine the amount of time and effort it would take you to write a bug-free script or application that will accept a URL, port scan it, and for each HTTP service that it finds, it will create a new thread and perform a black box penetration testing while impersonating a Blackberry 9900 smartphone. While you’re thinking, Here’s how you would have done it in Hackersh:
“http://localhost” \
-> url \
-> nmap \
-> browse(ua=”Mozilla/5.0 (BlackBerry; U; BlackBerry 9900; en) AppleWebKit/534.11+ (KHTML, like Gecko) Version/7.1.0.346 Mobile Safari/534.11+”) \
-> w3af
Meet Hackersh (“Hacker Shell”) – A new, free and open source cross-platform shell (command interpreter) with built-in security commands and Pythonect-like syntax.
Aside from being interactive, Hackersh is also scriptable with Pythonect. Pythonect is a new, free, and open source general-purpose dataflow programming language based on Python, written in Python. Hackersh is inspired by Unix pipeline, but takes it a step forward by including built-in features like remote invocation and threads. This 120 minute lab session will introduce Hackersh, the automation gap it fills, and its features. Lots of demonstrations and scripts are included to showcase concepts and ideas.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generator Tricks for Systems Programmers, v2.0
1. Generator Tricks
For Systems Programmers
(Version 2.0)
David Beazley
http://www.dabeaz.com
Presented at PyCon UK 2008
Copyright (C) 2008, http://www.dabeaz.com 1- 1
2. Introduction 2.0
• At PyCon'2008 (Chicago), I gave this tutorial
on generators that you are about to see
• About 80 people attended
• Afterwards, there were more than 15000
downloads of my presentation slides
• Clearly there was some interest
• This is a revised version...
Copyright (C) 2008, http://www.dabeaz.com 1- 2
3. An Introduction
• Generators are cool!
• But what are they?
• And what are they good for?
• That's what this tutorial is about
Copyright (C) 2008, http://www.dabeaz.com 1- 3
4. About Me
• I'm a long-time Pythonista
• First started using Python with version 1.3
• Author : Python Essential Reference
• Responsible for a number of open source
Python-related packages (Swig, PLY, etc.)
Copyright (C) 2008, http://www.dabeaz.com 1- 4
5. My Story
My addiction to generators started innocently
enough. I was just a happy Python
programmer working away in my secret lair
when I got "the call." A call to sort through
1.5 Terabytes of C++ source code (~800
weekly snapshots of a million line application).
That's when I discovered the os.walk()
function. This was interesting I thought...
Copyright (C) 2008, http://www.dabeaz.com 1- 5
6. Back Story
• I started using generators on a more day-to-
day basis and found them to be wicked cool
• One of Python's most powerful features!
• Yet, they still seem rather exotic
• In my experience, most Python programmers
view them as kind of a weird fringe feature
• This is unfortunate.
Copyright (C) 2008, http://www.dabeaz.com 1- 6
7. Python Books Suck!
• Let's take a look at ... my book
(a motivating example)
• Whoa! Counting.... that's so useful.
• Almost as useful as sequences of Fibonacci
numbers, random numbers, and squares
Copyright (C) 2008, http://www.dabeaz.com 1- 7
8. Our Mission
• Some more practical uses of generators
• Focus is "systems programming"
• Which loosely includes files, file systems,
parsing, networking, threads, etc.
• My goal : To provide some more compelling
examples of using generators
• I'd like to plant some seeds and inspire tool
builders
Copyright (C) 2008, http://www.dabeaz.com 1- 8
9. Support Files
• Files used in this tutorial are available here:
http://www.dabeaz.com/generators-uk/
• Go there to follow along with the examples
Copyright (C) 2008, http://www.dabeaz.com 1- 9
10. Disclaimer
• This isn't meant to be an exhaustive tutorial
on generators and related theory
• Will be looking at a series of examples
• I don't know if the code I've written is the
"best" way to solve any of these problems.
• Let's have a discussion
Copyright (C) 2008, http://www.dabeaz.com 1- 10
11. Performance Disclosure
• There are some later performance numbers
• Python 2.5.1 on OS X 10.4.11
• All tests were conducted on the following:
• Mac Pro 2x2.66 Ghz Dual-Core Xeon
• 3 Gbytes RAM
• WDC WD2500JS-41SGB0 Disk (250G)
• Timings are 3-run average of 'time' command
Copyright (C) 2008, http://www.dabeaz.com 1- 11
12. Part I
Introduction to Iterators and Generators
Copyright (C) 2008, http://www.dabeaz.com 1- 12
13. Iteration
• As you know, Python has a "for" statement
• You use it to loop over a collection of items
>>> for x in [1,4,5,10]:
... print x,
...
1 4 5 10
>>>
• And, as you have probably noticed, you can
iterate over many different kinds of objects
(not just lists)
Copyright (C) 2008, http://www.dabeaz.com 1- 13
14. Iterating over a Dict
• If you loop over a dictionary you get keys
>>> prices = { 'GOOG' : 490.10,
... 'AAPL' : 145.23,
... 'YHOO' : 21.71 }
...
>>> for key in prices:
... print key
...
YHOO
GOOG
AAPL
>>>
Copyright (C) 2008, http://www.dabeaz.com 1- 14
15. Iterating over a String
• If you loop over a string, you get characters
>>> s = "Yow!"
>>> for c in s:
... print c
...
Y
o
w
!
>>>
Copyright (C) 2008, http://www.dabeaz.com 1- 15
16. Iterating over a File
• If you loop over a file you get lines
>>> for line in open("real.txt"):
... print line,
...
Real Programmers write in FORTRAN
Maybe they do now,
in this decadent era of
Lite beer, hand calculators, and "user-friendly" software
but back in the Good Old Days,
when the term "software" sounded funny
and Real Computers were made out of drums and vacuum tubes,
Real Programmers wrote in machine code.
Not FORTRAN. Not RATFOR. Not, even, assembly language.
Machine Code.
Raw, unadorned, inscrutable hexadecimal numbers.
Directly.
Copyright (C) 2008, http://www.dabeaz.com 1- 16
17. Consuming Iterables
• Many functions consume an "iterable" object
• Reductions:
sum(s), min(s), max(s)
• Constructors
list(s), tuple(s), set(s), dict(s)
• in operator
item in s
• Many others in the library
Copyright (C) 2008, http://www.dabeaz.com 1- 17
18. Iteration Protocol
• The reason why you can iterate over different
objects is that there is a specific protocol
>>> items = [1, 4, 5]
>>> it = iter(items)
>>> it.next()
1
>>> it.next()
4
>>> it.next()
5
>>> it.next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
>>>
Copyright (C) 2008, http://www.dabeaz.com 1- 18
19. Iteration Protocol
• An inside look at the for statement
for x in obj:
# statements
• Underneath the covers
_iter = iter(obj) # Get iterator object
while 1:
try:
x = _iter.next() # Get next item
except StopIteration: # No more items
break
# statements
...
• Any object that supports iter() and next() is
said to be "iterable."
Copyright (C) 2008, http://www.dabeaz.com 1-19
20. Supporting Iteration
• User-defined objects can support iteration
• Example: Counting down...
>>> for x in countdown(10):
... print x,
...
10 9 8 7 6 5 4 3 2 1
>>>
• To do this, you just have to make the object
implement __iter__() and next()
Copyright (C) 2008, http://www.dabeaz.com 1-20
21. Supporting Iteration
• Sample implementation
class countdown(object):
def __init__(self,start):
self.count = start
def __iter__(self):
return self
def next(self):
if self.count <= 0:
raise StopIteration
r = self.count
self.count -= 1
return r
Copyright (C) 2008, http://www.dabeaz.com 1-21
22. Iteration Example
• Example use:
>>> c = countdown(5)
>>> for i in c:
... print i,
...
5 4 3 2 1
>>>
Copyright (C) 2008, http://www.dabeaz.com 1-22
23. Iteration Commentary
• There are many subtle details involving the
design of iterators for various objects
• However, we're not going to cover that
• This isn't a tutorial on "iterators"
• We're talking about generators...
Copyright (C) 2008, http://www.dabeaz.com 1-23
24. Generators
• A generator is a function that produces a
sequence of results instead of a single value
def countdown(n):
while n > 0:
yield n
n -= 1
>>> for i in countdown(5):
... print i,
...
5 4 3 2 1
>>>
• Instead of returning a value, you generate a
series of values (using the yield statement)
Copyright (C) 2008, http://www.dabeaz.com 1-24
25. Generators
• Behavior is quite different than normal func
• Calling a generator function creates an
generator object. However, it does not start
running the function.
def countdown(n):
print "Counting down from", n
while n > 0:
yield n
n -= 1 Notice that no
output was
>>> x = countdown(10) produced
>>> x
<generator object at 0x58490>
>>>
Copyright (C) 2008, http://www.dabeaz.com 1-25
26. Generator Functions
• The function only executes on next()
>>> x = countdown(10)
>>> x
<generator object at 0x58490>
>>> x.next()
Counting down from 10 Function starts
10 executing here
>>>
• yield produces a value, but suspends the function
• Function resumes on next call to next()
>>> x.next()
9
>>> x.next()
8
>>>
Copyright (C) 2008, http://www.dabeaz.com 1-26
27. Generator Functions
• When the generator returns, iteration stops
>>> x.next()
1
>>> x.next()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
StopIteration
>>>
Copyright (C) 2008, http://www.dabeaz.com 1-27
28. Generator Functions
• A generator function is mainly a more
convenient way of writing an iterator
• You don't have to worry about the iterator
protocol (.next, .__iter__, etc.)
• It just works
Copyright (C) 2008, http://www.dabeaz.com 1-28
29. Generators vs. Iterators
• A generator function is slightly different
than an object that supports iteration
• A generator is a one-time operation. You
can iterate over the generated data once,
but if you want to do it again, you have to
call the generator function again.
• This is different than a list (which you can
iterate over as many times as you want)
Copyright (C) 2008, http://www.dabeaz.com 1-29
30. Generator Expressions
• A generated version of a list comprehension
>>> a = [1,2,3,4]
>>> b = (2*x for x in a)
>>> b
<generator object at 0x58760>
>>> for i in b: print b,
...
2 4 6 8
>>>
• This loops over a sequence of items and applies
an operation to each item
• However, results are produced one at a time
using a generator
Copyright (C) 2008, http://www.dabeaz.com 1-30
31. Generator Expressions
• Important differences from a list comp.
• Does not construct a list.
• Only useful purpose is iteration
• Once consumed, can't be reused
• Example:
>>> a = [1,2,3,4]
>>> b = [2*x for x in a]
>>> b
[2, 4, 6, 8]
>>> c = (2*x for x in a)
<generator object at 0x58760>
>>>
Copyright (C) 2008, http://www.dabeaz.com 1-31
32. Generator Expressions
• General syntax
(expression for i in s if condition)
• What it means
for i in s:
if condition:
yield expression
Copyright (C) 2008, http://www.dabeaz.com 1-32
33. A Note on Syntax
• The parens on a generator expression can
dropped if used as a single function argument
• Example:
sum(x*x for x in s)
Generator expression
Copyright (C) 2008, http://www.dabeaz.com 1-33
34. Interlude
• We now have two basic building blocks
• Generator functions:
def countdown(n):
while n > 0:
yield n
n -= 1
• Generator expressions
squares = (x*x for x in s)
• In both cases, we get an object that
generates values (which are typically
consumed in a for loop)
Copyright (C) 2008, http://www.dabeaz.com 1-34
35. Part 2
Processing Data Files
(Show me your Web Server Logs)
Copyright (C) 2008, http://www.dabeaz.com 1- 35
36. Programming Problem
Find out how many bytes of data were
transferred by summing up the last column
of data in this Apache web server log
81.107.39.38 - ... "GET /ply/ HTTP/1.1" 200 7587
81.107.39.38 - ... "GET /favicon.ico HTTP/1.1" 404 133
81.107.39.38 - ... "GET /ply/bookplug.gif HTTP/1.1" 200 23903
81.107.39.38 - ... "GET /ply/ply.html HTTP/1.1" 200 97238
81.107.39.38 - ... "GET /ply/example.html HTTP/1.1" 200 2359
66.249.72.134 - ... "GET /index.html HTTP/1.1" 200 4447
Oh yeah, and the log file might be huge (Gbytes)
Copyright (C) 2008, http://www.dabeaz.com 1-36
37. The Log File
• Each line of the log looks like this:
81.107.39.38 - ... "GET /ply/ply.html HTTP/1.1" 200 97238
• The number of bytes is the last column
bytestr = line.rsplit(None,1)[1]
• It's either a number or a missing value (-)
81.107.39.38 - ... "GET /ply/ HTTP/1.1" 304 -
• Converting the value
if bytestr != '-':
bytes = int(bytestr)
Copyright (C) 2008, http://www.dabeaz.com 1-37
38. A Non-Generator Soln
• Just do a simple for-loop
wwwlog = open("access-log")
total = 0
for line in wwwlog:
bytestr = line.rsplit(None,1)[1]
if bytestr != '-':
total += int(bytestr)
print "Total", total
• We read line-by-line and just update a sum
• However, that's so 90s...
Copyright (C) 2008, http://www.dabeaz.com 1-38
39. A Generator Solution
• Let's use some generator expressions
wwwlog = open("access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes = (int(x) for x in bytecolumn if x != '-')
print "Total", sum(bytes)
• Whoa! That's different!
• Less code
• A completely different programming style
Copyright (C) 2008, http://www.dabeaz.com 1-39
40. Generators as a Pipeline
• To understand the solution, think of it as a data
processing pipeline
access-log wwwlog bytecolumn bytes sum() total
• Each step is defined by iteration/generation
wwwlog = open("access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes = (int(x) for x in bytecolumn if x != '-')
print "Total", sum(bytes)
Copyright (C) 2008, http://www.dabeaz.com 1-40
41. Being Declarative
• At each step of the pipeline, we declare an
operation that will be applied to the entire
input stream
access-log wwwlog bytecolumn bytes sum() total
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
This operation gets applied to
every line of the log file
Copyright (C) 2008, http://www.dabeaz.com 1-41
42. Being Declarative
• Instead of focusing on the problem at a
line-by-line level, you just break it down
into big operations that operate on the
whole file
• This is very much a "declarative" style
• The key : Think big...
Copyright (C) 2008, http://www.dabeaz.com 1-42
43. Iteration is the Glue
• The glue that holds the pipeline together is the
iteration that occurs in each step
wwwlog = open("access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes = (int(x) for x in bytecolumn if x != '-')
print "Total", sum(bytes)
• The calculation is being driven by the last step
• The sum() function is consuming values being
pulled through the pipeline (via .next() calls)
Copyright (C) 2008, http://www.dabeaz.com 1-43
44. Performance
• Surely, this generator approach has all
sorts of fancy-dancy magic that is slow.
• Let's check it out on a 1.3Gb log file...
% ls -l big-access-log
-rw-r--r-- beazley 1303238000 Feb 29 08:06 big-access-log
Copyright (C) 2008, http://www.dabeaz.com 1-44
45. Performance Contest
wwwlog = open("big-access-log")
total = 0
for line in wwwlog: Time
bytestr = line.rsplit(None,1)[1]
if bytestr != '-':
total += int(bytestr) 27.20
print "Total", total
wwwlog = open("big-access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes = (int(x) for x in bytecolumn if x != '-')
print "Total", sum(bytes) Time
25.96
Copyright (C) 2008, http://www.dabeaz.com 1-45
46. Commentary
• Not only was it not slow, it was 5% faster
• And it was less code
• And it was relatively easy to read
• And frankly, I like it a whole better...
"Back in the old days, we used AWK for this and
we liked it. Oh, yeah, and get off my lawn!"
Copyright (C) 2008, http://www.dabeaz.com 1-46
47. Performance Contest
wwwlog = open("access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes = (int(x) for x in bytecolumn if x != '-')
print "Total", sum(bytes) Time
25.96
% awk '{ total += $NF } END { print total }' big-access-log
Time
Note:extracting the last
column might not be 37.33
awk's strong point
Copyright (C) 2008, http://www.dabeaz.com 1-47
48. Food for Thought
• At no point in our generator solution did
we ever create large temporary lists
• Thus, not only is that solution faster, it can
be applied to enormous data files
• It's competitive with traditional tools
Copyright (C) 2008, http://www.dabeaz.com 1-48
49. More Thoughts
• The generator solution was based on the
concept of pipelining data between
different components
• What if you had more advanced kinds of
components to work with?
• Perhaps you could perform different kinds
of processing by just plugging various
pipeline components together
Copyright (C) 2008, http://www.dabeaz.com 1-49
50. This Sounds Familiar
• The Unix philosophy
• Have a collection of useful system utils
• Can hook these up to files or each other
• Perform complex tasks by piping data
Copyright (C) 2008, http://www.dabeaz.com 1-50
51. Part 3
Fun with Files and Directories
Copyright (C) 2008, http://www.dabeaz.com 1- 51
52. Programming Problem
You have hundreds of web server logs scattered
across various directories. In additional, some of
the logs are compressed. Modify the last program
so that you can easily read all of these logs
foo/
access-log-012007.gz
access-log-022007.gz
access-log-032007.gz
...
access-log-012008
bar/
access-log-092007.bz2
...
access-log-022008
Copyright (C) 2008, http://www.dabeaz.com 1-52
53. os.walk()
• A very useful function for searching the
file system
import os
for path, dirlist, filelist in os.walk(topdir):
# path : Current directory
# dirlist : List of subdirectories
# filelist : List of files
...
• This utilizes generators to recursively walk
through the file system
Copyright (C) 2008, http://www.dabeaz.com 1-53
54. find
• Generate all filenames in a directory tree
that match a given filename pattern
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
• Examples
pyfiles = gen_find("*.py","/")
logs = gen_find("access-log*","/usr/www/")
Copyright (C) 2008, http://www.dabeaz.com 1-54
55. Performance Contest
pyfiles = gen_find("*.py","/")
for name in pyfiles:
Wall Clock Time
print name
559s
% find / -name '*.py'
Wall Clock Time
468s
Performed on a 750GB file system
containing about 140000 .py files
Copyright (C) 2008, http://www.dabeaz.com 1-55
56. A File Opener
• Open a sequence of filenames
import gzip, bz2
def gen_open(filenames):
for name in filenames:
if name.endswith(".gz"):
yield gzip.open(name)
elif name.endswith(".bz2"):
yield bz2.BZ2File(name)
else:
yield open(name)
• This is interesting.... it takes a sequence of
filenames as input and yields a sequence of open
file objects
Copyright (C) 2008, http://www.dabeaz.com 1-56
57. cat
• Concatenate items from one or more
source into a single sequence of items
def gen_cat(sources):
for s in sources:
for item in s:
yield item
• Example:
lognames = gen_find("access-log*", "/usr/www")
logfiles = gen_open(lognames)
loglines = gen_cat(logfiles)
Copyright (C) 2008, http://www.dabeaz.com 1-57
58. grep
• Generate a sequence of lines that contain
a given regular expression
import re
def gen_grep(pat, lines):
patc = re.compile(pat)
for line in lines:
if patc.search(line): yield line
• Example:
lognames = gen_find("access-log*", "/usr/www")
logfiles = gen_open(lognames)
loglines = gen_cat(logfiles)
patlines = gen_grep(pat, loglines)
Copyright (C) 2008, http://www.dabeaz.com 1-58
59. Example
• Find out how many bytes transferred for a
specific pattern in a whole directory of logs
pat = r"somepattern"
logdir = "/some/dir/"
filenames = gen_find("access-log*",logdir)
logfiles = gen_open(filenames)
loglines = gen_cat(logfiles)
patlines = gen_grep(pat,loglines)
bytecolumn = (line.rsplit(None,1)[1] for line in patlines)
bytes = (int(x) for x in bytecolumn if x != '-')
print "Total", sum(bytes)
Copyright (C) 2008, http://www.dabeaz.com 1-59
60. Important Concept
• Generators decouple iteration from the
code that uses the results of the iteration
• In the last example, we're performing a
calculation on a sequence of lines
• It doesn't matter where or how those
lines are generated
• Thus, we can plug any number of
components together up front as long as
they eventually produce a line sequence
Copyright (C) 2008, http://www.dabeaz.com 1-60
61. Part 4
Parsing and Processing Data
Copyright (C) 2008, http://www.dabeaz.com 1- 61
62. Programming Problem
Web server logs consist of different columns of
data. Parse each line into a useful data structure
that allows us to easily inspect the different fields.
81.107.39.38 - - [24/Feb/2008:00:08:59 -0600] "GET ..." 200 7587
host referrer user [datetime] "request" status bytes
Copyright (C) 2008, http://www.dabeaz.com 1-62
63. Parsing with Regex
• Let's route the lines through a regex parser
logpats = r'(S+) (S+) (S+) [(.*?)] '
r'"(S+) (S+) (S+)" (S+) (S+)'
logpat = re.compile(logpats)
groups = (logpat.match(line) for line in loglines)
tuples = (g.groups() for g in groups if g)
• This generates a sequence of tuples
('71.201.176.194', '-', '-', '26/Feb/2008:10:30:08 -0600',
'GET', '/ply/ply.html', 'HTTP/1.1', '200', '97238')
Copyright (C) 2008, http://www.dabeaz.com 1-63
64. Tuple Commentary
• I generally don't like data processing on tuples
('71.201.176.194', '-', '-', '26/Feb/2008:10:30:08 -0600',
'GET', '/ply/ply.html', 'HTTP/1.1', '200', '97238')
• First, they are immutable--so you can't modify
• Second, to extract specific fields, you have to
remember the column number--which is
annoying if there are a lot of columns
• Third, existing code breaks if you change the
number of fields
Copyright (C) 2008, http://www.dabeaz.com 1-64
65. Tuples to Dictionaries
• Let's turn tuples into dictionaries
colnames = ('host','referrer','user','datetime',
'method','request','proto','status','bytes')
log = (dict(zip(colnames,t)) for t in tuples)
• This generates a sequence of named fields
{ 'status' : '200',
'proto' : 'HTTP/1.1',
'referrer': '-',
'request' : '/ply/ply.html',
'bytes' : '97238',
'datetime': '24/Feb/2008:00:08:59 -0600',
'host' : '140.180.132.213',
'user' : '-',
'method' : 'GET'}
Copyright (C) 2008, http://www.dabeaz.com 1-65
66. Field Conversion
• You might want to map specific dictionary fields
through a conversion function (e.g., int(), float())
def field_map(dictseq,name,func):
for d in dictseq:
d[name] = func(d[name])
yield d
• Example: Convert a few field values
log = field_map(log,"status", int)
log = field_map(log,"bytes",
lambda s: int(s) if s !='-' else 0)
Copyright (C) 2008, http://www.dabeaz.com 1-66
67. Field Conversion
• Creates dictionaries of converted values
{ 'status': 200,
'proto': 'HTTP/1.1', Note conversion
'referrer': '-',
'request': '/ply/ply.html',
'datetime': '24/Feb/2008:00:08:59 -0600',
'bytes': 97238,
'host': '140.180.132.213',
'user': '-',
'method': 'GET'}
• Again, this is just one big processing pipeline
Copyright (C) 2008, http://www.dabeaz.com 1-67
68. The Code So Far
lognames = gen_find("access-log*","www")
logfiles = gen_open(lognames)
loglines = gen_cat(logfiles)
groups = (logpat.match(line) for line in loglines)
tuples = (g.groups() for g in groups if g)
colnames = ('host','referrer','user','datetime','method',
'request','proto','status','bytes')
log = (dict(zip(colnames,t)) for t in tuples)
log = field_map(log,"bytes",
lambda s: int(s) if s != '-' else 0)
log = field_map(log,"status",int)
Copyright (C) 2008, http://www.dabeaz.com 1-68
69. Getting Organized
• As a processing pipeline grows, certain parts of it
may be useful components on their own
generate lines Parse a sequence of lines from
from a set of files Apache server logs into a
in a directory sequence of dictionaries
• A series of pipeline stages can be easily
encapsulated by a normal Python function
Copyright (C) 2008, http://www.dabeaz.com 1-69
70. Packaging
• Example : multiple pipeline stages inside a function
def lines_from_dir(filepat, dirname):
names = gen_find(filepat,dirname)
files = gen_open(names)
lines = gen_cat(files)
return lines
• This is now a general purpose component that can
be used as a single element in other pipelines
Copyright (C) 2008, http://www.dabeaz.com 1-70
71. Packaging
• Example : Parse an Apache log into dicts
def apache_log(lines):
groups = (logpat.match(line) for line in lines)
tuples = (g.groups() for g in groups if g)
colnames = ('host','referrer','user','datetime','method',
'request','proto','status','bytes')
log = (dict(zip(colnames,t)) for t in tuples)
log = field_map(log,"bytes",
lambda s: int(s) if s != '-' else 0)
log = field_map(log,"status",int)
return log
Copyright (C) 2008, http://www.dabeaz.com 1-71
72. Example Use
• It's easy
lines = lines_from_dir("access-log*","www")
log = apache_log(lines)
for r in log:
print r
• Different components have been subdivided
according to the data that they process
Copyright (C) 2008, http://www.dabeaz.com 1-72
73. Food for Thought
• When creating pipeline components, it's
critical to focus on the inputs and outputs
• You will get the most flexibility when you
use a standard set of datatypes
• Is it simpler to have a bunch of components
that all operate on dictionaries or to have
components that require inputs/outputs to
be different kinds of user-defined instances?
Copyright (C) 2008, http://www.dabeaz.com 1-73
74. A Query Language
• Now that we have our log, let's do some queries
• Find the set of all documents that 404
stat404 = set(r['request'] for r in log
if r['status'] == 404)
• Print all requests that transfer over a megabyte
large = (r for r in log
if r['bytes'] > 1000000)
for r in large:
print r['request'], r['bytes']
Copyright (C) 2008, http://www.dabeaz.com 1-74
75. A Query Language
• Find the largest data transfer
print "%d %s" % max((r['bytes'],r['request'])
for r in log)
• Collect all unique host IP addresses
hosts = set(r['host'] for r in log)
• Find the number of downloads of a file
sum(1 for r in log
if r['request'] == '/ply/ply-2.3.tar.gz')
Copyright (C) 2008, http://www.dabeaz.com 1-75
76. A Query Language
• Find out who has been hitting robots.txt
addrs = set(r['host'] for r in log
if 'robots.txt' in r['request'])
import socket
for addr in addrs:
try:
print socket.gethostbyaddr(addr)[0]
except socket.herror:
print addr
Copyright (C) 2008, http://www.dabeaz.com 1-76
77. Performance Study
• Sadly, the last example doesn't run so fast on a
huge input file (53 minutes on the 1.3GB log)
• But, the beauty of generators is that you can plug
filters in at almost any stage
lines = lines_from_dir("big-access-log",".")
lines = (line for line in lines if 'robots.txt' in line)
log = apache_log(lines)
addrs = set(r['host'] for r in log)
...
• That version takes 93 seconds
Copyright (C) 2008, http://www.dabeaz.com 1-77
78. Some Thoughts
• I like the idea of using generator expressions as a
pipeline query language
• You can write simple filters, extract data, etc.
• You you pass dictionaries/objects through the
pipeline, it becomes quite powerful
• Feels similar to writing SQL queries
Copyright (C) 2008, http://www.dabeaz.com 1-78
79. Part 5
Processing Infinite Data
Copyright (C) 2008, http://www.dabeaz.com 1- 79
80. Question
• Have you ever used 'tail -f' in Unix?
% tail -f logfile
...
... lines of output ...
...
• This prints the lines written to the end of a file
• The "standard" way to watch a log file
• I used this all of the time when working on
scientific simulations ten years ago...
Copyright (C) 2008, http://www.dabeaz.com 1-80
81. Infinite Sequences
• Tailing a log file results in an "infinite" stream
• It constantly watches the file and yields lines as
soon as new data is written
• But you don't know how much data will actually
be written (in advance)
• And log files can often be enormous
Copyright (C) 2008, http://www.dabeaz.com 1-81
82. Tailing a File
• A Python version of 'tail -f'
import time
def follow(thefile):
thefile.seek(0,2) # Go to the end of the file
while True:
line = thefile.readline()
if not line:
time.sleep(0.1) # Sleep briefly
continue
yield line
• Idea : Seek to the end of the file and repeatedly
try to read new lines. If new data is written to
the file, we'll pick it up.
Copyright (C) 2008, http://www.dabeaz.com 1-82
83. Example
• Using our follow function
logfile = open("access-log")
loglines = follow(logfile)
for line in loglines:
print line,
• This produces the same output as 'tail -f'
Copyright (C) 2008, http://www.dabeaz.com 1-83
84. Example
• Turn the real-time log file into records
logfile = open("access-log")
loglines = follow(logfile)
log = apache_log(loglines)
• Print out all 404 requests as they happen
r404 = (r for r in log if r['status'] == 404)
for r in r404:
print r['host'],r['datetime'],r['request']
Copyright (C) 2008, http://www.dabeaz.com 1-84
85. Commentary
• We just plugged this new input scheme onto
the front of our processing pipeline
• Everything else still works, with one caveat-
functions that consume an entire iterable won't
terminate (min, max, sum, set, etc.)
• Nevertheless, we can easily write processing
steps that operate on an infinite data stream
Copyright (C) 2008, http://www.dabeaz.com 1-85
86. Part 6
Feeding the Pipeline
Copyright (C) 2008, http://www.dabeaz.com 1- 86
87. Feeding Generators
• In order to feed a generator processing
pipeline, you need to have an input source
• So far, we have looked at two file-based inputs
• Reading a file
lines = open(filename)
• Tailing a file
lines = follow(open(filename))
Copyright (C) 2008, http://www.dabeaz.com 1-87
88. A Thought
• There is no rule that says you have to
generate pipeline data from a file.
• Or that the input data has to be a string
• Or that it has to be turned into a dictionary
• Remember: All Python objects are "first-class"
• Which means that all objects are fair-game for
use in a generator pipeline
Copyright (C) 2008, http://www.dabeaz.com 1-88
89. Generating Connections
• Generate a sequence of TCP connections
import socket
def receive_connections(addr):
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1)
s.bind(addr)
s.listen(5)
while True:
client = s.accept()
yield client
• Example:
for c,a in receive_connections(("",9000)):
c.send("Hello Worldn")
c.close()
Copyright (C) 2008, http://www.dabeaz.com 1-89
90. Generating Messages
• Receive a sequence of UDP messages
import socket
def receive_messages(addr,maxsize):
s = socket.socket(socket.AF_INET,socket.SOCK_DGRAM)
s.bind(addr)
while True:
msg = s.recvfrom(maxsize)
yield msg
• Example:
for msg, addr in receive_messages(("",10000),1024):
print msg, "from", addr
Copyright (C) 2008, http://www.dabeaz.com 1-90
91. Part 7
Extending the Pipeline
Copyright (C) 2008, http://www.dabeaz.com 1- 91
92. Multiple Processes
• Can you extend a processing pipeline across
processes and machines?
process 2
socket
pipe
process 1
Copyright (C) 2008, http://www.dabeaz.com 1-92
93. Pickler/Unpickler
• Turn a generated sequence into pickled objects
def gen_pickle(source, protocol=pickle.HIGHEST_PROTOCOL):
for item in source:
yield pickle.dumps(item, protocol)
def gen_unpickle(infile):
while True:
try:
item = pickle.load(infile)
yield item
except EOFError:
return
• Now, attach these to a pipe or socket
Copyright (C) 2008, http://www.dabeaz.com 1-93
94. Sender/Receiver
• Example: Sender
def sendto(source,addr):
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
s.connect(addr)
for pitem in gen_pickle(source):
s.sendall(pitem)
s.close()
• Example: Receiver
def receivefrom(addr):
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1)
s.bind(addr)
s.listen(5)
c,a = s.accept()
for item in gen_unpickle(c.makefile()):
yield item
c.close()
Copyright (C) 2008, http://www.dabeaz.com 1-94
95. Example Use
• Example: Read log lines and parse into records
# netprod.py
lines = follow(open("access-log"))
log = apache_log(lines)
sendto(log,("",15000))
• Example: Pick up the log on another machine
# netcons.py
for r in receivefrom(("",15000)):
print r
Copyright (C) 2008, http://www.dabeaz.com 1-95
96. Generators and Threads
• Processing pipelines sometimes come up in the
context of thread programming
• Producer/consumer problems
Thread 1 Thread 2
Producer Consumer
• Question: Can generator pipelines be
integrated with thread programming?
Copyright (C) 2008, http://www.dabeaz.com 1-96
97. Multiple Threads
• For example, can a generator pipeline span
multiple threads?
Thread 1 Thread 2
?
• Yes, if you connect them with a Queue object
Copyright (C) 2008, http://www.dabeaz.com 1-97
98. Generators and Queues
• Feed a generated sequence into a queue
# genqueue.py
def sendto_queue(source, thequeue):
for item in source:
thequeue.put(item)
thequeue.put(StopIteration)
• Generate items received on a queue
def genfrom_queue(thequeue):
while True:
item = thequeue.get()
if item is StopIteration: break
yield item
• Note: Using StopIteration as a sentinel
Copyright (C) 2008, http://www.dabeaz.com 1-98
99. Thread Example
• Here is a consumer function
# A consumer. Prints out 404 records.
def print_r404(log_q):
log = genfrom_queue(log_q)
r404 = (r for r in log if r['status'] == 404)
for r in r404:
print r['host'],r['datetime'],r['request']
• This function will be launched in its own thread
• Using a Queue object as the input source
Copyright (C) 2008, http://www.dabeaz.com 1-99
100. Thread Example
• Launching the consumer
import threading, Queue
log_q = Queue.Queue()
r404_thr = threading.Thread(target=print_r404,
args=(log_q,))
r404_thr.start()
• Code that feeds the consumer
lines = follow(open("access-log"))
log = apache_log(lines)
sendto_queue(log,log_q)
Copyright (C) 2008, http://www.dabeaz.com 1-
100
101. Part 8
Advanced Data Routing
Copyright (C) 2008, http://www.dabeaz.com 1-101
102. The Story So Far
• You can use generators to set up pipelines
• You can extend the pipeline over the network
• You can extend it between threads
• However, it's still just a pipeline (there is one
input and one output).
• Can you do more than that?
Copyright (C) 2008, http://www.dabeaz.com 1-
102
103. Multiple Sources
• Can a processing pipeline be fed by multiple
sources---for example, multiple generators?
source1 source2 source3
for item in sources:
# Process item
Copyright (C) 2008, http://www.dabeaz.com 1-
103
104. Concatenation
• Concatenate one source after another (reprise)
def gen_cat(sources):
for s in sources:
for item in s:
yield item
• This generates one big sequence
• Consumes each generator one at a time
• But only works if generators terminate
• So, you wouldn't use this for real-time streams
Copyright (C) 2008, http://www.dabeaz.com 1-
104
105. Parallel Iteration
• Zipping multiple generators together
import itertools
z = itertools.izip(s1,s2,s3)
• This one is only marginally useful
• Requires generators to go lock-step
• Terminates when any input ends
Copyright (C) 2008, http://www.dabeaz.com 1-
105
106. Multiplexing
• Feed a pipeline from multiple generators in
real-time--producing values as they arrive
• Example use
log1 = follow(open("foo/access-log"))
log2 = follow(open("bar/access-log"))
lines = multiplex([log1,log2])
• There is no way to poll a generator
• And only one for-loop executes at a time
Copyright (C) 2008, http://www.dabeaz.com 1-
106
107. Multiplexing
• You can multiplex if you use threads and you
use the tools we've developed so far
• Idea : source1 source2 source3
queue
for item in queue:
# Process item
Copyright (C) 2008, http://www.dabeaz.com 1-
107
108. Multiplexing
# genmultiplex.py
import threading, Queue
from genqueue import *
from gencat import *
def multiplex(sources):
in_q = Queue.Queue()
consumers = []
for src in sources:
thr = threading.Thread(target=sendto_queue,
args=(src,in_q))
thr.start()
consumers.append(genfrom_queue(in_q))
return gen_cat(consumers)
• Note: This is the trickiest example so far...
Copyright (C) 2008, http://www.dabeaz.com 1-
108
109. Multiplexing
• Each input source is wrapped by a thread which
runs the generator and dumps the items into a
shared queue
source1 source2 source3
sendto_queue sendto_queue sendto_queue
in_q queue
Copyright (C) 2008, http://www.dabeaz.com 1-
109
110. Multiplexing
• For each source, we create a consumer of
queue data
in_q queue
consumers = [genfrom_queue, genfrom_queue, genfrom_queue ]
• Now, just concatenate the consumers together
get_cat(consumers)
• Each time a producer terminates, we move to
the next consumer (until there are no more)
Copyright (C) 2008, http://www.dabeaz.com 1-
110
111. Broadcasting
• Can you broadcast to multiple consumers?
generator
consumer1 consumer2 consumer3
Copyright (C) 2008, http://www.dabeaz.com 1-
111
112. Broadcasting
• Consume a generator and send to consumers
def broadcast(source, consumers):
for item in source:
for c in consumers:
c.send(item)
• It works, but now the control-flow is unusual
• The broadcast loop is what runs the program
• Consumers run by having items sent to them
Copyright (C) 2008, http://www.dabeaz.com 1-
112
113. Consumers
• To create a consumer, define an object with a
send() method on it
class Consumer(object):
def send(self,item):
print self, "got", item
• Example:
c1 = Consumer()
c2 = Consumer()
c3 = Consumer()
lines = follow(open("access-log"))
broadcast(lines,[c1,c2,c3])
Copyright (C) 2008, http://www.dabeaz.com 1-
113
114. Network Consumer
• Example:
import socket,pickle
class NetConsumer(object):
def __init__(self,addr):
self.s = socket.socket(socket.AF_INET,
socket.SOCK_STREAM)
self.s.connect(addr)
def send(self,item):
pitem = pickle.dumps(item)
self.s.sendall(pitem)
def close(self):
self.s.close()
• This will route items across the network
Copyright (C) 2008, http://www.dabeaz.com 1-
114
115. Network Consumer
• Example Usage:
class Stat404(NetConsumer):
def send(self,item):
if item['status'] == 404:
NetConsumer.send(self,item)
lines = follow(open("access-log"))
log = apache_log(lines)
stat404 = Stat404(("somehost",15000))
broadcast(log, [stat404])
• The 404 entries will go elsewhere...
Copyright (C) 2008, http://www.dabeaz.com 1-
115
116. Commentary
• Once you start broadcasting, consumers can't
follow the same programming model as before
• Only one for-loop can run the pipeline.
• However, you can feed an existing pipeline if
you're willing to run it in a different thread or in
a different process
Copyright (C) 2008, http://www.dabeaz.com 1-
116
117. Consumer Thread
• Example: Routing items to a separate thread
import Queue, threading
from genqueue import genfrom_queue
class ConsumerThread(threading.Thread):
def __init__(self,target):
threading.Thread.__init__(self)
self.setDaemon(True)
self.in_q = Queue.Queue()
self.target = target
def send(self,item):
self.in_q.put(item)
def run(self):
self.target(genfrom_queue(self.in_q))
Copyright (C) 2008, http://www.dabeaz.com 1-
117
118. Consumer Thread
• Sample usage (building on earlier code)
def find_404(log):
for r in (r for r in log if r['status'] == 404):
print r['status'],r['datetime'],r['request']
def bytes_transferred(log):
total = 0
for r in log:
total += r['bytes']
print "Total bytes", total
c1 = ConsumerThread(find_404)
c1.start()
c2 = ConsumerThread(bytes_transferred)
c2.start()
lines = follow(open("access-log")) # Follow a log
log = apache_log(lines) # Turn into records
broadcast(log,[c1,c2]) # Broadcast to consumers
Copyright (C) 2008, http://www.dabeaz.com 1-
118
119. Part 9
Various Programming Tricks (And Debugging)
Copyright (C) 2008, http://www.dabeaz.com 1-119
120. Putting it all Together
• This data processing pipeline idea is powerful
• But, it's also potentially mind-boggling
• Especially when you have dozens of pipeline
stages, broadcasting, multiplexing, etc.
• Let's look at a few useful tricks
Copyright (C) 2008, http://www.dabeaz.com 1-
120
121. Creating Generators
• Any single-argument function is easy to turn
into a generator function
def generate(func):
def gen_func(s):
for item in s:
yield func(item)
return gen_func
• Example:
gen_sqrt = generate(math.sqrt)
for x in gen_sqrt(xrange(100)):
print x
Copyright (C) 2008, http://www.dabeaz.com 1-
121
122. Debug Tracing
• A debugging function that will print items going
through a generator
def trace(source):
for item in source:
print item
yield item
• This can easily be placed around any generator
lines = follow(open("access-log"))
log = trace(apache_log(lines))
r404 = trace(r for r in log if r['status'] == 404)
• Note: Might consider logging module for this
Copyright (C) 2008, http://www.dabeaz.com 1-
122
123. Recording the Last Item
• Store the last item generated in the generator
class storelast(object):
def __init__(self,source):
self.source = source
def next(self):
item = self.source.next()
self.last = item
return item
def __iter__(self):
return self
• This can be easily wrapped around a generator
lines = storelast(follow(open("access-log")))
log = apache_log(lines)
for r in log:
print r
print lines.last
Copyright (C) 2008, http://www.dabeaz.com 1-
123
124. Shutting Down
• Generators can be shut down using .close()
import time
def follow(thefile):
thefile.seek(0,2) # Go to the end of the file
while True:
line = thefile.readline()
if not line:
time.sleep(0.1) # Sleep briefly
continue
yield line
• Example: lines = follow(open("access-log"))
for i,line in enumerate(lines):
print line,
if i == 10: lines.close()
Copyright (C) 2008, http://www.dabeaz.com 1-
124
125. Shutting Down
• In the generator, GeneratorExit is raised
import time
def follow(thefile):
thefile.seek(0,2) # Go to the end of the file
try:
while True:
line = thefile.readline()
if not line:
time.sleep(0.1) # Sleep briefly
continue
yield line
except GeneratorExit:
print "Follow: Shutting down"
• This allows for resource cleanup (if needed)
Copyright (C) 2008, http://www.dabeaz.com 1-
125
126. Ignoring Shutdown
• Question: Can you ignore GeneratorExit?
import time
def follow(thefile):
thefile.seek(0,2) # Go to the end of the file
while True:
try:
line = thefile.readline()
if not line:
time.sleep(0.1) # Sleep briefly
continue
yield line
except GeneratorExit:
print "Forget about it"
• Answer: No. You'll get a RuntimeError
Copyright (C) 2008, http://www.dabeaz.com 1-
126
127. Shutdown and Threads
• Question : Can a thread shutdown a generator
running in a different thread?
lines = follow(open("foo/test.log"))
def sleep_and_close(s):
time.sleep(s)
lines.close()
threading.Thread(target=sleep_and_close,args=(30,)).start()
for line in lines:
print line,
Copyright (C) 2008, http://www.dabeaz.com 1-
127
128. Shutdown and Threads
• Separate threads can not call .close()
• Output:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.5/
lib/python2.5/threading.py", line 460, in __bootstrap
self.run()
File "/Library/Frameworks/Python.framework/Versions/2.5/
lib/python2.5/threading.py", line 440, in run
self.__target(*self.__args, **self.__kwargs)
File "genfollow.py", line 31, in sleep_and_close
lines.close()
ValueError: generator already executing
Copyright (C) 2008, http://www.dabeaz.com 1-
128
129. Shutdown and Signals
• Can you shutdown a generator with a signal?
import signal
def sigusr1(signo,frame):
print "Closing it down"
lines.close()
signal.signal(signal.SIGUSR1,sigusr1)
lines = follow(open("access-log"))
for line in lines:
print line,
• From the command line
% kill -USR1 pid
Copyright (C) 2008, http://www.dabeaz.com 1-
129
130. Shutdown and Signals
• This also fails:
Traceback (most recent call last):
File "genfollow.py", line 35, in <module>
for line in lines:
File "genfollow.py", line 8, in follow
time.sleep(0.1)
File "genfollow.py", line 30, in sigusr1
lines.close()
ValueError: generator already executing
• Sigh.
Copyright (C) 2008, http://www.dabeaz.com 1-
130
131. Shutdown
• The only way to externally shutdown a
generator would be to instrument with a flag or
some kind of check
def follow(thefile,shutdown=None):
thefile.seek(0,2)
while True:
if shutdown and shutdown.isSet(): break
line = thefile.readline()
if not line:
time.sleep(0.1)
continue
yield line
Copyright (C) 2008, http://www.dabeaz.com 1-
131
132. Shutdown
• Example:
import threading,signal
shutdown = threading.Event()
def sigusr1(signo,frame):
print "Closing it down"
shutdown.set()
signal.signal(signal.SIGUSR1,sigusr1)
lines = follow(open("access-log"),shutdown)
for line in lines:
print line,
Copyright (C) 2008, http://www.dabeaz.com 1-
132
133. Part 10
Parsing and Printing
Copyright (C) 2008, http://www.dabeaz.com 1-133
134. Incremental Parsing
• Generators are a useful way to incrementally
parse almost any kind of data
# genrecord.py
import struct
def gen_records(record_format, thefile):
record_size = struct.calcsize(record_format)
while True:
raw_record = thefile.read(record_size)
if not raw_record:
break
yield struct.unpack(record_format, raw_record)
• This function sweeps through a file and
generates a sequence of unpacked records
Copyright (C) 2008, http://www.dabeaz.com 1-
134
135. Incremental Parsing
• Example:
from genrecord import *
f = open("stockdata.bin","rb")
for name, shares, price in gen_records("<8sif",f):
# Process data
...
• Tip : Look at xml.etree.ElementTree.iterparse
for a neat way to incrementally process large
XML documents using generators
Copyright (C) 2008, http://www.dabeaz.com 1-
135
136. yield as print
• Generator functions can use yield like a print
statement
• Example:
def print_count(n):
yield "Hello Worldn"
yield "n"
yield "Look at me count to %dn" % n
for i in xrange(n):
yield " %dn" % i
yield "I'm done!n"
• This is useful if you're producing I/O output, but
you want flexibility in how it gets handled
Copyright (C) 2008, http://www.dabeaz.com 1-
136
137. yield as print
• Examples of processing the output stream:
# Generate the output
out = print_count(10)
# Turn it into one big string
out_str = "".join(out)
# Write it to a file
f = open("out.txt","w")
for chunk in out:
f.write(chunk)
# Send it across a network socket
for chunk in out:
s.sendall(chunk)
Copyright (C) 2008, http://www.dabeaz.com 1-
137
138. yield as print
• This technique of producing output leaves the
exact output method unspecified
• So, the code is not hardwired to use files,
sockets, or any other specific kind of output
• There is an interesting code-reuse element
• One use of this : WSGI applications
Copyright (C) 2008, http://www.dabeaz.com 1-
138
139. Part 11
Co-routines
Copyright (C) 2008, http://www.dabeaz.com 1-139
140. The Final Frontier
• In Python 2.5, generators picked up the ability
to receive values using .send()
def recv_count():
try:
while True:
n = (yield) # Yield expression
print "T-minus", n
except GeneratorExit:
print "Kaboom!"
• Think of this function as receiving values rather
than generating them
Copyright (C) 2008, http://www.dabeaz.com 1-
140
141. Example Use
• Using a receiver
>>> r = recv_count()
>>> r.next() Note: must call .next() here
>>> for i in range(5,0,-1):
... r.send(i)
...
T-minus 5
T-minus 4
T-minus 3
T-minus 2
T-minus 1
>>> r.close()
Kaboom!
>>>
Copyright (C) 2008, http://www.dabeaz.com 1-
141
142. Co-routines
• This form of a generator is a "co-routine"
• Also sometimes called a "reverse-generator"
• Python books (mine included) do a pretty poor
job of explaining how co-routines are supposed
to be used
• I like to think of them as "receivers" or
"consumer". They receive values sent to them.
Copyright (C) 2008, http://www.dabeaz.com 1-
142
143. Setting up a Coroutine
• To get a co-routine to run properly, you have to
ping it with a .next() operation first
def recv_count():
try:
while True:
n = (yield) # Yield expression
print "T-minus", n
except GeneratorExit:
print "Kaboom!"
• Example:r = recv_count()
r.next()
• This advances it to the first yield--where it will
receive its first value
Copyright (C) 2008, http://www.dabeaz.com 1-
143
144. @consumer decorator
• The .next() bit can be handled via decoration
def consumer(func):
def start(*args,**kwargs):
c = func(*args,**kwargs)
c.next()
return c
return start
• Example:@consumer
def recv_count():
try:
while True:
n = (yield) # Yield expression
print "T-minus", n
except GeneratorExit:
print "Kaboom!"
Copyright (C) 2008, http://www.dabeaz.com 1-
144
145. @consumer decorator
• Using the decorated version
>>> r = recv_count()
>>> for i in range(5,0,-1):
... r.send(i)
...
T-minus 5
T-minus 4
T-minus 3
T-minus 2
T-minus 1
>>> r.close()
Kaboom!
>>>
• Don't need the extra .next() step here
Copyright (C) 2008, http://www.dabeaz.com 1-
145
146. Coroutine Pipelines
• Co-routines also set up a processing pipeline
• Instead of being defining by iteration, it's
defining by pushing values into the pipeline
using .send()
.send() .send() .send()
• We already saw some of this with broadcasting
Copyright (C) 2008, http://www.dabeaz.com 1-
146
147. Broadcasting (Reprise)
• Consume a generator and send items to a set
of consumers
def broadcast(source, consumers):
for item in source:
for c in consumers:
c.send(item)
• Notice that send() operation there
• The consumers could be co-routines
Copyright (C) 2008, http://www.dabeaz.com 1-
147
148. Example
@consumer
def find_404():
while True:
r = (yield)
if r['status'] == 404:
print r['status'],r['datetime'],r['request']
@consumer
def bytes_transferred():
total = 0
while True:
r = (yield)
total += r['bytes']
print "Total bytes", total
lines = follow(open("access-log"))
log = apache_log(lines)
broadcast(log,[find_404(),bytes_transferred()])
Copyright (C) 2008, http://www.dabeaz.com 1-
148
149. Discussion
• In last example, multiple consumers
• However, there were no threads
• Further exploration along these lines can take
you into co-operative multitasking, concurrent
programming without using threads
• But that's an entirely different tutorial!
Copyright (C) 2008, http://www.dabeaz.com 1-
149