Python in Action (Part 2)

  • 8,151 views
Uploaded on

Official tutorial slides from USENIX LISA, Nov. 16, 2007.

Official tutorial slides from USENIX LISA, Nov. 16, 2007.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
8,151
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
408
Comments
0
Likes
14

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Python in Action Presented at USENIX LISA Conference November 16, 2007 David M. Beazley http://www.dabeaz.com (Part II - Systems Programming) Copyright (C) 2007, http://www.dabeaz.com 2- 1
  • 2. Section Overview • In this section, we're going to get dirty • Systems Programming • Files, I/O, file-system • Text parsing, data decoding • Processes and IPC • Networking • Threads and concurrency Copyright (C) 2007, http://www.dabeaz.com 2- 2
  • 3. Commentary • I personally think Python is a fantastic tool for systems programming. • Modules provide access to most of the major system libraries I used to access via C • No enforcement of "morality" • Decent performance • It just "works" and it's fun Copyright (C) 2007, http://www.dabeaz.com 2- 3
  • 4. Approach • I've thought long and hard about how I would present this part of the class. • A reference manual approach would probably be long and very boring. • So instead, we're going to focus on building something more in tune with the times Copyright (C) 2007, http://www.dabeaz.com 2- 4
  • 5. "To Catch a Slacker" • Write a collection of Python programs that can quietly monitor Firefox browser caches to find out who has been spending their day reading Slashdot instead of working on their TPS reports. • Oh yeah, and be a real sneaky bugger about it. Copyright (C) 2007, http://www.dabeaz.com 2- 5
  • 6. Why this Problem? • Involves a real-world system and data • Firefox already installed on your machine (?) • Cross platform (Linux, Mac, Windows) • Example of tool building • Related to a variety of practical problems • A good tour of "Python in Action" Copyright (C) 2007, http://www.dabeaz.com 2- 6
  • 7. Disclaimers • I am not involved in browser forensics (or spyware for that matter). • I am in no way affiliated with Firefox/Mozilla nor have I ever seen Firefox source code • I have never worked with the cache data prior to preparing this tutorial • I have never used any third-party tools for looking at this data. Copyright (C) 2007, http://www.dabeaz.com 2- 7
  • 8. More Disclaimers • All of the code in this tutorial works with a standard Python installation • No third party modules. • All code is cross-platform • Code samples are available online at http://www.dabeaz.com/action/ • Please look at that code and follow along Copyright (C) 2007, http://www.dabeaz.com 2- 8
  • 9. Assumptions • This is not a tutorial on systems concepts • You should be generally familiar with background material (files, filesystems, file formats, processes, threads, networking, protocols, etc.) • Hopefully you can "extrapolate" from the material presented here to construct more advanced Python applications. Copyright (C) 2007, http://www.dabeaz.com 2- 9
  • 10. The Big Picture • We want to write a tool that allows someone to locate, inspect, and perform queries across a distributed collection of Firefox caches. • For example, the cache directories on all machines on the LAN of a quasi-evil corporation. Copyright (C) 2007, http://www.dabeaz.com 2- 10
  • 11. The Firefox Cache • The Firefox browser keeps a disk cache of recently visited sites % ls Cache/ -rw------- 1 beazley 111169 Sep 25 17:15 01CC0844d01 -rw------- 1 beazley 104991 Sep 25 17:15 01CC3844d01 -rw------- 1 beazley 47233 Sep 24 16:41 021F221Ad01 ... -rw------- 1 beazley 26749 Sep 21 11:19 FF8AEDF0d01 -rw------- 1 beazley 58172 Sep 25 18:16 FFE628C6d01 -rw------- 1 beazley 1939456 Sep 25 19:14 _CACHE_001_ -rw------- 1 beazley 2588672 Sep 25 19:14 _CACHE_002_ -rw------- 1 beazley 4567040 Sep 25 18:44 _CACHE_003_ -rw------- 1 beazley 33044 Sep 23 21:58 _CACHE_MAP_ • A bunch of cryptically named files. Copyright (C) 2007, http://www.dabeaz.com 2- 11
  • 12. Problem : Finding Files • Find the Firefox cache Write a program findcache.py that takes a directory name as input and recursively scans that directory and all subdirectories looking for Firefox/Mozilla cache directories. • Example: % python findcache.py /Users/beazley /Users/beazley/Library/.../qs1ab616.default/Cache /Users/beazley/Library/.../wxuoyiuf.slt/Cache % • Use case: Searching for things on the filesystem. Copyright (C) 2007, http://www.dabeaz.com 2- 12
  • 13. findcache.py # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name Copyright (C) 2007, http://www.dabeaz.com 2- 13
  • 14. The sys module # findcache.py # Recursively scan a directory looking basic The sys module has for # Firefox/Mozilla cache directories information related to the import sys execution environment. import os if len(sys.argv) != 2: sys.argv print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) sys.stdin (path for'_CACHE_MAP_' list in os.walk(sys.argv[1]) caches = path,dirs,files of the command line A sys.stdout if options in files) sys.stderrname for name in caches: print sys.argv = ['findcache.py', '/Users/beazley'] Standard I/O files Copyright (C) 2007, http://www.dabeaz.com 2- 14
  • 15. Program Termination # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: SystemExit exception print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) Forces Python to exit. caches = (path for path,dirs,files inis return code. Value os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name Copyright (C) 2007, http://www.dabeaz.com 2- 15
  • 16. os Module # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache os module directories import sys import os Contains useful OS related if len(sys.argv) != 2: functions (files, processes, etc.) print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name Copyright (C) 2007, http://www.dabeaz.com 2- 16
  • 17. os.walk() os.walk(topdir) # findcache.py # Recursively scan a directory looking for Recursively walkscache directories and # Firefox/Mozilla a directory tree generates a sequence of tuples (path,dirs,files) import sys path import os = The current directory name if dirs = List of all subdirectory names in path len(sys.argv) != 2: files = List of all regular files (data) in path print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name Copyright (C) 2007, http://www.dabeaz.com 2- 17
  • 18. A Sequence of Caches # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories importstatement This sys generates a sequence of import os directory names where '_CACHE_MAP_' is contained in the filelist. if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: The print name name directory File name check that is generated as a result Copyright (C) 2007, http://www.dabeaz.com 2- 18
  • 19. Printing the Result # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) This prints the sequence if '_CACHE_MAP_' in files) of cache directories that for name in caches: print name are generated by the previous statement. Copyright (C) 2007, http://www.dabeaz.com 2- 19
  • 20. Commentary • Our solution is strongly based on a "declarative" programming style (again) • We simply write out a sequence of operations that produce what we want • Not focused on the underlying mechanics of how to traverse all of the directories. Copyright (C) 2007, http://www.dabeaz.com 2- 20
  • 21. Big Idea : Iteration • Python allows iteration to be captured as a kind of object. caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) • This de-couples iteration from the code that uses the iteration for name in caches: print name • Another usage example: for name in caches: print len(os.listdir(name)), name Copyright (C) 2007, http://www.dabeaz.com 2- 21
  • 22. Big Idea : Iteration • Compare to this: for path,dirs,files in os.walk(sys.argv[1]): if '_CACHE_MAP_' in files: print len(os.listdir(path)),path • This code is simple, but the loop and the code that executes in the loop body are coupled together • Not as flexible, but this is somewhat subtle to wrap your brain around at first. Copyright (C) 2007, http://www.dabeaz.com 2- 22
  • 23. Mini-Reference : sys, os • sys module sys.argv # List of command line options sys.stdin # Standard input sys.stdout # Standard output sys.stderr # Standard error sys.executable # Full path of Python executable sys.exc_info() # Information on current exception • os module os.walk(dir) # Recursively walk dir producing a # sequence of tuples (path,dlist,flist) os.listdir(dir) # Return a list of all files in dir • SystemExit exception raise SystemExit(n) # Exit with integer code n Copyright (C) 2007, http://www.dabeaz.com 2- 23
  • 24. Problem: Searching for Text • Extract all URL requests from the cache Write a program requests.py that scans the contents of the _CACHE_00n_ files and prints a list of URLs for documents stored in the cache. • Example: % python requests.py /Users/.../qs1ab616.default/Cache http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/js/ad_eo_1.1.j http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif http://us.i1.yimg.com/us.yimg.com/i/ww/thm/1/search_1.1.png ... % • Use case: Searching the contents of files for text patterns. Copyright (C) 2007, http://www.dabeaz.com 2- 24
  • 25. The Firefox Cache • The cache directory holds two types of data • Metadata (URLs, headers, etc.). • Raw data (HTML, JPEG, PNG, etc.) • This data is stored in two places • Cryptic files in the Cache directory • Blocks inside the _CACHE_00n_ files • Metadata almost always in _CACHE_00n_ Copyright (C) 2007, http://www.dabeaz.com 2- 25
  • 26. Possible Solution : Regex • The _CACHE_00n_ files are encoded in a binary format, but URLs are embedded inside as null-terminated text: x00x01x00x08x92x00x02x18x00x00x00x13Fxffx9f xceFxffx9fxcex00x00x00x00x00x00H)x00x00x00x1a x00x00x023HTTP:http://slashdot.org/x00request-methodx00 GETx00request-User-Agentx00Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7x00 request-Accept-Encodingx00gzip,deflatex00response-headx00 HTTP/1.1 200 OKrnDate: Sun, 30 Sep 2007 13:07:29 GMTrn Server: Apache/1.3.37 (Unix) mod_perl/1.29rnSLASH_LOG_DATA: shtmlrnX-Powered-By: Slash 2.005000176rnX-Fry: How can I live my life if I can't tell good from evil?rnCache-Control: • Maybe the requests could just be ripped using a regular expression. Copyright (C) 2007, http://www.dabeaz.com 2- 26
  • 27. A Regex Solution # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings request_pat = re.compile('([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 27
  • 28. The re module # requests.py import re re module import os import sys Contains all functionality related to cachedir = sys.argv[1] cachefiles regular expression pattern matching, = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] searching, replacing, etc. # A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)x00') Features are strongly influenced by Perl, but regexs are not directly integrated # Loop over all files and search for URLs for name in cachefiles: into the Python language. data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 28
  • 29. Using re # requests.py import re are first specified Patterns as strings and compiled into a regex import os import sys object. pat = re.compile(pattern [,flags]) cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings request_pat = re.compile('([a-z]+://.*?)x00') # Loop over all files and search for URLs The pattern syntax is "standard" for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() pat* pat1|pat2 index = 0 pat+ [chars] while True: pat? [^chars] m = request_pat.search(data,index) (pat) pat{n} if not m: break . pat{n,m} print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 29
  • 30. Using re # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_',the All subsequent operations are methods of '_CACHE_003_' ] compiled regex pattern # A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)x00') m = pat.match(data [,start]) # Check for match m = pat.search(data [,start]) # Search for match # Loop over all files and search for URLs newdata = pat.sub(data, repl) # Pattern replace for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 30
  • 31. Searching for Matches # requests.py import re import os pat.search(text import sys [,start]) cachedir = the string text for the first occurrence Searches sys.argv[1] cachefiles = [ pattern starting'_CACHE_002_', '_CACHE_003_' ] of the regex '_CACHE_001_', at position start. # Returns a "MatchObject" strings A regex for embedded URL if a match is found. request_pat = re.compile(r'([a-z]+://.*?)x00') In the code below, we're finding matches one # Loop over all files and search for URLs for a time. cachefiles: at name in data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 31
  • 32. Match Objects # requests.py import re import os import sys cachedir = sys.argv[1] Regex matches'_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] cachefiles = [ are represented by a MatchObject # m.group([n]) embedded URL matched by group n A regex for # Text strings m.start([n]) # Starting index of group n request_pat = re.compile(r'([a-z]+://.*?)x00') m.end([n]) # End index of group n # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 The matching text for while True: just the URL. m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() The end of the match Copyright (C) 2007, http://www.dabeaz.com 2- 32
  • 33. Groups # requests.py In patterns, parentheses () define groups which import re import os are numbered left to right. import sys group 0 # The entire pattern cachedir 1 sys.argv[1] Text in first () group = # group 2 # Text in next () cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] ... # A regex for embedded URL strings request_pat = re.compile('([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 33
  • 34. Mini-Reference : re • re pattern compilation pat = re.compile(r'patternstring') • Pattern syntax literal # Match literal text pat* # Match 0 or more repetitions of pat pat+ # Match 1 or more repetitions of pat pat? # Match 0 or 1 repetitions of pat pat1|pat2 # Patch pat1 or pat2 (pat) # Patch pat (group) [chars] # Match characters in chars [^chars] # Match characters not in chars . # Match any character except n d # Match any digit w # Match alphanumeric character s # Match whitespace Copyright (C) 2007, http://www.dabeaz.com 2- 34
  • 35. Mini-Reference : re • Common pattern operations pat.search(text) # Search text for a match pat.match(text) # Search start of text for match pat.sub(repl,text) # Replace pattern with repl • Match objects m.group([n]) # Text matched by group n m.start([n]) # Starting position of group n m.end([n]) # Ending position of group n • How to loop over all matches of a pattern for m in pat.finditer(text): # m is a MatchObject that you process ... Copyright (C) 2007, http://www.dabeaz.com 2- 35
  • 36. Mini-Reference : re • An example of pattern replacement # This replaces American dates of the form 'mm/dd/yyyy' # with European dates of the form 'dd/mm/yyyy'. # This function takes a MatchObject as input and returns # replacement text as output. def euro_date(m): month = m.group(1) day = m.group(2) year = m.group(3) return "%d/%d/%d" % (day,month,year) # Date re pattern and replacement operation datepat = re.compile(r'(d+)/(d+)/(d+)') newdata = datepat.sub(euro_date,text) Copyright (C) 2007, http://www.dabeaz.com 2- 36
  • 37. Mini-Reference : re • There are many more features of the re module • Strongly influenced by Perl (feature set) • Regexs are a library in Python, not integrated into the language. • A book on regular expressions may be essential for advanced functions. Copyright (C) 2007, http://www.dabeaz.com 2- 37
  • 38. File Handling # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings What is going on in this statement? request_pat = re.compile(r'([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 38
  • 39. os.path module # requests.py import re has portable file related functions os.path import os os.path.join(name1,name2,...) # Join path names import sys os.path.getsize(filename) # Get the file size os.path.getmtime(filename) # Get modification date cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] There are many more functions, but this is the #preferred module for basic filename handling A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 39
  • 40. os.path.join() # requests.py import re a fully-expanded pathname Creates import os dirname = '/foo/bar' filename = 'name' import sys os.path.join(dirname,filename) cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings '/foo/bar/name' request_pat = re.compile(r'([a-z]+://.*?)x00') Aware of platform differences ('/' vs. '') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 40
  • 41. Mini-Reference : os.path os.path.join(s1,s2,...) # Join pathname parts together os.path.getsize(path) # Get file size of path os.path.getmtime(path) # Get modify time of path os.path.getatime(path) # Get access time of path os.path.getctime(path) # Get creation time of path os.path.exists(path) # Check if path exists os.path.isfile(path) # Check if regular file os.path.isdir(path) # Check if directory os.path.islink(path) # Check if symbolic link os.path.basename(path) # Return file part of path os.path.dirname(path) # Return dir part of os.path.abspath(path) # Get absolute path Copyright (C) 2007, http://www.dabeaz.com 2- 41
  • 42. Binary I/O # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = binary files, use modes "rb","wb", etc. For all [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)x00') Windows) Disables new-line translation (critical on # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 42
  • 43. Common I/O Shortcuts # requests.py import re entire file into a string # Read an import=os data open(filename).read() import sys # Write a string out to a file open(filename,"w").write(text) cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # Loop over all lines in a file #forregex foropen(filename): A line in embedded URL strings ... request_pat = re.compile(r'([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 43
  • 44. Commentary on Solution • This regex approach is mostly a hack for this particular application. • Reads entire cache files into memory as strings (may be quite large) • Only finds URLs, no other metadata • Some risk of false positives since URLs could also be embedded in data. Copyright (C) 2007, http://www.dabeaz.com 2- 44
  • 45. Commentary • We have started to build a collection of very simple command line tools • Very much in the "Unix tradition." • Python makes it easy to create such tools • More complex applications could be assembled by simply gluing scripts together Copyright (C) 2007, http://www.dabeaz.com 2- 45
  • 46. Working with Processes • It is common to write programs that run other programs, collect their output, etc. • Pipes • Interprocess Communication • Python has a variety of modules for supporting this. Copyright (C) 2007, http://www.dabeaz.com 2- 46
  • 47. subprocess Module • A module for creating and interacting with subprocesses • Consolidates a number of low-level OS functions such as system(), execv(), spawnv(), pipe(), popen2(), etc. into a single module • Cross platform (Unix/Windows) Copyright (C) 2007, http://www.dabeaz.com 2- 47
  • 48. Example : Slackers • Find slacker cache entries. Using the programs findcache.py and requests.py as subprocesses, write a program that inspects cache directories and prints out all entries that contain the word 'slashdot' in the URL. Copyright (C) 2007, http://www.dabeaz.com 2- 48
  • 49. slackers.py # slackers.py import sys import subprocess # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess for cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 49
  • 50. Launching a subprocess # slackers.py import sys import subprocess # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess Thiscachedir in dirlist: for is launching a python Collection of output script as a subprocess, searcher = subprocess.Popen( with newline [sys.executable,"requests.py",cachedir], connecting its stdout stdout=subprocess.PIPE) stripping. stream to a pipe. for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 50
  • 51. Python Executable # slackers.py import sys import subprocess Full pathname of python interpreter # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess for cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 51
  • 52. Subprocess Arguments # slackers.py import sys import subprocess # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] List of arguments to subprocess. # Run request.py as a subprocess for cachedir in dirlist: to what would Corresponds appear on a shell command line. searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 52
  • 53. slackers.py # slackers.py import sys import subprocess # Run findcache.py as a subprocess directory we More of the same idea. For each finder = subprocess.Popen( found in the last step, we run requests.py to [sys.executable,"findcache.py",sys.argv[1]], produce requests. stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess for cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 53
  • 54. Commentary • subprocess is a large module with many options. • However, it takes care of a lot of annoying platform-specific details for you. • Currently the "recommended" way of dealing with subprocesses. Copyright (C) 2007, http://www.dabeaz.com 2- 54
  • 55. Low Level Subprocesses • Running a simple system command os.system("shell command") • Connecting to a subprocess with pipes pout, pin = popen2.popen2("shell command") • Exec/spawn os.execv(),os.execl(),os.execle(),... os.spawnv(),os.spawnvl(), os.spawnle(),... • Unix fork() os.fork(), os.wait(), os.waitpid(), os._exit(), ... Copyright (C) 2007, http://www.dabeaz.com 2- 55
  • 56. Interactive Processes • Python does not have built-in support for controlling interactive subprocesses (e.g., "Expect") • Must install third party modules for this • Example: pexpect • http://pexpect.sourceforge.net Copyright (C) 2007, http://www.dabeaz.com 2- 56
  • 57. Commentary • Writing small Unix-like utilities is fairly straightforward in Python • Support for standard kinds of operations (files, regular expressions, pipes, subprocesses, etc.) • However, our solution is also kind of clunky • Only returns some information • Not particularly memory efficient (reads large files into memory) Copyright (C) 2007, http://www.dabeaz.com 2- 57
  • 58. Interlude • Python is well-suited to building libraries and frameworks. • In the next part, we're going to take a totally different approach than simply writing simple utilities. • Will build libraries for manipulating cache data and use those libraries to build tools. Copyright (C) 2007, http://www.dabeaz.com 2- 58
  • 59. Problem : Parsing Data • Extract the cache data (for real) Write a module ffcache.py that contains a set of functions for reading Firefox cache data into useful data structures that can be used by other programs. Capture all available information including URLs, timestamps, sizes, locations, content types, etc. • Use case: Blood and guts Writing programs that can process foreign file formats. Processing binary encoded data. Creating code for later reuse. Copyright (C) 2007, http://www.dabeaz.com 2- 59
  • 60. The Firefox Cache • There are four critical files _CACHE_MAP_ # Cache index _CACHE_001_ # Cache data _CACHE_002_ # Cache data _CACHE_003_ # Cache data • All files are binary-encoded • _CACHE_MAP_ is used by Firefox to locate data, but it is not updated until Firefox exits. • We will ignore _CACHE_MAP_ since we want to observe caches of live Firefox sessions. Copyright (C) 2007, http://www.dabeaz.com 2- 60
  • 61. Firefox _CACHE_ Files • _CACHE_00n_ file organization Free/used block bitmap 4096 bytes Blocks Up to 32768 blocks • The block size varies according to the file: _CACHE_001_ 256 byte blocks _CACHE_002_ 1024 byte blocks _CACHE_003_ 4096 byte blocks Copyright (C) 2007, http://www.dabeaz.com 2- 61
  • 62. Cache Entries • Each cache entry: • A maximum of 4 cache blocks • Can either be data or metadata • If >16K, written to a file instead • Notice how all the "cryptic" files are >16K -rw------- beazley 111169 Sep 25 17:15 01CC0844d01 -rw------- beazley 104991 Sep 25 17:15 01CC3844d01 -rw------- beazley 47233 Sep 24 16:41 021F221Ad01 ... -rw------- beazley 26749 Sep 21 11:19 FF8AEDF0d01 -rw------- beazley 58172 Sep 25 18:16 FFE628C6d01 Copyright (C) 2007, http://www.dabeaz.com 2- 62
  • 63. Cache Metadata • Metadata is encoded as a binary structure Header 36 bytes Request String Variable length (in header) Request Info Variable length (in header) • Header encoding (binary, big-endian) 0-3 magic (???) unsigned int (0x00010008) 4-7 location unsigned int 8-11 fetchcount unsigned int 12-15 fetchtime unsigned int (system time) 16-19 modifytime unsigned int (system time) 20-23 expiretime unsigned int (system time) 24-27 datasize unsigned int (byte count) 28-31 requestsize unsigned int (byte count) 32-35 infosize unsigned int (byte count) Copyright (C) 2007, http://www.dabeaz.com 2- 63
  • 64. Solution Outline • Part 1: Parsing Metadata Headers • Part 2: Getting request information (URL) • Part 3: Extracting additional content info • Part 4: Scanning of individual cache files • Part 5: Scanning an entire directory • Part 6: Scanning a list of directories Copyright (C) 2007, http://www.dabeaz.com 2- 64
  • 65. Part I - Reading Headers • Write a function that can parse the binary metadata header and return the data in a useful format Copyright (C) 2007, http://www.dabeaz.com 2- 65
  • 66. Reading Headers import struct # This function parses a cache metadata header into a dict # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta Copyright (C) 2007, http://www.dabeaz.com 2- 66
  • 67. Reading Headers • How this is supposed to work: >>> f = open("Cache/_CACHE_001_","rb") >>> f.seek(4096) # Skip the bit map >>> headerdata = f.read(36) # Read 36 byte header >>> meta = parse_meta_header(headerdata) >>> meta {'fetchtime': 1190829792, 'requestsize': 27, 'magic': 65544, 'fetchcount': 3, 'expiretime': 0, 'location': 2449473536L, 'modifytime': 1190829792, 'datasize': 29448, 'infosize': 531} >>> • Basically, we're parsing the header into a useful Python data structure (a dictionary) Copyright (C) 2007, http://www.dabeaz.com 2- 67
  • 68. struct module import struct Parses binary encoded data into Python objects. # This function parses a cache metadata header into a dict # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', You would use this module to pack/unpack 'fetchtime','modifytime','expiretime', raw binary 'datasize','requestsize','infosize'] data from Python strings. def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta Unpacks 9 unsigned 32-bit big-endian integers Copyright (C) 2007, http://www.dabeaz.com 2- 68
  • 69. struct module import struct # This function parses a cache metadata header into a dict # of named fields (listed in _headernames below) Result is always a tuple of converted values. _headernames = ['magic','location','fetchcount', head = (65544, 'fetchtime','modifytime','expiretime', 0, 1, 1191682051, 1191682051, 0, 8645, 190, 218) 'datasize','requestsize','infosize'] def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta Copyright (C) 2007, http://www.dabeaz.com 2- 69
  • 70. Dictionary Creation zip(s1,s2) makes a list of tuples zip(_headernames,head) [('magic',head[0]), import struct ('location',head[1]), ('fetchcount',head[2]) # This function parses a cache metadata header into a dict ... # of named fields (listed in _headernames below) ] _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta Make a dictionary Copyright (C) 2007, http://www.dabeaz.com 2- 70
  • 71. Commentary • Dictionaries as data structures meta = { 'fetchtime' : 1190829792, 'requestsize' : 27, 'magic' : 65544, 'fetchcount' : 3, 'expiretime' : 0, 'location' : 2449473536L, 'modifytime' : 1190829792, 'datasize' : 29448, 'infosize' : 531 } • Useful if data has many parts data = f.read(meta[8]) # Huh?!? vs. data = f.read(meta['infosize']) # Better Copyright (C) 2007, http://www.dabeaz.com 2- 71
  • 72. Mini-reference : struct • struct module items = struct.unpack(fmt,data) data = struct.pack(fmt,item1,...,itemn) • Sample Format codes 'c' char (1 byte string) 'b' signed char (8-bit integer) 'B' unsigned char (8-bit integer) 'h' signed short (16-bit integer) 'H' unsigned short (16-bit integer) 'i' int (32-bit integer) 'I' unsigned int (32-bit integer) 'f' 32-bit single precision float 'd' 64-bit double precision float 's' char s[] (String) '>' Big endian modifier '<' Little endian modifier '!' Network order modifier 'n' Repetition count modifier Copyright (C) 2007, http://www.dabeaz.com 2- 72
  • 73. Part 2 : Parsing Requests • Write a function that will read the URL request string and request information • Request String : A Null-terminated string • Request Info : A sequence of Null-terminated key-value pairs (like a dictionary) Copyright (C) 2007, http://www.dabeaz.com 2- 73
  • 74. Parsing Requests import re part_pat = re.compile(r'[nr -~]*$') def parse_request_data(meta,requestdata): parts = requestdata.split('x00') for part in parts: if not part_pat.match(part): return False request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Copyright (C) 2007, http://www.dabeaz.com 2- 74
  • 75. Usage : Requests • Usage of the function: >>> f = open("Cache/_CACHE_001_","rb") >>> f.seek(4096) # Skip the bit map >>> headerdata = f.read(36) # Read 36 byte header >>> meta = parse_meta_header(headerdata) >>> requestdata = f.read(meta['requestsize']+meta['infosize']) >>> parse_request_data(meta,requestdata) True >>> meta['request'] 'http://www.yahoo.com/' >>> meta['info'] {'request-method': 'GET', 'request-User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/ 20070914 Firefox/2.0.0.7', 'charset': 'UTF-8', 'response- head': 'HTTP/1.1 200 OKrnDate: Wed, 26 Sep 2007 18:03:17 ...' } >>> Copyright (C) 2007, http://www.dabeaz.com 2- 75
  • 76. String Stripping import re part_pat = re.compile(r'[nr -~]*$') def parse_request_data(meta,requestdata): parts = requestdata.split('x00') for part in parts: if not part_pat.match(part): The request dataFalse sequence of null-terminated return is a strings. This splits the data up into parts. request = parts[0] if len(request) != (meta['requestsize'] - 1): requestdata = False return 'partx00partx00partx00partx00...' info = dict(zip(parts[1::2],parts[2::2])) .split('x00') meta['request'] = request.split(':',1)[1] meta['info'] = info parts = ['part','part','part','part',...] return True Copyright (C) 2007, http://www.dabeaz.com 2- 76
  • 77. String Validation import re part_pat = re.compile(r'[nr -~]*$') def parse_request_data(meta,requestdata): parts = requestdata.split('x00') for part in parts: if not part_pat.match(part): return False Individual parts are printable characters except for request = parts[0] if len(request) != (meta['requestsize'] - 1): newline characters ('nr'). return False info = dict(zip(parts[1::2],parts[2::2])) We use the re module to match each string. This meta['request'] = request.split(':',1)[1] would help catchinfo where we might be reading meta['info'] = cases return True bad data (false headers, raw data, etc.). Copyright (C) 2007, http://www.dabeaz.com 2- 77
  • 78. URL Request String import re part_pat = re.compile(r'[nr -~]*$') The request string is the first part. The check that def parse_request_data(meta,requestdata): parts = requestdata.split('x00') follows makes parts:it's the right size (a further sanity for part in sure check on not part_pat.match(part): if the data integrity). return False request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Copyright (C) 2007, http://www.dabeaz.com 2- 78
  • 79. Request Info import re Each request has a set of associated part_pat = re.compile(r'[nr -~]*$') data represented as key/value pairs. def parse_request_data(meta,requestdata): parts = requestdata.split('x00') parts = ['request','key','val','key','val','key','val'] for part in parts: parts[1::2] part_pat.match(part): if not ['key','key','key'] return['val','val','val'] parts[2::2] False zip(parts[1::2],parts[2::2]) [('key','val'), request = parts[0] ('key','val') if len(request) != (meta['requestsize'] - 1): ('key','val')] return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Makes a dictionary from (key,val) tuples Copyright (C) 2007, http://www.dabeaz.com 2- 79
  • 80. Fixing the Request # Given a dictionary of header information and a file, # this function extracts the request data from a cache # metadata entry and saves it instring Cleaning up the request the dictionary. Returns # True or False depending on success. request = "HTTP:http://www.google.com" def read_request_data(header,f): .split(':',1) request = f.read(header['requestsize']).strip('x00') infodata = f.read(header['infosize']).strip('x00') ['HTTP','http://www.google.com'] # Validate request and [1] infodata here (nothing now) # Turn the infodata into a dictionary 'http://www.google.com' parts = infodata.split('x00') info = dict(zip(parts[::2],parts[1::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Copyright (C) 2007, http://www.dabeaz.com 2- 80
  • 81. Commentary • Emphasize that Python has very powerful list manipulation primitives • Indexing • Slicing • List comprehensions • Etc. • Knowing how to use these leads to rapid development and compact code Copyright (C) 2007, http://www.dabeaz.com 2- 81
  • 82. Part 3: Content Info • All documents on the internet have optional content-type, encoding, and character set information. • Let's add this information since it will make it easier for us to determine the type of files that are stored in the cache (i.e., images, movies, HTML, etc.) Copyright (C) 2007, http://www.dabeaz.com 2- 82
  • 83. HTTP Responses • The cache metadata includes an HTTP response header >>> print meta['info']['response-head'] HTTP/1.1 200 OK Date: Sat, 29 Sep 2007 20:51:37 GMT Cache-Control: private Vary: User-Agent Content-Type: text/html; charset=utf-8 Content-Encoding: gzip >>> Content type, character set, and encoding. Copyright (C) 2007, http://www.dabeaz.com 2- 83
  • 84. Solution # Given a metadata dictionary, this function adds additional # fields related to the content type, charset, and encoding import email def add_content_info(meta): info = meta['info'] if 'response-head' not in info: return else: rhead = info.get('response-head').split("n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset Copyright (C) 2007, http://www.dabeaz.com 2- 84
  • 85. Internet Data Handling # Given a metadata dictionary, has afunction adds additional Python this vast assortment of internet data handling modules. # fields related to the content type, charset, and encoding import email email. Parsing of email messages, def add_content_info(meta): info = meta['info'] if 'response-head' MIME headers, etc. not in info: return else: rhead = info.get('response-head').split("n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset Copyright (C) 2007, http://www.dabeaz.com 2- 85
  • 86. Internet Data Handling # Given a metadata dictionary, this function adds additional # In this code, we parse the HTTP charset, and encoding fields related to the content type, reponse headers using the email import email module and extract content-type, def add_content_info(meta): info = meta['info'] encoding, and charset information. if 'response-head' not in info: return else: rhead = info.get('response-head').split("n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset Copyright (C) 2007, http://www.dabeaz.com 2- 86
  • 87. Commentary • Python is heavily used in Internet applications • There are modules for parsing common types of data (email, HTML, XML, etc.) • There are modules for processing bits and pieces of internet data (URLs, MIME types, RFC822 headers, etc.) Copyright (C) 2007, http://www.dabeaz.com 2- 87
  • 88. Part 4: File Scanning • Write a function that scans a single cache file and produces a sequence of records containing all of the cache metadata. • This is just one more of our building blocks • The goal is to hide some of the nasty bits Copyright (C) 2007, http://www.dabeaz.com 2- 88
  • 89. File Scanning # Scan a single file in the firefox cache def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta # Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 89
  • 90. Usage : File Scanning • Usage of the scan function >>> f = open("Cache/_CACHE_001_","rb") >>> for meta in scan_cache_file(f,256) ... print meta['request'] ... http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/ http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif ... • We can just open up a cache file and write a for-loop to iterate over all of the entries. Copyright (C) 2007, http://www.dabeaz.com 2- 90
  • 91. Python File I/O # Scan a single file in the firefox cache def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) File Objects if not headerdata: break Modeled after ANSI C. meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] bytes. Files are just < maxsize): File pointer keeps track. requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): f.read() # Read bytes add_content_info(meta) f.tell() # Current fp yield meta f.seek(n,off) # Move fp # Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 91
  • 92. Using Earlier Code # Scan a single file in the firefox cache def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an using our Here we are entry f.seek(4096) # Skip header parsing functions the bit-map while True: headerdata = f.read(36) written in previous parts. if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta Note: We are progressively adding #more the file apointer Move data to fp = f.tell() to the start of the next block dictionary. % blocksize): if (fp f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 92
  • 93. Data Validation # Scan a single file in the firefox cache This is a sanity check to make def scan_cachefile(f,blocksize): sure the header data looks #like a maxsize = 4*blocksize Maximum size of an entry f.seek(4096) # Skip the bit-map valid header. while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta # Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 93
  • 94. Generating Results # Scan a single file in the firefox cache def scan_cachefile(f,blocksize): We are using yield to Maximum size of for a maxsize = 4*blocksize # produce data an entry f.seek(4096) single cache entry. #IfSkip the bit-map a for- while True: someone uses loop, they will get all of the entries. headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) Note: This allows== 0x00010008 and cache if (meta['magic'] us to process the meta['requestsize'] + meta['infosize'] < maxsize): without reading all of the data into memory. requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta # Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 94
  • 95. Commentary • Have created a function that can scan a single _CACHE_00n_ file and produce a sequence of dictionaries with metadata. • It's still somewhat low-level • Just need to package it a little better Copyright (C) 2007, http://www.dabeaz.com 2- 95
  • 96. Part 5 : Scan a Directory • Write a function that takes the name of a Firefox cache directory, scans all of the cache files for metadata, and produces a single sequence of records. • Make it real easy to extract data Copyright (C) 2007, http://www.dabeaz.com 2- 96
  • 97. Solution : Directory Scan # Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)] for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() Copyright (C) 2007, http://www.dabeaz.com 2- 97
  • 98. Solution : Directory Scan General idea: # Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. We loop over the three _CACHE_00n_ files and import os a sequence of the cache records produce def scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)] for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() Copyright (C) 2007, http://www.dabeaz.com 2- 98
  • 99. Solution : Directory Scan # Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)] for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname We use the low-level file scanning function yield meta cfile.close() here to generate a sequence of records. Copyright (C) 2007, http://www.dabeaz.com 2- 99
  • 100. More Generation # Given the name of a Firefox cache directory, the function Byscans all of here, we are chainingfor metadata. A sequence # using yield the _CACHE_00n_ files together the results obtained from all three cache files into one # of dictionaries containing metadata is returned. big long sequence of results. import os def scan_cache(cachedir): files = [('_CACHE_001_',256), The underlying mechanics and implementation ('_CACHE_002_',1024), details are hidden (user doesn't care) ('_CACHE_003_',4096)] for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() Copyright (C) 2007, http://www.dabeaz.com 2-100
  • 101. Additional Data # Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), Adding ('_CACHE_003_',4096)] path and file for information to the data cname,blocksize in files: (May be useful later) cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() Copyright (C) 2007, http://www.dabeaz.com 2-101
  • 102. Usage : Cache Scan • Usage of the scan function >>> for meta in scan_cache("Cache/"): ... print meta['request'] ... http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/ http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif ... • Given the name of a cache directory, we can just loop over all of the metadata. Trivial! • With work, could perform various kinds of queries and processing of the data Copyright (C) 2007, http://www.dabeaz.com 2-102
  • 103. Another Example • Find all requests related to Slashdot >>> for meta in scan_cache("Cache/"): ... if 'slashdot' in meta['request']: ... print meta['request'] ... http://www.slashdot.org/ http://images.slashdot.org/topics/topiccommunications.gif http://images.slashdot.org/topics/topicstorage.gif http://images.slashdot.org/comments.css?T_2_5_0_176 ... • Well, that was pretty easy. Copyright (C) 2007, http://www.dabeaz.com 2-103
  • 104. Another Example • Find all large JPEG images in the cache >>> jpegs = (meta for meta in scan_cache("Cache/") if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000) >>> for j in jpegs: ... print j['request'] ... http://images.salon.com/ent/video_dog/comedy/2007/09/27/cereal/ story.jpg http://images.salon.com/ent/video_dog/ifc/2007/09/28/ apocalypse/story.jpg http://www.lakesideinns.com/images/fallroadphoto2006.jpg ... >>> • That was also pretty easy Copyright (C) 2007, http://www.dabeaz.com 2-104
  • 105. Part 6 : Scan Everything • Write a function that takes a list of cache directories and produces a sequence of all cache metadata found in all of them. • A single utility function that let's us query everything. Copyright (C) 2007, http://www.dabeaz.com 2-105
  • 106. Scanning Everything # scan an entire list of cache directories producing # a sequence of records def scan(cachedirs): if isinstance(cachedirs,str): cachedirs = [cachedirs] for cdir in cachedirs: for meta in scan_cache(cdir): yield meta Copyright (C) 2007, http://www.dabeaz.com 2-106
  • 107. Type Checking # scan an entire list of cache directories producing # a sequence of records def scan(cachedirs): if isinstance(cachedirs,str): cachedirs = [cachedirs] for cdir in cachedirs: This bit of code ismeta example of type an for meta in scan_cache(cdir): yield checking. If the argument is a string, we convert it to a list with one item. This allows the following usage: scan("CacheDir") scan(["CacheDir1","CacheDir2",...]) Copyright (C) 2007, http://www.dabeaz.com 2-107
  • 108. Putting it all together # slack.py # Find all of those slackers who should be working import sys, os, ffcache if len(sys.argv) != 2: print >>sys.stderr,"Usage: python slack.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for meta in ffcache.scan(caches): if 'slashdot' in meta['request']: print meta['request'] print meta['cachedir'] print Copyright (C) 2007, http://www.dabeaz.com 2-108
  • 109. Intermission • Have written a simple library ffcache.py • Library takes a moderate complex data processing problem and breaks it up into pieces. • About 100 lines of code. • Now, let's build an application... Copyright (C) 2007, http://www.dabeaz.com 2-109
  • 110. Problem : CacheSpy • Big Brother (make an evil sound here) Write a program that first locates all of the Firefox cache directories under a given directory. Then have that program run forever as a network server, waiting for connections. On each connection, send back all of the current cache metadata. • Big Picture We're going to write a daemon that will find and quietly report on browser cache contents. Copyright (C) 2007, http://www.dabeaz.com 2-110
  • 111. cachespy.py import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files] def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com 2-111
  • 112. SocketServer Module import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 SocketServer caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files] def module for easily creating A dump_cache(f): for meta in ffcache.scan(caches): low-level internet applications pickle.dump(meta,f) using sockets. class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com 2-112
  • 113. SocketServer Handlers import sys, os, pickle, SocketServer, ffcache You define a simple class that SPY_PORT = 31337 caches = [path for path,dname,files in os.walk(sys.argv[1]) implements handle(). if '_CACHE_MAP_' in files] def dump_cache(f): This implements the server logic. for meta in ffcache.scan(caches): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com 2-113
  • 114. SocketServer Servers import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files] def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f) Next, you just create a Server object, class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): hook f = self.request.makefile()run the the handler up to it, and server. dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com 2-114
  • 115. Data Serialization import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files] def dump_cache(f): Here, we areturning a socket into a for meta in ffcache.scan(caches): file and pickle.dump(meta,f) dumping cache data on it. class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() socket corresponding to SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d"that connected. client % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com 2-115
  • 116. pickle Module import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 The pickle module takes any caches = [path for path,dname,files in os.walk(sys.argv[1]) Python object and serializes it if '_CACHE_MAP_' in files] into a byte string. def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): There are really only two ops: f = self.request.makefile() dump_cache(f) f.close() pickle.dump(obj,f) # Dump object obj = pickle.load(f) # Load object SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com 2-116
  • 117. Running our Server • Example: % python cachespy.py /Users CacheSpy running on port 31337 • Server is just sitting there waiting • You can try connecting with telnet % telnet localhost 31337 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. (dp0 S'info' p1 ... bunch of cryptic data ... Copyright (C) 2007, http://www.dabeaz.com 2-117
  • 118. Problem : CacheMon • The Evil Overlord (make a more evil sound) Write a program cachemon.py that contains a function for retrieving the cache contents from a remote machine. • Big Picture Writing network clients. Programs that make outgoing connections to internet services. Copyright (C) 2007, http://www.dabeaz.com 2-118
  • 119. cachemon.py # cachemon.py import pickle, socket def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close() Copyright (C) 2007, http://www.dabeaz.com 2-119
  • 120. Solution : Socket Module # cachemon.py import pickle, socket def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: socket module provides direct access to low-level socket API. meta = pickle.load(f) meta['host'] = host # Add host to metadata s = socket(addr,type) yield meta except EOFError: s.connect(host) pass s.bind(addr) f.close() s.listen(n) s.close() s.accept() s.recv(n) s.send(data) ... Copyright (C) 2007, http://www.dabeaz.com 2-120
  • 121. Unpickling a Sequence # cachemon.py Here we use pickle import pickle, socket to repeatedly load objects def scan_remote_cache(host): yield to generate a off of the socket. We use sequence of received objects. s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close() Copyright (C) 2007, http://www.dabeaz.com 2-121
  • 122. Example Usage • Example: Find all JPEG images > 100K on a remote machine >>> rcache = scan_remote_cache(("localhost",31337)) >>> jpegs = (meta for meta in rcache ... if meta['content-type'] == 'image/jpeg' ... and meta['datasize'] > 100000) >>> for j in jpegs: ... print j['request'] ... http://images.salon.com/ent/video_dog/comedy/2007/09/27/ cereal/story.jpg http://images.salon.com/ent/video_dog/ifc/2007/09/28/ apocalypse/story.jpg http://www.lakesideinns.com/images/fallroadphoto2006.jpg ... • This looks almost identical to old code Copyright (C) 2007, http://www.dabeaz.com 2-122
  • 123. Code Similarity • A Remote Scan rcache = scan_remote_cache(("localhost",31337)) jpegs = (meta for meta in rcache if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000) for j in jpegs: print j['request'] • A Local Scan cache = ffcache.scan(cachedirs) jpegs = (meta for meta in cache if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000) for j in jpegs: print j['request'] Copyright (C) 2007, http://www.dabeaz.com 2-123
  • 124. Big Picture cachespy.py for meta in ffcache.scan(dirs): pickle.dump(meta,f) socket cachemon.py while True: meta = pickle.load(f) yield meta for meta in remote_scan(host): # ... Copyright (C) 2007, http://www.dabeaz.com 2-124
  • 125. Problem : Clusters • Scan a whole cluster of machines Write a function that can easily scan the caches of an entire collection of remote hosts. • Big Picture Collecting data from a group of machines on the network. Copyright (C) 2007, http://www.dabeaz.com 2-125
  • 126. cachemon.py # cachemon.py ... def scan_cluster(hostlist): for host in hostlist: try: for meta in scan_remote_cache(host): yield meta except (EnvironmentError,socket.error): pass A bit of exception handling to deal with dead machines, and other problems (would probably need to be expanded) Copyright (C) 2007, http://www.dabeaz.com 2-126
  • 127. Example Usage • Example: Find all JPEG images > 100K on a set of remote machines >>> hosts = [('host1',31337),('host2',31337),...] >>> rcaches = scan_cluster(hosts) >>> jpegs = (meta for meta in rcache ... if meta['content-type'] == 'image/jpeg' ... and meta['datasize'] > 100000) >>> for j in jpegs: ... print j['request'] ... ... • Think about the abstraction of "iteration" here. Query code is exactly the same. Copyright (C) 2007, http://www.dabeaz.com 2-127
  • 128. Problem : Concurrency • Collect data from a large set of machines In the last section, the scan_cluster() function retrieves data from one machine at a time. However, a world-wide quasi-evil organization is likely to have at least several dozen machines. • Your task Modify the scanner so that it can manage concurrent client connections, reading data from multiple sources at once. Copyright (C) 2007, http://www.dabeaz.com 2-128
  • 129. Concurrency • Python provides full support for threads • They are real threads (pthreads, system threads, etc.) • However, a lock within the Python interpreter (Global Interpreter Lock), prevents concurrency across more than one CPU. Copyright (C) 2007, http://www.dabeaz.com 2-129
  • 130. Programming with Threads • threading module provides a Thread object. • A variety of synchronization primitives are provided (Locks, Semaphores, Condition Variations, Events, etc.) • Can program very traditional kinds of threaded programs (multiple threads, lots of locking, race conditions, horrible debugging, etc.). Copyright (C) 2007, http://www.dabeaz.com 2-130
  • 131. Threads with Queues • One technique for thread programming is to have independent threads that share data via thread-safe message queues. • Variations of "producer-consumer" problems. • Will use this in our solution. Keep in mind, it's not the only way to program threads. Copyright (C) 2007, http://www.dabeaz.com 2-131
  • 132. A Cache Scanning Thread # cachemon.py ... import threading class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta) Copyright (C) 2007, http://www.dabeaz.com 2-132
  • 133. threading Module # cachemon.py ... threading module. import threading class ScanThread(threading.Thread): most functionality Contains def __init__(self,host,msg_q): threads. related to threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta) Copyright (C) 2007, http://www.dabeaz.com 2-133
  • 134. Thread Base Class # cachemon.py ... import threading class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q Threads are defined by def run(self): inheriting from the Thread base class. for meta in scan_remote_cache(self.host): self.msg_q.put(meta) Copyright (C) 2007, http://www.dabeaz.com 2-134
  • 135. Thread Initialization # cachemon.py ... initialization and setup import threading class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta) Copyright (C) 2007, http://www.dabeaz.com 2-135
  • 136. Thread Execution # cachemon.py ... import method run() threading Contains code that class ScanThread(threading.Thread): def __init__(self,host,msg_q): executes in the thread. threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta) The thread performs a scan of a single host. Copyright (C) 2007, http://www.dabeaz.com 2-136
  • 137. Launching a Thread • You create a thread object and start it t1 = ScanThread(("host1",31337),msg_q) t1.start() t2 = ScanThread(("host2",31337),msg_q) t2.start() • .start() starts the thread and calls .run() Copyright (C) 2007, http://www.dabeaz.com 2-137
  • 138. Thread Safe Queues • Queue module. Provides a thread-safe queue. import Queue msg_q = Queue.Queue() • Queue insertion msg_q.put(obj) • Queue removal obj = msg_q.get() • Queue can be shared by as many threads as you want without worrying about locking. Copyright (C) 2007, http://www.dabeaz.com 2-138
  • 139. Use of a Queue Object # cachemon.py ... import threading class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) A Queue object. self.host = host Where incoming self.msg_q = msg_q def run(self): objects are placed. for meta in scan_remote_cache(self.host): self.msg_q.put(meta) Get data from the remote machine and put into the Queue Copyright (C) 2007, http://www.dabeaz.com 2-139
  • 140. Primitive Use of a Queue • You first create a queue, then launch the threads to insert data into it. msg_q = Queue.Queue() t1 = ScanThread(("host1",31337),msg_q) t1.start() t2 = ScanThread(("host2",31337),msg_q) t2.start() while True: meta = msg_q.get() # Get metadata Copyright (C) 2007, http://www.dabeaz.com 2-140
  • 141. Monitor Architecture Host Host Host socket socket socket Monitor Thread Thread Thread .put() msg_q .get() Consumer ???? Copyright (C) 2007, http://www.dabeaz.com 2-141
  • 142. Concurrent Monitor import threading, Queue def concurrent_scan(hostlist, msg_q): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel def scan_cluster(hostlist): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q)).start() while True: meta = msg_q.get() if meta: yield meta else: break Copyright (C) 2007, http://www.dabeaz.com 2-142
  • 143. Launching Threads import threading, Queue def concurrent_scan(hostlist, msg_q): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel def scan_cluster(hostlist): Themsg_q = function is a thread that launches above Queue.Queue() ScanThreads. It then waits for the threads to threading.Thread(target=concurrent_scan, args=(hostlist,msg_q)).start() terminateTrue: while by joining with them. After all threads have terminated, a sentinel is dropped in the Queue. meta = msg_q.get() if meta: yield meta else: break Copyright (C) 2007, http://www.dabeaz.com 2-143
  • 144. Collecting Results import threading, Queue The function below creates a Queue and launches a def concurrent_scan(hostlist, msg_q): thread to launch all of the scanning threads. thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q) It thenthr.start() a sequence of cache data until the produces thr_list.append(thr) sentinel (None) is pulled off of the queue. for thr in thr_list: thr.join() msg_q.put(None) # Sentinel def scan_cluster(hostlist): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q)).start() while True: meta = msg_q.get() if meta: yield meta else: break Copyright (C) 2007, http://www.dabeaz.com 2-144
  • 145. More on Threads • There are many more issues to thread programming that we could discuss. • All issues concerning locking, synchronization, event handling, and race conditions apply to Python. • Because of global interpreter lock, threads are not a way to achieve higher performance (generally). Copyright (C) 2007, http://www.dabeaz.com 2-145
  • 146. Thread Synchronization • threading module has various primitives Lock() # Mutex Lock RLock() # Reentrant Mutex Lock Semaphore(n) # Semaphore • Example use: x = value # Some kind of shared object x_lock = Lock() # A lock associated with x ... x_lock.acquire() # Modify or do something with x (critical section) ... x_lock.release() Copyright (C) 2007, http://www.dabeaz.com 2-146
  • 147. Story so Far • Wrote a module ffcache.py that parsed contents of caches (~100 lines) • Wrote cachespy.py that allows cache data to be retrieved by a remote client (~25 lines) • Wrote a concurrent monitor for getting that data (~50 lines) Copyright (C) 2007, http://www.dabeaz.com 2-147
  • 148. A subtle observation • In none of our programs have we read the entire contents of any Firefox cache into memory. • In cachespy.py, the contents are read iteratively and piped through a socket (not stored in memory). • In cachemon.py, contents are received and routed through message queues. Processed iteratively (no temporary lists of results). Copyright (C) 2007, http://www.dabeaz.com 2-148
  • 149. Another Observation • For every connection, cachespy sends the entire contents of the Firefox cache metadata back to the monitor. • Given that caches are ~50 MB by default, this could result in large network traffic. • Question: Given that we're normally performing queries on the data, could we do any of this work on the remote machines? Copyright (C) 2007, http://www.dabeaz.com 2-149
  • 150. Remote Filtering • Distribute the work Modify the cachespy program so that some of the query work can be performed remotely on each of the machines. Only send back a subset of the data to the monitor program. • Big Picture Distributed computation. Massive security nightmare. Copyright (C) 2007, http://www.dabeaz.com 2-150
  • 151. The idea • Modify scan_cluster() and all related functions to accept an optional filter specification. Pass this on to the remote machine and use it to process the data remotely before returning results. filter = """ if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000 """ rcaches = scan_cluster(hostlist,filter) Copyright (C) 2007, http://www.dabeaz.com 2-151
  • 152. Changes to the Monitor Add a filter parameter # cachemon.py def scan_remote_cache(host,filter=""): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() pickle.dump(filter,f) f.flush() Send the filter to the try: remote host right while True: meta = pickle.load(f) after connecting. meta['host'] = host yield meta except EOFError: pass Copyright (C) 2007, http://www.dabeaz.com 2-152
  • 153. Changes to the Monitor # cachemon.py ... class ScanThread(threading.Thread): def __init__(self,host,msg_q,filter=""): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q filter added to self.filter = filter def run(self): thread data try: for meta in scan_remote_cache(self.host,self.filter): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass Copyright (C) 2007, http://www.dabeaz.com 2-153
  • 154. Changes to the Monitor def concurrent_scan(hostlist, msg_q,filter): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q,filter) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel filter passed to thread creation Copyright (C) 2007, http://www.dabeaz.com 2-154
  • 155. Changes to the Monitor # cachemon.py ... def scan_cluster(hostlist,filter=""): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q,filter)).start() while True: meta = msg_q.get() if not meta: break yield meta filter added Copyright (C) 2007, http://www.dabeaz.com 2-155
  • 156. Commentary • Have modified the cache monitor program to accept a filter string and to pass that string to remote clients upon connecting. • How to use the filter in the spy server. Copyright (C) 2007, http://www.dabeaz.com 2-156
  • 157. Changes to CacheSpy # cachespy.py ... def dump_cache(f,filter): values = """(meta for meta in ffcache.scan(caches) %s)""" % filter try: for meta in eval(values): pickle.dump(meta,f) except: pickle.dump({'error' : traceback.format_exc()},f) Copyright (C) 2007, http://www.dabeaz.com 2-157
  • 158. Changes to CacheSpy Filter added and used # cachespy.py ... to create an expression def dump_cache(f,filter): string. values = """(meta for meta in ffcache.scan(caches) %s)""" % filter try: for meta in eval(values): pickle.dump(meta,f) filter = "if meta['datasize'] > 100000" except: pickle.dump({'error' : traceback.format_exc()},f) values = """(meta for meta in ffcache.scan(caches) if meta['datasize'] > 100000)""" Copyright (C) 2007, http://www.dabeaz.com 2-158
  • 159. Eval() # cachespy.py ... def dump_cache(f,filter): eval(s). Evaluates s as a Python expression. values = """(meta for meta in ffcache.scan(caches) %s)""" % filter try: for meta in eval(values): pickle.dump(meta,f) except: pickle.dump({'error' : traceback.format_exc()},f) A bit error of handling. traceback module creates stack traces for exceptions. Copyright (C) 2007, http://www.dabeaz.com 2-159
  • 160. Changes to the Server # cachespy.py ... class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() filter = pickle.load(f) dump_cache(f,filter) Get filter from f.close() the monitor Copyright (C) 2007, http://www.dabeaz.com 2-160
  • 161. Putting it all Together • A remote query to find slackers # Find all of those slashdot slackers import cachemon hosts = [('host1',31337),('host2',31337), ('host3',31337),...] filter = "if 'slashdot' in meta['request']" rcaches = cachemon.scan_cluster(hosts,filter) for meta in rcaches: print meta['request'] print meta['host'],meta['cachedir'] print Copyright (C) 2007, http://www.dabeaz.com 2-161
  • 162. Putting it all Together • Queries run remotely on all the hosts • Only data of interest is sent back • No temporary lists or large data structures • Concurrent execution on monitor • Concurrency is hidden from user Copyright (C) 2007, http://www.dabeaz.com 2-162
  • 163. The Power of Iteration • Loop over all entries in a cache file: for meta in scan_cache_file(f,256): ... • Loop over all entries in a cache directory for meta in scan_cache(dirname): ... • Loop over all cache entries on remote host for meta in scan_remote_cache(host): ... • Loop over all cache entries on many hosts for meta in scan_cluster(hostlist): ... Copyright (C) 2007, http://www.dabeaz.com 2-163
  • 164. Wrapping Up • A lot of material has been presented • Again, the goal was to do something interesting with Python, not to be just a reference manual. • This is only a small taste of what's possible • And it's only a small taste of why people like programming in Python Copyright (C) 2007, http://www.dabeaz.com 2-164
  • 165. Other Python Examples • Python makes many annoying tasks relatively easy. • Will end by showing very simple examples of other modules. Copyright (C) 2007, http://www.dabeaz.com 2-165
  • 166. Fetching a Web Page • urllib and urllib2 modules import urllib w = urllib.urlopen("http://www.foo.com") for line in w: # ... page = urllib.urlopen("http://www.foo.com").read() • Additional options support uploading of form values, cookies, passwords, proxies, etc. Copyright (C) 2007, http://www.dabeaz.com 2-166
  • 167. A Web Server with CGI • Serve files and allow CGI scripts from BaseHTTPServer import HTTPServer from CGIHTTPServer import CGIHTTPRequestHandler import os os.chdir("/home/docs/html") serv = HTTPServer(("",8080),CGIHTTPRequestHandler) serv.serve_forever() • Can easily throw up a server with just a few lines of Python code. Copyright (C) 2007, http://www.dabeaz.com 2-167
  • 168. A Custom HTTP Server • BaseHTTPServer module from BaseHTTPServer import BaseHTTPRequestHandler,HTTPServer class MyHandler(BaseHTTPRequestHandler): def do_GET(self): ... def do_POST(self): ... def do_HEAD(self): ... def do_PUT(self): ... serv = HTTPServer(("",8080),MyHandler) serv.serve_forever() • Could use to put a web server in an application Copyright (C) 2007, http://www.dabeaz.com 2-168
  • 169. XML-RPC Server/Client • How to create a stand-alone server from SimpleXMLRPCServer import SimpleXMLRPCServer def add(x,y): return x+y s = SimpleXMLRPCServer(("",8080)) s.register_function(add) s.serve_forever() • How to test it (xmlrpclib) >>> import xmlrpclib >>> s = xmlrpclib.ServerProxy("http://localhost:8080") >>> s.add(3,5) 8 >>> s.add("Hello","World") "HelloWorld" >>> Copyright (C) 2007, http://www.dabeaz.com 2-169
  • 170. Where to go from here? • Network/Internet programming. Python has a large user base developing network applications, web frameworks, and internet data handling tools. • C/C++ extension building. Python is easily extended with C/C++ code. Can use Python as a high-level control application for existing systems software. Copyright (C) 2007, http://www.dabeaz.com 2-170
  • 171. Where to go from here? • GUI programming. There are several major GUI packages for Python (Tkinter, wxPython, PyQT, etc.). • Jython and IronPython. Implementations of the Python interpreter for Java and .NET. Copyright (C) 2007, http://www.dabeaz.com 2-171
  • 172. Where to go from here? • Everything Pythonic: http://www.python.org • Get involved. PyCon'2008 (Chicago) • Have an on-site course (shameless plug) http://www.dabeaz.com/python.html Copyright (C) 2007, http://www.dabeaz.com 2-172
  • 173. Thanks for Listening! • Hope you got something out of the class • Please give me feedback! Copyright (C) 2007, http://www.dabeaz.com 2-173