Python in Action
                              Presented at USENIX LISA Conference
                                       ...
Section Overview
                • In this section, we're going to get dirty
                • Systems Programming
       ...
Commentary
                • I personally think Python is a fantastic tool for
                        systems programming...
Approach

                  • I've thought long and hard about how I
                         would present this part of t...
"To Catch a Slacker"
            • Write a collection of Python programs that can
                    quietly monitor Fire...
Why this Problem?
                  • Involves a real-world system and data
                  • Firefox already installed ...
Disclaimers
                • I am not involved in browser forensics (or
                        spyware for that matter)....
More Disclaimers
                • All of the code in this tutorial works with a
                        standard Python i...
Assumptions
                  • This is not a tutorial on systems concepts
                  • You should be generally fam...
The Big Picture
                  • We want to write a tool that allows
                         someone to locate, inspec...
The Firefox Cache
                • The Firefox browser keeps a disk cache of
                       recently visited site...
Problem : Finding Files
                   • Find the Firefox cache
                           Write a program findcache.py...
findcache.py
                # findcache.py
                # Recursively scan a directory looking for
                # Fi...
The sys module
                # findcache.py
                # Recursively scan a directory looking basic
               ...
Program Termination
                # findcache.py
                # Recursively scan a directory looking for
            ...
os Module
                # findcache.py
                # Recursively scan a directory looking for
                # Fire...
os.walk()
               os.walk(topdir)
               # findcache.py
               # Recursively scan a directory looki...
A Sequence of Caches
                # findcache.py
                # Recursively scan a directory looking for
           ...
Printing the Result
                # findcache.py
                # Recursively scan a directory looking for
            ...
Commentary
                     • Our solution is strongly based on a
                             "declarative" programmi...
Big Idea : Iteration
                 • Python allows iteration to be captured as a
                         kind of objec...
Big Idea : Iteration
                 • Compare to this:
                            for path,dirs,files in os.walk(sys.ar...
Mini-Reference : sys, os
                   • sys module
                             sys.argv          #   List of comman...
Problem: Searching for Text
            • Extract all URL requests from the cache
                   Write a program reque...
The Firefox Cache
               • The cache directory holds two types of data
                    • Metadata (URLs, heade...
Possible Solution : Regex
            • The _CACHE_00n_ files are encoded in a
                   binary format, but URLs a...
A Regex Solution
              # requests.py
              import re
              import os
              import sys

   ...
The re module
              # requests.py
              import re                         re module
              import o...
Using re
              # requests.py
              import re are first specified
              Patterns                     ...
Using re
              # requests.py
              import re
              import os
              import sys

           ...
Searching for Matches
              # requests.py
              import re
              import os
               pat.searc...
Match Objects
              # requests.py
              import re
              import os
              import sys

      ...
Groups
              # requests.py
               In patterns, parentheses () define groups which
              import re
 ...
Mini-Reference : re
                  • re pattern compilation
                          pat = re.compile(r'patternstring'...
Mini-Reference : re
               • Common pattern operations
                        pat.search(text)        # Search te...
Mini-Reference : re
                   • An example of pattern replacement
                          # This replaces Ameri...
Mini-Reference : re
               • There are many more features of the re
                      module
               • ...
File Handling
              # requests.py
              import re
              import os
              import sys

      ...
os.path module
              # requests.py
              import re has portable file related functions
               os.pa...
os.path.join()
              # requests.py
              import re a fully-expanded pathname
              Creates
       ...
Mini-Reference : os.path
                      os.path.join(s1,s2,...)   #   Join pathname parts together
                ...
Binary I/O
              # requests.py
              import re
              import os
              import sys

         ...
Common I/O Shortcuts
              # requests.py
              import re entire file into a string
               # Read a...
Commentary on Solution
               • This regex approach is mostly a hack for this
                       particular ap...
Commentary
                   • We have started to build a collection of
                          very simple command lin...
Working with Processes
                   • It is common to write programs that run
                          other progra...
subprocess Module
            • A module for creating and interacting with
                   subprocesses
            • C...
Example : Slackers

             • Find slacker cache entries.
                    Using the programs findcache.py and requ...
slackers.py
               # slackers.py
               import sys
               import subprocess

               # Run ...
Launching a subprocess
               # slackers.py
               import sys
               import subprocess

          ...
Python Executable
               # slackers.py
               import sys
               import subprocess            Full ...
Subprocess Arguments
               # slackers.py
               import sys
               import subprocess

            ...
slackers.py
               # slackers.py
               import sys
               import subprocess

               # Run ...
Commentary

             • subprocess is a large module with many options.
             • However, it takes care of a lot ...
Low Level Subprocesses
            • Running a simple system command
                       os.system("shell command")

  ...
Interactive Processes
              • Python does not have built-in support for
                     controlling interacti...
Commentary
              • Writing small Unix-like utilities is fairly
                     straightforward in Python
    ...
Interlude
                • Python is well-suited to building libraries
                       and frameworks.
           ...
Problem : Parsing Data
            • Extract the cache data (for real)
                   Write a module ffcache.py that c...
The Firefox Cache
            • There are four critical files
                        _CACHE_MAP_         #   Cache   index...
Firefox _CACHE_ Files
            • _CACHE_00n_ file organization
                                            Free/used blo...
Cache Entries
            • Each cache entry:
                 • A maximum of 4 cache blocks
                 • Can either...
Cache Metadata
              • Metadata is encoded as a binary structure
                                            Heade...
Solution Outline
                   • Part 1: Parsing Metadata Headers
                   • Part 2: Getting request inform...
Part I - Reading Headers


                • Write a function that can parse the binary
                       metadata he...
Reading Headers
           import struct

           # This function parses a cache metadata header into a dict
          ...
Reading Headers
             • How this is supposed to work:
                 >>> f = open("Cache/_CACHE_001_","rb")
     ...
struct module
           import struct


                    Parses binary encoded data into Python objects.
           # ...
struct module
           import struct

           # This function parses a cache metadata header into a dict
           #...
Dictionary Creation
            zip(s1,s2) makes a list of tuples
           zip(_headernames,head)        [('magic',head[...
Commentary
                   • Dictionaries as data structures
                             meta           = { 'fetchtime...
Mini-reference : struct
                   • struct module
                           items = struct.unpack(fmt,data)
    ...
Part 2 : Parsing Requests

                   • Write a function that will read the URL
                          request ...
Parsing Requests
            import re
            part_pat = re.compile(r'[nr -~]*$')

            def parse_request_data...
Usage : Requests
             • Usage of the function:
                >>> f = open("Cache/_CACHE_001_","rb")
            ...
String Stripping
            import re
            part_pat = re.compile(r'[nr -~]*$')

            def parse_request_data...
String Validation
            import re
            part_pat = re.compile(r'[nr -~]*$')

            def parse_request_dat...
URL Request String
            import re
            part_pat = re.compile(r'[nr -~]*$')

             The request string ...
Request Info
            import re
             Each request has a set of associated
            part_pat = re.compile(r'[...
Fixing the Request
           # Given a dictionary of header information and a file,
           # this function extracts t...
Commentary
                 • Emphasize that Python has very powerful
                         list manipulation primitive...
Part 3: Content Info
                   • All documents on the internet have
                          optional content-ty...
HTTP Responses
                   • The cache metadata includes an HTTP
                          response header
        ...
Solution
         # Given a metadata dictionary, this function adds additional
         # fields related to the content ty...
Internet Data Handling
         # Given a metadata dictionary, has afunction adds additional
                             ...
Internet Data Handling
         # Given a metadata dictionary, this function adds additional
         # In this code, we p...
Commentary
                 • Python is heavily used in Internet applications
                 • There are modules for par...
Part 4: File Scanning

                   • Write a function that scans a single cache
                          file and p...
File Scanning
      # Scan a single file in the firefox cache
      def scan_cachefile(f,blocksize):
          maxsize = 4...
Usage : File Scanning
             • Usage of the scan function
                   >>> f = open("Cache/_CACHE_001_","rb")
...
Python File I/O
      # Scan a single file in the firefox cache
      def scan_cachefile(f,blocksize):
          maxsize =...
Using Earlier Code
      # Scan a single file in the firefox cache
      def scan_cachefile(f,blocksize):
          maxsiz...
Data Validation
      # Scan a single file in the firefox cache
      This is a sanity check to make
      def scan_cachef...
Generating Results
      # Scan a single file in the firefox cache
      def scan_cachefile(f,blocksize):
              We...
Commentary

                   • Have created a function that can scan a
                          single _CACHE_00n_ file ...
Part 5 : Scan a Directory

                   • Write a function that takes the name of a
                          Firefo...
Solution : Directory Scan
          # Given the name of a Firefox cache directory, the function
          # scans all of t...
Solution : Directory Scan
          General idea:
          # Given the name of a Firefox cache directory, the function
  ...
Solution : Directory Scan
          # Given the name of a Firefox cache directory, the function
          # scans all of t...
More Generation
         # Given the name of a Firefox cache directory, the function
         Byscans all of here, we are ...
Additional Data
          # Given the name of a Firefox cache directory, the function
          # scans all of the _CACHE_...
Usage : Cache Scan
             • Usage of the scan function
                   >>> for meta in scan_cache("Cache/"):
    ...
Another Example
              • Find all requests related to Slashdot
                      >>> for meta in scan_cache("Ca...
Another Example
             • Find all large JPEG images in the cache
             >>> jpegs = (meta for meta in scan_cac...
Part 6 : Scan Everything

                   • Write a function that takes a list of cache
                          direc...
Scanning Everything
                       # scan an entire list of cache directories producing
                       # a...
Type Checking
                       # scan an entire list of cache directories producing
                       # a seque...
Putting it all together
               # slack.py
               # Find all of those slackers who should be working
      ...
Intermission

                   • Have written a simple library ffcache.py
                   • Library takes a moderate ...
Problem : CacheSpy
                   • Big Brother (make an evil sound here)
                          Write a program th...
cachespy.py
           import sys, os, pickle, SocketServer, ffcache

           SPY_PORT = 31337
           caches = [pat...
SocketServer Module
           import sys, os, pickle, SocketServer, ffcache

           SPY_PORT = 31337
            Sock...
SocketServer Handlers
           import sys, os, pickle, SocketServer, ffcache

            You define a simple class that
...
SocketServer Servers
           import sys, os, pickle, SocketServer, ffcache

           SPY_PORT = 31337
           cach...
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Upcoming SlideShare
Loading in...5
×

Python in Action (Part 2)

8,726

Published on

Official tutorial slides from USENIX LISA, Nov. 16, 2007.

Published in: Technology
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,726
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
452
Comments
0
Likes
16
Embeds 0
No embeds

No notes for slide

Python in Action (Part 2)

  1. 1. Python in Action Presented at USENIX LISA Conference November 16, 2007 David M. Beazley http://www.dabeaz.com (Part II - Systems Programming) Copyright (C) 2007, http://www.dabeaz.com 2- 1
  2. 2. Section Overview • In this section, we're going to get dirty • Systems Programming • Files, I/O, file-system • Text parsing, data decoding • Processes and IPC • Networking • Threads and concurrency Copyright (C) 2007, http://www.dabeaz.com 2- 2
  3. 3. Commentary • I personally think Python is a fantastic tool for systems programming. • Modules provide access to most of the major system libraries I used to access via C • No enforcement of "morality" • Decent performance • It just "works" and it's fun Copyright (C) 2007, http://www.dabeaz.com 2- 3
  4. 4. Approach • I've thought long and hard about how I would present this part of the class. • A reference manual approach would probably be long and very boring. • So instead, we're going to focus on building something more in tune with the times Copyright (C) 2007, http://www.dabeaz.com 2- 4
  5. 5. "To Catch a Slacker" • Write a collection of Python programs that can quietly monitor Firefox browser caches to find out who has been spending their day reading Slashdot instead of working on their TPS reports. • Oh yeah, and be a real sneaky bugger about it. Copyright (C) 2007, http://www.dabeaz.com 2- 5
  6. 6. Why this Problem? • Involves a real-world system and data • Firefox already installed on your machine (?) • Cross platform (Linux, Mac, Windows) • Example of tool building • Related to a variety of practical problems • A good tour of "Python in Action" Copyright (C) 2007, http://www.dabeaz.com 2- 6
  7. 7. Disclaimers • I am not involved in browser forensics (or spyware for that matter). • I am in no way affiliated with Firefox/Mozilla nor have I ever seen Firefox source code • I have never worked with the cache data prior to preparing this tutorial • I have never used any third-party tools for looking at this data. Copyright (C) 2007, http://www.dabeaz.com 2- 7
  8. 8. More Disclaimers • All of the code in this tutorial works with a standard Python installation • No third party modules. • All code is cross-platform • Code samples are available online at http://www.dabeaz.com/action/ • Please look at that code and follow along Copyright (C) 2007, http://www.dabeaz.com 2- 8
  9. 9. Assumptions • This is not a tutorial on systems concepts • You should be generally familiar with background material (files, filesystems, file formats, processes, threads, networking, protocols, etc.) • Hopefully you can "extrapolate" from the material presented here to construct more advanced Python applications. Copyright (C) 2007, http://www.dabeaz.com 2- 9
  10. 10. The Big Picture • We want to write a tool that allows someone to locate, inspect, and perform queries across a distributed collection of Firefox caches. • For example, the cache directories on all machines on the LAN of a quasi-evil corporation. Copyright (C) 2007, http://www.dabeaz.com 2- 10
  11. 11. The Firefox Cache • The Firefox browser keeps a disk cache of recently visited sites % ls Cache/ -rw------- 1 beazley 111169 Sep 25 17:15 01CC0844d01 -rw------- 1 beazley 104991 Sep 25 17:15 01CC3844d01 -rw------- 1 beazley 47233 Sep 24 16:41 021F221Ad01 ... -rw------- 1 beazley 26749 Sep 21 11:19 FF8AEDF0d01 -rw------- 1 beazley 58172 Sep 25 18:16 FFE628C6d01 -rw------- 1 beazley 1939456 Sep 25 19:14 _CACHE_001_ -rw------- 1 beazley 2588672 Sep 25 19:14 _CACHE_002_ -rw------- 1 beazley 4567040 Sep 25 18:44 _CACHE_003_ -rw------- 1 beazley 33044 Sep 23 21:58 _CACHE_MAP_ • A bunch of cryptically named files. Copyright (C) 2007, http://www.dabeaz.com 2- 11
  12. 12. Problem : Finding Files • Find the Firefox cache Write a program findcache.py that takes a directory name as input and recursively scans that directory and all subdirectories looking for Firefox/Mozilla cache directories. • Example: % python findcache.py /Users/beazley /Users/beazley/Library/.../qs1ab616.default/Cache /Users/beazley/Library/.../wxuoyiuf.slt/Cache % • Use case: Searching for things on the filesystem. Copyright (C) 2007, http://www.dabeaz.com 2- 12
  13. 13. findcache.py # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name Copyright (C) 2007, http://www.dabeaz.com 2- 13
  14. 14. The sys module # findcache.py # Recursively scan a directory looking basic The sys module has for # Firefox/Mozilla cache directories information related to the import sys execution environment. import os if len(sys.argv) != 2: sys.argv print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) sys.stdin (path for'_CACHE_MAP_' list in os.walk(sys.argv[1]) caches = path,dirs,files of the command line A sys.stdout if options in files) sys.stderrname for name in caches: print sys.argv = ['findcache.py', '/Users/beazley'] Standard I/O files Copyright (C) 2007, http://www.dabeaz.com 2- 14
  15. 15. Program Termination # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: SystemExit exception print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) Forces Python to exit. caches = (path for path,dirs,files inis return code. Value os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name Copyright (C) 2007, http://www.dabeaz.com 2- 15
  16. 16. os Module # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache os module directories import sys import os Contains useful OS related if len(sys.argv) != 2: functions (files, processes, etc.) print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name Copyright (C) 2007, http://www.dabeaz.com 2- 16
  17. 17. os.walk() os.walk(topdir) # findcache.py # Recursively scan a directory looking for Recursively walkscache directories and # Firefox/Mozilla a directory tree generates a sequence of tuples (path,dirs,files) import sys path import os = The current directory name if dirs = List of all subdirectory names in path len(sys.argv) != 2: files = List of all regular files (data) in path print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name Copyright (C) 2007, http://www.dabeaz.com 2- 17
  18. 18. A Sequence of Caches # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories importstatement This sys generates a sequence of import os directory names where '_CACHE_MAP_' is contained in the filelist. if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: The print name name directory File name check that is generated as a result Copyright (C) 2007, http://www.dabeaz.com 2- 18
  19. 19. Printing the Result # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) This prints the sequence if '_CACHE_MAP_' in files) of cache directories that for name in caches: print name are generated by the previous statement. Copyright (C) 2007, http://www.dabeaz.com 2- 19
  20. 20. Commentary • Our solution is strongly based on a "declarative" programming style (again) • We simply write out a sequence of operations that produce what we want • Not focused on the underlying mechanics of how to traverse all of the directories. Copyright (C) 2007, http://www.dabeaz.com 2- 20
  21. 21. Big Idea : Iteration • Python allows iteration to be captured as a kind of object. caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) • This de-couples iteration from the code that uses the iteration for name in caches: print name • Another usage example: for name in caches: print len(os.listdir(name)), name Copyright (C) 2007, http://www.dabeaz.com 2- 21
  22. 22. Big Idea : Iteration • Compare to this: for path,dirs,files in os.walk(sys.argv[1]): if '_CACHE_MAP_' in files: print len(os.listdir(path)),path • This code is simple, but the loop and the code that executes in the loop body are coupled together • Not as flexible, but this is somewhat subtle to wrap your brain around at first. Copyright (C) 2007, http://www.dabeaz.com 2- 22
  23. 23. Mini-Reference : sys, os • sys module sys.argv # List of command line options sys.stdin # Standard input sys.stdout # Standard output sys.stderr # Standard error sys.executable # Full path of Python executable sys.exc_info() # Information on current exception • os module os.walk(dir) # Recursively walk dir producing a # sequence of tuples (path,dlist,flist) os.listdir(dir) # Return a list of all files in dir • SystemExit exception raise SystemExit(n) # Exit with integer code n Copyright (C) 2007, http://www.dabeaz.com 2- 23
  24. 24. Problem: Searching for Text • Extract all URL requests from the cache Write a program requests.py that scans the contents of the _CACHE_00n_ files and prints a list of URLs for documents stored in the cache. • Example: % python requests.py /Users/.../qs1ab616.default/Cache http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/js/ad_eo_1.1.j http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif http://us.i1.yimg.com/us.yimg.com/i/ww/thm/1/search_1.1.png ... % • Use case: Searching the contents of files for text patterns. Copyright (C) 2007, http://www.dabeaz.com 2- 24
  25. 25. The Firefox Cache • The cache directory holds two types of data • Metadata (URLs, headers, etc.). • Raw data (HTML, JPEG, PNG, etc.) • This data is stored in two places • Cryptic files in the Cache directory • Blocks inside the _CACHE_00n_ files • Metadata almost always in _CACHE_00n_ Copyright (C) 2007, http://www.dabeaz.com 2- 25
  26. 26. Possible Solution : Regex • The _CACHE_00n_ files are encoded in a binary format, but URLs are embedded inside as null-terminated text: x00x01x00x08x92x00x02x18x00x00x00x13Fxffx9f xceFxffx9fxcex00x00x00x00x00x00H)x00x00x00x1a x00x00x023HTTP:http://slashdot.org/x00request-methodx00 GETx00request-User-Agentx00Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7x00 request-Accept-Encodingx00gzip,deflatex00response-headx00 HTTP/1.1 200 OKrnDate: Sun, 30 Sep 2007 13:07:29 GMTrn Server: Apache/1.3.37 (Unix) mod_perl/1.29rnSLASH_LOG_DATA: shtmlrnX-Powered-By: Slash 2.005000176rnX-Fry: How can I live my life if I can't tell good from evil?rnCache-Control: • Maybe the requests could just be ripped using a regular expression. Copyright (C) 2007, http://www.dabeaz.com 2- 26
  27. 27. A Regex Solution # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings request_pat = re.compile('([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 27
  28. 28. The re module # requests.py import re re module import os import sys Contains all functionality related to cachedir = sys.argv[1] cachefiles regular expression pattern matching, = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] searching, replacing, etc. # A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)x00') Features are strongly influenced by Perl, but regexs are not directly integrated # Loop over all files and search for URLs for name in cachefiles: into the Python language. data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 28
  29. 29. Using re # requests.py import re are first specified Patterns as strings and compiled into a regex import os import sys object. pat = re.compile(pattern [,flags]) cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings request_pat = re.compile('([a-z]+://.*?)x00') # Loop over all files and search for URLs The pattern syntax is "standard" for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() pat* pat1|pat2 index = 0 pat+ [chars] while True: pat? [^chars] m = request_pat.search(data,index) (pat) pat{n} if not m: break . pat{n,m} print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 29
  30. 30. Using re # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_',the All subsequent operations are methods of '_CACHE_003_' ] compiled regex pattern # A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)x00') m = pat.match(data [,start]) # Check for match m = pat.search(data [,start]) # Search for match # Loop over all files and search for URLs newdata = pat.sub(data, repl) # Pattern replace for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 30
  31. 31. Searching for Matches # requests.py import re import os pat.search(text import sys [,start]) cachedir = the string text for the first occurrence Searches sys.argv[1] cachefiles = [ pattern starting'_CACHE_002_', '_CACHE_003_' ] of the regex '_CACHE_001_', at position start. # Returns a "MatchObject" strings A regex for embedded URL if a match is found. request_pat = re.compile(r'([a-z]+://.*?)x00') In the code below, we're finding matches one # Loop over all files and search for URLs for a time. cachefiles: at name in data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 31
  32. 32. Match Objects # requests.py import re import os import sys cachedir = sys.argv[1] Regex matches'_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] cachefiles = [ are represented by a MatchObject # m.group([n]) embedded URL matched by group n A regex for # Text strings m.start([n]) # Starting index of group n request_pat = re.compile(r'([a-z]+://.*?)x00') m.end([n]) # End index of group n # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 The matching text for while True: just the URL. m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() The end of the match Copyright (C) 2007, http://www.dabeaz.com 2- 32
  33. 33. Groups # requests.py In patterns, parentheses () define groups which import re import os are numbered left to right. import sys group 0 # The entire pattern cachedir 1 sys.argv[1] Text in first () group = # group 2 # Text in next () cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] ... # A regex for embedded URL strings request_pat = re.compile('([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 33
  34. 34. Mini-Reference : re • re pattern compilation pat = re.compile(r'patternstring') • Pattern syntax literal # Match literal text pat* # Match 0 or more repetitions of pat pat+ # Match 1 or more repetitions of pat pat? # Match 0 or 1 repetitions of pat pat1|pat2 # Patch pat1 or pat2 (pat) # Patch pat (group) [chars] # Match characters in chars [^chars] # Match characters not in chars . # Match any character except n d # Match any digit w # Match alphanumeric character s # Match whitespace Copyright (C) 2007, http://www.dabeaz.com 2- 34
  35. 35. Mini-Reference : re • Common pattern operations pat.search(text) # Search text for a match pat.match(text) # Search start of text for match pat.sub(repl,text) # Replace pattern with repl • Match objects m.group([n]) # Text matched by group n m.start([n]) # Starting position of group n m.end([n]) # Ending position of group n • How to loop over all matches of a pattern for m in pat.finditer(text): # m is a MatchObject that you process ... Copyright (C) 2007, http://www.dabeaz.com 2- 35
  36. 36. Mini-Reference : re • An example of pattern replacement # This replaces American dates of the form 'mm/dd/yyyy' # with European dates of the form 'dd/mm/yyyy'. # This function takes a MatchObject as input and returns # replacement text as output. def euro_date(m): month = m.group(1) day = m.group(2) year = m.group(3) return "%d/%d/%d" % (day,month,year) # Date re pattern and replacement operation datepat = re.compile(r'(d+)/(d+)/(d+)') newdata = datepat.sub(euro_date,text) Copyright (C) 2007, http://www.dabeaz.com 2- 36
  37. 37. Mini-Reference : re • There are many more features of the re module • Strongly influenced by Perl (feature set) • Regexs are a library in Python, not integrated into the language. • A book on regular expressions may be essential for advanced functions. Copyright (C) 2007, http://www.dabeaz.com 2- 37
  38. 38. File Handling # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings What is going on in this statement? request_pat = re.compile(r'([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 38
  39. 39. os.path module # requests.py import re has portable file related functions os.path import os os.path.join(name1,name2,...) # Join path names import sys os.path.getsize(filename) # Get the file size os.path.getmtime(filename) # Get modification date cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] There are many more functions, but this is the #preferred module for basic filename handling A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 39
  40. 40. os.path.join() # requests.py import re a fully-expanded pathname Creates import os dirname = '/foo/bar' filename = 'name' import sys os.path.join(dirname,filename) cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings '/foo/bar/name' request_pat = re.compile(r'([a-z]+://.*?)x00') Aware of platform differences ('/' vs. '') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 40
  41. 41. Mini-Reference : os.path os.path.join(s1,s2,...) # Join pathname parts together os.path.getsize(path) # Get file size of path os.path.getmtime(path) # Get modify time of path os.path.getatime(path) # Get access time of path os.path.getctime(path) # Get creation time of path os.path.exists(path) # Check if path exists os.path.isfile(path) # Check if regular file os.path.isdir(path) # Check if directory os.path.islink(path) # Check if symbolic link os.path.basename(path) # Return file part of path os.path.dirname(path) # Return dir part of os.path.abspath(path) # Get absolute path Copyright (C) 2007, http://www.dabeaz.com 2- 41
  42. 42. Binary I/O # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = binary files, use modes "rb","wb", etc. For all [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)x00') Windows) Disables new-line translation (critical on # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 42
  43. 43. Common I/O Shortcuts # requests.py import re entire file into a string # Read an import=os data open(filename).read() import sys # Write a string out to a file open(filename,"w").write(text) cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # Loop over all lines in a file #forregex foropen(filename): A line in embedded URL strings ... request_pat = re.compile(r'([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 43
  44. 44. Commentary on Solution • This regex approach is mostly a hack for this particular application. • Reads entire cache files into memory as strings (may be quite large) • Only finds URLs, no other metadata • Some risk of false positives since URLs could also be embedded in data. Copyright (C) 2007, http://www.dabeaz.com 2- 44
  45. 45. Commentary • We have started to build a collection of very simple command line tools • Very much in the "Unix tradition." • Python makes it easy to create such tools • More complex applications could be assembled by simply gluing scripts together Copyright (C) 2007, http://www.dabeaz.com 2- 45
  46. 46. Working with Processes • It is common to write programs that run other programs, collect their output, etc. • Pipes • Interprocess Communication • Python has a variety of modules for supporting this. Copyright (C) 2007, http://www.dabeaz.com 2- 46
  47. 47. subprocess Module • A module for creating and interacting with subprocesses • Consolidates a number of low-level OS functions such as system(), execv(), spawnv(), pipe(), popen2(), etc. into a single module • Cross platform (Unix/Windows) Copyright (C) 2007, http://www.dabeaz.com 2- 47
  48. 48. Example : Slackers • Find slacker cache entries. Using the programs findcache.py and requests.py as subprocesses, write a program that inspects cache directories and prints out all entries that contain the word 'slashdot' in the URL. Copyright (C) 2007, http://www.dabeaz.com 2- 48
  49. 49. slackers.py # slackers.py import sys import subprocess # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess for cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 49
  50. 50. Launching a subprocess # slackers.py import sys import subprocess # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess Thiscachedir in dirlist: for is launching a python Collection of output script as a subprocess, searcher = subprocess.Popen( with newline [sys.executable,"requests.py",cachedir], connecting its stdout stdout=subprocess.PIPE) stripping. stream to a pipe. for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 50
  51. 51. Python Executable # slackers.py import sys import subprocess Full pathname of python interpreter # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess for cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 51
  52. 52. Subprocess Arguments # slackers.py import sys import subprocess # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] List of arguments to subprocess. # Run request.py as a subprocess for cachedir in dirlist: to what would Corresponds appear on a shell command line. searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 52
  53. 53. slackers.py # slackers.py import sys import subprocess # Run findcache.py as a subprocess directory we More of the same idea. For each finder = subprocess.Popen( found in the last step, we run requests.py to [sys.executable,"findcache.py",sys.argv[1]], produce requests. stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess for cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 53
  54. 54. Commentary • subprocess is a large module with many options. • However, it takes care of a lot of annoying platform-specific details for you. • Currently the "recommended" way of dealing with subprocesses. Copyright (C) 2007, http://www.dabeaz.com 2- 54
  55. 55. Low Level Subprocesses • Running a simple system command os.system("shell command") • Connecting to a subprocess with pipes pout, pin = popen2.popen2("shell command") • Exec/spawn os.execv(),os.execl(),os.execle(),... os.spawnv(),os.spawnvl(), os.spawnle(),... • Unix fork() os.fork(), os.wait(), os.waitpid(), os._exit(), ... Copyright (C) 2007, http://www.dabeaz.com 2- 55
  56. 56. Interactive Processes • Python does not have built-in support for controlling interactive subprocesses (e.g., "Expect") • Must install third party modules for this • Example: pexpect • http://pexpect.sourceforge.net Copyright (C) 2007, http://www.dabeaz.com 2- 56
  57. 57. Commentary • Writing small Unix-like utilities is fairly straightforward in Python • Support for standard kinds of operations (files, regular expressions, pipes, subprocesses, etc.) • However, our solution is also kind of clunky • Only returns some information • Not particularly memory efficient (reads large files into memory) Copyright (C) 2007, http://www.dabeaz.com 2- 57
  58. 58. Interlude • Python is well-suited to building libraries and frameworks. • In the next part, we're going to take a totally different approach than simply writing simple utilities. • Will build libraries for manipulating cache data and use those libraries to build tools. Copyright (C) 2007, http://www.dabeaz.com 2- 58
  59. 59. Problem : Parsing Data • Extract the cache data (for real) Write a module ffcache.py that contains a set of functions for reading Firefox cache data into useful data structures that can be used by other programs. Capture all available information including URLs, timestamps, sizes, locations, content types, etc. • Use case: Blood and guts Writing programs that can process foreign file formats. Processing binary encoded data. Creating code for later reuse. Copyright (C) 2007, http://www.dabeaz.com 2- 59
  60. 60. The Firefox Cache • There are four critical files _CACHE_MAP_ # Cache index _CACHE_001_ # Cache data _CACHE_002_ # Cache data _CACHE_003_ # Cache data • All files are binary-encoded • _CACHE_MAP_ is used by Firefox to locate data, but it is not updated until Firefox exits. • We will ignore _CACHE_MAP_ since we want to observe caches of live Firefox sessions. Copyright (C) 2007, http://www.dabeaz.com 2- 60
  61. 61. Firefox _CACHE_ Files • _CACHE_00n_ file organization Free/used block bitmap 4096 bytes Blocks Up to 32768 blocks • The block size varies according to the file: _CACHE_001_ 256 byte blocks _CACHE_002_ 1024 byte blocks _CACHE_003_ 4096 byte blocks Copyright (C) 2007, http://www.dabeaz.com 2- 61
  62. 62. Cache Entries • Each cache entry: • A maximum of 4 cache blocks • Can either be data or metadata • If >16K, written to a file instead • Notice how all the "cryptic" files are >16K -rw------- beazley 111169 Sep 25 17:15 01CC0844d01 -rw------- beazley 104991 Sep 25 17:15 01CC3844d01 -rw------- beazley 47233 Sep 24 16:41 021F221Ad01 ... -rw------- beazley 26749 Sep 21 11:19 FF8AEDF0d01 -rw------- beazley 58172 Sep 25 18:16 FFE628C6d01 Copyright (C) 2007, http://www.dabeaz.com 2- 62
  63. 63. Cache Metadata • Metadata is encoded as a binary structure Header 36 bytes Request String Variable length (in header) Request Info Variable length (in header) • Header encoding (binary, big-endian) 0-3 magic (???) unsigned int (0x00010008) 4-7 location unsigned int 8-11 fetchcount unsigned int 12-15 fetchtime unsigned int (system time) 16-19 modifytime unsigned int (system time) 20-23 expiretime unsigned int (system time) 24-27 datasize unsigned int (byte count) 28-31 requestsize unsigned int (byte count) 32-35 infosize unsigned int (byte count) Copyright (C) 2007, http://www.dabeaz.com 2- 63
  64. 64. Solution Outline • Part 1: Parsing Metadata Headers • Part 2: Getting request information (URL) • Part 3: Extracting additional content info • Part 4: Scanning of individual cache files • Part 5: Scanning an entire directory • Part 6: Scanning a list of directories Copyright (C) 2007, http://www.dabeaz.com 2- 64
  65. 65. Part I - Reading Headers • Write a function that can parse the binary metadata header and return the data in a useful format Copyright (C) 2007, http://www.dabeaz.com 2- 65
  66. 66. Reading Headers import struct # This function parses a cache metadata header into a dict # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta Copyright (C) 2007, http://www.dabeaz.com 2- 66
  67. 67. Reading Headers • How this is supposed to work: >>> f = open("Cache/_CACHE_001_","rb") >>> f.seek(4096) # Skip the bit map >>> headerdata = f.read(36) # Read 36 byte header >>> meta = parse_meta_header(headerdata) >>> meta {'fetchtime': 1190829792, 'requestsize': 27, 'magic': 65544, 'fetchcount': 3, 'expiretime': 0, 'location': 2449473536L, 'modifytime': 1190829792, 'datasize': 29448, 'infosize': 531} >>> • Basically, we're parsing the header into a useful Python data structure (a dictionary) Copyright (C) 2007, http://www.dabeaz.com 2- 67
  68. 68. struct module import struct Parses binary encoded data into Python objects. # This function parses a cache metadata header into a dict # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', You would use this module to pack/unpack 'fetchtime','modifytime','expiretime', raw binary 'datasize','requestsize','infosize'] data from Python strings. def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta Unpacks 9 unsigned 32-bit big-endian integers Copyright (C) 2007, http://www.dabeaz.com 2- 68
  69. 69. struct module import struct # This function parses a cache metadata header into a dict # of named fields (listed in _headernames below) Result is always a tuple of converted values. _headernames = ['magic','location','fetchcount', head = (65544, 'fetchtime','modifytime','expiretime', 0, 1, 1191682051, 1191682051, 0, 8645, 190, 218) 'datasize','requestsize','infosize'] def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta Copyright (C) 2007, http://www.dabeaz.com 2- 69
  70. 70. Dictionary Creation zip(s1,s2) makes a list of tuples zip(_headernames,head) [('magic',head[0]), import struct ('location',head[1]), ('fetchcount',head[2]) # This function parses a cache metadata header into a dict ... # of named fields (listed in _headernames below) ] _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta Make a dictionary Copyright (C) 2007, http://www.dabeaz.com 2- 70
  71. 71. Commentary • Dictionaries as data structures meta = { 'fetchtime' : 1190829792, 'requestsize' : 27, 'magic' : 65544, 'fetchcount' : 3, 'expiretime' : 0, 'location' : 2449473536L, 'modifytime' : 1190829792, 'datasize' : 29448, 'infosize' : 531 } • Useful if data has many parts data = f.read(meta[8]) # Huh?!? vs. data = f.read(meta['infosize']) # Better Copyright (C) 2007, http://www.dabeaz.com 2- 71
  72. 72. Mini-reference : struct • struct module items = struct.unpack(fmt,data) data = struct.pack(fmt,item1,...,itemn) • Sample Format codes 'c' char (1 byte string) 'b' signed char (8-bit integer) 'B' unsigned char (8-bit integer) 'h' signed short (16-bit integer) 'H' unsigned short (16-bit integer) 'i' int (32-bit integer) 'I' unsigned int (32-bit integer) 'f' 32-bit single precision float 'd' 64-bit double precision float 's' char s[] (String) '>' Big endian modifier '<' Little endian modifier '!' Network order modifier 'n' Repetition count modifier Copyright (C) 2007, http://www.dabeaz.com 2- 72
  73. 73. Part 2 : Parsing Requests • Write a function that will read the URL request string and request information • Request String : A Null-terminated string • Request Info : A sequence of Null-terminated key-value pairs (like a dictionary) Copyright (C) 2007, http://www.dabeaz.com 2- 73
  74. 74. Parsing Requests import re part_pat = re.compile(r'[nr -~]*$') def parse_request_data(meta,requestdata): parts = requestdata.split('x00') for part in parts: if not part_pat.match(part): return False request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Copyright (C) 2007, http://www.dabeaz.com 2- 74
  75. 75. Usage : Requests • Usage of the function: >>> f = open("Cache/_CACHE_001_","rb") >>> f.seek(4096) # Skip the bit map >>> headerdata = f.read(36) # Read 36 byte header >>> meta = parse_meta_header(headerdata) >>> requestdata = f.read(meta['requestsize']+meta['infosize']) >>> parse_request_data(meta,requestdata) True >>> meta['request'] 'http://www.yahoo.com/' >>> meta['info'] {'request-method': 'GET', 'request-User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/ 20070914 Firefox/2.0.0.7', 'charset': 'UTF-8', 'response- head': 'HTTP/1.1 200 OKrnDate: Wed, 26 Sep 2007 18:03:17 ...' } >>> Copyright (C) 2007, http://www.dabeaz.com 2- 75
  76. 76. String Stripping import re part_pat = re.compile(r'[nr -~]*$') def parse_request_data(meta,requestdata): parts = requestdata.split('x00') for part in parts: if not part_pat.match(part): The request dataFalse sequence of null-terminated return is a strings. This splits the data up into parts. request = parts[0] if len(request) != (meta['requestsize'] - 1): requestdata = False return 'partx00partx00partx00partx00...' info = dict(zip(parts[1::2],parts[2::2])) .split('x00') meta['request'] = request.split(':',1)[1] meta['info'] = info parts = ['part','part','part','part',...] return True Copyright (C) 2007, http://www.dabeaz.com 2- 76
  77. 77. String Validation import re part_pat = re.compile(r'[nr -~]*$') def parse_request_data(meta,requestdata): parts = requestdata.split('x00') for part in parts: if not part_pat.match(part): return False Individual parts are printable characters except for request = parts[0] if len(request) != (meta['requestsize'] - 1): newline characters ('nr'). return False info = dict(zip(parts[1::2],parts[2::2])) We use the re module to match each string. This meta['request'] = request.split(':',1)[1] would help catchinfo where we might be reading meta['info'] = cases return True bad data (false headers, raw data, etc.). Copyright (C) 2007, http://www.dabeaz.com 2- 77
  78. 78. URL Request String import re part_pat = re.compile(r'[nr -~]*$') The request string is the first part. The check that def parse_request_data(meta,requestdata): parts = requestdata.split('x00') follows makes parts:it's the right size (a further sanity for part in sure check on not part_pat.match(part): if the data integrity). return False request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Copyright (C) 2007, http://www.dabeaz.com 2- 78
  79. 79. Request Info import re Each request has a set of associated part_pat = re.compile(r'[nr -~]*$') data represented as key/value pairs. def parse_request_data(meta,requestdata): parts = requestdata.split('x00') parts = ['request','key','val','key','val','key','val'] for part in parts: parts[1::2] part_pat.match(part): if not ['key','key','key'] return['val','val','val'] parts[2::2] False zip(parts[1::2],parts[2::2]) [('key','val'), request = parts[0] ('key','val') if len(request) != (meta['requestsize'] - 1): ('key','val')] return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Makes a dictionary from (key,val) tuples Copyright (C) 2007, http://www.dabeaz.com 2- 79
  80. 80. Fixing the Request # Given a dictionary of header information and a file, # this function extracts the request data from a cache # metadata entry and saves it instring Cleaning up the request the dictionary. Returns # True or False depending on success. request = "HTTP:http://www.google.com" def read_request_data(header,f): .split(':',1) request = f.read(header['requestsize']).strip('x00') infodata = f.read(header['infosize']).strip('x00') ['HTTP','http://www.google.com'] # Validate request and [1] infodata here (nothing now) # Turn the infodata into a dictionary 'http://www.google.com' parts = infodata.split('x00') info = dict(zip(parts[::2],parts[1::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Copyright (C) 2007, http://www.dabeaz.com 2- 80
  81. 81. Commentary • Emphasize that Python has very powerful list manipulation primitives • Indexing • Slicing • List comprehensions • Etc. • Knowing how to use these leads to rapid development and compact code Copyright (C) 2007, http://www.dabeaz.com 2- 81
  82. 82. Part 3: Content Info • All documents on the internet have optional content-type, encoding, and character set information. • Let's add this information since it will make it easier for us to determine the type of files that are stored in the cache (i.e., images, movies, HTML, etc.) Copyright (C) 2007, http://www.dabeaz.com 2- 82
  83. 83. HTTP Responses • The cache metadata includes an HTTP response header >>> print meta['info']['response-head'] HTTP/1.1 200 OK Date: Sat, 29 Sep 2007 20:51:37 GMT Cache-Control: private Vary: User-Agent Content-Type: text/html; charset=utf-8 Content-Encoding: gzip >>> Content type, character set, and encoding. Copyright (C) 2007, http://www.dabeaz.com 2- 83
  84. 84. Solution # Given a metadata dictionary, this function adds additional # fields related to the content type, charset, and encoding import email def add_content_info(meta): info = meta['info'] if 'response-head' not in info: return else: rhead = info.get('response-head').split("n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset Copyright (C) 2007, http://www.dabeaz.com 2- 84
  85. 85. Internet Data Handling # Given a metadata dictionary, has afunction adds additional Python this vast assortment of internet data handling modules. # fields related to the content type, charset, and encoding import email email. Parsing of email messages, def add_content_info(meta): info = meta['info'] if 'response-head' MIME headers, etc. not in info: return else: rhead = info.get('response-head').split("n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset Copyright (C) 2007, http://www.dabeaz.com 2- 85
  86. 86. Internet Data Handling # Given a metadata dictionary, this function adds additional # In this code, we parse the HTTP charset, and encoding fields related to the content type, reponse headers using the email import email module and extract content-type, def add_content_info(meta): info = meta['info'] encoding, and charset information. if 'response-head' not in info: return else: rhead = info.get('response-head').split("n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset Copyright (C) 2007, http://www.dabeaz.com 2- 86
  87. 87. Commentary • Python is heavily used in Internet applications • There are modules for parsing common types of data (email, HTML, XML, etc.) • There are modules for processing bits and pieces of internet data (URLs, MIME types, RFC822 headers, etc.) Copyright (C) 2007, http://www.dabeaz.com 2- 87
  88. 88. Part 4: File Scanning • Write a function that scans a single cache file and produces a sequence of records containing all of the cache metadata. • This is just one more of our building blocks • The goal is to hide some of the nasty bits Copyright (C) 2007, http://www.dabeaz.com 2- 88
  89. 89. File Scanning # Scan a single file in the firefox cache def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta # Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 89
  90. 90. Usage : File Scanning • Usage of the scan function >>> f = open("Cache/_CACHE_001_","rb") >>> for meta in scan_cache_file(f,256) ... print meta['request'] ... http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/ http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif ... • We can just open up a cache file and write a for-loop to iterate over all of the entries. Copyright (C) 2007, http://www.dabeaz.com 2- 90
  91. 91. Python File I/O # Scan a single file in the firefox cache def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) File Objects if not headerdata: break Modeled after ANSI C. meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] bytes. Files are just < maxsize): File pointer keeps track. requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): f.read() # Read bytes add_content_info(meta) f.tell() # Current fp yield meta f.seek(n,off) # Move fp # Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 91
  92. 92. Using Earlier Code # Scan a single file in the firefox cache def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an using our Here we are entry f.seek(4096) # Skip header parsing functions the bit-map while True: headerdata = f.read(36) written in previous parts. if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta Note: We are progressively adding #more the file apointer Move data to fp = f.tell() to the start of the next block dictionary. % blocksize): if (fp f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 92
  93. 93. Data Validation # Scan a single file in the firefox cache This is a sanity check to make def scan_cachefile(f,blocksize): sure the header data looks #like a maxsize = 4*blocksize Maximum size of an entry f.seek(4096) # Skip the bit-map valid header. while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta # Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 93
  94. 94. Generating Results # Scan a single file in the firefox cache def scan_cachefile(f,blocksize): We are using yield to Maximum size of for a maxsize = 4*blocksize # produce data an entry f.seek(4096) single cache entry. #IfSkip the bit-map a for- while True: someone uses loop, they will get all of the entries. headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) Note: This allows== 0x00010008 and cache if (meta['magic'] us to process the meta['requestsize'] + meta['infosize'] < maxsize): without reading all of the data into memory. requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta # Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 94
  95. 95. Commentary • Have created a function that can scan a single _CACHE_00n_ file and produce a sequence of dictionaries with metadata. • It's still somewhat low-level • Just need to package it a little better Copyright (C) 2007, http://www.dabeaz.com 2- 95
  96. 96. Part 5 : Scan a Directory • Write a function that takes the name of a Firefox cache directory, scans all of the cache files for metadata, and produces a single sequence of records. • Make it real easy to extract data Copyright (C) 2007, http://www.dabeaz.com 2- 96
  97. 97. Solution : Directory Scan # Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)] for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() Copyright (C) 2007, http://www.dabeaz.com 2- 97
  98. 98. Solution : Directory Scan General idea: # Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. We loop over the three _CACHE_00n_ files and import os a sequence of the cache records produce def scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)] for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() Copyright (C) 2007, http://www.dabeaz.com 2- 98
  99. 99. Solution : Directory Scan # Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)] for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname We use the low-level file scanning function yield meta cfile.close() here to generate a sequence of records. Copyright (C) 2007, http://www.dabeaz.com 2- 99
  100. 100. More Generation # Given the name of a Firefox cache directory, the function Byscans all of here, we are chainingfor metadata. A sequence # using yield the _CACHE_00n_ files together the results obtained from all three cache files into one # of dictionaries containing metadata is returned. big long sequence of results. import os def scan_cache(cachedir): files = [('_CACHE_001_',256), The underlying mechanics and implementation ('_CACHE_002_',1024), details are hidden (user doesn't care) ('_CACHE_003_',4096)] for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() Copyright (C) 2007, http://www.dabeaz.com 2-100
  101. 101. Additional Data # Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), Adding ('_CACHE_003_',4096)] path and file for information to the data cname,blocksize in files: (May be useful later) cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() Copyright (C) 2007, http://www.dabeaz.com 2-101
  102. 102. Usage : Cache Scan • Usage of the scan function >>> for meta in scan_cache("Cache/"): ... print meta['request'] ... http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/ http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif ... • Given the name of a cache directory, we can just loop over all of the metadata. Trivial! • With work, could perform various kinds of queries and processing of the data Copyright (C) 2007, http://www.dabeaz.com 2-102
  103. 103. Another Example • Find all requests related to Slashdot >>> for meta in scan_cache("Cache/"): ... if 'slashdot' in meta['request']: ... print meta['request'] ... http://www.slashdot.org/ http://images.slashdot.org/topics/topiccommunications.gif http://images.slashdot.org/topics/topicstorage.gif http://images.slashdot.org/comments.css?T_2_5_0_176 ... • Well, that was pretty easy. Copyright (C) 2007, http://www.dabeaz.com 2-103
  104. 104. Another Example • Find all large JPEG images in the cache >>> jpegs = (meta for meta in scan_cache("Cache/") if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000) >>> for j in jpegs: ... print j['request'] ... http://images.salon.com/ent/video_dog/comedy/2007/09/27/cereal/ story.jpg http://images.salon.com/ent/video_dog/ifc/2007/09/28/ apocalypse/story.jpg http://www.lakesideinns.com/images/fallroadphoto2006.jpg ... >>> • That was also pretty easy Copyright (C) 2007, http://www.dabeaz.com 2-104
  105. 105. Part 6 : Scan Everything • Write a function that takes a list of cache directories and produces a sequence of all cache metadata found in all of them. • A single utility function that let's us query everything. Copyright (C) 2007, http://www.dabeaz.com 2-105
  106. 106. Scanning Everything # scan an entire list of cache directories producing # a sequence of records def scan(cachedirs): if isinstance(cachedirs,str): cachedirs = [cachedirs] for cdir in cachedirs: for meta in scan_cache(cdir): yield meta Copyright (C) 2007, http://www.dabeaz.com 2-106
  107. 107. Type Checking # scan an entire list of cache directories producing # a sequence of records def scan(cachedirs): if isinstance(cachedirs,str): cachedirs = [cachedirs] for cdir in cachedirs: This bit of code ismeta example of type an for meta in scan_cache(cdir): yield checking. If the argument is a string, we convert it to a list with one item. This allows the following usage: scan("CacheDir") scan(["CacheDir1","CacheDir2",...]) Copyright (C) 2007, http://www.dabeaz.com 2-107
  108. 108. Putting it all together # slack.py # Find all of those slackers who should be working import sys, os, ffcache if len(sys.argv) != 2: print >>sys.stderr,"Usage: python slack.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for meta in ffcache.scan(caches): if 'slashdot' in meta['request']: print meta['request'] print meta['cachedir'] print Copyright (C) 2007, http://www.dabeaz.com 2-108
  109. 109. Intermission • Have written a simple library ffcache.py • Library takes a moderate complex data processing problem and breaks it up into pieces. • About 100 lines of code. • Now, let's build an application... Copyright (C) 2007, http://www.dabeaz.com 2-109
  110. 110. Problem : CacheSpy • Big Brother (make an evil sound here) Write a program that first locates all of the Firefox cache directories under a given directory. Then have that program run forever as a network server, waiting for connections. On each connection, send back all of the current cache metadata. • Big Picture We're going to write a daemon that will find and quietly report on browser cache contents. Copyright (C) 2007, http://www.dabeaz.com 2-110
  111. 111. cachespy.py import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files] def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com 2-111
  112. 112. SocketServer Module import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 SocketServer caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files] def module for easily creating A dump_cache(f): for meta in ffcache.scan(caches): low-level internet applications pickle.dump(meta,f) using sockets. class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com 2-112
  113. 113. SocketServer Handlers import sys, os, pickle, SocketServer, ffcache You define a simple class that SPY_PORT = 31337 caches = [path for path,dname,files in os.walk(sys.argv[1]) implements handle(). if '_CACHE_MAP_' in files] def dump_cache(f): This implements the server logic. for meta in ffcache.scan(caches): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com 2-113
  114. 114. SocketServer Servers import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files] def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f) Next, you just create a Server object, class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): hook f = self.request.makefile()run the the handler up to it, and server. dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com 2-114
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×