Accessing File-Specific
Attributes on Steroids

      Dinu C. Gherman
      gherman@python.net

     EuroPython Conference
...
Motivation
• Get quick overview of file attributes for
  multiple files
• Compare attribute values between files
• Identify g...
Background
~1971(?)
                wc
$ cd mercurial/hgweb
$ wc -lwc *.py

  16      66     502   __init__.py
 118     438    3988  ...
~2002
                pycount
$ cd mercurial/hgweb
$ pycount2.py *.py

lines code doc comment blank   file
   16    5   0 ...
~2005
                     ttfinfo
$ cd fonts/truetype/
$ ttfinfo.py -a maxp.numGlyphs
  -a kern.nPairs -a head.unitsPerEm ...
2007…
                   pyinfo
$ cd mercurial/hgweb
$ pyinfo.py -a nclass:ndef:ncalls:ndiffkw *.py

nclass   ndef   ncall...
2007…
                   pdfinfo
$ cd brandeins/200805_bildung
$ pdfinfo.py -a npages:nimgs:author *.pdf

npages   nimgs   ...
Fileinfo
Big Picture
• Describe input files & attributes
• Locate input files
• Investigate file attributes
• Process file attributes
•...
Input Files Examples
• fileinfo [opts] /mypath/*.pdf
• fileinfo [opts] $(find /mypath -name "*.py")
• fileinfo [opts] $(mdfind ...
Attributes Examples
• --attrs nclasses:ndefs
• --sort size:ndefs
• --filter "rec.ndefs > 1000"
Output Formats
• Text, HTML, CSV, ReST (simple)
• Cocoa, WxPython
• Django
Selected Plug-ins
General    XML             PDF         Python      Quicktime
counter    nattrs          title       ndef...
Examples
$ cd /Data/brandeins/200712_design
$ fileinfo --format rest-simple -a npages:nimgs 
  -f "rec.nimgs > 2" *.pdf
====== ====...
Implementation
PDF-Plugin (1)
class PDFInvestigator(BaseInvestigator):
    "A class for determining attributes of PDF files."

    attrMa...
PDF-Plugin (2)
def getNumPdfPages(self):
    "Return the number of pages in a PDF document."

    try:
        # uses PyPd...
PDF-Plugin (3)
def getNumImages(self):
    "Return the number of images in a PDF document."

    expr = r"d+ +d+ +obj.*?en...
An Aside: Spotlight
Spotlight
• Desktop file search
• Mac OS X 10.4 and 10.5
• Deeply integrated in Mac OS X
• Index-based, with attributes
• R...
Spotlight Menu
Spotlight Window
Spotlight
$ mdfind europython | egrep ".pdf$"

/Users/dinu/Desktop/EuroPython2008Timetable.pdf
/Users/dinu/Developer/Pytho...
Spotlight – Pro
• Great index/search technology
• Very fast, useful and easy to use
• ~125 search attributes in Mac OS X 1...
Spotlight – Con
• Result on command-line not in table form
• Result in GUI is always a list of file names +
  the attribute...
Future
Issues
• Testing, debugging & refactoring, …
• Better folder handling (e.g. OS X bundles)
• Attribute namespaces (pdf.npag...
More Features?
• Output format plug-ins?
• Pylint plug-in for fileinfo?
• Fileinfo Python plug-in pyinfo.py?
• Plug-ins for...
Summary
• Useful as general purpose attribute ”browser“
• Access to Spotlight meta-data (Mac OS X)
• Easy to write plug-in...
Links
•   http://www.dinu-gherman.net/tmp/
    fileinfo-0.3.2.tar.gz

•   http://developer.apple.com/macosx/spotlight.html
...
Questions?
Accessing File-Specific Attributes on Steroids - EuroPython 2008
Accessing File-Specific Attributes on Steroids - EuroPython 2008
Upcoming SlideShare
Loading in …5
×

Accessing File-Specific Attributes on Steroids - EuroPython 2008

725 views
665 views

Published on

A presentation about a tool named "fileinfo" given at EuroPython 2008.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
725
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
14
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Accessing File-Specific Attributes on Steroids - EuroPython 2008

  1. 1. Accessing File-Specific Attributes on Steroids Dinu C. Gherman gherman@python.net EuroPython Conference 2008-07-07, Vilnius
  2. 2. Motivation • Get quick overview of file attributes for multiple files • Compare attribute values between files • Identify groups of files • Reuse overview results • Avoid “opening” files with applications
  3. 3. Background
  4. 4. ~1971(?) wc $ cd mercurial/hgweb $ wc -lwc *.py 16 66 502 __init__.py 118 438 3988 common.py 993 2876 36064 hgweb_mod.py 305 910 12420 hgwebdir_mod.py 228 683 7258 protocol.py 101 320 3577 request.py 298 863 10698 server.py 127 414 3907 webcommands.py 65 190 2090 wsgicgi.py 2251 6760 80504 total
  5. 5. ~2002 pycount $ cd mercurial/hgweb $ pycount2.py *.py lines code doc comment blank file 16 5 0 7 4 __init__.py 118 77 22 8 11 common.py 993 809 3 31 150 hgweb_mod.py 305 249 0 17 39 hgwebdir_mod.py 228 174 0 19 35 protocol.py 101 76 2 7 16 request.py 298 244 5 10 39 server.py 127 93 0 10 24 webcommands.py 65 43 0 11 11 wsgicgi.py 2251 1770 32 120 329 total
  6. 6. ~2005 ttfinfo $ cd fonts/truetype/ $ ttfinfo.py -a maxp.numGlyphs -a kern.nPairs -a head.unitsPerEm A*.ttf 249 0 1000 AmericanTypewriter.ttf 1320 3072 2048 Arial.ttf 245 1536 2048 ArialBlack.ttf 1320 3072 2048 ArialBold.ttf 956 3072 2048 ArialBoldItalic.ttf 956 3072 2048 ArialItalic.ttf 244 384 2048 ArialNarrow.ttf 245 384 2048 ArialNarrowBold.ttf 244 384 2048 ArialNarrowBoldItalic.ttf 244 384 2048 ArialNarrowItalic.ttf 243 1536 2048 ArialRoundedMTBold.ttf
  7. 7. 2007… pyinfo $ cd mercurial/hgweb $ pyinfo.py -a nclass:ndef:ncalls:ndiffkw *.py nclass ndef ncalls ndiffkw file 0 2 2 3 __init__.py 1 9 31 18 common.py 1 60 492 24 hgweb_mod.py 1 15 133 23 hgwebdir_mod.py 0 11 121 21 protocol.py 1 12 30 16 request.py 6 24 104 18 server.py 0 14 50 15 webcommands.py 0 3 15 13 wsgicgi.py 10 150 978 total
  8. 8. 2007… pdfinfo $ cd brandeins/200805_bildung $ pdfinfo.py -a npages:nimgs:author *.pdf npages nimgs author file 1 1 Kathrin 802053_008b10508m.pdf 1 0 Kathrin 802055_010b10508w.pdf 2 0 Kathrin 802056_012b10508m.pdf 2 1 Kathrin 802057_018b10508d.pdf 1 1 Kathrin 802060_020b10508m.pdf 9 8 n/a 802064_022b10508d.pdf 8 8 Kathrin 802067_036b10508w.pdf 2 0 Kathrin 803048_136b10508w.pdf 26 19 total
  9. 9. Fileinfo
  10. 10. Big Picture • Describe input files & attributes • Locate input files • Investigate file attributes • Process file attributes • Present tabular output
  11. 11. Input Files Examples • fileinfo [opts] /mypath/*.pdf • fileinfo [opts] $(find /mypath -name "*.py") • fileinfo [opts] $(mdfind -onlyin /mypath -name "*.py")
  12. 12. Attributes Examples • --attrs nclasses:ndefs • --sort size:ndefs • --filter "rec.ndefs > 1000"
  13. 13. Output Formats • Text, HTML, CSV, ReST (simple) • Cocoa, WxPython • Django
  14. 14. Selected Plug-ins General XML PDF Python Quicktime counter nattrs title ndefs duration wc ndattrs author nclasses box lc ntags producer ncalls datasize md5 ndtags creation nstrs ntracks depht date ndocstrs OS npages nkws OS X bundles uid TTF nimgs ndkws bundlename username kern.nPairs nimpstmts bundleversion mtime maxp.numGlyphs MP3 nops size maxp.version album mlw Spotlight level head.unitsPerEm artist mil, … kMDItem*
  15. 15. Examples
  16. 16. $ cd /Data/brandeins/200712_design $ fileinfo --format rest-simple -a npages:nimgs -f "rec.nimgs > 2" *.pdf ====== ===== ===================== npages nimgs path ====== ===== ===================== 11 3 540237_058b11207s.pdf 8 18 540238_070b11207r.pdf 7 11 540240_082b11207a.pdf 9 9 540242_096b11207f.pdf 3 5 540243_106b11207r.pdf 11 15 540244_110b11207s.pdf 7 8 540245_122b11207s.pdf 2 3 540246_136b11207s.pdf 2 3 540248_148b11207d.pdf 6 6 540252_138b11207b.pdf 8 6 540260_026b11207h.pdf 6 5 540261_038b11207o.pdf 8 10 540262_048b11207m.pdf 6 6 540263_156b11207d.pdf 7 6 540265_170b11207h.pdf 101 114 total ====== ===== =====================
  17. 17. Implementation
  18. 18. PDF-Plugin (1) class PDFInvestigator(BaseInvestigator): "A class for determining attributes of PDF files." attrMap = { "title": "getTitle", "author": "getAuthor", "producer": "getProducer", "creationdate": "getCreationDate", "npages": "getNumPdfPages", "nimgs": "getNumImages", } totals = ("npages", "nimgs") def activate(self): "Try activating self, setting 'active' variable." # calculate self.active... return self.active
  19. 19. PDF-Plugin (2) def getNumPdfPages(self): "Return the number of pages in a PDF document." try: # uses PyPdf res = self.input.getNumPages() except: res = "n/a" return res
  20. 20. PDF-Plugin (3) def getNumImages(self): "Return the number of images in a PDF document." expr = r"d+ +d+ +obj.*?endobjs+(?:%.*?[rn])?" objPat = re.compile(expr, re.M | re.S) items = re.findall(objPat, self.content) for p in [ re.compile("/%ss*/%s" % (k, v), re.M | re.S) for (k, v) in [("Type", "XObject"), ("Subtype", "Image")]]: items = [i for i in items if re.search(p, i) != None] return len(items)
  21. 21. An Aside: Spotlight
  22. 22. Spotlight • Desktop file search • Mac OS X 10.4 and 10.5 • Deeply integrated in Mac OS X • Index-based, with attributes • Results based on relevance and recency • Plug-ins/API for custom file formats • GUI & command-line
  23. 23. Spotlight Menu
  24. 24. Spotlight Window
  25. 25. Spotlight $ mdfind europython | egrep ".pdf$" /Users/dinu/Desktop/EuroPython2008Timetable.pdf /Users/dinu/Developer/Python/fileinfo/presentation/fileinfo-slides.pdf /Data/Perso/CV/cv-dg.pdf /Users/dinu/Library/Mail Downloads/cv-dg.pdf /Users/dinu/Developer/Python/epc2008/badge_data.pdf /Data/Docs/dev/The Python Papers/ThePythonPapersVolume2Issue4.pdf /Data/Docs/dev/The Python Papers/ThePythonPapersVolume3Issue1.pdf /Data/Docs/dev/The Python Papers/ThePythonPapersVolume2Issue3.pdf /Data/Docs/dev/The Python Papers/The Python Papers Volume 2, Issue 2.pdf /Data/Docs/dev/The Python Papers/The Python Papers Volume 2, Issue 1.pdf /Users/dinu/Developer/Python/epc2008/badge_data-hpda.pdf /Users/dinu/Developer/Python/epc2008/badge_data-hpda-sliced.pdf /Data/Perso/Travel/Vilnius2008/EuroPython 2008 Invoice.pdf /Users/dinu/Developer/Python/hipsterpda/output/badges.pdf ...
  26. 26. Spotlight – Pro • Great index/search technology • Very fast, useful and easy to use • ~125 search attributes in Mac OS X 10.5 (e.g. Aperture, Composer, …) • Extensible (Python plug-in available)
  27. 27. Spotlight – Con • Result on command-line not in table form • Result in GUI is always a list of file names + the attributes, that the Finder (!) knows • Weak on providing overview • Mac OS X only
  28. 28. Future
  29. 29. Issues • Testing, debugging & refactoring, … • Better folder handling (e.g. OS X bundles) • Attribute namespaces (pdf.npages)? • Attribute parameters (nattr#h2)? • Attribute Null values (”n/a“)? • Better dependancies handling
  30. 30. More Features? • Output format plug-ins? • Pylint plug-in for fileinfo? • Fileinfo Python plug-in pyinfo.py? • Plug-ins for functions like total() • Access intra-file dataset attributes? • Multi-line attribute values? • ”Abreviations“ for attribute lists? • Derived attributes (ncomments/loc)?
  31. 31. Summary • Useful as general purpose attribute ”browser“ • Access to Spotlight meta-data (Mac OS X) • Easy to write plug-ins • Fileinfo not like Spotlight (no index/search) • More like iTunes (on the command-line ;-)
  32. 32. Links • http://www.dinu-gherman.net/tmp/ fileinfo-0.3.2.tar.gz • http://developer.apple.com/macosx/spotlight.html • http://www.apple.com/downloads/macosx/ spotlight/ • http://toxicsoftware.com/ python_metadata_importer_106_released/
  33. 33. Questions?

×