Accessing File-Specific Attributes on Steroids - EuroPython 2008

  • 496 views
Uploaded on

A presentation about a tool named "fileinfo" given at EuroPython 2008.

A presentation about a tool named "fileinfo" given at EuroPython 2008.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
496
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
11
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Accessing File-Specific Attributes on Steroids Dinu C. Gherman gherman@python.net EuroPython Conference 2008-07-07, Vilnius
  • 2. Motivation • Get quick overview of file attributes for multiple files • Compare attribute values between files • Identify groups of files • Reuse overview results • Avoid “opening” files with applications
  • 3. Background
  • 4. ~1971(?) wc $ cd mercurial/hgweb $ wc -lwc *.py 16 66 502 __init__.py 118 438 3988 common.py 993 2876 36064 hgweb_mod.py 305 910 12420 hgwebdir_mod.py 228 683 7258 protocol.py 101 320 3577 request.py 298 863 10698 server.py 127 414 3907 webcommands.py 65 190 2090 wsgicgi.py 2251 6760 80504 total
  • 5. ~2002 pycount $ cd mercurial/hgweb $ pycount2.py *.py lines code doc comment blank file 16 5 0 7 4 __init__.py 118 77 22 8 11 common.py 993 809 3 31 150 hgweb_mod.py 305 249 0 17 39 hgwebdir_mod.py 228 174 0 19 35 protocol.py 101 76 2 7 16 request.py 298 244 5 10 39 server.py 127 93 0 10 24 webcommands.py 65 43 0 11 11 wsgicgi.py 2251 1770 32 120 329 total
  • 6. ~2005 ttfinfo $ cd fonts/truetype/ $ ttfinfo.py -a maxp.numGlyphs -a kern.nPairs -a head.unitsPerEm A*.ttf 249 0 1000 AmericanTypewriter.ttf 1320 3072 2048 Arial.ttf 245 1536 2048 ArialBlack.ttf 1320 3072 2048 ArialBold.ttf 956 3072 2048 ArialBoldItalic.ttf 956 3072 2048 ArialItalic.ttf 244 384 2048 ArialNarrow.ttf 245 384 2048 ArialNarrowBold.ttf 244 384 2048 ArialNarrowBoldItalic.ttf 244 384 2048 ArialNarrowItalic.ttf 243 1536 2048 ArialRoundedMTBold.ttf
  • 7. 2007… pyinfo $ cd mercurial/hgweb $ pyinfo.py -a nclass:ndef:ncalls:ndiffkw *.py nclass ndef ncalls ndiffkw file 0 2 2 3 __init__.py 1 9 31 18 common.py 1 60 492 24 hgweb_mod.py 1 15 133 23 hgwebdir_mod.py 0 11 121 21 protocol.py 1 12 30 16 request.py 6 24 104 18 server.py 0 14 50 15 webcommands.py 0 3 15 13 wsgicgi.py 10 150 978 total
  • 8. 2007… pdfinfo $ cd brandeins/200805_bildung $ pdfinfo.py -a npages:nimgs:author *.pdf npages nimgs author file 1 1 Kathrin 802053_008b10508m.pdf 1 0 Kathrin 802055_010b10508w.pdf 2 0 Kathrin 802056_012b10508m.pdf 2 1 Kathrin 802057_018b10508d.pdf 1 1 Kathrin 802060_020b10508m.pdf 9 8 n/a 802064_022b10508d.pdf 8 8 Kathrin 802067_036b10508w.pdf 2 0 Kathrin 803048_136b10508w.pdf 26 19 total
  • 9. Fileinfo
  • 10. Big Picture • Describe input files & attributes • Locate input files • Investigate file attributes • Process file attributes • Present tabular output
  • 11. Input Files Examples • fileinfo [opts] /mypath/*.pdf • fileinfo [opts] $(find /mypath -name "*.py") • fileinfo [opts] $(mdfind -onlyin /mypath -name "*.py")
  • 12. Attributes Examples • --attrs nclasses:ndefs • --sort size:ndefs • --filter "rec.ndefs > 1000"
  • 13. Output Formats • Text, HTML, CSV, ReST (simple) • Cocoa, WxPython • Django
  • 14. Selected Plug-ins General XML PDF Python Quicktime counter nattrs title ndefs duration wc ndattrs author nclasses box lc ntags producer ncalls datasize md5 ndtags creation nstrs ntracks depht date ndocstrs OS npages nkws OS X bundles uid TTF nimgs ndkws bundlename username kern.nPairs nimpstmts bundleversion mtime maxp.numGlyphs MP3 nops size maxp.version album mlw Spotlight level head.unitsPerEm artist mil, … kMDItem*
  • 15. Examples
  • 16. $ cd /Data/brandeins/200712_design $ fileinfo --format rest-simple -a npages:nimgs -f "rec.nimgs > 2" *.pdf ====== ===== ===================== npages nimgs path ====== ===== ===================== 11 3 540237_058b11207s.pdf 8 18 540238_070b11207r.pdf 7 11 540240_082b11207a.pdf 9 9 540242_096b11207f.pdf 3 5 540243_106b11207r.pdf 11 15 540244_110b11207s.pdf 7 8 540245_122b11207s.pdf 2 3 540246_136b11207s.pdf 2 3 540248_148b11207d.pdf 6 6 540252_138b11207b.pdf 8 6 540260_026b11207h.pdf 6 5 540261_038b11207o.pdf 8 10 540262_048b11207m.pdf 6 6 540263_156b11207d.pdf 7 6 540265_170b11207h.pdf 101 114 total ====== ===== =====================
  • 17. Implementation
  • 18. PDF-Plugin (1) class PDFInvestigator(BaseInvestigator): "A class for determining attributes of PDF files." attrMap = { "title": "getTitle", "author": "getAuthor", "producer": "getProducer", "creationdate": "getCreationDate", "npages": "getNumPdfPages", "nimgs": "getNumImages", } totals = ("npages", "nimgs") def activate(self): "Try activating self, setting 'active' variable." # calculate self.active... return self.active
  • 19. PDF-Plugin (2) def getNumPdfPages(self): "Return the number of pages in a PDF document." try: # uses PyPdf res = self.input.getNumPages() except: res = "n/a" return res
  • 20. PDF-Plugin (3) def getNumImages(self): "Return the number of images in a PDF document." expr = r"d+ +d+ +obj.*?endobjs+(?:%.*?[rn])?" objPat = re.compile(expr, re.M | re.S) items = re.findall(objPat, self.content) for p in [ re.compile("/%ss*/%s" % (k, v), re.M | re.S) for (k, v) in [("Type", "XObject"), ("Subtype", "Image")]]: items = [i for i in items if re.search(p, i) != None] return len(items)
  • 21. An Aside: Spotlight
  • 22. Spotlight • Desktop file search • Mac OS X 10.4 and 10.5 • Deeply integrated in Mac OS X • Index-based, with attributes • Results based on relevance and recency • Plug-ins/API for custom file formats • GUI & command-line
  • 23. Spotlight Menu
  • 24. Spotlight Window
  • 25. Spotlight $ mdfind europython | egrep ".pdf$" /Users/dinu/Desktop/EuroPython2008Timetable.pdf /Users/dinu/Developer/Python/fileinfo/presentation/fileinfo-slides.pdf /Data/Perso/CV/cv-dg.pdf /Users/dinu/Library/Mail Downloads/cv-dg.pdf /Users/dinu/Developer/Python/epc2008/badge_data.pdf /Data/Docs/dev/The Python Papers/ThePythonPapersVolume2Issue4.pdf /Data/Docs/dev/The Python Papers/ThePythonPapersVolume3Issue1.pdf /Data/Docs/dev/The Python Papers/ThePythonPapersVolume2Issue3.pdf /Data/Docs/dev/The Python Papers/The Python Papers Volume 2, Issue 2.pdf /Data/Docs/dev/The Python Papers/The Python Papers Volume 2, Issue 1.pdf /Users/dinu/Developer/Python/epc2008/badge_data-hpda.pdf /Users/dinu/Developer/Python/epc2008/badge_data-hpda-sliced.pdf /Data/Perso/Travel/Vilnius2008/EuroPython 2008 Invoice.pdf /Users/dinu/Developer/Python/hipsterpda/output/badges.pdf ...
  • 26. Spotlight – Pro • Great index/search technology • Very fast, useful and easy to use • ~125 search attributes in Mac OS X 10.5 (e.g. Aperture, Composer, …) • Extensible (Python plug-in available)
  • 27. Spotlight – Con • Result on command-line not in table form • Result in GUI is always a list of file names + the attributes, that the Finder (!) knows • Weak on providing overview • Mac OS X only
  • 28. Future
  • 29. Issues • Testing, debugging & refactoring, … • Better folder handling (e.g. OS X bundles) • Attribute namespaces (pdf.npages)? • Attribute parameters (nattr#h2)? • Attribute Null values (”n/a“)? • Better dependancies handling
  • 30. More Features? • Output format plug-ins? • Pylint plug-in for fileinfo? • Fileinfo Python plug-in pyinfo.py? • Plug-ins for functions like total() • Access intra-file dataset attributes? • Multi-line attribute values? • ”Abreviations“ for attribute lists? • Derived attributes (ncomments/loc)?
  • 31. Summary • Useful as general purpose attribute ”browser“ • Access to Spotlight meta-data (Mac OS X) • Easy to write plug-ins • Fileinfo not like Spotlight (no index/search) • More like iTunes (on the command-line ;-)
  • 32. Links • http://www.dinu-gherman.net/tmp/ fileinfo-0.3.2.tar.gz • http://developer.apple.com/macosx/spotlight.html • http://www.apple.com/downloads/macosx/ spotlight/ • http://toxicsoftware.com/ python_metadata_importer_106_released/
  • 33. Questions?