Slideshow transcript
Slide 1: Indexing BackPAN brian d foy brian@stonehenge.com April 22, 2008
Slide 2: • BackPAN is the historical archive of Comprehensive Perl Archive Network (CPAN) • http://backpan.cpan.org • 200k number of files, 10 Gb
Slide 3: • CPAN only has the distributions the authors leave on there • 55k distributions, 4 Gb • CPAN tools use an index
Slide 4: • Perl doesn't have a package manager • Install a file by putting it in @INC • No file to distro reverse mapping • Avoids overwriting by PAUSE indexing and permissions checking • No version management, multiple versions, author management
Slide 5: • Recreate module installation history • Start with the files in @INC • Work backward to distro • End with a list of distros to install • Create a MyCPAN with those distros
Slide 6: How PAUSE indexes
Slide 7: • PAUSE accepts uploads from anyone • Need PAUSE ID, but that's easy to get • PAUSE does not want to run any code • Wants author, namespace, version • It happens in mldistwatch
Slide 8: Extract $VERSION next unless /([$*])(([w:']*)bVERSION)b.*=/; my $current_parsed_line = $_; my $eval = qq{ package ExtUtils::MakeMaker::_version; local $1$2; $$2=undef; do { $_ }; $$2 }; $result = eval( $eval ); mldistwatch
Slide 9: • An author has namespace privileges • Can't upload without permission • Further uploads have higher versions • Index failures don't prevent uploads
Slide 10: Extract package if( $pline =~ m{ (.*) bpackages+ ([w:']+) s* ( $ | [};] ) }x) { $pkg = $2; }
Slide 11: Create an index • 02.package.details.txt.gz DBI 1.604 T/TI/TIMB/DBI-1.604.tar.gz • Has only latest distro • This isnʼt a magic file • But CPAN tools use it
Slide 12: How I Index
Slide 13: • Donʼt care about permissions • PAUSE has already filtered • Run in virtual machines • no network connections • mount BackPAN readonly • if it blows up, so what
Slide 14: • Don't trust anything • Not META.yml, Makefile.PL, Build.PL • Run the build file, look in blib • Extract blib file list, file meta data, namespaces, and versions • Extract anything else I can • Dependencies
Slide 15: • Mostly automated • Use one set-up, index • See what fails • Try another to get more • Try different methods
Slide 16: • Right now, I just want the data • Distribute data in many forms • People can use it how they like • Keep up with CPAN
Slide 17: Mechanics
Slide 18: Unpack dist • Archive::Extract • Automatically dispatches my $extractor = eval { Archive::Extract->new( archive => $dist ) }; my $rc = $extractor->extract( to => $unpack_dir ); my $type = $extractor->type; $ tgz, etc.
Slide 19: Fork • Each dist gets itʼs own process • Compartmentalize • Parallelize • alarm-ize
Slide 20: Extract versions • Module::Extract::VERSION • Same as PAUSE, but in a module Module::Extract::VERSION ->parse_version_safely( FILE ); • Can be changed later
Slide 21: Guess build system • Distribution::Guess::BuildSystem • Try different techniques • Disable auto_install • Create blib
Slide 22: use Distribution::Guess::BuildSystem; my $guesser = Distribution::Guess::BuildSystem->new( $dist_dir ); if( $guesser->uses_makemaker ) { ... } elsif( $guesser->uses_module_build ) { ... } elsif( ... ) { ... }
Slide 23: Record meta data • Filenames • File size • MD5 digests • Source control keywords • PPI cache?
Slide 24: Extract packages • All packages use Module::Extract::Namespaces; # in list context, extract all namespaces my @namespaces = Module::Extract::Namespaces ->from_file( $filename ); • Assume first package is main one
Slide 25: Use PPI my $package_statements = $Document->find( sub { $_[1]->isa('PPI::Statement::Package')} ); my @namespaces = map { /package s+ (w+(::w+)*) s* ; /x; $1 } @$package_statements; }
Slide 26: Record as YAML • Changing too much for a database • Easier to look at • Can import later • Can hand edit to correct
Slide 27: Notice errors • Some distros donʼt build • Perl versions • Perl compilation options (threads) • OS dependencies • missing libraries
Slide 28: Modify system • Find out what doesnʼt work • Fix the indexer for it • Make a special case
Slide 29: Conclusion • Index all of BackPAN • Modularize bits of PAUSE • Redistribute the data • Create custom CPAN versions




Add a comment on Slide 1
If you have a SlideShare account, login to comment; else you can comment as a guest- Favorites & Groups
Showing 1-50 of 3 (more)