Slideshare.net (beta)

 

All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 3 (more)

Indexing BackPAN

From brian_d_foy, 4 months ago

Indexing BackPAN is the first step to custom views of CPAN. See my more

723 views  |  0 comments  |  0 favorites
Download not available ?
Embed
options

More Info

This slideshow is Public
Total Views: 723
on Slideshare: 723
from embeds: 0

Slideshow transcript

Slide 1: Indexing BackPAN brian d foy brian@stonehenge.com April 22, 2008

Slide 2: • BackPAN is the historical archive of Comprehensive Perl Archive Network (CPAN) • http://backpan.cpan.org • 200k number of files, 10 Gb

Slide 3: • CPAN only has the distributions the authors leave on there • 55k distributions, 4 Gb • CPAN tools use an index

Slide 4: • Perl doesn't have a package manager • Install a file by putting it in @INC • No file to distro reverse mapping • Avoids overwriting by PAUSE indexing and permissions checking • No version management, multiple versions, author management

Slide 5: • Recreate module installation history • Start with the files in @INC • Work backward to distro • End with a list of distros to install • Create a MyCPAN with those distros

Slide 6: How PAUSE indexes

Slide 7: • PAUSE accepts uploads from anyone • Need PAUSE ID, but that's easy to get • PAUSE does not want to run any code • Wants author, namespace, version • It happens in mldistwatch

Slide 8: Extract $VERSION next unless /([$*])(([w:']*)bVERSION)b.*=/; my $current_parsed_line = $_; my $eval = qq{ package ExtUtils::MakeMaker::_version; local $1$2; $$2=undef; do { $_ }; $$2 }; $result = eval( $eval ); mldistwatch

Slide 9: • An author has namespace privileges • Can't upload without permission • Further uploads have higher versions • Index failures don't prevent uploads

Slide 10: Extract package if( $pline =~ m{ (.*) bpackages+ ([w:']+) s* ( $ | [};] ) }x) { $pkg = $2; }

Slide 11: Create an index • 02.package.details.txt.gz DBI 1.604 T/TI/TIMB/DBI-1.604.tar.gz • Has only latest distro • This isnʼt a magic file • But CPAN tools use it

Slide 12: How I Index

Slide 13: • Donʼt care about permissions • PAUSE has already filtered • Run in virtual machines • no network connections • mount BackPAN readonly • if it blows up, so what

Slide 14: • Don't trust anything • Not META.yml, Makefile.PL, Build.PL • Run the build file, look in blib • Extract blib file list, file meta data, namespaces, and versions • Extract anything else I can • Dependencies

Slide 15: • Mostly automated • Use one set-up, index • See what fails • Try another to get more • Try different methods

Slide 16: • Right now, I just want the data • Distribute data in many forms • People can use it how they like • Keep up with CPAN

Slide 17: Mechanics

Slide 18: Unpack dist • Archive::Extract • Automatically dispatches my $extractor = eval { Archive::Extract->new( archive => $dist ) }; my $rc = $extractor->extract( to => $unpack_dir ); my $type = $extractor->type; $ tgz, etc.

Slide 19: Fork • Each dist gets itʼs own process • Compartmentalize • Parallelize • alarm-ize

Slide 20: Extract versions • Module::Extract::VERSION • Same as PAUSE, but in a module Module::Extract::VERSION ->parse_version_safely( FILE ); • Can be changed later

Slide 21: Guess build system • Distribution::Guess::BuildSystem • Try different techniques • Disable auto_install • Create blib

Slide 22: use Distribution::Guess::BuildSystem; my $guesser = Distribution::Guess::BuildSystem->new( $dist_dir ); if( $guesser->uses_makemaker ) { ... } elsif( $guesser->uses_module_build ) { ... } elsif( ... ) { ... }

Slide 23: Record meta data • Filenames • File size • MD5 digests • Source control keywords • PPI cache?

Slide 24: Extract packages • All packages use Module::Extract::Namespaces; # in list context, extract all namespaces my @namespaces = Module::Extract::Namespaces ->from_file( $filename ); • Assume first package is main one

Slide 25: Use PPI my $package_statements = $Document->find( sub { $_[1]->isa('PPI::Statement::Package')} ); my @namespaces = map { /package s+ (w+(::w+)*) s* ; /x; $1 } @$package_statements; }

Slide 26: Record as YAML • Changing too much for a database • Easier to look at • Can import later • Can hand edit to correct

Slide 27: Notice errors • Some distros donʼt build • Perl versions • Perl compilation options (threads) • OS dependencies • missing libraries

Slide 28: Modify system • Find out what doesnʼt work • Fix the indexer for it • Make a special case

Slide 29: Conclusion • Index all of BackPAN • Modularize bits of PAUSE • Redistribute the data • Create custom CPAN versions