Your SlideShare is downloading. ×
BioPerl: Developming Open Source Software
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

BioPerl: Developming Open Source Software

4,248
views

Published on

My Keynote talk at BOSC 2005 about the development of BioPerl, an open source Perl toolkit for Bioinformatics.

My Keynote talk at BOSC 2005 about the development of BioPerl, an open source Perl toolkit for Bioinformatics.

Published in: Technology

3 Comments
4 Likes
Statistics
Notes
  • Excellent. You've shown your credibility on presentation with this slideshow. This one deserves thumbs up. I'm John, owner of www.freeringtones.ws/ . Perhaps to see more quality slides from you.

    Best wishes.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • outstanding display..convinced me to have a hardlook at my business model..great

    Janie
    http://financejedi.com
    http://healthjedi.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • fetterr
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
4,248
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
152
Comments
3
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Bioperl: Developing Open Source Software Jason Stajich Duke University BOSC 2005
  • 2. Why Talk About Bioperl? • Successful open-source project • Bioinformatics is a difficult field to straddle • Toolkit still has (many) flaws • An evolving project BOSC 2005
  • 3. How Did We Get Here? • A Brief History • Development Stats • People • Bioperl in Action BOSC 2005
  • 4. Bioperl Development Goals • Provide a useable Perl toolkit for life science data • Solve problems that developers need • Avoid the 1-off “everyone write a BLAST|GenBank|FASTA parser” • Continuity of solutions BOSC 2005
  • 5. A Brief History 2002 1997-1998 2000 Hackathons AZ to ZA Poster at ISMB 2004 Bio::SeqIO, Bio::DB OMG Bio-Objects “Core” Founded bio.perl.org website Bioperl Bootcamp at UMontreal EnsEMBL launched Bioperl 1.0 release Bio::PreSeq bioperl-ext Bioperl paper Bio::UnivAln BOSC2004 Gbrowse paper Bio::Struct Bioperl 0.07 BLAST modules HOWTOs started BOSC2000 bioperl-run Bioperl 0.01-0.04 BOSC2002 1999 2003 1996 2001 2005 Singapore Hackathon Bioperl 1.5 released Ewan Birney becomes VSNS-BCD bptutorial written project leader after Steve Bioperl 1.2.3 released Course Bioperl-db 100+ Paper references Chervitz BioPipe paper published Steven Brenner bioperl-db to have official Bioperl 1.4 released Steve Chervitz * BioCORBA release Bio::SeqIO, Bio::Index Chris Dagdigian BOSC2001 BOSC2005 bioperl-microarray Richard Resnick Releases 0.05-0.06 started Lew Gramer BioP BOSC2003 BioJava, BioPython launched Bioperl-99 mtg BOSC 2005
  • 6. Defining Points • Vision of module(s) set by code owner • She/He who writes it, “wins” argument • Project leader/core developers • can remove code before a release to insure a working release • Enforce code and module continuity • Preserve API as much as possible BOSC 2005
  • 7. Project Mechanics • Code is on a shared server (O|B|F) • Publicly accessible • Write-access to repository granted by main developers • Discussion of code, ideas on-list BOSC 2005
  • 8. Bioperl Mailing List Traffic 1400 ’full_dates.dat’ using 1:2 bioperl-guts 266 subscribers (20-June-2005) ’full_guts_dates.dat’ using 1:2 bioperl-l 1500 subscribers (20-June-2005) 1200 1000 800 600 400 200 0 01/01/1996 01/01/1997 01/01/1998 01/01/1999 01/01/2000 01/01/2001 01/01/2002 01/01/2003 01/01/2004 01/01/2005 01/01/2006 BOSC 2005
  • 9. Code Growth 800 350000 ’plot.dat’ using 1:2 ’plot.dat’ using 1:3 700 300000 600 250000 500 200000 400 150000 300 100000 200 50000 100 0 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Modules per release Lines per release BOSC 2005
  • 10. CVS Commits 300 ’ChangeLog.sum.dat’ using 1:2 250 200 150 100 50 0 01/01/1999 01/01/2000 01/01/2001 01/01/2002 01/01/2003 01/01/2004 01/01/2005 01/01/2006 BOSC 2005
  • 11. Bioperl in Action # Convert sequence formats use Bio::SeqIO; my $in = Bio::SeqIO->new(-format => ‘genbank’, -file=> ‘file.gbk’); my $out = Bio::SeqIO->new(-format =>’fasta’,-file =>’>file.fas’); while( my $seq = $in->next_seq ) { $out->write_seq($seq); } ---------------------------------- use Bio::SearchIO; my $in = Bio::SearchIO->new(-format => ‘blast’, -file=> ‘result.blastp’); while( my $r = $in->next_result ) { while( my $h = $r->next_hit ) { while( my $hsp = $h->next_hsp ) { print $hsp->query->start, “..”,$hsp->query->end,” ”; } } BOSC 2005 }
  • 12. Bioperl in Action # Convert sequence formats use Bio::SeqIO; my $in = Bio::SeqIO->new(-format => ‘embl’, -file=> ‘file.embl’); my $out = Bio::SeqIO->new(-format =>’gcg’,-file =>’>file.gcg’); while( my $seq = $in->next_seq ) { $out->write_seq($seq); } ---------------------------------- use Bio::SearchIO; my $in = Bio::SearchIO->new(-format => ‘fasta’, -file => ‘result.fastp’); while( my $r = $in->next_result ) { while( my $h = $r->next_hit ) { while( my $hsp = $h->next_hsp ) { print $hsp->query->start, “..”,$hsp->query->end,” ”; } } BOSC 2005 }
  • 13. Measuring sucess • 2002 Paper has 100+ Citations • Bits and pieces used in many informatics tools • Bio::SearchIO (Rfam, In-Paranoid) • Generic Genome Browser (Stein et al, 2002) • Comparative Genomics Library (Yandell, Mungall, et al) • [EnsEMBL] BOSC 2005
  • 14. Bioperl Taught At Tutorials and Courses CSHL Bioinformatics Software Courses Various mini courses Bootcamp 2004 MIT, Duke, EBI, Pasteur (Montreal) BOSC 2005
  • 15. Knowledge of Bioperl is A Marketable Skill http://bioag.byu.edu/botany/homepage/botweb/jobs_Ph.D.htm Academic Facilities Coordinator II (Facilities) http://www.ce.com/education/Bioinformatics-Certificate- Bioinformatics Position at the UCR Genomics Institute Program-10082109.htm UC, Riverside Track 1: Suggested electives for computer scientists and IT professionals * Advanced Sequence Analysis in Bioinformatics (2 units) The Center for Plant Cell Biology (CEPCEB) in the Genomics Institute of the University of California, Riverside, * Gene Expression and Pathways (2 units) invites applicants for an Academic Facilities Coordinator II position, an academic-track 11-month appointment. * Protein Structure Analysis in Bioinformatics (2 units) Salary for the position is commensurate with education and experience. * Design and Implementation of Bioinformatics Infrastructures (3 units) * DNA Microarrays - Principles, Applications and Data Analysis (2 Units) The successful applicant will be expected to organize a small bioinformatics team to provide support to The Center * Parallel and Distributed Computing for Bioinformatics (2 units) for Plant Cell Biology. This team will implement currently available bioinformatics tools including relational database Track 2: Suggested electives for molecular biologists and other scientists support and will develop user-specific data-mining tools. The appointee will be expected to develop research * Introduction to Programming for Bioinformatics II*** (3 units) collaborations with the faculty and teach or organize short courses that will inform that will inform the local * Introduction to Programming for Bioinformatics III (3 units) community about the available bioinformatics resources. Applicants must have a Ph.D. in the Biological Sciences BioPerl for Bioinformatics (2 units) * (Plant Biology is preferred). Applicants with experience in leading a bioinformatics group will be given preference. * Design and Implementation of Bioinformatics Infrastructures (3 units) Additionally, the applicant must be proficient in one or more programming languages (PERL, PYTHON, JAVA, C++) * Gene Expression and Pathways (2 units) and have a good understanding of database design and implementation. In addition, applicants should have * Protein Structure Analysis in Bioinformatics (2 units) experience using one of the open source bioinformatics frameworks such as BIOJAVA or * DNA Microarrays - Principles, Applications and Data Analysis (2 units) BIOPERL. The applicant should have experience with software collaboration tools such as CVS. The applicant will be expected to oversee the purchase, installation, and management of the necessary computer hardware and http://www.foothill.fhda.edu/bio/programs/bioinfo/curric.shtml software required to provide bioinformatics support to several users. A good understanding of the UNIX operating system and systems administration is also an important qualification. CAREER CERTIFICATE REQUIREMENTS (49 units)* http://careers.psgs.com/CareerOppLocation2.asp?location=BETHESDA Biotechnology Core Courses (14 units) CF-046: Bioinformatics Specialist BTEC 51A Cell Biology for Biotechnology (3 units) BTEC 52A Molecular Biology for Biotechnology (3 units) The NIAID Office of Technology and Information Systems (OTIS)/Bioinformatics and Scientific IT Program (BSIP) is seek-ing BTEC 65 DNA Electrophoretic Systems (1 unit) a bioinformatics specialist. The position includes managing our bioinformatics services and applications, training NIH BTEC 68 Polymerase Chain Reaction (1 unit) scientists, creating and modifying simple bioinformatics software, and collaborating with NIH scientists on specific projects. BTEC 71 DNA Sequencing & Bioinformatics (1 unit) [snip] BTEC 76 Introduction to Microarray Data Analysis (2 unit) BTEC 64 Protein Electrophoretic Systems (1 unit) The qualified candidate must hold a Master's Degree (or equivalent) and three years of experience or a Ph. D in life science BTEC 66 HPLC (2 units) or computer science. The candidate must have strong interpersonal, written and oral communication skills and be a lateral thinker. Must be able to communicate current bioinformatics technology in a clear and precise manner and to discuss Computer Science Core Courses (30 units) projects with scientists and advise what relevant tools may be used or implemented, have the ability to locate relevant data/ CIS 52A Introduction to Data Management Systems (5 units) information and put this into the context of projects they work on. Candidate must have expert knowledge of UNIX (Mac OS CIS 52B2 Introduction to Oracle SQL (5 units) X Darwin a plus), with the ability to install and configure command-line-based applications and services. These include: UNIX CIS 68A Introduction to UNIX (5 units) windowing systems (X11, FreeX86), UNIX-based open source scientific applications, Perl and BioPerl scripts, CIS 68E Introduction to PERL (5 units) HTML, XML. Must have experience with one or more of the following: · Web services (WSDL, SOAP) · Genomics and CIS 68H Introduction to BioPerl (5 units) proteomics · Protein structure prediction/visualization. · Sequence analysis, alignment and database searches. · Regular BOSC 2005 COIN81 Bioinformatics Tools & Databases (5 units) expressions and relational database queries. · Data mining, text mining · Biostatistics
  • 16. Why has the Project Worked? • A critical mass of individuals who wanted to solve a common problem • Infrastructure that works • Sense of community • Sharing information, tutorials BOSC 2005
  • 17. Some Key Developers Not Pictured Richard Adams Scott Cain Sean Davis Stefan Kirov Nathan Haigh Marc Logghe Chad Heikki Aaron Steve Jason Shawn Allen Ewan Chris M Hilmar Lincoln Brian Chris D BOSC 2005
  • 18. Infrastructure • Easily accessible CVS, web service • Reliable • Machines that we own • Additional services can be added • Moving to faster machines this summer BOSC 2005
  • 19. Collaborative Nature of the Project • Folks working on similar problems • Pooled development resources • Friendly & interactive way to do science • Shared user support load... BOSC 2005
  • 20. Open Source is Good! • Fix things faster • (If you really care about the fix) • Everyone can see solution, audit code • Contribution can be modular • “Give a little, get a lot” • More developers for less resources BOSC 2005
  • 21. Open Source Doesn’t Solve Everything • Best laid plans • Decision by consensus ... • ... Free-for-all committing • Things don’t “just happen” • Need strong leadership, vision, and commitment • Documentation tends to lag • Interesting, needed code trumps boring BOSC 2005
  • 22. Great Power == Great Responsibility • Making releases • Point releases maintain API • Attempt to maintain backwards compatibility between stable releases • Many people depend on the code • Additional modules should maintain continuity BOSC 2005
  • 23. We Know how to have fun! BOSC 2005
  • 24. Bioperl: Present Day • Parsers for • Sequence & alignment parsing • Gene Prediction output • Phylogenetic trees • Weight Matricies (TFBS) • Ontologies • Structure data • Various CompBio, MolEvol apps BOSC 2005
  • 25. Bioperl: Present Day • Tools for • Sequence manipulation (I/O, manip) • Graphical sequence rendering • Sequence & alignment stats, popgen tests, molevo tests • Access to Local & Remote Sequence Databases BOSC 2005
  • 26. Present Day Stats • 780+ modules • 336,000 lines of code • ~9,000 tests in unit test ‘t’ dir • 57 utility scripts, 60 example scripts • 11 HOWTO docs • 44 Q&A in FAQ BOSC 2005
  • 27. Bioperl Magic • Using interfaces to hide specifics • Bio::DB::GenBank & Bio::Index::GenBank • Objects can masquerade like other objects • Bio::SeqFeature::Generic, Bio::Location, DB::GFF::RelSegment, DB::GFF::Feature BOSC 2005
  • 28. Bioperl Magic use Bio::DB::GenBank; use Bio::DB::Fasta; my $db = Bio::DB::GenBank->new; my $idx = Bio::DB::Fasta->new($fafile_dir); my $seq1 = $db->get_Seq_by_acc($acc); my $seq2 = $idx->get_Seq_by_acc($acc); use Bio::DB::FileCache; my $cachedb = Bio::DB::FileCache->new($db); my $acc = $cachedb->get_Seq_by_acc($acc); BOSC 2005
  • 29. ...The Problems • Interfaces system is cluttered • Learning curve is steep • Nonsensical to some people • OO Code is too slow • Developer community needs growth BOSC 2005
  • 30. Why is Bioperl Slow (in some places)? 1. Object-Oriented Perl is a Hack! 1. $class->SUPER::new(@_); 2. $class->SUPER::_rearrange(@_); 3. Object overhead can be high. BOSC 2005
  • 31. Research with Bioperl • Genome annotation pipeline • Genome Browsing • Comparative genomics • Phylogenomics BOSC 2005
  • 32. S.cerevisiae Existing Genome C.glabrata Annotation K.lactis New K.waltii Annotation A.gossypii D.hansenii C.guilliermondii C.lusitaniae Hemiascomycota Y.lipolytica P.anserina C.globsum N.crassa M.grisea Euascomycota F.graminearum A.fumigatus A.nidulans H.capsulatum C.immitis Archiascomycota S.pombe C.cinereus P.chrysosporium Basidiomycota C.neoformans U.maydis Zygomycota R.oryzae H.sapiens Protein ML tree based on 100 random orthologs A.thaliana 10
  • 33. Genome Annotation Pipeline Rfam tRNAscan ZFF to GFF3 GFF to AA SNAP FASTA predicted Proteins proteins Twinscan all-vs-all GFF2 to GFF3 Bio::DB::GFF Tools::Glimmer Glimmer SearchIO protein to Genscan Genome genome Tools::Genscan coordinates homologs HMMER MCL SearchIO SearchIO and BLASTZ orthologs BLASTN Tools::GFF GLEAN GFF2 to GFF3 exonerate (combiner) Proteins Gene protein2genome families exonerate ESTs est2genome BOSC 2005
  • 34. Genome Browsing BOSC 2005
  • 35. Comparative Genomics • Find orthologs, align, build trees • Assess gene structure among orthologs • Find gene families • Assess gene structure • Copy number among genomes BOSC 2005
  • 36. Comparative Genomics: Gene Structure Species phylogeny Bio::Tree/TreeIO Bio::DB Bio::AignIO Bio::SimpleAlign Bio::SeqIO MUSCLE Alignments Score shared orthologs Bio::Coordinate with Introns intorn positions on Tree BOSC 2005
  • 37. Phylogenomics Species phylogeny Bio::TreeIO gene MrBayes families Bio::DB PhyML Classify tree Bio::SeqIO Bio::AignIO MUSCLE differences ProtML PUZZLE Bin genes by orthologs topology BOSC 2005
  • 38. Bioperl - The Future BOSC 2005
  • 39. Where Should Bioperl be Headed? • Concise set of functionality that just works? • Supporting leading edge of research? • Beginner level tools? • Advanced tools? BOSC 2005
  • 40. Challenges • Logistics • Building and maintaining large projects takes a lot of time • Project leaders with vision (and time) • Face-to-face time invaluable for coordination • Need more dedicated developers BOSC 2005
  • 41. Starting Over? • Years of organic development don’t make a wholly consistent toolkit • Keeping POD up to date vs API maintenance • Autodoc-ing? • Perl6 rewrite? • Do we start a Bioperl 2, from scratch? BOSC 2005
  • 42. Supporting the Toolkit • Consider applying for funding to support a developer? • Finding companies to provide more dedicated user support? • Gatherings (hackathons) • DocFests to increase documentation • Better incentives for bug fixes BOSC 2005
  • 43. My Priority Queue • Simplified API, hidden interfaces • Overview docs point functionality to modules • Speed • Extensibility BOSC 2005
  • 44. A Plan • Continue Bioperl 1.x • Try and fix interface overload • Work out speed issues (maybe) • Encourage dabbling into a 2.0 codebase • Outline some key principals in new codebase. • Consider prototypes & Perl6 BOSC 2005
  • 45. We’ve Grown Up • Over the course of the project we have • Gotten married • Had kids • Changed continents • Changed jobs BOSC 2005
  • 46. BOSC 2005
  • 47. Thanks • Developers - past and present • Ewan Birney • Lincoln Stein • Chris Dagdigian • Brian Osborne BOSC 2005

×