Bioinformatics and BioPerl

7,468 views

Published on

Feb 2006 Raleigh Perl Monger's talk - explaining data formats, OO Perl modules, and limits of current technology.

Published in: Business, Technology

Bioinformatics and BioPerl

  1. 1. Bioinformatics and BioPerl Jason Stajich Duke University
  2. 2. Why Compute Biology? • Biology is an information science too! • Quantitative information has always been large part of biology • Recent advances have made techniques like sequencing high throughput • Genomes (long strings of a 4-base alphabet) • Proteomes (short strings of a 20-base alphabet)
  3. 3. Why Perl for Biology? • Most things are represented in text (letters for sequence, numbers for locations) • Want to write quick script to munge data, get stats, etc - not debug mem leaks. • Hashes Rock! • Wealth of available libraries for visualization, Web interface, and existing Biology tools (BioPerl!)
  4. 4. BioPerl • An open source project providing Perl modules for life sciences computation and data mining • Focused mostly on genomic application • Started in 1996 • 20-30 major developers, ~1500 subbed to mailing list • 810 modules in core distro, • another 200+ in related packages
  5. 5. • Core - parsers, objects, interfaces • Run - wrappers for applications • Ext - C-extensions for alignment and linking to C libs • DB - RDBMS for sequence and feature data • Pedigree, Microarray, GUI - specialized packages
  6. 6. Today’s Special • About BioPerl • Bioinformatics tasks and BioPerl to the rescue • BioPerl ins and outs • How you can dig in to help or use the toolkit
  7. 7. Brief Timeline 2002 1997-1998 2000 Hackathons AZ to ZA Poster at ISMB 2004 Bio::SeqIO, Bio::DB OMG Bio-Objects “Core” Founded bio.perl.org website Bioperl Bootcamp at UMontreal EnsEMBL launched Bioperl 1.0 release Bio::PreSeq bioperl-ext Bioperl paper Bio::UnivAln BOSC2004 Gbrowse paper Bio::Struct Bioperl 0.07 BLAST modules HOWTOs started BOSC2000 bioperl-run Bioperl 0.01-0.04 BOSC2002 1999 2003 1996 2001 2005 Singapore Hackathon Bioperl 1.5 released Ewan Birney becomes VSNS-BCD bptutorial written project leader after Steve Bioperl 1.2.3 released Course Bioperl-db 300+ Paper references Chervitz BioPipe paper published Steven Brenner bioperl-db to have official Bioperl 1.4 released Steve Chervitz * BioCORBA release Bio::SeqIO, Bio::Index Chris Dagdigian BOSC2001 BOSC2005 bioperl-microarray Richard Resnick Releases 0.05-0.06 started Lew Gramer BioP BOSC2003 BioJava, BioPython launched Bioperl-99 mtg
  8. 8. Code ethos • Vision of module(s) set by code owner • She/He who writes it, “wins” argument • Project leader/core developers • can remove code before a release to insure a working release • Enforce code and module continuity • Preserve API as much as possible
  9. 9. Gratiutious statistics
  10. 10. Bioperl Mailing List Traffic 1400 ’full_dates.dat’ using 1:2 bioperl-guts 266 subscribers (20-June-2005) ’full_guts_dates.dat’ using 1:2 bioperl-l 1500 subscribers (20-June-2005) 1200 1000 800 600 400 200 0 01/01/1996 01/01/1997 01/01/1998 01/01/1999 01/01/2000 01/01/2001 01/01/2002 01/01/2003 01/01/2004 01/01/2005 01/01/2006
  11. 11. Code Growth 800 350000 ’plot.dat’ using 1:2 ’plot.dat’ using 1:3 700 300000 600 250000 500 200000 400 150000 300 100000 200 50000 100 0 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Modules per release Lines per release
  12. 12. CVS Commits 300 ’ChangeLog.sum.dat’ using 1:2 250 200 150 100 50 0 01/01/1999 01/01/2000 01/01/2001 01/01/2002 01/01/2003 01/01/2004 01/01/2005 01/01/2006
  13. 13. Citations in PubMed http://www.bioperl.org/wiki/BioPerl_publications
  14. 14. Measuring success • 2002 Paper has 300+ Citations • Bits and pieces used in many informatics tools • Academic • Bio::SearchIO (Rfam, In-Paranoid) • Generic Genome Browser (Stein et al, 2002) • Comparative Genomics Library (Yandell, PLoS CompBio 2006) • Commercial • SciTegic Discovery Pipeline • VectorNTI
  15. 15. Bioperl Taught At Tutorials and Courses CSHL Bioinformatics Software Courses Various mini courses Bootcamp 2004 MIT, Duke, EBI, Pasteur (Montreal)
  16. 16. Knowledge of Bioperl is A Marketable Skill http://bioag.byu.edu/botany/homepage/botweb/jobs_Ph.D.htm Academic Facilities Coordinator II (Facilities) http://www.ce.com/education/Bioinformatics-Certificate- Bioinformatics Position at the UCR Genomics Institute Program-10082109.htm UC, Riverside Track 1: Suggested electives for computer scientists and IT professionals * Advanced Sequence Analysis in Bioinformatics (2 units) The Center for Plant Cell Biology (CEPCEB) in the Genomics Institute of the University of California, Riverside, * Gene Expression and Pathways (2 units) invites applicants for an Academic Facilities Coordinator II position, an academic-track 11-month appointment. * Protein Structure Analysis in Bioinformatics (2 units) Salary for the position is commensurate with education and experience. * Design and Implementation of Bioinformatics Infrastructures (3 units) * DNA Microarrays - Principles, Applications and Data Analysis (2 Units) The successful applicant will be expected to organize a small bioinformatics team to provide support to The Center * Parallel and Distributed Computing for Bioinformatics (2 units) for Plant Cell Biology. This team will implement currently available bioinformatics tools including relational database Track 2: Suggested electives for molecular biologists and other scientists support and will develop user-specific data-mining tools. The appointee will be expected to develop research * Introduction to Programming for Bioinformatics II*** (3 units) collaborations with the faculty and teach or organize short courses that will inform that will inform the local * Introduction to Programming for Bioinformatics III (3 units) community about the available bioinformatics resources. Applicants must have a Ph.D. in the Biological Sciences BioPerl for Bioinformatics (2 units) * (Plant Biology is preferred). Applicants with experience in leading a bioinformatics group will be given preference. * Design and Implementation of Bioinformatics Infrastructures (3 units) Additionally, the applicant must be proficient in one or more programming languages (PERL, PYTHON, JAVA, C++) * Gene Expression and Pathways (2 units) and have a good understanding of database design and implementation. In addition, applicants should have * Protein Structure Analysis in Bioinformatics (2 units) experience using one of the open source bioinformatics frameworks such as BIOJAVA or * DNA Microarrays - Principles, Applications and Data Analysis (2 units) BIOPERL. The applicant should have experience with software collaboration tools such as CVS. The applicant will be expected to oversee the purchase, installation, and management of the necessary computer hardware and http://www.foothill.fhda.edu/bio/programs/bioinfo/curric.shtml software required to provide bioinformatics support to several users. A good understanding of the UNIX operating system and systems administration is also an important qualification. CAREER CERTIFICATE REQUIREMENTS (49 units)* http://careers.psgs.com/CareerOppLocation2.asp?location=BETHESDA Biotechnology Core Courses (14 units) CF-046: Bioinformatics Specialist BTEC 51A Cell Biology for Biotechnology (3 units) BTEC 52A Molecular Biology for Biotechnology (3 units) The NIAID Office of Technology and Information Systems (OTIS)/Bioinformatics and Scientific IT Program (BSIP) is seek-ing BTEC 65 DNA Electrophoretic Systems (1 unit) a bioinformatics specialist. The position includes managing our bioinformatics services and applications, training NIH BTEC 68 Polymerase Chain Reaction (1 unit) scientists, creating and modifying simple bioinformatics software, and collaborating with NIH scientists on specific projects. BTEC 71 DNA Sequencing & Bioinformatics (1 unit) [snip] BTEC 76 Introduction to Microarray Data Analysis (2 unit) BTEC 64 Protein Electrophoretic Systems (1 unit) The qualified candidate must hold a Master's Degree (or equivalent) and three years of experience or a Ph. D in life science BTEC 66 HPLC (2 units) or computer science. The candidate must have strong interpersonal, written and oral communication skills and be a lateral thinker. Must be able to communicate current bioinformatics technology in a clear and precise manner and to discuss Computer Science Core Courses (30 units) projects with scientists and advise what relevant tools may be used or implemented, have the ability to locate relevant data/ CIS 52A Introduction to Data Management Systems (5 units) information and put this into the context of projects they work on. Candidate must have expert knowledge of UNIX (Mac OS CIS 52B2 Introduction to Oracle SQL (5 units) X Darwin a plus), with the ability to install and configure command-line-based applications and services. These include: UNIX CIS 68A Introduction to UNIX (5 units) windowing systems (X11, FreeX86), UNIX-based open source scientific applications, Perl and BioPerl scripts, CIS 68E Introduction to PERL (5 units) HTML, XML. Must have experience with one or more of the following: · Web services (WSDL, SOAP) · Genomics and CIS 68H Introduction to BioPerl (5 units) proteomics · Protein structure prediction/visualization. · Sequence analysis, alignment and database searches. · Regular COIN81 Bioinformatics Tools & Databases (5 units) expressions and relational database queries. · Data mining, text mining · Biostatistics
  17. 17. OSS => Fun
  18. 18. Tasks • Sequence files and annotations • Gene prediction • Sequence alignments • Pairwise comparisons • Multiple alignment • Visualization • Evolution - Trees and Rates
  19. 19. Sequences and Features TSS Feature PolyA site Exon Features Sequence Genomic Sequence with 3 exons 1 Transcript Start Site (TSS) 1 Poly-A Site
  20. 20. Sequence File Formats • Simple formats - without features • FASTA (Pearson), Raw, GCG • Rich Formats - with features and annotations • GenBank, EMBL • Swissprot, GenPept • XML - BSML, CHAOS, GAME, TIGRXML, CHADO
  21. 21. Fasta format >ID Description(Free text) AGTGATGATAGTGAGTAGGA >gi|number|emb|ACCESSION AGATAGTAGGGGATAGAG >gi|number|sp|BOSS_7LES MTMFWQQNVDHQSDEQDKQAKGAAPTKRLN
  22. 22. Rich Formats • Combine • Sequence data • Bibliographic references • Taxonomic information • Features • Annotations
  23. 23. GenBank Format
  24. 24. LOCUS DDU63596 310 bp DNA INV 14-MAY-1999 DEFINITION Dictyostelium discoideum Tdd-4 transposable element flanking sequence, clone p427/428 right end. ACCESSION U63596 NID g2393749 KEYWORDS . SOURCE Dictyostelium discoideum. ORGANISM Dictyostelium discoideum Eukaryota; Dictyosteliida; Dictyostelium. REFERENCE 1 (bases 1 to 310) AUTHORS Wells,D.J. TITLE Tdd-4, a DNA transposon of Dictyostelium that encodes proteins similar to LTR retroelement integrases JOURNAL Nucleic Acids Res. 27 (11), 2408-2415 (1999) FEATURES Location/Qualifiers source 1..310 /organism="Dictyostelium discoideum" /strain="AX4" /db_xref="taxon:44689" /clone="p427/428" misc_feature 5.12 /note="Fuzzy location" misc_feature join(J00194:(100..202),1..245,256..258) /note="Location partly in another entry" BASE COUNT 118 a 46 c 67 g 79 t ORIGIN 1 gtgacagttg gctgtcagac atacaatgat tgtttagaag aggagaagat tgatccggag 61 taccgtgata gtattttaaa aactatgaaa gcgggaatac ttaatggtaa
  25. 25. FH Key Location/Qualifiers FH FT source 1..310 FT /db_xref="taxon:44689" FT /mol_type="genomic DNA" FT /organism="Dictyostelium discoideum" FT /strain="AX4" FT /clone="p427/428" XX SQ Sequence 310 BP; 118 A; 46 C; 67 G; 79 T; 0 other; gtgacagttg gctgtcagac atacaatgat tgtttagaag aggagaagat tgatccggag 60 taccgtgata gtattttaaa aactatgaaa gcgggaatac ttaatggtaa actagttaga 120 ttatgtgacg tgccaagggg tgtagatgta gaaattgaaa caactggtct aaccgattca 180 gaaggagaaa gtgaatcaaa agaagaagag tgatgatgaa tagccaccat tactgcatac 240 tgtagccctt acccttgtcg caccattagc cattaataaa aataaaaaat tatataaaaa 300 ttacacccat 310 //
  26. 26. Sequences, Features, and Annotations • Sequence - DNA, RNA, AA • Feature container • Feature - Information with a Sequence Location • Annotation - Information without explicit Sequence location
  27. 27. Parsing Sequences • Bio::SeqIO • multiple drivers: genbank, embl, fasta,... • Sequence objects • Bio::PrimarySeq • Bio::Seq • Bio::Seq::RichSeq
  28. 28. Look at the Sequence object • Common (Bio::PrimarySeq) methods • seq() - get the sequence as a string • length() - get the sequence length • subseq($s,$e) - get a subseqeunce • translate(...) - translate to protein [DNA] • revcom() - reverse complement [DNA] • display_id() - identifier string
  29. 29. Using a Sequence my $str = “ATGAATGATGAA”; my $seq = Bio::PrimarySeq->new(-seq => $str,

×