SlideShare a Scribd company logo
BioPerl at 15: New Features, New Directions Christopher J. Fields, University of Illinois, cjfields@uiuc.edu *Mark A. Jensen, Fortinbras Research and SRA International,  mark_jensen@sra.com  Jason  E. Stajich, University of California at Riverside, jason.stajich@ucr.edu The BioPerl Project, an open-source Perl toolkit for bioinformatics, was initiated in 1995 and became instrumental in the automated organization and analysis of original Human Genome Project data. Since then, BioPerl has become a complete object-oriented Perl environment for bioinformatics development, with modules to perform a wide range of bioinformatics functions, including multi-format parsing and translation, object-relational model databasing, EMBL and NCBI web service access, and external program execution. The BioPerl developer community is actively responding to the far-reaching changes in the field that have taken place over the last several years. Major goals are: (1) to provide new functionality useful to researchers at the cutting edge of bioinformatics, (2) to reorganize BioPerl into smaller application-oriented packages, (3) to deprecate older modules whose utility has declined substantially, and (4) to continue to expand and improve documentation, so that BioPerl remains useful and relevant in the years ahead.  Google Summer of Code BioPerl has provided mentorship for GSoC projects  for the past three years. These have resulted in material additions to the codebase, and have been focused on expanding BioPerl's capabilities in format parsing and large file processing. The BioPerl wiki (http://bioperl.org) The wiki is now the central location for all BioPerl documentation: installation, module POD, HOWTO articles, code snippets, and personnel descriptions. It has played an important role as the new face of BioPerl and as a landing for the developer discussions that are taking BioPerl forward. BioPerl on gitHub (http://github.com/bioperl) BioPerl recently migrated all active repositories to gitHub from OBF-hosted Subversion. With the move to  git  comes decentralization and more fluid, independent development. We expect this to improve the BioPerl response time both to bugs and to new developments in the field, as well as increase new developer recruitment and community participation. Community participation and development New features New directions Next-gen sequencing support Bringing BioPerl up to speed for next-gen sequence data handling has led to efforts along three lines: file format standardization, common command-line tool wrapping, and BioPerl object system I/O integration tailored to next-gen data. Formats BioPerl and other Bio* projects recently published a collaborative effort to standardize FASTQ formats, including variants for Illumina and Solexa platforms. These formats are now in use across BioPerl and the Bio* projects. Support for important binary formats (BAM, BigWIG) is provided by wrappers for command line tools, and the integration of fast XS-based Perl modules such as Lincoln Stein's  Bio-SamTools  and  Bio-BigFile  CPAN packages. Wrappers Enhancements to the  Bio::Tools::Run::WrapperBase  system has made it easier to add BioPerl wrapper modules for external programs, and to integrate these into other modules that implement pipelines using BioPerl sequence and alignment objects as I/O. Tracking NCBI developments In the past year, NCBI has released a fully updated BLAST toolkit,  blast+ † , and has been encouraging a move from their EUtilities RESTful interface to a newer SOAP interface ‡ .  BioPerl has responded with  Bio::Tools::Run::StandAloneBlastPlus  and  Bio::DB::SoapEUtilities . These were designed not only to update the API interface, but also to add I/O layers that accept and parse messages into familiar BioPerl objects, and to build in straightforward methods for creating pipelines of  blast+  program analyses or EUtilities fetches. † ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST ‡ http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/v2.0/DOC/esoap_help.html bedtools bowtie bwa minimo newbler samtools BioPerl object support :  Bio::Assembly The  Bio::Assembly  system has been extensively updated, to include reading and/or writing assemblies in MAQ, BAM, SAM, BWA, and other formats. Assembly object support is integrated into run wrappers for  bwa ,  bedtools ,  maq , and  samtools . Future work will incorporate new sequence objects that are optimized for large files (through the work of GSoC student Jun Yin). use Bio::Tools::Run::Maq; my $maq = Bio::Tools::Run::Maq->new(); $assy_obj = $maq->run('read1.fastq', 'refseq.fas', 'read2.fastq'); Timeline BioPerl has grown in its user and developer base since those early days. New developers and collaborations have contributed not only key modules, but also important design methodologies and refactoring over the years that have helped BioPerl to maintain its usefulness and relevance. Discontinuities followed by increases in lines of code over time reflect a high level of community flexibility and dedication in pursuit of DTWT. General wrapper facility A set of modules ( Bio::Tools::WrapperMaker ) is under development that will increase the responsiveness of BioPerl development by providing an XML-based way for users themselves to specify the interface for their favorite commandl ine programs, at the same time creating a common, consistent API for executing those programs and accessing output. Intermediate layers for large file handling and generic parsing BioPerl parsers generally take raw data to Perl objects with no intermediate layer. This induces prohibitive overhead when parsing large files, and also can limit user flexibility: parsing may be desired, but not the BioPerl objects. The first problem is being tackled by attaching backend handlers onto container class constructors that are able persist records of large files efficiently, creating BioPerl objects only as needed or desired. The second problem has led to experiments in generic parsing: data file records are parsed into a simple stream of hashes, which then can be directed where the user desires; into the creation of BioPerl objects as usual, or elsewhere. Biome and BioPerl 6 BioPerl has been object-oriented from the beginning, but suffers the weaknesses of Perl 5 objects: very high overhead, loose encapsulation, limited object introspection, and the lack of built-in interfaces and roles, among other things.  These issues are being addressed in two ways: in Perl 5 through the Moose classes and dependencies, and in the creation of Perl 6. BioPerl is exploring both paths to true objects with the experimental  Biome  (BioPerl with Metaobject Extensions) and BioPerl 6 projects. Biome role as interface Shattering the Monolith  BioPerl continues to be distributed as just a handful of packages. The core package in particular has grown to 341 files, comprising 874 classes with 23,146 tests. Maintenance and installation issues are barriers to developers and users alike. We are in the process of splitting the core into reasonable, application-related chunks. This plus the  git  migration should significantly improve BioPerl management. The  BioPerl Core Development Team  is Sendu Bala, Rob Buels, Christopher Fields, Mark Jensen, Hilmar Lapp, Heikki Lehväslaiho, Aaron Mackey, Dave Messina, Brian Osborne, Jason Stajich, and Lincoln Stein. Key support is provided by Chris Dagdigian and Mauricio Herrera Cuadra. Florent Angly and Dan Kortschak are lead developers of projects discussed here. Year Sponsoring Institution Student Project Example Module 2008 NESCent Mira Han PhyloXML parsing  Bio::TreeIO::phyloxml 2009 NESCent Chase Miller NeXML parsing Bio::Nexml 2010 OBF Jun Yin Alignment subsystem refactoring  in progress source: http://www.ohloh.net/p/bioperl Convert plain text sequence Map reads to reference seq Assemble map into consensus Extract info from consensus fasta2bfa fastq2bfq map mapmerge assemble mapview cns2fq maq  assembly pipeline class consumes role Class Role must instantiate reqd abstract method consuming class possesses role members instance possesses concrete role methods main::

More Related Content

Similar to BioPerl (Poster T02, ISMB 2010)

BioNLPSADI
BioNLPSADIBioNLPSADI
E Talevich - Biopython project-update
E Talevich - Biopython project-updateE Talevich - Biopython project-update
E Talevich - Biopython project-update
Jan Aerts
 
Biopython Project Update (BOSC 2012)
Biopython Project Update (BOSC 2012)Biopython Project Update (BOSC 2012)
Biopython Project Update (BOSC 2012)
Eric Talevich
 
Developing an open source community for cloud bioinformatics
Developing an open source community for cloud bioinformaticsDeveloping an open source community for cloud bioinformatics
Developing an open source community for cloud bioinformatics
Brad Chapman
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit down
Chris Freeland
 
why google stores billions of lines of code in a single repository
why google stores billions of lines of code in a single repositorywhy google stores billions of lines of code in a single repository
why google stores billions of lines of code in a single repository
mustafa sarac
 
Why Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single RepositoryWhy Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single Repository
Kapil Mohan
 
Essential Requirements for Community Annotation Tools
Essential Requirements for Community Annotation ToolsEssential Requirements for Community Annotation Tools
Essential Requirements for Community Annotation Tools
Monica Munoz-Torres
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
Barry Smith
 
Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and Anduril
Christian Frech
 
Wageningen phenotype meeting
Wageningen phenotype meetingWageningen phenotype meeting
Wageningen phenotype meeting
thehyve
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
Phil Cryer
 
SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...
SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...
SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...
Frank Oellien
 
State of the Union eBPF - Linux Kernel Programming
State of the Union eBPF - Linux Kernel ProgrammingState of the Union eBPF - Linux Kernel Programming
State of the Union eBPF - Linux Kernel Programming
cvasilak1
 
FYP report
FYP reportFYP report
FYP report
Chong Yee Gan
 
Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009
bosc
 
A semantic framework for biomedical image discovery
A semantic framework for biomedical image discoveryA semantic framework for biomedical image discovery
A semantic framework for biomedical image discovery
Syed Ahmad Chan Bukhari, PhD
 
FAIR data and model management for systems biology (and SOPs too!)
FAIR data and model management for systems biology (and SOPs too!)FAIR data and model management for systems biology (and SOPs too!)
FAIR data and model management for systems biology (and SOPs too!)
FAIRDOM
 
FAIR Data and Model Management for Systems Biology (and SOPs too!)
FAIR Data and Model Management for Systems Biology(and SOPs too!)FAIR Data and Model Management for Systems Biology(and SOPs too!)
FAIR Data and Model Management for Systems Biology (and SOPs too!)
Carole Goble
 
Introducing the Global Open Knowledgebase (ER&L 2012)
Introducing the Global Open Knowledgebase (ER&L 2012)Introducing the Global Open Knowledgebase (ER&L 2012)
Introducing the Global Open Knowledgebase (ER&L 2012)
GOKb Project
 

Similar to BioPerl (Poster T02, ISMB 2010) (20)

BioNLPSADI
BioNLPSADIBioNLPSADI
BioNLPSADI
 
E Talevich - Biopython project-update
E Talevich - Biopython project-updateE Talevich - Biopython project-update
E Talevich - Biopython project-update
 
Biopython Project Update (BOSC 2012)
Biopython Project Update (BOSC 2012)Biopython Project Update (BOSC 2012)
Biopython Project Update (BOSC 2012)
 
Developing an open source community for cloud bioinformatics
Developing an open source community for cloud bioinformaticsDeveloping an open source community for cloud bioinformatics
Developing an open source community for cloud bioinformatics
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit down
 
why google stores billions of lines of code in a single repository
why google stores billions of lines of code in a single repositorywhy google stores billions of lines of code in a single repository
why google stores billions of lines of code in a single repository
 
Why Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single RepositoryWhy Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single Repository
 
Essential Requirements for Community Annotation Tools
Essential Requirements for Community Annotation ToolsEssential Requirements for Community Annotation Tools
Essential Requirements for Community Annotation Tools
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
 
Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and Anduril
 
Wageningen phenotype meeting
Wageningen phenotype meetingWageningen phenotype meeting
Wageningen phenotype meeting
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
 
SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...
SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...
SP Intervets BioInformatics Portal - A customized global Pipeline Pilot Webpo...
 
State of the Union eBPF - Linux Kernel Programming
State of the Union eBPF - Linux Kernel ProgrammingState of the Union eBPF - Linux Kernel Programming
State of the Union eBPF - Linux Kernel Programming
 
FYP report
FYP reportFYP report
FYP report
 
Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009
 
A semantic framework for biomedical image discovery
A semantic framework for biomedical image discoveryA semantic framework for biomedical image discovery
A semantic framework for biomedical image discovery
 
FAIR data and model management for systems biology (and SOPs too!)
FAIR data and model management for systems biology (and SOPs too!)FAIR data and model management for systems biology (and SOPs too!)
FAIR data and model management for systems biology (and SOPs too!)
 
FAIR Data and Model Management for Systems Biology (and SOPs too!)
FAIR Data and Model Management for Systems Biology(and SOPs too!)FAIR Data and Model Management for Systems Biology(and SOPs too!)
FAIR Data and Model Management for Systems Biology (and SOPs too!)
 
Introducing the Global Open Knowledgebase (ER&L 2012)
Introducing the Global Open Knowledgebase (ER&L 2012)Introducing the Global Open Knowledgebase (ER&L 2012)
Introducing the Global Open Knowledgebase (ER&L 2012)
 

Recently uploaded

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 

Recently uploaded (20)

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 

BioPerl (Poster T02, ISMB 2010)

  • 1. BioPerl at 15: New Features, New Directions Christopher J. Fields, University of Illinois, cjfields@uiuc.edu *Mark A. Jensen, Fortinbras Research and SRA International, mark_jensen@sra.com Jason E. Stajich, University of California at Riverside, jason.stajich@ucr.edu The BioPerl Project, an open-source Perl toolkit for bioinformatics, was initiated in 1995 and became instrumental in the automated organization and analysis of original Human Genome Project data. Since then, BioPerl has become a complete object-oriented Perl environment for bioinformatics development, with modules to perform a wide range of bioinformatics functions, including multi-format parsing and translation, object-relational model databasing, EMBL and NCBI web service access, and external program execution. The BioPerl developer community is actively responding to the far-reaching changes in the field that have taken place over the last several years. Major goals are: (1) to provide new functionality useful to researchers at the cutting edge of bioinformatics, (2) to reorganize BioPerl into smaller application-oriented packages, (3) to deprecate older modules whose utility has declined substantially, and (4) to continue to expand and improve documentation, so that BioPerl remains useful and relevant in the years ahead. Google Summer of Code BioPerl has provided mentorship for GSoC projects for the past three years. These have resulted in material additions to the codebase, and have been focused on expanding BioPerl's capabilities in format parsing and large file processing. The BioPerl wiki (http://bioperl.org) The wiki is now the central location for all BioPerl documentation: installation, module POD, HOWTO articles, code snippets, and personnel descriptions. It has played an important role as the new face of BioPerl and as a landing for the developer discussions that are taking BioPerl forward. BioPerl on gitHub (http://github.com/bioperl) BioPerl recently migrated all active repositories to gitHub from OBF-hosted Subversion. With the move to git comes decentralization and more fluid, independent development. We expect this to improve the BioPerl response time both to bugs and to new developments in the field, as well as increase new developer recruitment and community participation. Community participation and development New features New directions Next-gen sequencing support Bringing BioPerl up to speed for next-gen sequence data handling has led to efforts along three lines: file format standardization, common command-line tool wrapping, and BioPerl object system I/O integration tailored to next-gen data. Formats BioPerl and other Bio* projects recently published a collaborative effort to standardize FASTQ formats, including variants for Illumina and Solexa platforms. These formats are now in use across BioPerl and the Bio* projects. Support for important binary formats (BAM, BigWIG) is provided by wrappers for command line tools, and the integration of fast XS-based Perl modules such as Lincoln Stein's Bio-SamTools and Bio-BigFile CPAN packages. Wrappers Enhancements to the Bio::Tools::Run::WrapperBase system has made it easier to add BioPerl wrapper modules for external programs, and to integrate these into other modules that implement pipelines using BioPerl sequence and alignment objects as I/O. Tracking NCBI developments In the past year, NCBI has released a fully updated BLAST toolkit, blast+ † , and has been encouraging a move from their EUtilities RESTful interface to a newer SOAP interface ‡ . BioPerl has responded with Bio::Tools::Run::StandAloneBlastPlus and Bio::DB::SoapEUtilities . These were designed not only to update the API interface, but also to add I/O layers that accept and parse messages into familiar BioPerl objects, and to build in straightforward methods for creating pipelines of blast+ program analyses or EUtilities fetches. † ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST ‡ http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/v2.0/DOC/esoap_help.html bedtools bowtie bwa minimo newbler samtools BioPerl object support : Bio::Assembly The Bio::Assembly system has been extensively updated, to include reading and/or writing assemblies in MAQ, BAM, SAM, BWA, and other formats. Assembly object support is integrated into run wrappers for bwa , bedtools , maq , and samtools . Future work will incorporate new sequence objects that are optimized for large files (through the work of GSoC student Jun Yin). use Bio::Tools::Run::Maq; my $maq = Bio::Tools::Run::Maq->new(); $assy_obj = $maq->run('read1.fastq', 'refseq.fas', 'read2.fastq'); Timeline BioPerl has grown in its user and developer base since those early days. New developers and collaborations have contributed not only key modules, but also important design methodologies and refactoring over the years that have helped BioPerl to maintain its usefulness and relevance. Discontinuities followed by increases in lines of code over time reflect a high level of community flexibility and dedication in pursuit of DTWT. General wrapper facility A set of modules ( Bio::Tools::WrapperMaker ) is under development that will increase the responsiveness of BioPerl development by providing an XML-based way for users themselves to specify the interface for their favorite commandl ine programs, at the same time creating a common, consistent API for executing those programs and accessing output. Intermediate layers for large file handling and generic parsing BioPerl parsers generally take raw data to Perl objects with no intermediate layer. This induces prohibitive overhead when parsing large files, and also can limit user flexibility: parsing may be desired, but not the BioPerl objects. The first problem is being tackled by attaching backend handlers onto container class constructors that are able persist records of large files efficiently, creating BioPerl objects only as needed or desired. The second problem has led to experiments in generic parsing: data file records are parsed into a simple stream of hashes, which then can be directed where the user desires; into the creation of BioPerl objects as usual, or elsewhere. Biome and BioPerl 6 BioPerl has been object-oriented from the beginning, but suffers the weaknesses of Perl 5 objects: very high overhead, loose encapsulation, limited object introspection, and the lack of built-in interfaces and roles, among other things. These issues are being addressed in two ways: in Perl 5 through the Moose classes and dependencies, and in the creation of Perl 6. BioPerl is exploring both paths to true objects with the experimental Biome (BioPerl with Metaobject Extensions) and BioPerl 6 projects. Biome role as interface Shattering the Monolith BioPerl continues to be distributed as just a handful of packages. The core package in particular has grown to 341 files, comprising 874 classes with 23,146 tests. Maintenance and installation issues are barriers to developers and users alike. We are in the process of splitting the core into reasonable, application-related chunks. This plus the git migration should significantly improve BioPerl management. The BioPerl Core Development Team is Sendu Bala, Rob Buels, Christopher Fields, Mark Jensen, Hilmar Lapp, Heikki Lehväslaiho, Aaron Mackey, Dave Messina, Brian Osborne, Jason Stajich, and Lincoln Stein. Key support is provided by Chris Dagdigian and Mauricio Herrera Cuadra. Florent Angly and Dan Kortschak are lead developers of projects discussed here. Year Sponsoring Institution Student Project Example Module 2008 NESCent Mira Han PhyloXML parsing Bio::TreeIO::phyloxml 2009 NESCent Chase Miller NeXML parsing Bio::Nexml 2010 OBF Jun Yin Alignment subsystem refactoring in progress source: http://www.ohloh.net/p/bioperl Convert plain text sequence Map reads to reference seq Assemble map into consensus Extract info from consensus fasta2bfa fastq2bfq map mapmerge assemble mapview cns2fq maq assembly pipeline class consumes role Class Role must instantiate reqd abstract method consuming class possesses role members instance possesses concrete role methods main::