BioLib Development Report (BOSC
             2009)
 C and C++ libraries for BioPerl, BioJAVA,
         BioPython, BioRuby. . .
                      Pjotr Prins (pjotr.prins at wur.nl)


Wageningen University, Dept. of Nematology; Groningen Bioinformatics Center




                                                              BioLib Development Report (BOSC 2009) – p.
The stated problem

Many high-level languages used in Biology
(Perl, R, Java. . . )
Duplication of effort in all Bio* efforts -
BioPerl, BioConductor, BioJAVA. . .
in particular for data IO/parsing/interpretation
(Alan’s keynote)




                                         BioLib Development Report (BOSC 2009) – p.
What if?

What if you need some functionality (e.g. linear
regression) in Perl, you can
   Roll your own in Perl (performance?)
   Bind against existing clib using Perl-XS (ugh)
   Bind using SWIG (better, but one-off like
   Perl::GSL)
   Bind using SWIG with Biolib (all languages)
   In fact, it may already be there (GSL or Rlib)

                                         BioLib Development Report (BOSC 2009) – p.
DRY-DRO

Do not repeat yourself (DRY)
Do not repeat ourselves (DRO)
Bio*: BioPerl, BioPython, BioRuby, BioJAVA,
BioConductor, BioHaskell, BioCPP, . . .
Limited pool of programmers in bioinformatics
Usually 2 or 3 competing implementations
Use existing implementations


                                   BioLib Development Report (BOSC 2009) – p.
Why bother?

Open Source Software is about eyes




                               BioLib Development Report (BOSC 2009) – p.
Eyes!

Eyes like these!




                   BioLib Development Report (BOSC 2009) – p.
Eyes (3)

Eyes like these!. . .




                        BioLib Development Report (BOSC 2009) – p.
Eyes (5)

Well, realistically. . .




                           BioLib Development Report (BOSC 2009) – p.
BioLib project

Objectives:
   Utilize existing C/C++ libraries
   Create mappings to all Bio* languages
   Focus on correctness and
   performance
   A central place (plumbing)
   An OBF affiliated project



                                      BioLib Development Report (BOSC 2009) – p.
Power Trio

Plumbing power trio:
   Git - modular version control
   Cmake - make file generator
   SWIG - simplified wrapper and interface
   generator




                                     BioLib Development Report (BOSC 2009) – p. 1
Power trio (1)

GIT
  Version control on steroids
  What source control should be
   Easy branching of development
   Submodules




                                   BioLib Development Report (BOSC 2009) – p. 1
Power trio (2)

CMake
  Generator for make files
  Very modular approach
  Resolves complex dependencies
  Looks like a simple
  programming language
  Easy on the eyes and mind



                                  BioLib Development Report (BOSC 2009) – p. 1
Power trio (3)

SWIG
  Code generator for mappings done right:
    Rules for generating code
    Macros (DRY)
    Pattern matching
    Flexible
    Supports many languages




                                    BioLib Development Report (BOSC 2009) – p. 1
Achievements (year one)

  Affyio: Affymetrix arrays (357 methods; 10K lines)
  Staden: Sequencer trace files (95; 16K)
  GSL: GNU Science Library (2702; 200K)
  Rlib: R routines (> 176; 43K)
  R/qtl: Quantitative genetics (> 100; 10K)*
  Libsequence: Sequence analysis (> 1000; 21K)*
  Bio++: Sequence analysis (> 1000; 52K)*

Code base 350K lines USD 10 million R&D
                                               BioLib Development Report (BOSC 2009) – p. 1
Source tree

|--   clibs
|     |-- affyio-1.8
|     |-- biolib_R
|     |-- biolib_microarray
|     |-- libsequence-1.6.6
|--   mappings
|     ‘-- swig
|         |-- perl
|         |    |-- affyio
|         |    |-- staden_io_lib
|         |    ‘-- test
|         |-- python
|         |-- ruby
104   directories, 668 files




                                        BioLib Development Report (BOSC 2009) – p. 1
Adding a C lib

Unpack C/C++ library in
./src/clibs/modulename
Add CMake file - compiles into .so shared
library
Create Perl mapping in
./src/mapping/swig/perl/module
Add SWIG .i file
Add CMake file - compiles into .pm and .so
shared library

                                  BioLib Development Report (BOSC 2009) – p. 1
CMake goodies

# Defining a C library build in Biolib:
SET (M_NAME staden_io_lib)
SET (M_VERSION 1.11.6)
FIND_PACKAGE(ZLIB REQUIRED)
BUILD_CLIB()

ADD_LIBRARY(${LIBNAME} SHARED
array.c
compress.c
compression.c
ctfCompress.c
(...)

INSTALL_CLIB()




                                          BioLib Development Report (BOSC 2009) – p. 1
CMake for Perl

# Defining a C library mapping for Perl
SET (USE_ZLIB TRUE)
SET (USE_INCLUDEPATH io_lib)

FIND_PACKAGE(MapPerl)

POST_BUILD_PERL_BINDINGS()
TEST_PERL_BINDINGS()
INSTALL_PERL_BINDINGS()




                                          BioLib Development Report (BOSC 2009) – p. 1
SWIG Map

%include <Read.h>

#define TT_ANY 0
#define TT_ZTR 7

typedef struct
{
    int         format;
    char       *trace_name;
    int         NPoints;
    int         NBases;
    (...)
} Read;

Read *read_reading(char *fn, int format);



                                            BioLib Development Report (BOSC 2009) – p. 1
Perl

use biolib::staden_io_lib;

$result = staden_io_lib::read_reading($fn,
                                      $staden_io_lib::TT_ANY);
print("format=",staden_io_libc::Read_format_get($result));
print("NBases=",$result->{NBases});
print("base=",staden_io_libc::Read_base_get($result));

Outputs:

format=7
NBases=766
base=NCTTGGGAAAGCATAAACCATGTATTATCGAATTCGAGCT
     CGGTCCCAACTTAATTGTACA...




                                                     BioLib Development Report (BOSC 2009) – p. 2
Python

import biolib.staden_io_lib as io_lib

result = io_lib.read_reading(procsrffn,
                             io_lib.TT_ANY)
print result.format
print result.NBases
print result.base

7
766
NCTTGGGAAAGCATAAACCATGTATTATCGAATTCGAGCT
CGGTCCCAACTTAATTGTACA...




                                              BioLib Development Report (BOSC 2009) – p. 2
For the Perl coder

Adding functionality in language of choice
Easier deployment - ’install biolib-perl’
Shared correctness testing
Generated API documentation




                                       BioLib Development Report (BOSC 2009) – p. 2
For the authors

Independent source trees
Increased exposure (Ruby, Perl. . . )
Added unit/integration testing environment
Deployment, multi-platform support (Linux,
OSX, Windows)
No autoconf pain (./configure and friends)
Implicit access to other libraries (GSL, Rlib)
Online generated API documentation

                                        BioLib Development Report (BOSC 2009) – p. 2
Future work

Automated API documentation (with doctests)
More libraries (Emboss, NCBI, . . . )
New code (HPC)
More languages (JAVA, R, OCaml, . . . )
Bio* integration (CPAN, Ruby gems, Python
eggs)
Debian/Fedora/OSX/Windows packages
More platforms (Windows without Cygwin)

                                        BioLib Development Report (BOSC 2009) – p. 2
Credits

Ben Bolstad (Affyio), James Bonfield (Staden), Karl Broman (R/qtl)

Jonathan Leto (GSL SWIG)

Xin Shuai (Google SoC libsequence)

Adam Smith (Google SoC Bio++)

Oswaldo Trelles, José Manuel Mateos-Duran and Andrés Rodríguez (UMA)

Chris Fields (BioPerl), Mark Jensen (BioPerl), Hilmar Lap (Nescent, OBF)

Jaap Bakker (WU), Geert Smant (WU), Ritsert Jansen (GBIC)




                                                               BioLib Development Report (BOSC 2009) – p. 2
BoF

BioLib: Birds of a Feather Session (BoF) at 16:50 hours




                                                          BioLib Development Report (BOSC 2009) – p. 2

Prins Bio Lib Bosc 2009

  • 1.
    BioLib Development Report(BOSC 2009) C and C++ libraries for BioPerl, BioJAVA, BioPython, BioRuby. . . Pjotr Prins (pjotr.prins at wur.nl) Wageningen University, Dept. of Nematology; Groningen Bioinformatics Center BioLib Development Report (BOSC 2009) – p.
  • 2.
    The stated problem Manyhigh-level languages used in Biology (Perl, R, Java. . . ) Duplication of effort in all Bio* efforts - BioPerl, BioConductor, BioJAVA. . . in particular for data IO/parsing/interpretation (Alan’s keynote) BioLib Development Report (BOSC 2009) – p.
  • 3.
    What if? What ifyou need some functionality (e.g. linear regression) in Perl, you can Roll your own in Perl (performance?) Bind against existing clib using Perl-XS (ugh) Bind using SWIG (better, but one-off like Perl::GSL) Bind using SWIG with Biolib (all languages) In fact, it may already be there (GSL or Rlib) BioLib Development Report (BOSC 2009) – p.
  • 4.
    DRY-DRO Do not repeatyourself (DRY) Do not repeat ourselves (DRO) Bio*: BioPerl, BioPython, BioRuby, BioJAVA, BioConductor, BioHaskell, BioCPP, . . . Limited pool of programmers in bioinformatics Usually 2 or 3 competing implementations Use existing implementations BioLib Development Report (BOSC 2009) – p.
  • 5.
    Why bother? Open SourceSoftware is about eyes BioLib Development Report (BOSC 2009) – p.
  • 6.
    Eyes! Eyes like these! BioLib Development Report (BOSC 2009) – p.
  • 7.
    Eyes (3) Eyes likethese!. . . BioLib Development Report (BOSC 2009) – p.
  • 8.
    Eyes (5) Well, realistically.. . BioLib Development Report (BOSC 2009) – p.
  • 9.
    BioLib project Objectives: Utilize existing C/C++ libraries Create mappings to all Bio* languages Focus on correctness and performance A central place (plumbing) An OBF affiliated project BioLib Development Report (BOSC 2009) – p.
  • 10.
    Power Trio Plumbing powertrio: Git - modular version control Cmake - make file generator SWIG - simplified wrapper and interface generator BioLib Development Report (BOSC 2009) – p. 1
  • 11.
    Power trio (1) GIT Version control on steroids What source control should be Easy branching of development Submodules BioLib Development Report (BOSC 2009) – p. 1
  • 12.
    Power trio (2) CMake Generator for make files Very modular approach Resolves complex dependencies Looks like a simple programming language Easy on the eyes and mind BioLib Development Report (BOSC 2009) – p. 1
  • 13.
    Power trio (3) SWIG Code generator for mappings done right: Rules for generating code Macros (DRY) Pattern matching Flexible Supports many languages BioLib Development Report (BOSC 2009) – p. 1
  • 14.
    Achievements (year one) Affyio: Affymetrix arrays (357 methods; 10K lines) Staden: Sequencer trace files (95; 16K) GSL: GNU Science Library (2702; 200K) Rlib: R routines (> 176; 43K) R/qtl: Quantitative genetics (> 100; 10K)* Libsequence: Sequence analysis (> 1000; 21K)* Bio++: Sequence analysis (> 1000; 52K)* Code base 350K lines USD 10 million R&D BioLib Development Report (BOSC 2009) – p. 1
  • 15.
    Source tree |-- clibs | |-- affyio-1.8 | |-- biolib_R | |-- biolib_microarray | |-- libsequence-1.6.6 |-- mappings | ‘-- swig | |-- perl | | |-- affyio | | |-- staden_io_lib | | ‘-- test | |-- python | |-- ruby 104 directories, 668 files BioLib Development Report (BOSC 2009) – p. 1
  • 16.
    Adding a Clib Unpack C/C++ library in ./src/clibs/modulename Add CMake file - compiles into .so shared library Create Perl mapping in ./src/mapping/swig/perl/module Add SWIG .i file Add CMake file - compiles into .pm and .so shared library BioLib Development Report (BOSC 2009) – p. 1
  • 17.
    CMake goodies # Defininga C library build in Biolib: SET (M_NAME staden_io_lib) SET (M_VERSION 1.11.6) FIND_PACKAGE(ZLIB REQUIRED) BUILD_CLIB() ADD_LIBRARY(${LIBNAME} SHARED array.c compress.c compression.c ctfCompress.c (...) INSTALL_CLIB() BioLib Development Report (BOSC 2009) – p. 1
  • 18.
    CMake for Perl #Defining a C library mapping for Perl SET (USE_ZLIB TRUE) SET (USE_INCLUDEPATH io_lib) FIND_PACKAGE(MapPerl) POST_BUILD_PERL_BINDINGS() TEST_PERL_BINDINGS() INSTALL_PERL_BINDINGS() BioLib Development Report (BOSC 2009) – p. 1
  • 19.
    SWIG Map %include <Read.h> #defineTT_ANY 0 #define TT_ZTR 7 typedef struct { int format; char *trace_name; int NPoints; int NBases; (...) } Read; Read *read_reading(char *fn, int format); BioLib Development Report (BOSC 2009) – p. 1
  • 20.
    Perl use biolib::staden_io_lib; $result =staden_io_lib::read_reading($fn, $staden_io_lib::TT_ANY); print("format=",staden_io_libc::Read_format_get($result)); print("NBases=",$result->{NBases}); print("base=",staden_io_libc::Read_base_get($result)); Outputs: format=7 NBases=766 base=NCTTGGGAAAGCATAAACCATGTATTATCGAATTCGAGCT CGGTCCCAACTTAATTGTACA... BioLib Development Report (BOSC 2009) – p. 2
  • 21.
    Python import biolib.staden_io_lib asio_lib result = io_lib.read_reading(procsrffn, io_lib.TT_ANY) print result.format print result.NBases print result.base 7 766 NCTTGGGAAAGCATAAACCATGTATTATCGAATTCGAGCT CGGTCCCAACTTAATTGTACA... BioLib Development Report (BOSC 2009) – p. 2
  • 22.
    For the Perlcoder Adding functionality in language of choice Easier deployment - ’install biolib-perl’ Shared correctness testing Generated API documentation BioLib Development Report (BOSC 2009) – p. 2
  • 23.
    For the authors Independentsource trees Increased exposure (Ruby, Perl. . . ) Added unit/integration testing environment Deployment, multi-platform support (Linux, OSX, Windows) No autoconf pain (./configure and friends) Implicit access to other libraries (GSL, Rlib) Online generated API documentation BioLib Development Report (BOSC 2009) – p. 2
  • 24.
    Future work Automated APIdocumentation (with doctests) More libraries (Emboss, NCBI, . . . ) New code (HPC) More languages (JAVA, R, OCaml, . . . ) Bio* integration (CPAN, Ruby gems, Python eggs) Debian/Fedora/OSX/Windows packages More platforms (Windows without Cygwin) BioLib Development Report (BOSC 2009) – p. 2
  • 25.
    Credits Ben Bolstad (Affyio),James Bonfield (Staden), Karl Broman (R/qtl) Jonathan Leto (GSL SWIG) Xin Shuai (Google SoC libsequence) Adam Smith (Google SoC Bio++) Oswaldo Trelles, José Manuel Mateos-Duran and Andrés Rodríguez (UMA) Chris Fields (BioPerl), Mark Jensen (BioPerl), Hilmar Lap (Nescent, OBF) Jaap Bakker (WU), Geert Smant (WU), Ritsert Jansen (GBIC) BioLib Development Report (BOSC 2009) – p. 2
  • 26.
    BoF BioLib: Birds ofa Feather Session (BoF) at 16:50 hours BioLib Development Report (BOSC 2009) – p. 2