Varre_Biomanycores_BOSC2009
Upcoming SlideShare
Loading in...5
×
 

Varre_Biomanycores_BOSC2009

on

  • 1,495 views

 

Statistics

Views

Total Views
1,495
Views on SlideShare
1,493
Embed Views
2

Actions

Likes
0
Downloads
16
Comments
0

1 Embed 2

http://www.slideshare.net 2

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Varre_Biomanycores_BOSC2009 Varre_Biomanycores_BOSC2009 Presentation Transcript

  • Biomanycores, a repository of interoperable open-source code for many-cores bioinformatics Jean-St´phane Varr´, St´phane Janot, Mathieu Giraud e e e contact@biomanycores.org Sequoia Bioinformatics LIFL – UMR CNRS 8022 – Universit´ Lille 1, France e INRIA Lille Nord-Europe, France June 2009 J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 1 / 20
  • Outline High-performance computing Graphical Processing Units and bioinformatics biomanycores.org aim of the project what has been done ? future developments J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 2 / 20
  • High Performance Bioinformatics – Manycores 1970 – 2002: Moore’s law = increasing frequencies problems: power consumption, heat dissipation here from now on: Moore’s law continues with multiple cores from multicores: dual-cores, quad-cores, octo-cores... to manycores: Graphic processing units (GPUs) Nvidia GTX 285 ⇒ 30 × 8 cores, 1.2 GHz, 40 (×8) GFlops convergence CPU-GPU: Intel Larrabee J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 3 / 20
  • High Performance Bioinformatics – Manycores GPGPU = General-Purpose computation on GPU until 2007: tweaking graphics primitives 2007: Nvidia CUDA 2009: OpenCL (Khronos Group) dec 08: 1.0 specification may 09: beta release of a Nvidia compiler AMD/ATI compiler coming soon ⇒ portable manycores applications ? With GPGPU... 10× / 100× peak speed-up, low costs ($50–$500) even with loss due to parallelism, 10× speed-up is possible (relatively) easy with CUDA / OpenCL, requires some learning J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 4 / 20
  • GPU + Bioinformatics Methods “Graphical” GPGPU (2005/06): speed-up RAxML up to 2× Charalambous et al. 2005 ClustalW up to 7× Liu et al. 2006 CUDA (since 2007): speed-up mummerGPU up to 10× Schatz et al. 2007 Smith-Waterman up to 15× Manavski and Valle 2008 Neighbor-Joining up to 26× Liu et al. 2009 RNAfold up to 17× Risk and Lavenier 2009 ∼ 10 papers between 2007 and 2009 J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 5 / 20
  • GPU + Bioinformatics Specific Bioinformatics HPC Events HiComb (IEEE Workshop on High Performance Computational Biology) since 2002 in conjunction with IPDPS [may 09, Roma] PBC (Parallel Bio-Computing Workshop) since 2005, every two years in conjunction with PPAM [sept 09, Wroclaw] HiBi (Workshop on High Performance Computational Systems Biology) [oct 09, Trento] J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 6 / 20
  • Sequoia Bioinformatics LIFL, INRIA, Universit´ Lille 1, France e H. Touzet’s group, 14 people (including 5 PhD students) Large-scale sequence analysis Sequence comparisons, seed-based heuristics RNA, transcription factors, NRPS High-Performance Bioinformatics SIMD flexible read mapper (L. No´, M. Gˆ e ırdea) GPU PWM scan / P-value (22× – 77× on a GTX 280) GPU ADP (6.1× – 22.8× on a GTX 280, with U. Bielefeld) GPU & bit-parallelism pattern matching (ongoing) Supported by NVIDIA (Professor Partnership, 2009) J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 7 / 20
  • GPU + Position-Weight Matrices (PWM) Parallel Position Weight Matrices Algorithms. M. Giraud and J.-S. Varr´. ISPDC’09 e PWMs are used for modeling transcription factor binding sites, transcription start sites, 2.0 TGT GGT protein domains, . . . bits 1.0 score threshold or P-value computation: A T T 0.0 TC A C A CT C A C C A requires to enumerate words A G 5 WebLogo 3.0 occurrences: requires to scan quickly a very long sequence 25x 100x CPU (one thread) GeForce 8800 GTX 280 20x GTX 280 (+ atomic) 10x 15x Speedup Speedup 10x CPU (one thread) GeForce 8800 GTX 280 1x 5x 35 40 45 50 55 60 65 70 0 10 20 30 40 50 60 70 80 90 Matrix length Matrix length J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 8 / 20
  • HPC Bioinformatics for human beings ? Research in High-Performance Computing nice ideas, nice papers but not always exploited A few HPC bioinformatics frameworks projects... ⇒ far from everyday usage of bioinformaticians and biologists J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 9 / 20
  • www.biomanycores.org 1. Share OpenCL code = public repository, open-source 2. Make it easy = Bio∗ integration 3. Benchmark algorithms, implementations, hardware J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 10 / 20
  • www.biomanycores.org 1. Share OpenCL code (currently CUDA) = public repository, open-source 2. Make it easy = Bio∗ integration 3. Benchmark algorithms, implementations, hardware J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 10 / 20
  • Already included projects SWcuda – Smith-Waterman protein alignment CRIBI Genomics, University of Padova, Italy S. A. Manavski, G. Valle, CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment, BMC Bioinformatics 2008, 9(S2):S10 pknotsRG – pseudonots of an RNA sequence Universit¨t Bielefeld, Germany a J. Reeder, P. Steffen, R. Giegerich, pknotsRG: RNA pseudoknot folding including near-optimal structures and sliding windows, Nucl. Acids. Res., 2007 cudaPWM – scan a PWM against a DNA sequence Sequoia, LIFL, INRIA, Universit´ Lille 1 e M. Giraud, J.-S. Varr´, Parallel Position Weight Matrices Algorithms, ISPDC’09 e Interfaces to BioJava 1.6, BioPerl 1.52, and Biopython 1.50b J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 11 / 20
  • J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 12 / 20
  • Biopython + CRIBI SW from Bio i m p o r t SeqIO from Biomanycores i m p o r t PadovaSW bank = SeqIO . parse ( open ( ” u n i p r o t −s t a r t . f a ” ) , ” f a s t a ” ) f o r query i n SeqIO . parse ( open ( ” p r o t 6 4 . f a ” ) , ” f a s t a ” ) : handle = PadovaSW . run ( query , bank ) result = PadovaSW . SWParser ( ) . parse ( ) p r i n t result J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 13 / 20
  • Biopython + CRIBI SW Tests on a GeForce 8800 biopython$ time python sw-demo.py cuda ** cd ../bin/ ; ./swcuda config.gpu ../tmp/swcuda.fa ../tmp/swcuda.bank ** 1.846s 12098 results... [(84.0, 0, 0, ’sp|P30350|ADH1_ANAPL’), (81.0, 0, 0, ’sp|P23991|ADH1_CHICK’), (81.0, real 2.81 user 1.79 sys 0.27 biopython$ time python sw-demo.py cpu ** cd ../bin/ ; ./swcuda config.cpu ../tmp/swcuda.fa ../tmp/swcuda.bank ** 16.604s 12098 results... [(84.0, 0, 0, ’sp|P30350|ADH1_ANAPL’), (81.0, 0, 0, ’sp|P23991|ADH1_CHICK’), (81.0, real 17.57 user 16.42 sys 0.14 10× – 15× paper speedup (BMC Bioinformatics 2008, 9S2) 8.7× application speedup 6.2× final speedup (including Biopython/Biomanycores) J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 14 / 20
  • BioPerl + CRIBI SW BioPerl tutorial u s e Bio : : Tools : : pSW ; $factory = new Bio : : Tools : : pSW ( ’−m a t r i x ’=> ’ b l o s u m 6 2 . b l a ’ , ’−gap ’ ← =>12, ’−e x t ’ =>2) ; $factory−>alig n_and_sh ow ( $seq1 , $seq2 , STDOUT ) ; $aln = $factory−>p a i r w i s e _ a l i g n m e n t ( $seq1 , $seq2 ) ; With biomanycores u s e Bio : : SeqIO ; u s e Biomanycores : : PadovaSW ; $factory = PadovaSW−>new ( ) ; $factory−>swcuda ( $inputseq , $bank ) ; @r = $factory−>parse_result ( ) ; J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 15 / 20
  • BioJava + PWM i m p o r t org . biojavax . bio . seq . RichSequence ; i m p o r t org . biojava . bio . dp . S i m p l e W e i g h t M a t r i x ; ... i m p o r t org . biomanycores . bio . pwm . ∗ ; ... { LillePWMScan scanner = new LillePWMScan ( launcher ) ; // r e a d t h e s e q u e n c e R i c h S e q u e n c e I t e r a t o r it = n u l l ; Buffe redRead er in1 = new Buff eredRead er ( new FileReader ( args [ 1 ] ) ) ; it = RichSequence . IOTools . readFastaDNA ( in1 , n u l l ) ; RichSequence query = it . n e x t R i c h S e q u e nc e ( ) ; // r e a d a w e i g h t m a t r i x S i m p l e W e i g h t M a t r i x pwm = PFMParser . PARSER . get ( args [ 2 ] , alph , ”ACGT” ) ; // s c a n t h e s e q u e n c e List<PWMHit> al = scanner . scan ( query , pwm , threshold ) ; } J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 16 / 20
  • Challenges Differents APIs, different philosophies BioJava : no external program execution ? Object representation (alignments) Object existence (PWM) Minimal modifications to the source code of applications CribiSW : command-line arguments Real-world pipelines ? Bio∗ are not HPC frameworks Succession of several programs Usage: requires CUDA / OpenCL SDK J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 17 / 20
  • Licenses Projects must have an open-source licence Bio∗ interfaces : same license than mother API BioJava: LGPL 2.1 BioPerl: Perl artistic license Biopython: Biopython license J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 18 / 20
  • www.biomanycores.org 1. Share OpenCL code (currently CUDA) = public repository, open-source ⇒ bring new projects 2. Make it easy = Bio∗ integration ⇒ integrate new projects ⇒ improve current interfaces 3. Benchmark algorithms, implementations, hardware ⇒ think ! J.-S. Varr´, S. Janot, M. Giraud (LIFL) e Biomanycores June 2009 19 / 20
  • go back