E biothon workshop 2014 04 15 v1

326 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
326
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

E biothon workshop 2014 04 15 v1

  1. 1. e-Biothon V. Breton (breton@clermont.in2p3.fr) LPC Clermont-Ferrand, IdGC CNRS-IN2P3 http://france-grilles.fr Credit: N. Bard, A. Franc, JF Gibrat Extreme Performance Computational Science workshop Tokyo, April 15th 2014
  2. 2. Table of content 2 • What are the computing challenges of life sciences? • France Grilles: a multidisciplinarydistributede- infrastructure for science • E-Biothon: an HPC platform for research in life sciences
  3. 3. Generalities on sequencing • Genome = DNA sequence (4 nucleotids: A, C, G, T) – Smallest non viral genome: Carsonellaruddii (0,16Mbp) – Largestgenome: Polychaosdubium(670Gbp)
  4. 4. Sanger technology 500 bpsequences 454 technology 105reads of 450 to 600bp seq. Illumina Technology 106 reads of 100 bpseq. Currentprojects(Tara) 107reads of 100 to 400 bpseq. Explosion of data set size Data analysis ? Algorithms? Heuristics? Tara @ http://oceans.taraexpeditions.org/ Evolution of sequencing techniques
  5. 5. Data production isdistributed 2558 High Throughput « NextGeneration » sequencingfacilities in the world, located in 920 centers (only 10 with more than 15 machines) Source: omicspmaps.com
  6. 6. Data production growsfasterthanMoore’slaw
  7. 7. Sequencing scenarii • Interest for a new genome requires assembly – process of taking a large number of short DNA sequences and putting them back together to create a representation of the original – Algorithms based on read overlapping benefit from large RAM (1 TO) -> HPC • Working with a reference genome requires comparative analysis – Alignment algorithms (BLAST) findregions of local similaritybetweensequences – Phylogeny algorithms (PhyML) build evolutionary relationships between genomes – Comparative analyses are easily parallelized at data level -> HTC
  8. 8. Summary • Life Sciences have specificcomputational challenges – Data production growsfasterthan Moore law – Permanent need of comparing new data to existingones • Life sciences needscanberelevantlyaddressed on multidisciplinary IT infrastructures (e-infrastructures) – HPC resources best fitted for genomeassembly – Grid/cloud HTC resourceswellfitted for comparative analysis • Life sciences are among the main users of the French national grid/cloud production infrastructure
  9. 9. France Grilles • Is a ScientificInterest Group… – Created in 2010 by 8 partners: CEA, CNRS,CPU, INRA, INRIA, INSERM, MESR, RENATER… – To steer up and coordinate the national strategy in the fields of grids and clouds • Vision: – Build and operate a national distributedcomputing infrastructure open to all sciences and to developing countries 9
  10. 10. France Grilles model • France Grilles does not own the resources – Resourcesowned by user communities • France Grilles provides a framework – To shareresources, expertise and know how – To promote innovation and initiatives – To foster collaboration at national and international levels – To reach out to the long tail of users 10
  11. 11. France Grilles resources France-Grillesbackbone: LCG-France France-Grillesspine: CC-IN2P3
  12. 12. EGI de 2010 à 2013 12 2010-2013: from 14 regional to 34 operations centres in 53 countries, from 188,000 jobs/day with 80,000 cores on 250 Resource Centres to 1,200,000 jobs/day with 430,000 cores on 337 Resource Centres Technologies • Grids • Clouds • Desktops Exposé S. Newhouse Madrid, Sept. 2013 France Grilles, a partner of EGI
  13. 13. Provide a commonframework to all user communities
  14. 14. Provide an open environment for fruitfuldisciplinary and multidisciplinaryresearch 14 5 1 1 218 54 9 1 5 9 11 15 13 11 755 99 50 9 23 1 10 100 1000 Over 1500 scientific publications june 2010 – April 2014
  15. 15. Web portal Users 479 registered users in Nov 2013 (175 in France) Most used robot certificate in EGI (http://go.egi.eu/wiki.robot.users) Neuro-image analysisCancer therapy simulation Prostate radiotherapy plan simulated with GATE(L. Grevillot and D. Sarrut) Image simulation Echocardiography simulated with FIELD-II (O. Bernard et al) Modeling and optimization of distributed computing systems Acceleration yielded by non-clairvoyant task replication (R. Ferreira da Silva et al) Brain tissue segmentation with Freesurfer Scientific applications Infrastructure Supported by EGI Infrastructure Uses biomed VO (most used EGI VO for life sciences in 2013) VIP accounts for ~25% of biomed's activity VIP consumes ~50 CPU years every month DIRAC France-Grilles Application as a service File transfer to/from grid Virtual Imaging Platform: http://www.creatis.insa-lyon.fr/vip
  16. 16. Collaborations withdedicated life sciences infrastructures • Institut Français de Bioinformatique (computing and storageresourcesatIDRIS) • France Genomique ( computing and storageresourcesat TGCC) • France Life Imaging (infrastructure for biomedicalimaging) • E-Biothon 16
  17. 17. 17 • Telethon: everyyear, fundraising by french media for French MuscularDistrophy Association (AFM) • FromTelethon to Decrypthon – Computing infrastructure (IBM) – Researchprojects (CNRS) – Humanresources (AFM) • FromDecrypthon to E-Biothon E-Biothon: history
  18. 18. e-Biothon: an HPC platform for research in life sciences 18 User Support Blue Gene / p machines Technical supportUser Support Blue Gene / P operationWeb access portal
  19. 19. E-Biothon: infrastructure 19 • 2 Blue Gene/P IBM racks with 200 TO storage – 2x1024 4-core nodes – up to 28 TFlopspeak performance • SysFera-DS web access to computingresources • 2 modes: – Standard (MPI) – HTC (1024 independenttasks in parallel)
  20. 20. E-Biothon vision is to offer a service to the user communities in life sciences • 2013-2014: first 3 projects – Jean-François Gibrat et al, (MIGALE platform, INRA Jouy-en-Josas) – Olivier Gascuel, Stéphane Guindon et Vincent Lefort (CNRS Montpellier) – Yec’hanLaizet, Philippe Chaumeil, Jean-Marc Frigerio, Stéphanie Mariette, Sophie Gerber, Alain Franc (INRA BioGeCo – Bordeaux) • > 2014: open call for projects (IFB)
  21. 21. Studying the synteny over a wide range of microbialgenomes 21 • Definition: similar blocks of genes in the same relative positions in the genome • Interest: Study of syntenycan show how the genomeiscut and pasted in the course of evolution • MIGALE team at INRA designed a pipeline analysis to computesyntenybetween 2 genomes and store it in a database • E-Biothon impact: change in scale - capacity to computesyntenybetween 2000 completebacterialgenomes (7 millions comparisons)
  22. 22. PhyML Philogeneticsis the study of evolutionaryrelationshipsamong groups of organisms PhyMLis a software thatestimates maximum likelihoodphylogeniesfromalignments of nucleotide or aminoacidsequences PhyML original publication in 2007 is the mostcited in environment and ecology (> 6000 citations). E-Biothon impact: change in scale in the resources made available to PhyMLusers
  23. 23. Characterizing biodiversity
  24. 24. According to botanictheory, biodiversityisorganized in species, genders, families, orders: isitconfirmed in the distance betweensequences?
  25. 25. Study of biodiversity in Guyane 16000 differenttreespecies in amazonianforest (≈ 300 in Europe) More biodiversity in 10000 m2 of forest in French Guyana than in Europe Decrypthonadded value Change in scale (from local Mesocenter in Bordeaux) Millions of reads Exact distance computation withoutheuristics (alignement scores) TOctets of data producedeveryweek
  26. 26. Conclusion • Both HPC and HTC resources are increasinglyneeded to address life sciences data and computing challenges: – As sequencing technologies keepevolving, data production growsfasterthan Moore law and isincreasinglydistributed – Biological data need to beconstantlycompared to eachother (phylogenetics, genomics comparative analysis) • France isdevelopingcomplementary HPC and HTC infrastructures for life sciences – Institut Français de Bioinformatique, France Génomique – E-Biothon: an HPC platform for research in life sciences – France Grilles: a multidisciplinarygrid/cloud production infrastructure
  27. 27. 2558 NextGenerationSequencers in the world
  28. 28. Are life sciences specificw.r.tcomputing? Whatisspecific to life sciences: - As sequencing technologies keepevolving, data production growsfasterthan Moore law - Biological data need to beconstantlycompared to eachother (phylogenetics, Genomics comparative analysis) Whatis not specific? - Data production isdistributed - Multiscalemodeling

×