Hedlund_biogrid_BOSC2009

609 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
609
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hedlund_biogrid_BOSC2009

  1. 1. Biogrid – Bioinformatics for the grid Joel Hedlund <yohell@ifm.liu.se> Biogrid User and Developer Linköping University, Sweden Birds-of-a-feather session tonight: see me after this talk!
  2. 2. Outline • What is it? • What is it good for? • Does it really work? • Gory details. • Why did we do this? • Profit!
  3. 3. What is it? NDGF BIO Community Grid Bioinformatics for the Grid
  4. 4. What is it? • Unified interface ...to popular bioinformatic applications ...on shared, distributed computational resources ...using versioned and cached databases
  5. 5. What is it good for? • Burst computing – High demand for short periods of time • high during development / production • low during analysis / writing papers – Share resources to enable more efficient use • Database accessibility • Availibility • Unified interface
  6. 6. What is NDGF?
  7. 7. What is NDGF? • Nordic Data Grid Facility • A WLCG Tier1 facility – Worldwide LHC Computational Grid – Stores and processes data from LHC at CERN • peak rate ≈ 1.6Gb/s, when the accelerator is running (and that’s after most of the data have been filtered away)
  8. 8. ”Does it really work, this distributed thingie?”
  9. 9. ”Does it really work, this distributed thingie?” Why yes, very well thank you!
  10. 10. NDGF • 96% availablity (highest of all Tier1 facilities) • Third largest Tier1 facility in the world • Lowest ratio of failed ATLAS jobs • Production goals met, and beyond – Goal: 8% of all ATLAS resources (10.5% provided) – Goal: 9% of all ALICE resources (12% provided) * Data graciously stolen from Leif Nixons NorduNet 2008 talk. Thank you Leif :-)
  11. 11. DISTRIBUTION IS A STRENGTH
  12. 12. It enforces unification It ensures availability
  13. 13. Does it really work? It’s good enough for LHC. It’s good enough for Bioinformatics.
  14. 14. Gory details
  15. 15. Biogrid provides Optimised applications: – BLAST – ClustalW – HMMER – Muscle – Mafft Planned: molecular dynamics, phylogeny...
  16. 16. Biogrid provides Versioned, indexed and cached databases – UniProtKB (subreleases) – Uniref (subreleases) Planned: genomes (EnsEMBL), nucleotides (EMBL)...
  17. 17. Cached database access Database files are transfered to the cluster at most once per project.
  18. 18. Unified Interface
  19. 19. Unified Interface
  20. 20. Unified Interface DATA RESULTS
  21. 21. Unified Interface • XRSL Job Description Standard in ARC Grid Middleware • Well defined runtime environments $HMMERDIR: node local (fast) scratch dir containing db files prepare_db: download and unpack db files on the fly from front node to $HMMERDIR
  22. 22. XRSL Job Description (jobName=refinehmm-family023) (runTimeEnvironment=APPS/BIO/HMMER2.3.2) (cpuTime=3000) (executable=refinehmm.jobscript.sh) (inputFiles= (sp.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_sprot.fasta.gz) (tr.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_trembl.fasta.gz) (family023.hmm ””) ) (outputfiles= (family023.refined.hmm ””) )
  23. 23. XRSL Job Description (jobName=refinehmm-$HMM_NAME) (runTimeEnvironment=APPS/BIO/HMMER2.3.2) (cpuTime=3000) (executable=refinehmm.jobscript.sh) (inputFiles= (sp.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_sprot.fasta.gz) (tr.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_trembl.fasta.gz) ($HMM_NAME.hmm ””) ) (outputfiles= ($HMM_NAME.refined.hmm ””) )
  24. 24. Unified Interface • Run on any resource I can access: $ ngsub myjob.xrsl • ...or run on my buddy’s cluster: $ ngsub -c kiniini.csc.fi myjob.xrsl • Check jobs: $ ngstat refinehmm-family023 (or use Grid Monitor web interface at www.nordugrid.org) • Fetch results: $ ngget refinehmm-family* DATA GRID RESULTS
  25. 25. What do I need? 1. A resource with ARC and Biogrid REs 2. An ARC client 3. A Grid Certificate (available from a number of global certificate authorities) 4. Time allowance on the resource ( 5. Biogrid VO Membership Not really necessary, but it will get you 1 & 4 )
  26. 26. What do I need? ...or you can just grab the RE scripts off the biogrid website, and your db of choice from the biogrid dCache.
  27. 27. Why did we do this? Bioinformatic applications... – CPU intensive – Small input and output files – ”Large” databases can be cached ...are very well suited for distributed computing.
  28. 28. Profit!
  29. 29. Subclassification of the MDR superfamily • 15000 members from all kingdoms of life • 500 families 25% sequence identity • 40 human members • Different substrate specificities • Different subunit & cofactor count • 2 HMMs available for superfamily detection • None for any of the individual families
  30. 30. Subclassification of the MDR superfamily • We made HMMs for all MDR (sub)families with 20+ members. • 86 families • 34 detected subfamilies to 14 of these • 11579 / 15000 sequences classified • ≈5000*hmmsearch vs UniProtKB Manuscript in preparation
  31. 31. refinehmm • Algorithm for automated HMM refinement • Produces stable and reliable HMMs • Developed using Biogrid REs and resources Will also be open source software once the paper is out.
  32. 32. Acknowledgements • Olli Tourunen Supercomputing centers Biogrid developer • NSC • Bengt Persson Jens Larsson, Leif Nixon Biogrid PI • HPC2N • NDGF Åke Sandgren Michael Grønager Josva Kleist • Others C3SE, CSC, Uppmax, Lunarc, PDC, • Biogrid co-applicants Aalborg University, Oslo University Ann-Charlotte Berglund Sonnhammer Erik Sonnhammer Inge Jonassen Joel Hedlund yohell@ifm.liu.se Biogrid User and Developer Linköping University, Sweden Birds-of-a-feather session tonight: see me after the talk!
  33. 33. Acknowledgements • Olli Tourunen Supercomputing centers Biogrid developer • NSC • Bengt Persson Jens Larsson, Leif Nixon Biogrid PI • HPC2N • NDGF Åke Sandgren Michael Grønager Josva Kleist • Others C3SE, CSC, Uppmax, Lunarc, PDC, • Biogrid co-applicants Aalborg University, Oslo University Ann-Charlotte Berglund Sonnhammer Erik Sonnhammer Inge Jonassen Joel Hedlund yohell@ifm.liu.se Biogrid User and Developer Linköping University, Sweden Birds-of-a-feather session tonight: see me after the talk!

×