Computational infrastructure for NGS data analysis

4,766 views

Published on

Computational infrastructure for NGS data analysis -
Pablo Escobar -
Massive sequencing data analysis workshop -
Granada 2011

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,766
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
116
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Computational infrastructure for NGS data analysis

  1. 1. Computational infrastructure for NGS data analysis Pablo Escobar pescobar@cipf.es
  2. 2. Index Introduction to supercomputing (HPC, high performance computing) NGS computational infrastructures Sequencing Centers  Medical Genome Project MGP - Sevilla  Centro Nacional Análisis genómico CNAG - Barcelona  Beijing Genomics Institute BGI - China Alternatives:  Cloud computing  Grid computing
  3. 3. Computational infrastructure for NGS In NGS we have to process really big amounts of data, which is not trivial in computing terms. Big NGS projects require supercomputing infrastructures
  4. 4. Computational infrastructure for NGSThese infrastructures are expensive and not trivial to use, we require:  Acondicionated data center (servers room). This is expensive  Computing cluster:  Many computing nodes (servers)  High performance storage (hard disks)  Fast networks (10Gb ethernet, infiniband...)  Skilled people in computing ( sysadmins and developers).  In CNAG currently 30 staff - >50% informatics
  5. 5. Big infrastructure cluster Distributed memory cluster  Starting at 20 computing nodes 160 to 240 cores   amd64 (x86_64) is the most used cpu architecture  At least 48GB ram per node Fast networks  10Gbit  Infiniband Batch queue system (sge, condor, pbs, slurm) Optional MPI and GPUs environment depending on project requirements
  6. 6. Big infrastructure storage Distributed filesystem for high performance storage (starting at 100TB)  Lustre  GPFS  Ibrix  parallel nfs  glusterFS Storage is the most expensive NFS is not a good option for supercomputing
  7. 7. Big infrastructure storage
  8. 8. Big infrastructure
  9. 9. Big infrastructure Starting at 200.000€  200.000€ is just the hardware  Plus data center (computers room)  Plus informatics salary Not every partner knows about supercomputing.  SGI  Bull  IBM  HP
  10. 10. Middle-size infrastructure ”Small” distributed filesystem ( around 50TB). ”Small” cluster (around 10 nodes, 80 to 120 cores). At least gigabit ethernet network. Price range: 50.000 – 100.000 € (just hardware)  plus data center and informatics salary
  11. 11. Small infrastructure Recommended at least 2 machines  8 or 12 cores each machine.  48Gb ram minimum each machine.  BIG local disk. At least 4TB each machine  As much local disks as we can afford Price range: starting at 8.000€ - 10.000€ (two machines)
  12. 12. Sequencing centers in Spain Medical Genome Project Sequencing Instruments  7 GS-FLX (Roche)  4 SolidTM 4 (Applied Biosystems) Informatics infrastructure  300 core cluster  0,5 petabyte hard disks
  13. 13. Medical genome projectStorage racks IBRIX filesystem front-ends
  14. 14. MGP raw data generation a solid sequencer run  7 days running  Generates around 4TB Only the four solid sequencers working full time can generate around 12TB each week. 12TB just of raw data. After running bioinformatics analysis more data is generated Raw data size grows really fast  New sequencer models  New reagents
  15. 15. MGP raw data generation
  16. 16. Sequencing centers in Spain CNAG Sequencing Instruments  8 Illumina Genome Analyzer Iix  6 Illumina HiSeq2000  4 Illumina cBots Informatics infrastructure  850 core cluster  1.2 petabyte hard disks  10 x 10 Gb/s link with marenostrum (Barcelona Super Computer 10,240 cores)
  17. 17. CNAG
  18. 18. Largest sequencing center in the world Beijing Genomics Institute (BGI)
  19. 19. Largest sequencing center in the world Beijing Genomics Institute (BGI) Hardware resources Source: http://www.genomics.cn/en/platform.php?id=249
  20. 20. Sequencing center resources
  21. 21. Most used operating system is GNU/LINUX Source: http://www.top500.org/stats/list/36/osfam
  22. 22. Alternatives – cloud computing Pros  Flexibility.  You pay what you use.  Don´t need to maintain a data center. Cons  Transfer big datasets over internet is slow.  You pay for consumed bandwidth. That is a problem with big datasets.  Lower performance, specially in disk read/write.  Privacy/security concerns.  More expensive for big and long term projects.
  23. 23. Alternatives – grid computing
  24. 24. Alternatives – grid computing Pros  Cheaper.  More resources available. Cons  Too heterogeneous environment.  Slow connectivity (specially in Spain).  Requires a lot of time to find good resources in the grid.
  25. 25. M4NY TH4NKS :)

×