Your SlideShare is downloading. ×
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Computational infrastructure for NGS data analysis

3,387

Published on

Computational infrastructure for NGS data analysis - …

Computational infrastructure for NGS data analysis -
Pablo Escobar -
Massive sequencing data analysis workshop -
Granada 2011

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,387
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
86
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Computational infrastructure for NGS data analysis Pablo Escobar pescobar@cipf.es
  • 2. Index Introduction to supercomputing (HPC, high performance computing) NGS computational infrastructures Sequencing Centers  Medical Genome Project MGP - Sevilla  Centro Nacional Análisis genómico CNAG - Barcelona  Beijing Genomics Institute BGI - China Alternatives:  Cloud computing  Grid computing
  • 3. Computational infrastructure for NGS In NGS we have to process really big amounts of data, which is not trivial in computing terms. Big NGS projects require supercomputing infrastructures
  • 4. Computational infrastructure for NGSThese infrastructures are expensive and not trivial to use, we require:  Acondicionated data center (servers room). This is expensive  Computing cluster:  Many computing nodes (servers)  High performance storage (hard disks)  Fast networks (10Gb ethernet, infiniband...)  Skilled people in computing ( sysadmins and developers).  In CNAG currently 30 staff - >50% informatics
  • 5. Big infrastructure cluster Distributed memory cluster  Starting at 20 computing nodes 160 to 240 cores   amd64 (x86_64) is the most used cpu architecture  At least 48GB ram per node Fast networks  10Gbit  Infiniband Batch queue system (sge, condor, pbs, slurm) Optional MPI and GPUs environment depending on project requirements
  • 6. Big infrastructure storage Distributed filesystem for high performance storage (starting at 100TB)  Lustre  GPFS  Ibrix  parallel nfs  glusterFS Storage is the most expensive NFS is not a good option for supercomputing
  • 7. Big infrastructure storage
  • 8. Big infrastructure
  • 9. Big infrastructure Starting at 200.000€  200.000€ is just the hardware  Plus data center (computers room)  Plus informatics salary Not every partner knows about supercomputing.  SGI  Bull  IBM  HP
  • 10. Middle-size infrastructure ”Small” distributed filesystem ( around 50TB). ”Small” cluster (around 10 nodes, 80 to 120 cores). At least gigabit ethernet network. Price range: 50.000 – 100.000 € (just hardware)  plus data center and informatics salary
  • 11. Small infrastructure Recommended at least 2 machines  8 or 12 cores each machine.  48Gb ram minimum each machine.  BIG local disk. At least 4TB each machine  As much local disks as we can afford Price range: starting at 8.000€ - 10.000€ (two machines)
  • 12. Sequencing centers in Spain Medical Genome Project Sequencing Instruments  7 GS-FLX (Roche)  4 SolidTM 4 (Applied Biosystems) Informatics infrastructure  300 core cluster  0,5 petabyte hard disks
  • 13. Medical genome projectStorage racks IBRIX filesystem front-ends
  • 14. MGP raw data generation a solid sequencer run  7 days running  Generates around 4TB Only the four solid sequencers working full time can generate around 12TB each week. 12TB just of raw data. After running bioinformatics analysis more data is generated Raw data size grows really fast  New sequencer models  New reagents
  • 15. MGP raw data generation
  • 16. Sequencing centers in Spain CNAG Sequencing Instruments  8 Illumina Genome Analyzer Iix  6 Illumina HiSeq2000  4 Illumina cBots Informatics infrastructure  850 core cluster  1.2 petabyte hard disks  10 x 10 Gb/s link with marenostrum (Barcelona Super Computer 10,240 cores)
  • 17. CNAG
  • 18. Largest sequencing center in the world Beijing Genomics Institute (BGI)
  • 19. Largest sequencing center in the world Beijing Genomics Institute (BGI) Hardware resources Source: http://www.genomics.cn/en/platform.php?id=249
  • 20. Sequencing center resources
  • 21. Most used operating system is GNU/LINUX Source: http://www.top500.org/stats/list/36/osfam
  • 22. Alternatives – cloud computing Pros  Flexibility.  You pay what you use.  Don´t need to maintain a data center. Cons  Transfer big datasets over internet is slow.  You pay for consumed bandwidth. That is a problem with big datasets.  Lower performance, specially in disk read/write.  Privacy/security concerns.  More expensive for big and long term projects.
  • 23. Alternatives – grid computing
  • 24. Alternatives – grid computing Pros  Cheaper.  More resources available. Cons  Too heterogeneous environment.  Slow connectivity (specially in Spain).  Requires a lot of time to find good resources in the grid.
  • 25. M4NY TH4NKS :)

×