Computational infrastructure for NGS data analysis Pablo Escobar email@example.com
Index Introduction to supercomputing (HPC, high performance computing) NGS computational infrastructures Sequencing Centers Medical Genome Project MGP - Sevilla Centro Nacional Análisis genómico CNAG - Barcelona Beijing Genomics Institute BGI - China Alternatives: Cloud computing Grid computing
Computational infrastructure for NGS In NGS we have to process really big amounts of data, which is not trivial in computing terms. Big NGS projects require supercomputing infrastructures
Computational infrastructure for NGSThese infrastructures are expensive and not trivial to use, we require: Acondicionated data center (servers room). This is expensive Computing cluster: Many computing nodes (servers) High performance storage (hard disks) Fast networks (10Gb ethernet, infiniband...) Skilled people in computing ( sysadmins and developers). In CNAG currently 30 staff - >50% informatics
Big infrastructure cluster Distributed memory cluster Starting at 20 computing nodes 160 to 240 cores amd64 (x86_64) is the most used cpu architecture At least 48GB ram per node Fast networks 10Gbit Infiniband Batch queue system (sge, condor, pbs, slurm) Optional MPI and GPUs environment depending on project requirements
Big infrastructure storage Distributed filesystem for high performance storage (starting at 100TB) Lustre GPFS Ibrix parallel nfs glusterFS Storage is the most expensive NFS is not a good option for supercomputing
Big infrastructure Starting at 200.000€ 200.000€ is just the hardware Plus data center (computers room) Plus informatics salary Not every partner knows about supercomputing. SGI Bull IBM HP
Middle-size infrastructure ”Small” distributed filesystem ( around 50TB). ”Small” cluster (around 10 nodes, 80 to 120 cores). At least gigabit ethernet network. Price range: 50.000 – 100.000 € (just hardware) plus data center and informatics salary
Small infrastructure Recommended at least 2 machines 8 or 12 cores each machine. 48Gb ram minimum each machine. BIG local disk. At least 4TB each machine As much local disks as we can afford Price range: starting at 8.000€ - 10.000€ (two machines)
Medical genome projectStorage racks IBRIX filesystem front-ends
MGP raw data generation a solid sequencer run 7 days running Generates around 4TB Only the four solid sequencers working full time can generate around 12TB each week. 12TB just of raw data. After running bioinformatics analysis more data is generated Raw data size grows really fast New sequencer models New reagents
Most used operating system is GNU/LINUX Source: http://www.top500.org/stats/list/36/osfam
Alternatives – cloud computing Pros Flexibility. You pay what you use. Don´t need to maintain a data center. Cons Transfer big datasets over internet is slow. You pay for consumed bandwidth. That is a problem with big datasets. Lower performance, specially in disk read/write. Privacy/security concerns. More expensive for big and long term projects.
Alternatives – grid computing Pros Cheaper. More resources available. Cons Too heterogeneous environment. Slow connectivity (specially in Spain). Requires a lot of time to find good resources in the grid.