2. Index
Introduction to supercomputing (HPC, high
performance computing)
NGS computational infrastructures
Sequencing Centers
Medical Genome Project MGP - Sevilla
Centro Nacional Análisis genómico CNAG - Barcelona
Beijing Genomics Institute BGI - China
Alternatives:
Cloud computing
Grid computing
3. Computational infrastructure for NGS
In NGS we have to process really big amounts
of data, which is not trivial in computing terms.
Big NGS projects require supercomputing
infrastructures
4. Computational infrastructure for NGS
These infrastructures are expensive and not trivial to use, we require:
Acondicionated data center (servers room). This is
expensive
Computing cluster:
Many computing nodes (servers)
High performance storage (hard disks)
Fast networks (10Gb ethernet, infiniband...)
Skilled people in computing ( sysadmins and developers).
In CNAG currently 30 staff - >50% informatics
5. Big infrastructure cluster
Distributed memory cluster
Starting at 20 computing nodes
160 to 240 cores
amd64 (x86_64) is the most
used cpu architecture
At least 48GB ram per node
Fast networks
10Gbit
Infiniband
Batch queue system (sge, condor, pbs, slurm)
Optional MPI and GPUs environment depending on project
requirements
6. Big infrastructure storage
Distributed filesystem for high performance
storage (starting at 100TB)
Lustre
GPFS
Ibrix
parallel nfs
glusterFS
Storage is the most expensive
NFS is not a good option for supercomputing
9. Big infrastructure
Starting at 200.000€
200.000€ is just the hardware
Plus data center (computers room)
Plus informatics salary
Not every partner knows about supercomputing.
SGI
Bull
IBM
HP
10. Middle-size infrastructure
”Small” distributed filesystem ( around 50TB).
”Small” cluster (around 10 nodes, 80 to 120 cores).
At least gigabit ethernet network.
Price range: 50.000 – 100.000 € (just hardware)
plus data center and informatics salary
11. Small infrastructure
Recommended at least 2 machines
8 or 12 cores each machine.
48Gb ram minimum each machine.
BIG local disk. At least 4TB each machine
As much local disks as we can afford
Price range: starting at 8.000€ - 10.000€ (two
machines)
14. MGP raw data generation
a solid sequencer run
7 days running
Generates around 4TB
Only the four solid sequencers working
full time can generate around 12TB
each week.
12TB just of raw data. After running
bioinformatics analysis more data is
generated
Raw data size grows really fast
New sequencer models
New reagents
21. Most used operating system is GNU/LINUX
Source: http://www.top500.org/stats/list/36/osfam
22. Alternatives – cloud computing
Pros
Flexibility.
You pay what you use.
Don´t need to maintain a data center.
Cons
Transfer big datasets over internet is slow.
You pay for consumed bandwidth. That is a problem
with big datasets.
Lower performance, specially in disk read/write.
Privacy/security concerns.
More expensive for big and long term projects.
24. Alternatives – grid computing
Pros
Cheaper.
More resources available.
Cons
Too heterogeneous environment.
Slow connectivity (specially in Spain).
Requires a lot of time to find good resources in the
grid.