2. Graphics processing units (GPUs)
● Specialized processing chips, initially for graphics
● Thousands of cores for parallel computations
● Software development for GPUs streamlined by Nvidia’s CUDA
● Became heavily used with the rise of neural networks
● Other architectures like FPGA, TPU/NPU are available
3. GPUs in bioinformatics
● Alignment (blast, short read aligners)
● Simulations of biomolecules (e.g. proteins)
● Any task that can use deep learning:
○ Variant calling with DeepVariant
○ Oxford Nanopore basecalling
○ Cell type inference in scRNA-seq
○ Many more every day...
4. When is GPU useful?
● Amount of biological data grows
(at least) exponentially
● Sequences, cells, tracker data, etc
● However:
○ Appropriate type of
computation is needed
○ RAM requirements and
transfer I/O must not
outweigh the benefits
5. DeepVariant
● Accurate variant calling is a very
hard problem
● Deep Variant = variant caller
using neural networks
● Converts pileup (alignment) data
into image “channels”
● Takes advantage of CNN image
expertise at Google
6. DeepVariant
● Variant calling is much faster on a
GPU, even though the team made
effort to optimize on CPU
● Model training is dramatically faster
(and nearly impossible on CPU)
7. DeepVariant
● Variant calling is much faster on a
GPU, even though the team made
effort to optimize on CPU
● Model training is dramatically faster
(and nearly impossible on CPU)
● Huge demand for this task in
medical genomics!
● Illumina is offering DRAGEN, an
FPGA-based solution that
streamlines human variant calling
8. Nanopore Basecalling
● Lots of errors (up to 30%!) initially
● Basecalling is hard for Nanopore-derived reads
because signal is continuous (homopolymers are
particularly an issue)
● However, NN-based methods improve every year
● Algorithms developed for speech-recognition
technology are hugely useful (CTC)
● Results in median accuracy over 98%, some reads
are perfect over 1000s of bps
9. Basecalling on CPU vs GPU
● Nanopore basecalling is virtually useless on CPU (big run = days on 64 cores)
● FPGA or GPU-based calling is 1-2 orders of magnitude faster
10. Interactive GPU usage
● Model “zoos” become increasingly standard in genomics and bioinformatics
● Scientists start using high level NN-based tools on a daily basis
● Effective usage of such modelling requires constant parameter adjustment
11. scVI-tools for scRNA-seq
● Huge cell atlases are being
generated and annotated at
Sanger and worldwide
● Millions of cells will soon
become billions
● Efficient label transfer and
bias removal become
incredibly important
● Difference becomes dramatic
as datasets size increases