Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Mastering Bio Grid
1. Dare to compute
Mastering bioGrid
Sven Warris
Instituut
Leiden,
November 3, 2009
voor Life Science & Technology
2. Overview
Bioinformatics
What is a grid?
Use of grid technology in research @ Hanze
• miRNA prediction
• Interproscan
Mastering bioGrid project
GPGPU: emerging technology
Current use of grid/GPGPU @ Hanze
• Dingo @ UNSW
• Mixed Cultures
Future & PhD project
Instituut
voor Life Science & Technology
3. Bioinformatics
Biology and computer science
Basic: using software and computing power to
perform database searches
Advanced:
•Development of software
•Creating large databases
Why computers?
•Complex analysis
•Large scale: gigabytes per run
•Visualization
Instituut
voor Life Science & Technology
4. Grid infrastructure
Many computers in a network
Software for access and control
Central manager to distribute work
Easy access to idle computers
No need to start several
computers by hand
Fail safe
Instituut
voor Life Science & Technology
5. Grid infrastructure
Condor software: easy to use
Bioinformatics network:
• Linux
• NFS
• central user authentication
Use of almost all standard software
• BLAST, meme, ClustalW
• Own applications / scripts
• Controlled from web server
Instituut
voor Life Science & Technology
6. Grid infrastructure
Current setup:
~70 nodes
• many dual cores
• at least 1gb ram per core, usually 2gb
Calculix: 8gb -> 16gb, 2x dual core
GPU: 4gb, quadCore, GTX295 & 8800 GTX
Already used for many years of CPU-time!
Instituut
voor Life Science & Technology
7. Grid technology & research
Start in 2005
Installation of Condor grid software
Needed for high throughput computing
Basic configuration: resource hogging!
• full use of claimed computer
• no user access possible anymore
Used for BLAST/SW-type applications
Only possible during weekends/holidays
No students allowed (unstable configuration)
However: new possibilities!
Instituut
voor Life Science & Technology
8. Grid technology & research
Example: miRNAs
small, interfering RNAs 50-300b
distinct hairpin structure
Bonnet et.a.: minimum free energy significantly lower than
random sequences (population)
For prediction: 1 + 1000 MFE values
Slow software and too many MFEs
Precalculate on grid: many different 'populations'
• ~1,400,000 populations (= ~1,400,000,000 MFEs)
Estimate population for candidate: single MFE needed
Use grid for prediction as well!
Instituut
voor Life Science & Technology
9. Grid technology & research
Example: interproscan
Used for protein prediction/classification
Large databases
Slow software
Students developed Java application Condor-
Interproscan
Introduction to GPGPU: clawhmmer
Instituut
voor Life Science & Technology
10. From research to education
Grid successful in research
However: unstable, resource hogging configuration
No educational materials present
Only two lecturers familiar with technology
Needed: time and resources to integrate research and
education
Mastering bioGrid Project
Instituut
voor Life Science & Technology
11. Mastering bioGrid
Cooperation between Hanzehogeschool & Hogeschool
Arnhem Nijmegen
Development of education materials
Reconfiguration of grid setup
Testing of configurations
Setup of grid infrastructure @ HAN
Research into possibilities GPGPU
Education of lecturers
Important input from students
Budget: 160,000€, subsidized 100,000€:
SURF,Utrecht
Instituut
voor Life Science & Technology
12. Mastering bioGrid, results
Small grid @ HAN
Documentation for use Condor software @ Hanze
Education materials
Further development of Interproscan-Condor
Lecturers, researchers and students consider it already
as a standard tool
Students start using it in second year
More research possible during student projects
Very flexible grid infrastructure
Students: don't use openGL, use CUDA
Instituut
voor Life Science & Technology
13. GPGPU
General Purpose Graphics Processing Unit
Highly parallel, data driven
Graphics processing: many independent, simple but
mathematically intensive calculations
Bioinformatics: HMM, Smith-Waterman, SNP-detection!
OpenGL: very complex, not very suitable for GPGPU
CUDA:
•NVIDIA GPU
•C-based, 'simple' API
Compatible with openCL
Cheap hardware!
Redesign of algorithms needed
Instituut
voor Life Science & Technology
14. GPGPU
2003 2003 2004 2005 2006 2007 2008
Instituut
voor Life Science & Technology
16. Current research use
Dingo SNP detection (together with UNSW, Sydney)
•~500,000 '454' reads
•Reference genome: Dog
•Find reads with a single SNP: SOAP software
•Find unique reads not in Dog SNP database
•Results: 8242 new SNPs
•Now: tests in lab for pure Dingo selection
Mixed cultures
•Sample water/soil/etc
•Sequence all DNA
•Find organisms/species/etc
Ongoing miRNA-research
Instituut
voor Life Science & Technology
17. Current educational use
Several projects now use grid
More and faster BLAST results: no more days waiting
High throughput / high performance computing
•Theme 12 specialisation
•Focus on GPGPU, CUDA
•Very enthusiastic students
•Student from HAN for minor
•Huge PR appeal
(graphics cards are sexy)
Instituut
voor Life Science & Technology
18. Future & PhD
RAAK Pro Project
•Started in September 2009
•1M Euro, 4 years, 0.6M Euro subsidized
•WUR, UMCG, UNSW, HAN, PRI, KeyGene
More use of current grid infrastructure by others
Life Science Grid (SARA)
Larger grid @ HAN
More grids @ Hanze
•School of ICT, Media
Connect those grids
Development of GPGPU-based software
Use these and third party for research
Instituut
voor Life Science & Technology
19. Future?
2
Next and next generation sequencing
demand high throughput / high performance!
More and more other chip-based lab
technologies come available!
Research asks, we develop and teach
Instituut
voor Life Science & Technology