It is widely agreed that complex diseases are typically caused by joint effects of multiple genetic variations, rather than a single genetic variation. Multi-SNP interactions, also known as epistatic interactions, have the potential to provide information about causes of complex diseases, and build on GWAS studies that look at associations between single SNPs and phenotypes. However, epistatic analysis methods are both computationally expensive, and have limited accessibility for biologists wanting to analyse GWAS datasets due to being command line based. Here we present APPistatic, a prototype desktop version of a pipeline for epistatic analysis of GWAS datasets. his application combines ease-of-use, via a GUI, with accelerated implementation of BOOST and FaST-LMM epistatic analysis methods.
1. http://www.mrsymbiomath.eu
This work has been partially supported by the Mr.Symbiomath IAPP (Project Code: 324554); the ‘Plataforma de Recursos Biomoleculares y Bioinformaticos (ISCIII-PT13.0001.0012)’ and ‘Proyecto de Excelencia Junta de Andalucia
(P10-TIC-6108)’
Alex Upton1, Priscill Orue1, Oswaldo Trelles1,2
1 Computer Architecture Department, University of Malaga (UMA), Spain
2 RISC Software GmbH. Hagenberg, Austria
Abstract
It is widely agreed that complex diseases are typically caused by joint effects of multiple genetic variations, rather than a single genetic variation [1]. Multi-SNP interactions, also known as epistatic
interactions, have the potential to provide information about causes of complex diseases, and build on GWAS studies that look at associations between single SNPs and phenotypes. However, epistatic
analysis methods are both computationally expensive, and have limited accessibility for biologists wanting to analyse GWAS datasets due to being command line based.
Here we present APPistatic, a prototype desktop version of a pipeline for epistatic analysis of GWAS datasets. This application combines ease-of-use, via a GUI, with accelerated implementation of
BOOST [2] and FaST-LMM [3] epistatic analysis methods.
Pipeline
Conclusions
• Implementation of the analysis methods via a GUI results in improved accessibility, thereby making epistatic analysis tools a viable option for end users such as biologists that are not comfortable
with command line based tools. This allows further analysis of GWAS data sets, potentially building on existing analysis and resulting in additional genetic information being discovered.
• Notable improvement in execution time also obtained, compared to default execution of epistatic analysis tools. Future HPC deployment makes typical GWAS data set analysis feasible; a relatively
small GWAS dataset, with 100,000 SNPs that pass quality control, has 5x10-9 pairwise interactions, that would take approximately two years to calculate on a desktop computer. Using HPC, this
can be executed in a number of days, aiding in the analysis of genetic variants of disease.
• In addition, a cloud-based version of the pipeline could also be developed using Web services, which could be accessed via a client such as jORCA [6]. Cloud Computing allows researchers to rent
computational and storage resources on an ad-hoc basis for large scale data processing, allowing access to High Performance Computing. Furthermore, this implementation could join up with
existing cloud-based pipelines to create an all-in-one process. Additionally, we are exploring the option of exporting results directly to visualisation software for visual inspection of the results.
Accessibility
Analysis of GWAS Data
The application provides an easy-to-use all-in-one analysis of GWAS data by
incorporating a number of analysis steps which are shown in Figure 1 below.
Steps Involved
(1) End user loads GWAS files of interest. These can be either in VCF or PLINK format.
For end users with raw .CEL files, one recommended tool for obtaining VCF files is
the Cloud-based GWAS Analysis Pipeline for Clinical Researchers [4].
(2) Prior to epistatic analysis, it is of interest to carry out single SNP association analysis.
This is performed using the widely used tool PLINK [5].
(3) The next step is to carry out an epistatic analysis using an optimised implementation
of BOOST that takes advantage of the multi-core environment of modern computers.
(4) The next step is to use the FaST-LMM [3] analysis tools. Prior to using these, the user
files have to be converted to ensure compatibility. This is carried out in this step.
(5) The next step is to carry out a single SNP association analysis with FaST-LMM, that
corrects for population structure.
(6) The final step is to carry out an epistatic analysis using FaST-LMM. As with BOOST,
implementation has been optimised to take advantage of multiple cores.
Acceleration
Desktop PC Implementation
The execution of APPistatic on a typical desktop PC results in a speedup of between 4
and 8 times for epistatic analysis, depending on the number of cores. The screenshot
above shows the default acceleration, using 4 tasks and 256MB RAM per task.
HPC Implementation
Greater speedup making the analysis of typical GWAS datasets feasible is obtained by
using High Performance Computing (HPC). Initial HPC deployment using 100 cores
shows a promising speedup of over 114 times. Table 2 below shows the execution times
for BOOST and FaST-LMM epistatic analysis for a demo data set for both a typical
desktop PC running Windows, and initial HPC deployment. It should be noted that the
demo data set contains 10,000 SNPs. The faster execution time of BOOST is due to the
use of a linear regression model, compared to the linear mixed method model used by
FaST-LMM.
Computational Environment BOOST Epistatic
Execution Time (s)
FaST-LMM Epistatic
Execution Time (s)
Standard Implementation (a) 25.4 15123
Appistatic Deployed on Desktop PC (b) 4.8 1903
Deployment on HPC (c) 1.2 132
(a) Default execution of applications from command line on Desktop PC (detailed below)
(b) Desktop PC with Intel Core 2 Quad 2.66 GHz CPU and 4GB RAM running Windows 7
(c) Split into 100 tasks with 4 cores and 8GB ram assigned to each task
References
[1] Anunciação, Orlando, Susana Vinga, and Arlindo L. Oliveira. "Using Information Interaction to Discover Epistatic Effects in Complex Diseases." PloS one 8, no. 10 (2013): e76300.
[2] Wan, Xiang, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan, Nelson LS Tang, and Weichuan Yu. "BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies." The American Journal of Human Genetics 87, no. 3 (2010):
325-340.
[3] Lippert, Christoph, Jennifer Listgarten, Ying Liu, Carl M. Kadie, Robert I. Davidson, and David Heckerman. "FaST linear mixed models for genome-wide association studies." Nature Methods 8, no. 10 (2011): 833-835.
[4] P. Heinzlreiter, J. Perkins, O. Torreñno Tirado, J. Karlsson, A. Mitterecker, M. Blanca and O. Trelles. "A Cloud-based GWAS Analysis Pipeline for Clinical Researchers" 4th International Conference on Cloud Computing and Services Science, CLOSER 2014.
[5] Purcell, Shaun, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel AR Ferreira, David Bender, Julian Maller et al. "PLINK: a tool set for whole-genome association and population-based linkage analyses." The American Journal of Human Genetics 81, no.
3 (2007): 559-575.
[6] Martín-Requena, Victoria, Javier Ríos, Maximiliano García, Sergio Ramírez, and Oswaldo Trelles. "jORCA: easily integrating bioinformatics Web Services." Bioinformatics 26, no. 4 (2010): 553-559.
Figure 1: Overview of Pipeline
Graphical User Interface
Providing GUI access to epistatic analysis
methods, along with single SNP association
methods, improves their accessibility as multiple
tools are accessed in the same manner,
allowing targeted non-expert computer users,
e.g. biologists, to easily analyse their GWAS
datasets without having to learn different
commands for each tool. The GUI is shown in
Figure 2 on the left. Note the easily configurable
options for acceleration. The prototype version
of APPistatic can be downloaded from:
Figure 2: Implementation Results
http://chirimoyo.ac.uma.es/appistaticFigure 2: APPistatic GUI