SlideShare a Scribd company logo
1 of 89
Download to read offline
1
School of Biosciences
Detection of Expression Quantitative trait Loci and its role in explaining
the variation in the gene expression phenotypes
A research project submitted by
Abhilash Krishna Kannan
A thesis presented to the School of Biosciences of the University of
Birmingham in partial fulfilment of the requirements for the degree of
Master of Science in Analytical Genomics
School of Biosciences
University of Birmingham
Birmingham,UK.
September 2011
Supervisor:
Prof Zewei Luo
2
To my Parents,
Kannan Krishnaswamy and Chitra Kannan
3
4
Abstract
The main influence of the variation in the gene expression and its genetic effects can be
attributed to the specific regions of a genome and its variants. The detailed analysis of the
mRNA expression levels from the lymphoblastoid cell lines of 210 individuals and the
genotypes from Phase II HapMap project has revealed a strong correlation between the
variation in the genotypes and the variation in the gene expression levels. REML analysis
reveals that gene expression is a heritable phenotype. However for some of the genes
only a small proportion of variation in the expression levels is explained by the SNPs. Of
the total ~2.2 million SNPs (MAF > 5%) used for the association analysis by the simple
Linear Regression models, significant cis – associations were observed in 579 genes and
32 genes were found to be significantly associated with the SNPs in trans. The results
obtained from the association analysis confirm the presence of a large number of the
genes with cis effects compared to genes with trans effects because the genes with trans
effects may be regulated by a large number of SNPs, many of which fail to pass through
the stringent significance threshold. However this also suggests that the variation in the
gene expression phenotype may be primarily due to the regulatory variants. The results
from the heritability analysis show that 44% of the heritability in the gene expression was
explained by the peak SNP and the variation explained by the cis eQTL is significantly
higher than the variation explained by trans eQTLs, but few genes with low heritability
estimates still had eQTL, questioning the power of heritability to identify cis and trans
acting eQTLs. The analyses also consider several additional approaches and
methodologies along with heritability analysis and GWAS to accurately analyse the
variation in the gene expression of all the 210 unrelated individuals.
5
Keywords
Single Nucleotide Polymorphisms, Genome Wide Association Studies, Expression QTL,
Linear Regression, Restricted Maximum Likelihood, Principal Components Analysis,
Heritability, Cis –eQTL, Trans –eQTL, Gene Expression, HapMap, Lymphoblastoid,
Microarray, Population Stratification, BLAT, Multiple Testing, PLINK, GCTA.
6
Acknowledgement
The first and the foremost person that I would like to thank would be my Mother Mrs.
Chitra Kannan for her constant support throughout my entire life and always encouraging
me in difficult and testing times. I strongly believe that whatever I have achieved till data
is only because of her. Concrete foundations in the Biological Sciences propelled me
towards a four year Bachelor of Technology degree in Biotechnology at the prestigious
Padmashree Dr. D.Y. Patil Institute for Biotechnology and Bioinformatics. I developed a
huge interest in Computational genomics during my Bachelors degree. After the
completion of my four year course, I was searching several Universities in UK which
were known for their excellence in research. During this time I came across an
Advertisement about Analytical Genomics course at the University of Birmingham, UK.
The course content was simply outstanding and I had always dreamt of pursuing my
further education in the subject of my liking from such a prestigious University. I blinked
in disbelief when I read though the list of professors and lecturers. That is where I came
across the profile of Prof Zewei Luo and I would never have guessed that my quest for
the postgraduate study in UK would eventually lead me to the group of Analytical
Genomics headed by Prof Luo. Prof Luo was a great supervisor and I would like to thank
him for his enthusiasm, support and ideas, but also for believing that I could do this even
without having prior knowledge and high computational background before the start of
the course. With Prof Luo, I was able to share my fascination for Human genetic
variation which I am sure has kept us both awake until the early morning hours. Human
Variation was the reason I studied biology in the first place.
I was lucky enough to be taught about eQTLs by Dr. Lindsey Leach, form the
Department of Plant Sciences, University of Oxford. Dr. Leach was one of the most
inspiring people I have met and through her I discovered what fascinated me the most in
science. I am sure that if it weren‘t for her, I would not be writing this now. I am also
very grateful to Dr. Arpita Gupte who taught me Biochemistry during my
Undergraduation and Dr Ganapathy Subramaniam who was my Project supervisor during
my undergraduation. Both of them were great teachers and wonderful people who always
inspired me to go into the field of Research. Their Positive view on work and life, as well
as their exciting ideas were very motivating throughout these years. I would also like to
thank Minghui Wang and Ning Jiang for their guidance and reassuring comments,
especially during project meeting.
7
Abbreviations
A Adenine
ASW African ancestry in the Southwest USA
BLAT BLAST-like alignment tool
C Cytosine
CEU Residents with ancestry from Northern and Western Europe
CHB Han Chinese in Beijing
CEPH Centre d'Etude du Polymorphisme Humain
CHD Chinese in metropolitan Denver, CO, from USA
DNA Deoxyribose Nucleic acid
e QTL Expression Quantitative Loci
EV EigenVector
FDR False Discovery Rate
G Guanine
GC Genomic Control
GWAS Genome Wide Association Studies
GCTA Genome-Wide Complex Trait Analysis
GIH Gujarati Indians in Houston, TX, from USA
JPT Japan
kb Kilo Base
LR Linear Regression
LWK Luhya in Webuye from Kenya
LD Linkage Disequilibrium
MAF Minor Allele Frequency
m RNA Messenger Ribonucleic Acid
MKK Maasai in Kinyawa, from Kenya
8
MEX Mexican ancestry in Los Angeles, CA, USA
MDS Multidimensional Scaling
Mb Mega Base
PCA Principal Components Analysis
QTL Quantitative Trait Loci
REML Restricted Maximum Likelihood Analysis
RNA Ribonucleic Acid
r RNA Ribosomal Ribonucleic Acid
SNP Single Nucleotide Polymorphism
T Thymine
t RNA Transfer Ribonucleic Acid
TSI Tuscans in Italy
TSS Transcription Start Site
UCSC University of California, Santa Cruz
YRI Yoruba from Nigeria
9
Table of Contents
Declaration……………………………………..……………………………………....3
Abstract….………………………….………...…..…..……….……………………….4
Keywords……………………………………………….…………………………..….5
Acknowledgements…………….……………………….…………..…….….………...6
Abbreviations…………………….……………………….……………………...…….7
Table of Contents……………….………………………….…………………...……...9
1. Introduction………………….……………………………………………………12
1.1 Expression Quantitative Trait Loci…….….……….…….….………………...13
1.2 Challenges with Microarray Technology…….……….…….….…….….…….14
1.3 Impact of Genetical Genomics………….………..………....….…….……..….14
1.4 Genome Wide Association Studies (GWAS) in eQTL Analysis…….…...…...15
1.5 Focus on Gene Expression…………………………….…..….….….……….....15
1.6 Natural Variation in Gene Expression………………….…….…..….………..16
1.7 Associating the variation in gene expression with marker genotypes....….....16
1.8 Focus on HapMap individuals………….…..….….….…….………………......17
1.8.1 Phase I………………………………………….……....….……………...…18
1.8.2 Phase II……………………………………….…….……….……………….18
1.8.3 Phase III……………………………………………..……………………....18
1.9 Effect of Population Structure in GWAS…………….……………………....19
1.10 Example of Population Stratification……….…………….…………………...19
1.11 Controlling the stratification…………………….………….……………….....21
1.12 Two Major classes of eQTL…………………….…………….………………..22
1.13 Measure of Heritability……………………….………………..……………….23
1.14 Main Aim of the study………………….……………………….……………...26
2. Methods……………………………….…………………………….…………….27
10
2.1 Selecting the genes to study for the analysis………………………………..…28
2.2 Investigation and Correction of population structure…………………..……29
2.3 Performing Association studies between the gene expression traits
and SNP marker genotypes………………………………………………….…30
2.4 Correcting the Association signals by Multiple testing……….….…...………31
2.5 Heritability Analysis……………………………………………………………32
2.6 Statistical approach to REML…………………………………………………33
3. Results……………………………………………………….…………….….…36
3.1 Addressing the problem of population stratification…….….….…………….37
3.2 Cis –associations of SNPs with the gene expression phenotypes
and the distribution of cis –acting eQTL………….…..……….……...………38
3.3 Reasons for choosing phase II HapMap individuals………....….…..………..39
3.4 Positions of SNPs relative to the Transcription Start Sites…….….….……...40
3.5 Usefulness of Multipopulation over Single population analysis…..…….……40
3.6 Trans –associations of SNPs with the gene expression phenotypes….….…..44
3.7 Heritability of gene expression traits……………..…………………….……...46
3.8 eQTL heritability……………………….….………………….………………...48
4. Discussion……………………………………….…………….….……………...49
5. Conclusion……….…………….…..………..…………….……………………..56
6. References………………………….….……….………….……..….…………..59
7. Appendix……………………………….….….…………………………………70
7.1 S1: BLAT search……………………………….….….…………………………71
7.2 S2: Table Showing 10 eigenvectors obtained from the PCA analysis….….....72
7.3 S3: Association Studies……………………………….….….…………………..76
7.4 S4: Table of genes showing significant cis –associations and their
Heritability estimates………………………………….….…….……………….78
7.5 S5: Table of genes showing significant trans –associations and their
11
Heritability estimates………………………………..……….………………….86
7.6 S6: Estimating the heritability of Gene expression traits and eQTLs...……..87
7.7 References………………………………………………………………..….…..89
12
Introduction
-
13
Expression Quantitative Trait Loci
Gene Expression Quantititative trait loci (eQTL) is a particular site or position in the
genome where the variation in the nucleotide sequences between two genotypes leads to a
significant difference in the gene expression levels between the individuals with these
genotypes1
. eQTL studies have gained more importance in the recent years especially
after the rapid advancements in the microarray expression profiling. The expression
levels of the genes explain several key questions. The most notable ones are - the
susceptibility to a disease, adaptive evolution, characteristic feature of cells, regulatory
mechanisms within the cells. Initially microarray technology was primarily used to
compare the expression levels of the genes under various physiological and
environmental conditions. In the past few years several research groups have started to
combine the Variation in the DNA sequence to the individual differences in the gene
expression (figure 1). The variations in the gene expression were regarded as a
Quantitative trait and were tested for associations with the SNP marker genotypes. Thus
the integration of genetic association with the differences in the expression of genes led
to the development of Genetical Genomics. It aims to study the genetic basis of gene
expression by linking the conventional genetic analysis with the gene expression analysis.
14
Challenges with the Microarray technology
The accurate measure of the mRNA‘s quality and quantity is very much necessary to
identify eQTLs. This is done with the help of DNA microarrays. We can accurately
measure the expression levels of all the genes from different tissues. There are few
disadvantages of this hybridization based technology.
1. Coding SNPs present within the probes could alter the hybridization efficiency,
thereby giving false values of mRNA expression levels explained by the SNP.
2. The use of the relevant probes to be used in the hybridization experiments requires
prior knowledge about the RNA sequences.
3. Significant background noise.
Despite these limitations, gene expression studies with the help of the microarray
technology have detected QTLs for many gene genes in various tissues with satisfactory
power.
Impacts of Genetical Genomics
Genetical genomics is considered to be a very good solution to explain the basis of
complex traits and susceptibility to diseases. It is well known that a complex trait is
usually controlled by a few genes of major effect, many genes which exhibit a minor
effect and modified by the environment. The most prominent examples include – height,
weight, intelligence, etc. the genes that are involved are known as Quantitative Trait Loci
(QTL). Genetical Genomics essentially attempts to address the relationship between
genotype, gene expression and phenotype. It treats gene expression level as a quantitative
trait and links the variation in the trait to genomic locations. The statistical power of such
a combined genetic linkage analysis and expression profiling will be higher than either
15
approach alone. It gives a clear understanding of the organisation of the gene networks
by identifying the polymorphisms responsible for variation in gene expression.
Genome Wide Association Studies (GWAS) in eQTL analysis:
The genes vary from one individual to another. GWAS examines the genome from
different individuals belonging to a particular species. It associates this variation with
different traits, such as disease, gene expression (in case of eQTL), etc. it helps us to
identify, whether a particular gene is associated with a disease. It involves testing
thousands of individuals for mutation, polymorphisms (SNPs). It is mainly used in the
epidemiological studies to find disease pathways, disease susceptibility etc. The genetic
variations are assumed to be associated with a particular trait (eg. Gene Expression) if
they occur frequently in the people having that trait. The methods used to identify trait
associated mutations can be either hypothesis driven or non-hypothesis driven one.
Hypothesis-driven studies are based on the hypothesis that a gene may be associated with
a specific trait and basically attempts to find this association. Non-hypothesis-driven
method tries to scan entire genome and determines which of these demonstrates a
significant association. Most of the GWAS are non-hypothesis-driven35, 36
.
Focus on the Gene expression
Gene expression is defined as the process by which information from a gene is used in
the synthesis of a functional gene product2
. Main products of the gene expression in the
form of proteins or a functional RNA molecule (i.e. rRNA, tRNA, microRNA) are well
known. Expression levels of the genes can be regarded as a complex trait since they are
controlled mainly by Genetic3-7
, epigenetic8, 9
, and environmental10, 11
factors. Hence for
these reasons they can be considered as continuously varying phenotype.
16
Natural Variation in gene expression
Regulation of gene expression is one of the key events in the developmental programme
of an organism. The changes in the expression pattern are known to bring about a
considerable change in the normal functioning of the cell. Natural variation in gene
expression has been studied in many species (such as yeast, fruit flies, mice, and humans)
10, 12-18
. In one of the experiments, approximately half (2,698 out of 6,215) of all the in the
genome were differentially expressed in a cross between two different strains of yeast13
.
In one of the studies related to human population, it was observed that there was a
significant natural variation in the gene expression among 16 individuals from European
and African descent. About 83% and 17% of the genes were differentially expressed
among the individuals and populations respectively19
. In another study11
, more than 65%
of the genes were differentially expressed among the three populations of healthy
Moroccan Amazigh (Berbers). These results clearly suggest that there is a significant
amount of natural variation in gene expression within the species.
Associating variation in Gene expression with Marker Genotypes:
Only 0.01% of the 6 billion nucleotides making up the human genome vary between any
two randomly chosen individuals22
. This variation could be mainly due to the single
nucleotide polymorphisms, copy number variants, insertions, deletions, retroposon
insertions or a combination of these eQTL mainly focuses on the association between
Single Nucleotide Polymorphisms (SNP) marker genotypes and variation in the gene
expression to detect eQTLs. SNP is a common type of DNA sequence variation occurring
due to the change in a single nucleotide — A, T, C, or G — in the genome20
. For example,
let the DNA sequence from different individual be AGTTACAGT and AGTTGCAGT. In
this case we can clearly see a difference in a single nucleotide. In
17
other words, there are two alleles: A and G. SNPs account for approximately 75% of the
total observed variation in humans21
. There are more than ten million SNPs in the human
genome. The DNA position is considered to be polymorphic at an allele frequency
between 1% and 99% in the population. International Hapmap project‘s main aim was to
group these variants to identify the genetic similarities and differences between humans22
.
Based on the position of the SNPs in the genome, they can be classified as coding or non-
coding. It has been observed that only 1.5% of the entire genome encode for the proteins
and there is great deal of redundancy in the genetic code where a particular amino acid is
encoded by multiple codons. SNPs can be classified as – Synonymous SNPs and non-
synonymous SNPs. Former does not alter the amino acid sequences, while the latter
brings about a single amino acid change.
Focus on the Hapmap Individuals:
Hapmap project is an international project and a very useful public resource that
collaborate various researchers and their groups from non-profit organizations22
. The
main aim of this project is to identify and group SNP variants to identify genetic
similarities and differences between humans from four different geographically distinct
populations (i.e. samples from Nigeria (YRI-Yoruba), Japan (JPT), China (CHB – Han
Chinese in Beijing) and the U.S. (residents with ancestry from Northern and Western
Europe (abbreviated CEU), collected by the Centre d'Etude du Polymorphisme Humain
(CEPH)) 22-24
. This project generates the detailed haplotype map of the human genome
and explores the common patterns of genetic variation. In this project DNA from the
Lymphoblastoid Cell Lines obtained from the blood samples of the individuals from four
distinct populations are genotyped22-24
. This project consists of three phases:
18
Phase 1
In this phase at least one common SNP was genotyped every five kb spanning the
euchromatic portion of the genome. Phase 1 comprised of over a million accurate and
complete SNP genotypes from 269 individuals who were grouped into four distinct
groups (30 mother–father–adult child trios of CEU, 45 unrelated CHB, 44 unrelated JPT,
and 30 trios from the YRI individuals)23
. This resource was released in 200523
.
Phase 2:
Phase 2 had more than 2.1 million SNPs from each of 270 individuals (each and every
individual from the Phase 1 were included along with few extra JPT individuals). The
density of the SNP in this phase was close to one SNP for every 1000bp. The studies
capturing the patterns of genetic variation were better understood in the presence of the
tag SNPs. The use of the fixed marker sets increased the power of association through
imputation. The detailed information about this resource was released in 2007.
Phase 3
Phase 3 had seven additional populations (Luhya in Webuye from Kenya (LWK); Maasai
in Kinyawa, from Kenya (MKK); Gujarati Indians in Houston, TX, from USA (GIH);
Chinese in metropolitan Denver, CO, from USA(CHD); Toscani in Italia (Tuscans in
Italy - TSI); African ancestry in the Southwest USA (ASW); and Mexican ancestry in
Los Angeles, CA, USA(MEX)) apart from the four initial populations. Over 4 million
SNPs from 541 individuals from the four initial populations (CHB, CEU, YRI, and JPT)
and close to about 1.5 million SNPs from the 760 individuals belonging to seven
additional populations were genotyped.
19
Table 1 gives a brief summary of SNPs and the individuals studied in each of the 3
phases 22-24
Effect of Population Structure in GWAS
In the tests of association between the gene expression traits and SNP marker genotypes
among the unrelated individuals, Population stratification has a significant effect in
inflating the test statistics and thereby giving rise to type 1 error. ―Population
stratification occurs due to the systematic difference in the allele frequency between
subpopulations in a study due to ancestry difference between study subjects‖25
. If
unnoticed, the presence of population structure could result in false association signals.
Hence it needs to be corrected before carrying out GWAS.
Example of Population Stratification
The use of chopsticks is more common in Chinese population. There was once a study by
an ethnogeneticist who wanted to figure out the reason for the high use of chopsticks in
Chinese population (chopsticks gene). He included both the American and the Chinese
participants in his study. His main aim was to find any gene that differed in the frequency
20
between these two population groups and it was observed that there was an association
with the use of chopsticks. However such a study had no relevance because such an
association was found to be spurious26
. The use of chopsticks among the Chinese
populations was mainly due to their traditional practices. This practice was correlated
with the gene frequencies and hence this led to a false correlation between the genes and
the use of chopsticks (i.e. the gene in association with the chopstick use are those
showing difference in frequency between the Chinese and American population).
Genetics does not play a huge role in defining the population structure. It is not a
significant problem in case of non-human animals. We can experimentally control their
environments. It is possible to avoid the spurious correlations between the occurrence of
a particular allele and its exposure to the environments by choosing a random and an
invariant environment. The best approach to control the population stratification would
be to raise all the individuals in the same common environment, which is very unlikely in
case of the humans. In such an approach it would be possible to raise the humans
identically and assign them to their respective environments of our choice.
Types of population structure:
We can categorize the population structure into 3 basic categories:
1. Discrete structure – comprises of remotely related populations. (i.e. Asians, Europeans,
Africans). It is easier to detect the structure since the individuals are well separated.
2. Admixed structure – comprises of individuals from mixed ancestry (i.e. Hispanic
Americans, African Americans). Certain degrees of admixture are associated with these
individuals. Hence it is difficult to categorize them into distinct clusters.
21
3. Hierarchical structure - comprises of both Discrete and Admixed structure. It consists of
multi-ethnic individuals27
.
Controlling the stratification
Genomic control (GC) is one of most useful approaches employed to control the
population structure28-30
. It has the following principle: For a large number of markers
measuring association with phenotype, the estimate of Chi-square statistic is derived and
this test statistic is adjusted with an inflation factor provided by the random set of
markers which is not associated with the phenotype of interest. Stringent genome-wide
significance threshold is required as large number of SNP markers are tested in the
Genome-wide association scan (GWAS). Even such a threshold cannot control type 1
error29, 31, 32
because the variance inflation is violated across the SNPs. The only problem
with GC is that the uniform adjustment may not be sufficient if the allele frequencies of
some markers differ more than others across the ancestral population. Structured
association is another approach that can be employed to control the stratification. It
makes use of the STRUCTURE program33
where the individuals to be studied are
assigned to discrete subpopulations and the evidence of association is collected within
each subpopulation. This approach is computationally demanding and assigning the
individuals to clusters depends on the number of clusters, which is not well defined in
many of the studies. One of the recently developed approaches is the stratification control
by EIGENSTRAT65
. It helps identify the population structure by calculating the
principal component of SNPs across the genome. The first few (topmost) principal
components represent the axes of greatest genetic variation among the individuals. These
principal components are taken as covariates to carry out regression analysis. Another
22
interesting approach is the use of Multi Dimensional Scaling analysis (MDS)34
which is
an extension of EIGENSTRAT. "It is a clustering approach which attempts to recognize
the patterns of genetic variation (both discrete and Admixed). The positions of a subject
along the axis of genetic variation are identified along with the cluster membership
(Figure 2). Each of these positions is then adjusted in order to correct for the potential
confounding effects. On the other hand, EIGENSTRAT measures the genetic correlation
between the subjects. The results obtained from simulations clearly demonstrate that
MDS offers a better control over the stratification as compared to the EIGENSTRAT
approach34
.
Two major classes of eQTLs
It is well known that the abundance of each transcript from many subjects in the
population is used to map eQTLs. The mean abundance of a transcript from each of the
cell line is then utilized in mapping the QTLs responsible for expression levels of each
transcript by standard mapping approaches37
. eQTLs can be classified into two main
23
categories – cis (proximal) and trans (distal) based on their physical distance from the
regulated gene. Gene expression trait linked to a locus is known as cis-linkage if the locus
is located near the target gene itself. Otherwise it is known as trans-linkage (figure 3).
Such a distance based criteria provides a broad classification of regulatory elements. In
their studies related to yeast, Brem et al38
defined the linkage as cis if the gene and the
marker was less than 10 kb apart. Schadt et al39
and Wang et al40
defined the linkage as
cis if the gene and the marker was less than 20Mb apart from each other. Cis-linkage
mainly occurs due to the first order effect which causes the DNA polymorphism of the
gene itself and produces a strong linkage signal. Trans-linkage on the other hand occurs
due to the second order effect where the transcription product (protein) of the gene
containing the DNA polymorphism affects the expression of a different gene thereby
producing a weaker signal.
Measure of heritability
The significant association of the SNPs with the expression traits forms the basis of
genome wide association studies in human population. However, only a small proportion
24
of variation in the gene expression levels is explained by the variation in the genotypes
across the individuals. The important question here is – What about the missing
heritability? 58,59
. The possible explanations would be: 1) gene * gene or gene *
environment interactions60
. 2) Epigenetic factors61
. 3) Rare variant and common disease
hypothesis62, 63
. The variation in the gene expression explained by the SNPs is generally
lower than the narrow sense heritability (based on the additive genetic effects). Hence
non additive effect cannot explain missing heritability. When a small proportion of
variation is explained by casual variants, it has been observed that their effects fail to
reach strict significant thresholds and they would not be in complete linkage
disequilibrium with SNP markers. Thus the heritability of the expression trait would be
heavily dependent on these factors.
Thus heritability of the gene expression can be defined as ―the ratio of genetic variance to
the total phenotypic variance66
‖
Genetic variance referes to ―the variation in the gene expression phenotypes occuring
due to the presence of different genotypes across the population‖79
. On the other hand
environmental variance explains the proportion of the variance in the gene expression
phenotypescaused due to the differences in the environments to hich the individuals in a
population have been exposed‖ 80
Similarly ―eQTL heritability can be defined as
25
Although a particular SNP and the Loci in LD with the SNP can have a substantial effect
in the variation of mRNA expression levels, average heritability of eQTL has been
reported to be ~25% in many studies67,68,69
. However the gene expression studies by
Schadt and et.al.70
on human liver concluded that the variation explained by the SNPs can
be as low as 2% and as high as 90% thereby suggesting a large variability in gene
expression heritability and its dependency to a particular type of cell or tissue.
26
Main Aim of the study
The studies related to the genetic aspect of the gene expression with the main focus on
the variation in the gene expression have been one of the hot topics for the past few years.
It is now well known that regulatory polymorphisms are present in the human genome,
with the cis and trans- regulation of the genes. Most of the studies have focussed single
genetic variant effects and expression measured in a single cell type. Many eQTL studies
have mainly focussed on the experimental crosses in yeast and mice41 42
. Recently several
studies performed in a variety of organisms have clearly shown that gene expression
levels are heritable43-47
. Considering these aspects, the main objective of our study is:
1. To detect cis- and trans- acting expression QTLs (eQTLs) in a mixture of HapMap
populations.
2. To estimate the heritability of the gene expression traits.
3. To explore the relationship between the heritability of the gene‘s expression level and
the power to identify cis- and trans-acting eQTLs.
4. To predict presence/absence of the eQTLs for a particular gene based on the
heritability estimates of that gene.
27
Methods
28
Selecting the genes to study for the analysis
The Gene Expression levels of 210 unrelated HapMap individuals (CEU: 60; CHB: 45;
JPT: 45; YRI: 60) in four technical replicates, were measured using Illumina‘s human
whole-genome expression array (Box1). ―Quantile normalization was performed on the
gene expression data within the replicates and this was then followed by the median
normalization across the Hapmap individuals‖6
The Normalized gene expression values
were readily available in Gene Expression Omnibus database with an accession number
of GSE6536 6
. The array consisted of 47,294 Illumina probes. There were few genes that
were mapped by multiple probes. BLAT48
Search was carried out in order to map them to
human RNA sequences from Refseq (hg 19)49
. All those probes that mapped to the more
than one gene with more than 90% similarities were discarded. Only those probes that
matched exactly to a unique gene were selected.
From these I looked for those probes for which a gene had only one RNA accession
number in the RefSeq database and retained them. The genes having multiple splice
forms and multiple annotated start sites were avoided thereby making the analysis
simpler50
. Now, out of the remaining genes, majority of them had only one probe, while
few of them still had multiple probes. I had to check whether these probes mapped to 3‘
end or 5‘ end of the gene. If the majority of the probes were biased towards the 3‘end of
the gene, such a bias was reduced by selecting the probes nearest to the 5‘end out of
multiple probes. I searched for the SNPs positions to see if if there were any within the
probe sequences and discarded those as it could have a significant impact on the
29
measured expression level. The probes that have the SNPs in them were indentified in
UCSC Genome Browser51
. Finally probes that mapped to X and Y chromosomes were
left out and not included in the analysis. After performing the above mentioned steps,
data set of 11,446 gene expression values were left for the final data analyses.
Investigation and Correction of population structure
EIGENSTRAT65
method was used to correct the stratification. The principal Components
analysis was applied to the SNP genotypes. The PC axes (Eigenvectors) contributing to
the genetic variation was inferred from the PCA analysis.
30
These Eigenvectors were used as the covariates in association and heritability analysis
(figure 5). The complexity and multidimensionality of the data was greatly reduced by
the Axes of variation. These axes of variation explained maximum variability in the data.
The eigenvectors were obtained by carrying out the PCA analysis in Golden Helix
Softwarewebref1
. 10 eigenvectors scores obtained from the PCA analyses are shown in the
(Appendix-S2).
Performing Association studies between the gene expression traits and SNP marker
Genotypes
Each SNP from ~2.2 million SNPs and each of selected probes were fitted into a simple
Linear Regression (LR) model71
. This model is also called as Single-SNP model. In this
model every single SNP is regressed with the selected probes. Let Xi be the ith
individual‘s genotype for a given SNP. This genotype may take any of the three form: Xi
= 2, 1, 0 for common homozygous alleles (AA), heterozygous alleles (Aa) and rare
homozygous alleles (aa) respectively. Linear regression was fitted for this additive model:
Where Yi is the Quantitative trait variable (i.e. log normalized probe expression values of
ith
individual, where i = 1, 2, 3, …, 210. εi are the random residuals following the normal
distribution. These variables have a constant variance and a mean = 0. Only the SNPs
with Minor Allele Frequency (MAF) greater than 0.05, were used for the association
testing. In order to classify the significant association signals as cis or trans, the distance
between the midpoint of the probe and the SNP genomic location was assessed. The
association between the SNPs and probes were known as ‗Cis association‘ if the above
mentioned distance was less than or equal to 1Mb72
. Otherwise they were considered as
the trans association. The association analysis was performed using PLINK software54
.
31
The screen shots and the few commands that were used in the association studies are
detained in (Appendix-S3). The association testing was carried separately on each of the
population groups and also on the pooled samples (Multipopulation). In the
multipopulation analysis, PCA axes were incorporated as covariates in the linear
regression model to control for the population structure. The use of the pooled samples in
the association testing was mainly done to increase the power of the analysis (especially
to detect the hidden cis –association signals).
Correcting the association signals by multiple testing
The multiple test correction was carried out by using following three methods:
1. Bonferroni Correction73
,
2. False Discovery Rate approach74
,
3. Sidak corrections75
Each of these three approaches was used for both Cis and Genome wide analysis.
Bonferroni Corrections were too conservative as it failed to identify many true
association signals, thereby giving rise to type II error. This can be clearly seen in table 2.
Sidak method was less conservative compared to the Bonferroni correction, but could be
used only if all the genes were independently regulated. Hence the FDR correction was
considered to be a better approach
32
Heritability analysis
The heritability estimates of the gene expression traits and eQTLs associated with those
gene expression traits were obtained by Restricted Maximum Likelihood Analysis57,76
from GCTA program77
. For estimating the heritability of the eQTLs, a simple approach
was used to estimate the proportion of the phenotypic variation explained by the SNP.
For each locus only one SNP showing the highest significance (lowest FDR corrected p-
value) was chosen and the variance explained by them was calculated by the following
the equation:
However, the above equation could explain only a fraction of the variance in the gene
expression levels and this was not necessarily the same as heritability (because the
denominator var(y) did not account for population structure), although they were related
33
with heritability in some extent. The heritability estimates obtained for some of the
eQTLs traits with the help of the above equation and REML are shown in the (table 3).
During the heritability analysis the population structure arising the from the HapMap
populations was corrected by estimating relationships in each individual population
separately. These relationship matrices were merged by setting the relationships between
individuals of different populations to zero77
. Such an adjusted relationship matrix was
used to analyse heritability of the gene expression traits in the mixed populations. The
first two eigenvectors (from PCA analysis)65
which showed maximum stratification were
used as the covariates in the heritability analysis. The Screen shots and commands used in
the GCTA program for the calculation of the heritability estimates are explained in the
(Appendix-S6).
Statistical approach to REML
The gene expression value of a particular individual depends on the total genetic effect of
that individual, number of causal variant and its scaled additive effect. This can be
represented by the following equation57
.
34
In general, the equation becomes:
where g = z*u
Now the scaled additive effect (u) is taken as a random effect. The variance of the total
additive genetic effects depends on the number of the causal variants and the variance of
causal additive effects. This can be explained by the following equation:
Thus the variance in the gene expression values cam be broken down into two parts –
variance due to the additive genetic effects (σg
2
), and the residual variance (σe
2
),
G is a very important term in the above equation. G represents the genetic relationship
matrix individuals of different population groups. This matrix is very important for
adjusting the stratification effects. The Equation 3 is very much similar to the classical
description of Heritability.
35
However it is very difficult to predict anything about the causal variants. Which SNP is
the actual causal variant or whether it is LD with the causal variant? We cannot tell that
how many causal variant would be present in ~ 2.2 million genotypes SNPs and which
are they. Therefore the relationship matrix based on the genotypes SNPs (A) is calculated.
This matrix is then used in the REML analysis to obtain the heritability estimates.
Thus A is calculated for each SNP and the weighted average is taken across all the SNPs.
Finally a common genome wide relationship matrix (A) is obtained by combining
individual matrices and used in REML analysis. For eQTL heritability, a separate
relationship matrix involving a particular loci or region in the chromosome to which the
gene is associated was constructed. This region of the genome consisted of the SNPs
showing significant association with that gene. For example if 10 SNPs (covering a
region of 100 kb in the genome) showed a strong association signal with a particular gene
(namely Gene X), that particular segment of the chromosome was primarily used to
construct the relationship matrix. Such a relationship matrix was fitted into the REML
analysis to explain the variation explained by that particular locus, thereby giving rise to
eQTL heritability estimates.
36
Results
37
Based on the above mentioned strategy (Methods) for selecting the genes to study in the
GWAS, I was left with a data set of 11,446 gene expression values. Each of these
Expression values was treated as the phenotypes and was subsequently used for the
further analyses. As mentioned before, the main motive of the study was to find the SNPs
that were significant associated with the variation in the gene Expression Phenotypes and
the heritability of these Phenotypes. Along with the heritability of Gene expression traits,
the heritability measures for the eQTLs (cis and trans) significantly associated with
expression variants were also measured. About 2.2 million SNPs were selected from the
HapMap Phase II project23,24
for Association and Heritability analyses. The Minor allele
frequencies of the selected SNPs were greater than 5% from each of the four unrelated
populations (CEU, YRI, CHB, JPT).
Addressing the problem of Population Stratification
Since HapMap samples used in this study comprised of individuals from four different
populations, there was a high possibility of Population stratification effect giving rise to
false association signals and heritability results. Hence Principal Component method65
was employed to control the effects of Population stratification. In this approach the
Principal Components of SNPs across the genome were calculated. These Principal
Components were treated as the covariates in the Regression analysis. The Presence of
the Population structure was clearly detected by plotting the Fist two components (Figure
5). These two components were used to adjust the test statistics for the markers that gave
rise to the stratification. The P value distribution here therefore takes account of the
correction. Figure clearly shows the adjustment made after the stratification correction for
the gene ‗hmm28636‘. It is clear from the (Figure 6) that the distribution of the majority
of the p-values is in agreement with random expectation (i.e. a model without any
significant associations – expected p-values under null hypothesis). The Genomic
38
inflation factor (based on median chi-squared) was 1 for all the genes indicating that there
was no residual population stratification after the correction.
Cis Associations of SNPs with the gene Expression Phenotypes and the distribution
of Cis-Acting eQTLs.
The association analyses between the variation in the SNPs and the variation in the Gene
Expression for each of 11,446 Gene Expression traits from 210 unrelated individuals
were performed by using a linear regression model (additive model – equation). The
39
association analysis was carried out using PLINK54
. To analyse the SNPs that caused a
pronounced effect on the measured mRNA levels in Cis, only those SNPs situated within
1-Mb upstream and downstream from the midpoint of the expression probe was
considered. Such an approach is referred to as the Cis Candidate region approach 72
. In
case of large genes (> 500kb), a region 500 kb upstream and downstream of the
transcription start site (TSS) and end site (TES) respectively were regarded as the Cis
candidate regions. The P-values obtained from the association analyses were then
adjusted for multiple testing. In order to avoid the false positives and to constrain the
study-wise significance level, the p- values obtained from association testing were
corrected for multiple testing by employing stringent and conservative False Discovery
rate (FDR) method. The FDR of 5 % was set as the threshold for association analysis and
all the SNPs with the adjusted Significant P-values (FDR < 0.05) were recorded and were
assumed to be significantly associated with the variations in Gene Expression trait.
Significant Cis association were observed in 579 genes (Appendix-S4). A total of 611
(5.33%) genes were found to contain at least one SNP with a p value < 1x10-7
. In
majority of the cases, the exact location of the eQTLs could be predicted because the
SNPs that were significantly associated with mRNA levels were localized to a restricted
region. (Figure 7) shows the examples of the p-values of three such genes. In cis
association, position of genes and eQTLs superimpose on each other to produce cis
diagonal (Figure7 and 8).
The reasons for choosing Phase II HapMap Individuals
Significant Cis- associations shown by certain Genes identified in the Phase 1 HapMap
were compared with those identified in the Phase II HapMap data. It was observed that
the Phase 1 identified 82% of the genes detected with Phase II. (Figure 9) shows the
example of one such gene. This could be due to slow decay of Linkage Disequilibrium.
40
Thus instead of detecting the diversity among the common haplotype, Phase II HapMap
detected additional variation in genotypes compared to the Phase I HapMap.
Positions of the SNPs relative to Transcription start Site.
Because of the high density of SNPs in the HapMap, it could be possible that several
SNPs included in the analysis were in fact the causal variants. Hence from the Pooled
sample of individuals (multipopulation), the SNPs having Significant Cis- associations
for each of the genes from the multi- population analysis were mapped relative to the
transcription start sites (TSS) of the genes. It was found that these Cis- associated SNPs
were present very close to the TSS (Figure 11).
Usefulness of Multi-population over Single-population analysis.
The analysis was mainly focussed on four different Unrelated Population groups. The
association analysis was initially carried out on each of the single population groups
(Single-population analysis). However the use of four Population Groups together (Multi-
population) in the analysis was considered to be more appropriate because by pooling
together the four population groups, we would be able to obtain many hidden association
signals. By carrying out Multi-population analysis, it was observed that this methodology
detected majority of the population-shared Cis- association signals that were identified in
the single-population analysis. Along with these association signals, it was able to detect
many additional Cis effects (Figure 10a). Majority of the effects captured by Multi-
population analysis were much smaller (R2
= 0.10-0.60) compared to those captured by
Single-population analysis (Figure 10b).Thus 579 Cis- associations were detected in the
pooled sample of four population groups. Since effects of population stratification were
removed by using principal components as the covariates in the linear regression model.
The linear regression analysis thus yielded accurate Cis- association signals.
41
42
43
44
Trans- associations of SNPs with the gene Expression Phenotypes
Genetic effects acting in Trans- could be explained due to the availability of SNP and
whole-genome expression data. Testing approx 2.2 million SNPs per population with an
allele frequency of >= 0.5 against all the gene expression traits was statistically and
computationally demanding. Once again Linear Regression model was used to test the
association between the variation in gene expression and the variation in the SNPs. To
analyse the SNPs that caused a pronounced effect on the measured mRNA levels in trans-,
only those SNPs situated beyond 1-Mb upstream and downstream from the midpoint of
the expression probe was considered. These SNPs could either be present in the same
45
chromosome as the gene to which they were significantly associated or could be present
in a different chromosome. From the regression analysis it was found that 32 genes had
significant trans association (Appendix-S5). One of genes having a trans eQTL is shown
in the (Figure 12).
Out of these 32 genes had one eQTL, 2 had two eQTLs and rest had more than two
eQTLs. Only one gene (gene ‗HLA-C‘) had trans- associations on the same chromosome
(distance greater than 1 Mb. There were 8 SNPs that showed strong association with
multiple genes (Table 3). Of these, 2 SNPs showed both cis and trans association with
the more than one gene; 4 SNPs were associated with multiple genes in Cis and the
remaining two SNPs showed trans association with more than one gene. It was observed
that the effects of Trans associations were much weaker compared to Cis-associations.
Majority of the Cis associations (60%) had a –logP score greater than 10 (Figure 13).
46
Heritability of Gene expression traits
The heritability of the gene expression traits and eQTLs were estimated by REML
approach57,76
(explained in the methods section). Of the 611 genes that had eQTLs (cis
and trans), only 112 genes had heritability higher that 0.5, 507 genes had heritability
higher than 0.2, and rest of the genes had heritability lower than 0.2 (Figure 14 and
Appendix-S4&S5).
47
Although heritability measures show a reasonable correlation (r = 0.2) with the cis-
association significance (Figure 15), It is also clear from the that some of the gene
expression traits with very low heritability estimates (heritability < 0.1) show significant
cis associations and the maximum –log10P values of these associated SNPs are greater
than the –log10P values of some of the SNPs showing strong cis- associations with the
gene expression traits that have high heritability ( heritability > 0.5).
The heritability measures of the genes showing significant Cis- association were
compared with those showing strong trans- association signals (Figure16, Appendix-S4).
It was found that there was no significant difference between them with respect to their
heritability estimates (t = -1.0995, p-value = 0.2792) (Figure 16).
48
eQTL heritability
The mean heritability of the SNP that was strongly associated with each gene expression
trait was 0.16 (s.d. 0.16577 maximum of 0.817) compared to 0.36 (s.d. 0.161, maximum
0.86) which was the mean heritability for the overall gene expression trait. This suggests
that about 44% of the heritability in the gene expression can be explained by the peak
SNP. However the proportion of variation explained by the cis eQTLs (cis eQTL
heritability) is significantly higher than the variation explained by trans eQTLs (trans
eQTL heritability) (p-value = 1.190e-10) (Figure 17 and Appendix-S4&S5).
49
Discussion
50
The detailed analysis of genetic effects giving rise to the variation in the expression of the
mRNA levels (Genes) extracted from the lymphoblastoid cell lines was carried out.
Based on the results obtained from the analysis, it was observed that 579 genes had
significant Cis associations and 32 genes had significant trans associations with the SNPs.
It can be assumed that only a small proportion of the functional regulatory effect could be
well explained in these four population groups owing to the limited power of the analysis.
Besides this limitation, the analysis was mainly focussed on a single cell type. Variations
arising in different cell types are not shown in this analysis. Therefore it is possible that a
plethora of Cis- regulatory variants were obtained from the analysis. The main purpose of
the analysis was not limited to the identification of Cis regulatory events alone. The
analysis attempted to depict the characteristics of cis- association signals. The four
different population used in this analysis gave rise to a significant stratification. In order
to avoid spurious association signals, the stratification was corrected by using the
principal components scores as the covariates.
Although there are various ways by which the stratification can be controlled (namely –
Genomic control, Structured Association), PCA seems to perform much better in
controlling the stratification. The computational effort to carry PCA is much lesser than
SA and GC78
. In PCA, the SNPs showing significant difference in the allele frequency
between different population groups are corrected to a greater extent. This analysis
provides some interesting insights with respect to the variation in gene expression. The
sensitivity of the analysis was greatly improved by using the pooled sample of four
population groups (multipopulation) (Figure 10). Because of the small sample size in the
single population analysis, the genetic effects captured by the association testing was high
and this could have possible given rise to a wider range of squared correlation coefficient
51
scores (R2
from 0.30 to 1). Due to Multipopulation analysis, weaker regulatory effects
shared across the four population groups could be identified. It was possible to carry out
the conditional permutation calculations which would allow pooling the members of the
population if the population identities or their relationships were known. Because of the
presence of unrelated individuals in the pooled sample and the complexity of the time
consuming permutation calculations, linear regression method was employed along with
PCA approach for the association analysis. Although the outliers can have pronounced
effect on the p- values, prediction capabilities and additional factors can be added in this
method. Calculations were much simpler and less time consuming by performing simple
linear regression for association testing. The association testing was carried out for few
genes using the SNPs belonging to both HapMap 1 and II. However due to slow decay of
Linkage Disequilibrium with the causal variants and the inclusion of variants with low
minor allele frequency (less than 5%)ref
, the SNPs from the phase II HapMap were used
for the association testing. Hence association signals in some of genes could easily pass
the stringent significance threshold (Figure 9) when the Phase II genotypes were used.
The presence of majority of the Cis associated SNPs near the TSS (Figure 11), provides
valuable information about the location of cis- regulatory variants in the genome. It can
be said that most of these variants with cis effects are in genic or immediate intergenic
regions in the human genome.
Compared to Cis effects very few genes had significant trans associations with the SNPs.
This was expected because, trans regulations are considered to more indirect. It could be
possible that the gene may well be regulated by a large number SNPs, many of which fail
to pass through the stringent significance threshold. Also size of the sample used in this
analysis possibly provides less power to detect many trans association signals. The
52
previous studies ref
have also shown it has been quite difficult to capture trans effects in
humans as compared to yeast. As is yeast is a unicellular organism, the
biologicalinteraction studied in a single cell is capable of detecting all of the other
interactions. However human cell can be considered as a minute part of the whole
organism. Therefore majority of the trans effects mediated by intercellular events can be
difficult to detect. Also trans effects may well be shared across different cell types
thereby diluting them. Finally the use of strict significance threshold could have made the
detection of trans eQTLs difficult. Large number of false negatives could have cropped
up by the use of strict significance threshold (FDR < 0.05) thereby giving rise to fewer
trans effects compared large number of cis-effects. However the use of such stringent
threshold was necessary to avoid any false positives and to be made sure that they were
indeed true eQTLs.
Our analysis indicates that much of the variation of the complex phenotypes in humans
can be explained by the by cis – regulatory variants because of its large enrichment
among a group of potential trans –regulatory variants. From the analysis it was observed
that trans effects were not as strong as cis. Most of the –log10P scores greater than 10
belonged to cis (Figure 13). Previous studies in miceref
and humansref
have shown similar
observations. In spite of the weak trans effects, several distant associations between the
SNPs and gene expression phenotypes were observed. Thus it could be said that large
sample size will be able to predict the trans regulation effect of the transcripts more
accurately. The analysis showed that the median heritability of the genes expression traits
having eQTLs was 34.5% which is consistent with the previous studiesref
. However it was
surprising to note that there were about 104 genes with heritability estimates of below 0.2
(Figure 14). These genes were significantly associated with the SNPs and had eQTLs.
This raised an important issue of missing heritability among the genes. The significant
53
associations of the individual SNPs with the gene expression trait and the power to
capture them mainly depend on the variance that can be explained by the SNPs. This, in
turn is dependent on the linkage disequilibrium between the actual causal variant and
SNPs. The large effects of the rare alleles or the small effects of the causal variants
probably is less likely to explain large variation and hence would not turn out to be
significant even if the assayed SNPs were in high LD with the actual causal variant.
However the total effect of SNPs genotyped in the Phase II HapMap was considered
during the heritability calculations. Thus for few genes we expect only a small fraction of
the variance in gene expression level to be explained by the SNPs giving rise to the low
heritability estimate in those genes.
Even after making use of about ~2.2 million markers spanning the entire genome, there
are possibilities that several causal variants could still have very low minor allele
frequency (MAF) and a poor LD with the assayed SNPs. As a result, the power of
identifying such causal variants is greatly reduced in the association studies which in turn
reduce the heritability measures (i.e the variance explained by SNPs) of some of the
genes. If there are many causal variants for a particular gene expression trait, there are
chances that only a small percentage of variance is explained by the majority of the
causal variants. Many GWASs have observed this phenomenon by having the wider
distribution of high test statistics for most part of the genomeref
. Could there be an
ascertainment Bias in the data analysis and problem of population structure in the
heritability analysis? For this reason, it was necessary to assume that individuals of
different populations are unrelated and their genetic relationships are almost zero.
Therefore, to analyze the heritability of gene expression and eQTLs from the pooled
population (Multipopulation), the approach was to estimate relationships in each
individual population group separately and then merge all the relationship matrices by
54
setting the relationships between individuals of different populations to zero. With this
adjusted relationship matrix, heritability measures were obtained for the genes and the
their eQTLs in the mixed populations. The REML analysis was also performed by the
fitting first two Eigen vectors from the mixed populations as the covariates. Hence the
results obtained at the end of the analysis were not biased by the population stratification
effect. Any individuals having a relationship score > 0.025 with another individual was
not included in the analysis. The relationships obtained from the SNPs is based on the LD
between the SNPsref
and this LD gives rise to significant association signals between the
SNPs (which is not an actual causal variant) and the gene expression trait. Since the
heritability measures of the gene expression takes into account the variance explained by
all the genotyped SNPs obtained from Phase II HapMap project, it is not necessary that
individual SNPs pass strict Significance thresholds. The large variation in the heritability
of the gene expression traits depends on the cell state. It is possible that under certain
conditions of stress (hypoxia) or in other developmental stages, the genetic effect of the
gene expression is diluted to some extentref
.
Since Heritability is ―the proportion of phenotypic variation caused by the additive gene
tic factors‖ref
the additive effects of the SNPs were therefore fitted into REML analysis.
The variations caused by the gene-environment interactions and non-additive effects do
not affect the heritability estimates. The main aim for carrying out the heritability
analysis was mainly to find out the variance in the gene expression traits explained by the
SNPs. The analysis do show that the SNPs explained as low as 0% to as high as 86%
variation among the phenotypes and that the missing heritability in some of the genes
could be due to lack of complete LD between the causal variants and SNPs. However
there were few genes which had high heritability ( ) but were not significantly associated
any SNP. The variation in these gene expression traits could have been regulated by
55
multiple SNPs or loci, each giving rise to small effects in order to regulate the overall
expression collectively. From the previous studiesref
, it is known that the loci having
small effects become difficult to be captured than those with large effects.Previous
studiesref
have suggested that the magnitude of the heritability of the gene expression trait
heavily depends on the precision and power of the gene maping experiments. Thus the
detection of the eQTLs are less likely if there is a large environmental variance. The
analysis however has shown that the heritability estimate of the gene expression traits
does not necessarily determine about the presence of absence of eQTLs for a particular
gene.
The analysis showed that cis-eQTLs have higher estimates of heritability (median =
0.3592) compared to the trans-eQTLs (median = 0.1017). There were about 36 eQTLs
(35 cis-eQTL and 1 trans-eQTL) that had big allelic effects and explained more that 50%
variance in the gene expression levels (Appendix-S4&S5). This implies that variation in
the expression levels of these genes was also due to small effects other multiple loci
together with the major cis –eQTL. Several studies in yeast have also shown similar kind
of observationsref
.
56
Conclusion
57
As the variation explained by individual SNPs are too small the results presented in this
work hints at employing larger association studies (use of large number of individuals) to
identify the significant associations between the SNPs and the gene expression traits. The
heritabilities of some of the gene expression traits are still insignificant possibly due to
the weak LD between the SNPs and the causal variants. Such causal variants and rare
polymorphisms are likely to be identified by carrying out detailed Resequencing studies
and by the use of advanced genotyping arrays. The lack of the heritability of some of the
genes strongly associated with eQTLs could be due the large number of causal variants
with very small effects and in order to show their effects as significant, it would be
beneficial to carry out the analysis with large sample size. Therefore one needs to be
careful while selecting the phenotype for fine mapping based only on the heritability
estimates. This is many because the genes with low heritability (<0.2) may still have
significant association with the SNPs. Although some of the rare alleles with large effects
can explain a small proportion of the variation in the gene expression, still a large sample
size would be required for these effects to be statistically significant. A comprehensive
analysis of the variation in the gene expression phenotypes among the 210 unrelated
individuals from four different population groups has been described. Detailed genetic
characteristics and the approximate positions of the cis and trans effects across the human
genome has been studied in the work.
The results and the detailed analysis about the heritability and eQTL characteristics of
the gene expression traits could provide valuable insights and robust framework for
further downstream analysis and the future studies related the gene expression variation
among large population groups with each group having large number of individuals
58
(large sample size) using different types of cells and tissues which would involve the
SNPs with MAF (< 0.001) and SNP densities. I strongly believe that some of the
observations presented in this study can be used to interpret the findings of the
association signals in some of the diseases and identify the biological effects of the SNPs
or a particular region (loci) in the genome that show significant association in the disease
states. Thus it would possible to explain the functional variation taking place in the
genome and how this variation could lead to the variation in the phenotypes across the
human populations.
59
References:
1. Kliebenstein, D.J. (2009) ―Quantitative Genomics; analyzing intra-specific variation
using global gene expression polymorphisms or eQTLs.‖ Annual Reviews Plant Biology
60(1)93-114.
2. Lewin, B. (2008). Genes IX. Sudbury, MA ; London, Jones and Bartlett.
3. Monks, S. A., A. Leonardson, et al. (2004). "Genetic inheritance of gene expression in
human cell lines." Am J Hum Genet 75(6): 1094-105.
4. Morley, M., C. M. Molony, et al. (2004). "Genetic analysis of genome-wide variation in
human gene expression." Nature 430(7001): 743-7.
5. Cheung, V. G., R. S. Spielman, et al. (2005). "Mapping determinants of human gene
expression by regional and genome-wide association." Nature 437(7063): 1365-9.
6. Stranger, B. E., M. S. Forrest, et al. (2007). "Relative impact of nucleotide and copy
number variation on gene expression phenotypes." Science 315(5813): 848-53.
7. Stranger, B. E., A. C. Nica, et al. (2007). "Population genomics of human gene
expression." Nat Genet 39(10): 1217-24.
8. Eckhardt, F., J. Lewin, et al. (2006). "DNA methylation profiling of human chromosomes
6, 20 and 22." Nat Genet 38(12): 1378-85.
9. Petronis, A. (2006). "Epigenetics and twins: three variations on the theme." Trends Genet
22(7): 347-50.
10. Gibson, G. (2008). "The environmental contribution to gene expression profiles." Nat
Rev Genet 9(8): 575-81.
60
11. Idaghdour, Y., J. D. Storey, et al. (2008). "A genome-wide gene expression signature of
environmental geography in leukocytes of Moroccan Amazighs." PLoS Genet4(4):
e1000052.
12. Hartman, J. L. t., B. Garvik, et al. (2001). "Principles for the buffering of genetic
variation." Science 291(5506): 1001-4.
13. Jin, W., R. M. Riley, et al. (2001). "The contributions of sex, genotype and age to
transcriptional variance in Drosophila melanogaster." Nat Genet 29(4): 389-95.
14. Brem, R. B., G. Yvert, et al. (2002). "Genetic dissection of transcriptional regulation in
budding yeast." Science 296(5568): 752-5.
15. Schadt, E. E., S. A. Monks, et al. (2003). "Genetics of gene expression surveyed in maize,
mouse and man." Nature 422(6929): 297-302.
16. Stranger, B. E. and E. T. Dermitzakis (2005). "The genetics of regulatory variation in the
human genome." Hum Genomics 2(2): 126-31.
17. Stranger, B. E., M. S. Forrest, et al. (2005). "Genome-wide associations of gene
expression variation in humans." PLoS Genet 1(6): e78.
18. Boone, C., H. Bussey, et al. (2007). "Exploring genetic interactions and networks with
yeast." Nat Rev Genet 8(6): 437-49.
19. Storey, J. D., J. Madeoy, et al. (2007). "Gene-expression variation within and among
human populations." Am J Hum Genet 80(3): 502-9.
20. Hartl, D. L. and A. G. Clark (2007). Principles of population genetics. Sunderland, Mass.,
Sinauer Associates.
61
21. Levy, S., G. Sutton, et al. (2007). "The diploid genome sequence of an individual
human." PLoS Biol 5(10): e254.
22. International HapMap Consortium (2003). "The International HapMap Project." Nature
426(6968): 789-96.
23. International HapMap Consortium (2005). "A haplotype map of the human genome."
Nature 437(7063): 1299-320.
24. International HapMap Consortium (2007). "A second generation human haplotype map
of over 3.1 million SNPs." Nature 449(7164): 851-61.
25. Li MY, Reilly MP, Rader DJ, Wang LS (2010) Correcting population stratification in
genetic association studies using a phylogenetic approach. Bioinformatics 26(6): 798–806.
26. Hamer D, Sirota L (2000) Beware the chopsticks gene. Mol Psychiatry 5: 11–13.
27. Serre,D. et al. (2008) Correction of population stratification in large multi-ethnic
association studies. PLoS ONE, 1, e1382.
28. Devlin, B., K. Roeder, and L. Wasserman (2001). Genomic Control, a New Approach to
Genetic-Based Association Studies. Theoretical Population Biology 60 (3), 155-166.
29. Devlin, B., S. Bacanu, and K. Roeder (2004). Genomic Control to the extreme. Nature
Genetics 36 (11), 1129-1130.
62
30. Devlin, B. and K. Roeder (1999). Genomic Control for Association Studies. Biometrics
55 (4), 997-1004.
31. Marchini, J., L. Cardon, M. Phillips, and P. Donnelly (2004). The effects of human
population structure on large genetic association studies. Nature Genetics 36, 512-517.
32. Zhang, F., Y. Wang, and H.-W. Deng (2008). Comparison of population-based
association study methods correcting for population stratification. PLoS ONE 3 (10),
e3392
33. Pritchard,J.K. and Rosenberg,N.A. (1999) Use of unlinked genetic markers to detect
population stratification in association studies. Am. J. Hum. Genet., 65, 220–228.
34. Li,Q. and Yu,K. (2008) Improved correction for population stratification in genome wide
association studies by identifying hidden population structures. Genet. Epid., 32, 215–
226.
35. Pearson TA, Manolio TA (March 2008). "How to interpret a genome-wide association
study". J. Am. Med. Ass. 299 (11): 1335–44.
36. Hunter DJ, Altshuler D, Rader DJ (June 2008). "From Darwin's Finches to Canaries in
the Coal Mine — Mining the Genome for New Biology". N. Engl. J. Med. 358 (26):
2760–63.
63
37. Doerge RW (2002). ―Mapping and analysis of quantitative trait loci in experimental
populations‖. Nat. Rev. Genet. 3:42-52.
38. Rachel B Brem, Gael Yvert, Rebecca Clinton, and Leonid Kruglyak (April 2002).
―Genetic dissection of transcriptional regulation in budding yeast.‖Science,
296(5568):752–755.
39. Eric E Schadt, Stephanie A Monks, Thomas A Drake, Aldons J Lusis, Nam Che,
Veronica Colinayo, Thomas G Ruff, Stephen B Milligan, John R Lamb, Guy Cavet, Peter
S Linsley, Mao Mao, Roland B Stoughton, and Stephen H Friend. (March 2003).
―Genetics of gene expression surveyed in maize, mouse and man.‖ Nature,
422(6929):297–302.
40. Susanna Wang, Nadir Yehya, Eric E Schadt, Hui Wang, Thomas A Drake, and Aldons J
Lusis (February 2006). ―Genetic and genomic analysis of a fat mass trait with complex
inheritance reveals marked sex specificity.‖ PLoS Genet, 2(2):e15,
41. Ronald J, Brem R, Whittle J, Kruglyak L (2005). ―Local Regulatory variation in
Saccharomyces cerevisiae‖. PLos Genet 1:e25.
42. GuhaTakurta D, Xie T, Anand M, Edwards S, Li G, et al. (2006). ―Cis- regulatory
variations: A study of SNPs around genes showing cis-linkage in segregating mouse
populations‖. BMC Genomics 7:235.
64
43. Cheung V, Conlin L, Weber T, Arcaro M, Jen K, et al. (2003) Natural variation in human
gene expression assessed in lymphoblastoid cells. Nat Genet 33: 422–425.
doi:10.1038/ng1094.
44. Dixon A, Liang L, Moffatt M, Chen W, Heath S, et al. (2007) A genome-wide
association study of global gene expression. Nature Genetics 39: 1202–1207.
45. Göring H, Curran J, Johnson M, Dyer T, Charlesworth J, et al. (2007) Discovery of
expression QTLs using large-scale transcriptional profiling in human lymphocytes.
Nature Genetics 39: 1208–1216.
46. Emilsson V, Thorleifsson G, Zhang B, Leonardson A, Zink F, et al. (2008) Genetics of
gene expression and its effect on disease. Nature 452: 423–428.
47. Dunning, M. J., M. L. Smith, et al. (2007). "beadarray: R classes and methods for
Illumina bead-based data." Bioinformatics 23(16): 2183-4
48. Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12:656–64.
49. Pruitt K, Tatusova T, Maglott D (2007) NCBI reference sequences (Ref-Seq): a curated
non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids
Research 35: D61–D65.
50. Veyrieras B, Kudaravalli S, Kim SY, Dermitzakis ET, Gilad Y, Stephens M, Pritchard
JK. 2008. High-resolution mapping of expression-QTLs yields insight into human gene
regulation. PLoS Genet 4:e1000214.
65
51. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. (June
2002) ―The human genome browser at UCSC‖. Genome Res 12(6):996-1006.
52. McCaroll, S.A. et al.(2006) Common deletion polymorphisms in the human genome.
Nat.Genet.38,86-92.
53. Spielman RS, Bastone LA, Burdick JT, Morley M, Ewens WJ, et al. (2007) Common
genetic variants account for differences in gene expression among ethnic groups. Nat
Genet 39: 226–231.
54. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, et al. (2007) PLINK: a tool
set for whole-genome association and population-based linkage analyses. Am J Hum
Genet 81: 559–575.
55. Aulchenko YS, Ripke S, Isaacs A, van Duijn CM . (May 2007). ―GenABEL: an R library
for genome-wide association analysis‖. Bioinformatics 23(10):1294-6.
56. Veyrieras J-B, Kudaravalli S, Kim SY, Dermitzakis ET, Gilad Y, Stephens M, Pritchard
JK (2008) High-Resolution Mapping of Expression-QTLs Yields Insight into Human
Gene Regulation. PLoS Genet 4:e1000214.
57. Yang, et al, (2010) Common SNPs explain a large proportion of the heritability for
human height, Nature Genetics online June 2010.
66
58. Maher, B. (2008) Personal genomes: The case of the missing heritability. Nature 456, 18–
21.
59. Manolio, T.A. et al. (2009) Finding the missing heritability of complex diseases. Nature
461, 747–753.
60. Frazer, K.A., Murray, S.S., Schork, N.J. & Topol, E.J. (2009) Human genetic variation
and its contribution to complex traits. Nat. Rev. Genet. 10, 241–251.
61. Pritchard, J.K. (2001) Are rare variants responsible for susceptibility to complex diseases?
Am. J. Hum. Genet. 69, 124–137.
62. Johannes, F., Colot, V. & Jansen, R.C. (2008) Epigenome dynamics: a quantitative
genetics perspective. Nat. Rev. Genet. 9, 883–890.
63. Johannes, F. et al. (2009) Assessing the impact of transgenerational epigenetic variation
on complex traits. PLoS Genet. 5, e1000530.
64. Hayes, B.J., Visscher, P.M. & Goddard, M.E. (2009) Increased accuracy of artificial
selection by using the realized relationship matrix. Genet. Res. 91, 47–60.
65. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. (2006) Principal
components analysis corrects for stratification in genome-wide association studies. Nat
Genet 38: 904–909.
67
66. Kadarmideen H.N. (2008). Genetical systems biology in Livestock – Application to
GnRH and Reproduction. IET Systems Biology 2: 423-441.
67. Emilsson V, Thorleifsson G, Zhang B, et al. Genetics of gene expression and its effect on
disease. Nature 2008;452(7186):423–8.
68. Dixon AL, Liang L, Moffatt MF, et al. A genome-wide association study of global gene
expression. Nat Genet 2007;39(10):1202–7
69. Goring HH, Curran JE, Johnson MP, et al. Discovery of expression QTLs using large-
scale transcriptional profiling in human lymphocytes. Nat Genet 2007;39(10):1208–16.
70. Schadt EE, Molony C, Chudin E, etal. Mapping the genetic architecture of gene
expression in human liver. PLoS Biol 2008;6(5):e107.
71. Meng, J. F. & Fingerlin, T. E. 2008: Linear models for analysis of multiple single
nucleotide polymorphisms with quantitative traits in unrelated individuals. — Ann. Zool.
Fennici 45: 429–440
72. Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, Ingle CE, Dunning M,
Flicek P, Koller D, Montgomery S, Tavare S, Deloukas P, Dermitzakis ET, 2007:
Population genomics of human gene expression. Nat Genet,vol.39(10): 1217-1124.
68
73. Duggal P, Gillanders EM, Holmes TN, Bailey-Wilson JE. Establishing an adjusted p-
value threshold to control the family-wide type 1 error in genome wide association
studies. BMC Genomics. 2008 Oct 31;9:516.
74. Dudoit, S., J. P. Shaffer and J. C. Boldrick, 2003. Multiple hypothesis testing in
microarray experiments. Stat. Sci. 18 71–103.
75. Holm, S., 1979. A simple sequentially rejective multiple test procedure. Scand. J.
Stat. 6 65–70.
76. Patterson, H.D. & Thompson, R. Recovery of inter-block information when block sizes
are unequal.Biometrika 58, 545–554 (1971).
77. Yang J, Lee SH, Goddard ME and Visscher PM. GCTA: a tool for Genome-wide
Complex Trait Analysis. Am J Hum Genet. 2011 Jan 88(1): 76-82.
78. Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification
in genome-wide association studies. Nat Rev Genet 2010;11(7):459-463.
79. Falconer, D. S. & Mackay TFC (1996). Introduction to Quantitative Genetics. Fourth
edition. Addison Wesley Longman, Harlow, Essex, UK
80. Daniel L. Hartl and Elizabeth W. Jones, Essential Genetics: A Genomics Perspective.
Sudbury, MA: Jones and Bartlett, 2002.
69
Webref
1 . http://www.goldenhelix.com/SNP_Variation/index.html
70
Appendix
71
S1: BLAT search
In the methods section of the report, it was reported that BLAT1
search was carried out to
map the probes to the human RNA sequences2
. The identity scores were checked. The
location of the probe sequence in the human genome was searched with the help of
BLAT1
. With the help of the search it was possible to find out the whether the probes
mapped to a single or multiple human RNA sequences. Only the probes with the exact
match to a unique gene were retained. The flowchart of the BLAT1
search is described
below.
This process of the search can be done manually for a small number of probes sequence.
However for large number of probe sequence this process can become tedious and
painstaking. Therefore the BLAT1
search was automated. For this purpose the human
Paste the probes sequence in BLAT Query
search
BLAT search results revealing likely location
of the probe sequence and the identity score
of the match with different regions of the
human genome
Discard the probes sequences with more than
90% identity to multiple regions of the genome
Of the remaining probes sequences, confirm
the location of the probe sequence in the
human genome with the help of genome
browser
72
RNA sequences from RefSeq (hg 19)2
and the BLAT source and executables1
were first
downloaded. The Following the script was then used to automate the search process.
Thus after comparing the human RNA sequences from RefSeq (hg 19)2
, the above script
retained only the probes with the score > 20 and %identity >= 90%.
S2: Table showing 10 eigenvectors scores obtained from the PCA analyses.
73
74
75
76
S3: Association studies
Genome wide Association analysis was performed using PLINK3
. Linear regression
analysis for the association testing was carried out with the help of this software. The
basic files required for the association testing were:
1. ―Pedigree (PED) file – contains genotype information‖3
2. ―Map (MAP) files – position and name of the markers present in PED file‖3
.
Running the association analysis with the help of these two files was time consuming and
computationally intensive. Hence, these files were first converted into binary files and
were used for association testing.
77
The commands used for the association testing in PLINK:
Here the ‗bfile hapmap23maf0.05‘ reads the genotypes, names and positions of the
markers in the binary format. ‗df1.xls‘ contains all the gene expression values. Only the
markers with MAF > 0.5 are taken for the analysis. ‗210.cov‘ contains the eigenvectors in
the form of covariates. Only the p-values <= 1e-5 are allowed to be shown after the
analysis. The p-values are then adjusted for multiple testing.
Output of an association test for gene ‗ZNF266‘ (figure S3-1 and S3-2) from PLINK:
78
S4: Genes showing significant Cis associations and their heritability estimates
79
80
81
82
83
84
85
86
S5: Genes showing significant trans association and their heritability estimates:
87
S6: Estimating the heritability of the gene expression trait and eQTLs.
The heritability calculation by GCTA program4
also requires the same input files as the
PLINK3
(i.e. ‗PED‘ and ‗MAP‘ files in binary format). The heritability calculations were
performed in 3 steps:
1. ―Estimating the genetic relationship matrix (GRMs) between the individuals from the
SNP markers‖4
. Markers from the sex chromosomes were excluded from the
calculations.
Command used –
gcta --bfile hapmap23maf0.05 --autosome --maf 0.05 --make-grm --out
hapmap23maf0.05grm
The above command would use the Binary Genome wide SNPs (MAF >0.05) from the
all the 22 autosomes and construct the GRMs between the pair wise individuals.
2. Estimating the genetic relationship matrix (GRMs) between the individuals from the
locus (100 – 1000 kb region) that is significantly associated with a particular gene
(for eQTL heritability). Markers from the sex chromosomes were excluded from the
calculations.
Commands used:
gcta --bfile hapmap23maf0.05 --extract gene.snp.list --autosome --maf 0.05 --
make-grm --out hapmap23maf0.05genegrm
The above command would use only the SNPs (MAF >0.05) form the selected locus
known to be associated with a particular gene from the all the 22 autosomes and construct
88
the GRMs between the pair wise individuals. The GRMs were constructed separately for
each of the four population groups and merged together in a file named as ‗multi_grm.txt‘.
3. Estimating the variance explained by the by all the SNPs (heritability of gene
expression trait) and a significantly associated locus (~5 or more significantly
associated SNPs within a span of 1000 kb):
Commands used:
gcta --reml --mgrm multi_grm.txt --pheno gene.phen --qcovar
hapmap23maf0.05_10PCs.txt --out hapmap23maf0.05gene_mgrm
The above command fits both the GRMs constructed in the steps 2 and 3 into the REML
model. ‗gene.phen‘ contains the expression values from all the 210 individuals.
‗hapmap23maf0.05_10PCs.txt‘ contains all the 10 eigenvectors. The merged GRMs and
the eigenvectors (in the form of the covariates) were used in the REML analysis to
correct the population structure.
Output obtained from the GCTA for gene ‘ZNF266’:
where V(1) = variance explained by the eQTL
V(2) = variance explained by all the SNPs
V(e) = Environmental variance.
V(p) = Phenotypic variance
V(1)/V(p) = heritability estimate of the eQTL
V(2)/V(p) = heritability estimate of the gene expression phenotypes.
89
References
1. Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12:656–64
2. Pruitt K, Tatusova T, Maglott D (2007) NCBI reference sequences (Ref-Seq): a
curated non-redundant sequence database of genomes, transcripts and proteins.
Nucleic Acids Research 35: D61–D65.
3. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, et al. (2007) PLINK: a
tool set for whole-genome association and population-based linkage analyses. Am J
Hum Genet 81: 559–575.
4. Yang J, Lee SH, Goddard ME and Visscher PM. GCTA: a tool for Genome-wide
Complex Trait Analysis. Am J Hum Genet. 2011 Jan 88(1): 76-82.

More Related Content

What's hot

103BL-F14_PosterSmallQuality_PhageHunters
103BL-F14_PosterSmallQuality_PhageHunters103BL-F14_PosterSmallQuality_PhageHunters
103BL-F14_PosterSmallQuality_PhageHuntersEric Zhou
 
Metatranscriptomic sequencing service
Metatranscriptomic sequencing serviceMetatranscriptomic sequencing service
Metatranscriptomic sequencing serviceDynah Perry
 
High Resolution Outbreak Tracing and Resistance Detection using Whole Genome ...
High Resolution Outbreak Tracing and Resistance Detection using Whole Genome ...High Resolution Outbreak Tracing and Resistance Detection using Whole Genome ...
High Resolution Outbreak Tracing and Resistance Detection using Whole Genome ...QIAGEN
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics TechnologiesSean Davis
 
YSP Week 3 HGP
YSP Week 3 HGPYSP Week 3 HGP
YSP Week 3 HGPLisa Feng
 
Neurodevelopmental consequences of prenatal alcohol exposure behavioural and...
Neurodevelopmental consequences of prenatal alcohol exposure  behavioural and...Neurodevelopmental consequences of prenatal alcohol exposure  behavioural and...
Neurodevelopmental consequences of prenatal alcohol exposure behavioural and...BARRY STANLEY 2 fasd
 
single-cell-sequencing-research-review
single-cell-sequencing-research-reviewsingle-cell-sequencing-research-review
single-cell-sequencing-research-reviewSwati Kadam Ph.D.
 
Translating Genomes | Personalizing Medicine
Translating Genomes | Personalizing MedicineTranslating Genomes | Personalizing Medicine
Translating Genomes | Personalizing MedicineCandy Smellie
 
Whole genome sequencing of arabidopsis thaliana
Whole genome sequencing of arabidopsis thalianaWhole genome sequencing of arabidopsis thaliana
Whole genome sequencing of arabidopsis thalianaBhavya Sree
 
Random RNA interactions control protein expression in prokaryotes
Random RNA interactions control protein expression in prokaryotesRandom RNA interactions control protein expression in prokaryotes
Random RNA interactions control protein expression in prokaryotesPaul Gardner
 
CRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse ModelingCRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse ModelingTristan Kempston
 
New insights into the human genome by ENCODE project
New insights into the human genome by ENCODE project New insights into the human genome by ENCODE project
New insights into the human genome by ENCODE project Senthil Natesan
 
Comparative genomics @ sid 2003 format
Comparative genomics @ sid 2003 formatComparative genomics @ sid 2003 format
Comparative genomics @ sid 2003 formatsidjena70
 
Genomic mapping by kk sahu
Genomic mapping by kk sahuGenomic mapping by kk sahu
Genomic mapping by kk sahuKAUSHAL SAHU
 
Gene hunting strategies
Gene hunting strategiesGene hunting strategies
Gene hunting strategiesAshfaq Ahmad
 
Gene expression profile analysis of human hepatocellular carcinoma using sage...
Gene expression profile analysis of human hepatocellular carcinoma using sage...Gene expression profile analysis of human hepatocellular carcinoma using sage...
Gene expression profile analysis of human hepatocellular carcinoma using sage...Ahmed Madni
 

What's hot (20)

Regulatory RNA at epigenetic level
Regulatory RNA at epigenetic level Regulatory RNA at epigenetic level
Regulatory RNA at epigenetic level
 
103BL-F14_PosterSmallQuality_PhageHunters
103BL-F14_PosterSmallQuality_PhageHunters103BL-F14_PosterSmallQuality_PhageHunters
103BL-F14_PosterSmallQuality_PhageHunters
 
Transcriptomics approaches
Transcriptomics approachesTranscriptomics approaches
Transcriptomics approaches
 
Metatranscriptomic sequencing service
Metatranscriptomic sequencing serviceMetatranscriptomic sequencing service
Metatranscriptomic sequencing service
 
High Resolution Outbreak Tracing and Resistance Detection using Whole Genome ...
High Resolution Outbreak Tracing and Resistance Detection using Whole Genome ...High Resolution Outbreak Tracing and Resistance Detection using Whole Genome ...
High Resolution Outbreak Tracing and Resistance Detection using Whole Genome ...
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics Technologies
 
YSP Week 3 HGP
YSP Week 3 HGPYSP Week 3 HGP
YSP Week 3 HGP
 
Neurodevelopmental consequences of prenatal alcohol exposure behavioural and...
Neurodevelopmental consequences of prenatal alcohol exposure  behavioural and...Neurodevelopmental consequences of prenatal alcohol exposure  behavioural and...
Neurodevelopmental consequences of prenatal alcohol exposure behavioural and...
 
single-cell-sequencing-research-review
single-cell-sequencing-research-reviewsingle-cell-sequencing-research-review
single-cell-sequencing-research-review
 
Translating Genomes | Personalizing Medicine
Translating Genomes | Personalizing MedicineTranslating Genomes | Personalizing Medicine
Translating Genomes | Personalizing Medicine
 
Whole genome sequencing of arabidopsis thaliana
Whole genome sequencing of arabidopsis thalianaWhole genome sequencing of arabidopsis thaliana
Whole genome sequencing of arabidopsis thaliana
 
Random RNA interactions control protein expression in prokaryotes
Random RNA interactions control protein expression in prokaryotesRandom RNA interactions control protein expression in prokaryotes
Random RNA interactions control protein expression in prokaryotes
 
CRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse ModelingCRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse Modeling
 
New insights into the human genome by ENCODE project
New insights into the human genome by ENCODE project New insights into the human genome by ENCODE project
New insights into the human genome by ENCODE project
 
Comparative genomics @ sid 2003 format
Comparative genomics @ sid 2003 formatComparative genomics @ sid 2003 format
Comparative genomics @ sid 2003 format
 
Gemome annotation
Gemome annotationGemome annotation
Gemome annotation
 
Genomic mapping by kk sahu
Genomic mapping by kk sahuGenomic mapping by kk sahu
Genomic mapping by kk sahu
 
Gene hunting strategies
Gene hunting strategiesGene hunting strategies
Gene hunting strategies
 
Gene expression profile analysis of human hepatocellular carcinoma using sage...
Gene expression profile analysis of human hepatocellular carcinoma using sage...Gene expression profile analysis of human hepatocellular carcinoma using sage...
Gene expression profile analysis of human hepatocellular carcinoma using sage...
 
sequencing-methods-review
sequencing-methods-reviewsequencing-methods-review
sequencing-methods-review
 

Similar to Masters Dissertation

Report- Genome wide association studies.
Report- Genome wide association studies.Report- Genome wide association studies.
Report- Genome wide association studies.Varsha Gayatonde
 
Mike (Gang) CV-updated
Mike (Gang) CV-updatedMike (Gang) CV-updated
Mike (Gang) CV-updatedGang Zhang
 
Dna profiling presentation x2
Dna profiling presentation x2Dna profiling presentation x2
Dna profiling presentation x2Eli Rosenthal
 
Dna profiling presentation x2
Dna profiling presentation x2Dna profiling presentation x2
Dna profiling presentation x2teamchaotex
 
Sophie F. summer Poster Final
Sophie F. summer Poster FinalSophie F. summer Poster Final
Sophie F. summer Poster FinalSophie Friedheim
 
Plegable biología molecular
Plegable biología molecularPlegable biología molecular
Plegable biología molecularAndre Urrego
 
Molecular markers for measuring genetic diversity
Molecular markers for measuring genetic diversity Molecular markers for measuring genetic diversity
Molecular markers for measuring genetic diversity Zohaib HUSSAIN
 
DISSERTATION_LDOC1 A NOVEL BIOMARKER OF PROGNOSIS IN CHRONIC LYMPHOCYTIC LEUK-1
DISSERTATION_LDOC1 A NOVEL BIOMARKER OF PROGNOSIS IN CHRONIC LYMPHOCYTIC LEUK-1DISSERTATION_LDOC1 A NOVEL BIOMARKER OF PROGNOSIS IN CHRONIC LYMPHOCYTIC LEUK-1
DISSERTATION_LDOC1 A NOVEL BIOMARKER OF PROGNOSIS IN CHRONIC LYMPHOCYTIC LEUK-1Hatice Duzkale, MD, MPH, PhD, FACMG
 
GIGA2 Structuring Phenotype Data
GIGA2 Structuring Phenotype DataGIGA2 Structuring Phenotype Data
GIGA2 Structuring Phenotype DataChris Mungall
 
4.4 genetic engineering & biotechnology
4.4 genetic engineering & biotechnology4.4 genetic engineering & biotechnology
4.4 genetic engineering & biotechnologycartlidge
 
Jason C Poole Cv Linked In
Jason C Poole Cv Linked InJason C Poole Cv Linked In
Jason C Poole Cv Linked Inrastare1a
 
Presentación plegable1
Presentación plegable1Presentación plegable1
Presentación plegable1Leslie M.
 
Presentación plegable 1
Presentación plegable 1Presentación plegable 1
Presentación plegable 1Leslie M.
 
A Retrospective Analysis of Exome Sequencing Cases Using the GenePool™ Genomi...
A Retrospective Analysis of Exome Sequencing Cases Using the GenePool™ Genomi...A Retrospective Analysis of Exome Sequencing Cases Using the GenePool™ Genomi...
A Retrospective Analysis of Exome Sequencing Cases Using the GenePool™ Genomi...Antoaneta Vladimirova
 
Theusch 2009. GWAS AP
Theusch 2009. GWAS APTheusch 2009. GWAS AP
Theusch 2009. GWAS APYuri Cheung
 

Similar to Masters Dissertation (20)

Single cell pcr
Single cell pcrSingle cell pcr
Single cell pcr
 
Pharmacogenomics
PharmacogenomicsPharmacogenomics
Pharmacogenomics
 
GWAS
GWASGWAS
GWAS
 
Report- Genome wide association studies.
Report- Genome wide association studies.Report- Genome wide association studies.
Report- Genome wide association studies.
 
Mike (Gang) CV-updated
Mike (Gang) CV-updatedMike (Gang) CV-updated
Mike (Gang) CV-updated
 
Dna profiling presentation x2
Dna profiling presentation x2Dna profiling presentation x2
Dna profiling presentation x2
 
Dna profiling presentation x2
Dna profiling presentation x2Dna profiling presentation x2
Dna profiling presentation x2
 
Sophie F. summer Poster Final
Sophie F. summer Poster FinalSophie F. summer Poster Final
Sophie F. summer Poster Final
 
Molecular Technique for Gender Identification: A Boon in Forensic Odontology
Molecular Technique for Gender Identification: A Boon in Forensic OdontologyMolecular Technique for Gender Identification: A Boon in Forensic Odontology
Molecular Technique for Gender Identification: A Boon in Forensic Odontology
 
Plegable biología molecular
Plegable biología molecularPlegable biología molecular
Plegable biología molecular
 
Molecular markers for measuring genetic diversity
Molecular markers for measuring genetic diversity Molecular markers for measuring genetic diversity
Molecular markers for measuring genetic diversity
 
DISSERTATION_LDOC1 A NOVEL BIOMARKER OF PROGNOSIS IN CHRONIC LYMPHOCYTIC LEUK-1
DISSERTATION_LDOC1 A NOVEL BIOMARKER OF PROGNOSIS IN CHRONIC LYMPHOCYTIC LEUK-1DISSERTATION_LDOC1 A NOVEL BIOMARKER OF PROGNOSIS IN CHRONIC LYMPHOCYTIC LEUK-1
DISSERTATION_LDOC1 A NOVEL BIOMARKER OF PROGNOSIS IN CHRONIC LYMPHOCYTIC LEUK-1
 
GIGA2 Structuring Phenotype Data
GIGA2 Structuring Phenotype DataGIGA2 Structuring Phenotype Data
GIGA2 Structuring Phenotype Data
 
4.4 genetic engineering & biotechnology
4.4 genetic engineering & biotechnology4.4 genetic engineering & biotechnology
4.4 genetic engineering & biotechnology
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Jason C Poole Cv Linked In
Jason C Poole Cv Linked InJason C Poole Cv Linked In
Jason C Poole Cv Linked In
 
Presentación plegable1
Presentación plegable1Presentación plegable1
Presentación plegable1
 
Presentación plegable 1
Presentación plegable 1Presentación plegable 1
Presentación plegable 1
 
A Retrospective Analysis of Exome Sequencing Cases Using the GenePool™ Genomi...
A Retrospective Analysis of Exome Sequencing Cases Using the GenePool™ Genomi...A Retrospective Analysis of Exome Sequencing Cases Using the GenePool™ Genomi...
A Retrospective Analysis of Exome Sequencing Cases Using the GenePool™ Genomi...
 
Theusch 2009. GWAS AP
Theusch 2009. GWAS APTheusch 2009. GWAS AP
Theusch 2009. GWAS AP
 

Masters Dissertation

  • 1. 1 School of Biosciences Detection of Expression Quantitative trait Loci and its role in explaining the variation in the gene expression phenotypes A research project submitted by Abhilash Krishna Kannan A thesis presented to the School of Biosciences of the University of Birmingham in partial fulfilment of the requirements for the degree of Master of Science in Analytical Genomics School of Biosciences University of Birmingham Birmingham,UK. September 2011 Supervisor: Prof Zewei Luo
  • 2. 2 To my Parents, Kannan Krishnaswamy and Chitra Kannan
  • 3. 3
  • 4. 4 Abstract The main influence of the variation in the gene expression and its genetic effects can be attributed to the specific regions of a genome and its variants. The detailed analysis of the mRNA expression levels from the lymphoblastoid cell lines of 210 individuals and the genotypes from Phase II HapMap project has revealed a strong correlation between the variation in the genotypes and the variation in the gene expression levels. REML analysis reveals that gene expression is a heritable phenotype. However for some of the genes only a small proportion of variation in the expression levels is explained by the SNPs. Of the total ~2.2 million SNPs (MAF > 5%) used for the association analysis by the simple Linear Regression models, significant cis – associations were observed in 579 genes and 32 genes were found to be significantly associated with the SNPs in trans. The results obtained from the association analysis confirm the presence of a large number of the genes with cis effects compared to genes with trans effects because the genes with trans effects may be regulated by a large number of SNPs, many of which fail to pass through the stringent significance threshold. However this also suggests that the variation in the gene expression phenotype may be primarily due to the regulatory variants. The results from the heritability analysis show that 44% of the heritability in the gene expression was explained by the peak SNP and the variation explained by the cis eQTL is significantly higher than the variation explained by trans eQTLs, but few genes with low heritability estimates still had eQTL, questioning the power of heritability to identify cis and trans acting eQTLs. The analyses also consider several additional approaches and methodologies along with heritability analysis and GWAS to accurately analyse the variation in the gene expression of all the 210 unrelated individuals.
  • 5. 5 Keywords Single Nucleotide Polymorphisms, Genome Wide Association Studies, Expression QTL, Linear Regression, Restricted Maximum Likelihood, Principal Components Analysis, Heritability, Cis –eQTL, Trans –eQTL, Gene Expression, HapMap, Lymphoblastoid, Microarray, Population Stratification, BLAT, Multiple Testing, PLINK, GCTA.
  • 6. 6 Acknowledgement The first and the foremost person that I would like to thank would be my Mother Mrs. Chitra Kannan for her constant support throughout my entire life and always encouraging me in difficult and testing times. I strongly believe that whatever I have achieved till data is only because of her. Concrete foundations in the Biological Sciences propelled me towards a four year Bachelor of Technology degree in Biotechnology at the prestigious Padmashree Dr. D.Y. Patil Institute for Biotechnology and Bioinformatics. I developed a huge interest in Computational genomics during my Bachelors degree. After the completion of my four year course, I was searching several Universities in UK which were known for their excellence in research. During this time I came across an Advertisement about Analytical Genomics course at the University of Birmingham, UK. The course content was simply outstanding and I had always dreamt of pursuing my further education in the subject of my liking from such a prestigious University. I blinked in disbelief when I read though the list of professors and lecturers. That is where I came across the profile of Prof Zewei Luo and I would never have guessed that my quest for the postgraduate study in UK would eventually lead me to the group of Analytical Genomics headed by Prof Luo. Prof Luo was a great supervisor and I would like to thank him for his enthusiasm, support and ideas, but also for believing that I could do this even without having prior knowledge and high computational background before the start of the course. With Prof Luo, I was able to share my fascination for Human genetic variation which I am sure has kept us both awake until the early morning hours. Human Variation was the reason I studied biology in the first place. I was lucky enough to be taught about eQTLs by Dr. Lindsey Leach, form the Department of Plant Sciences, University of Oxford. Dr. Leach was one of the most inspiring people I have met and through her I discovered what fascinated me the most in science. I am sure that if it weren‘t for her, I would not be writing this now. I am also very grateful to Dr. Arpita Gupte who taught me Biochemistry during my Undergraduation and Dr Ganapathy Subramaniam who was my Project supervisor during my undergraduation. Both of them were great teachers and wonderful people who always inspired me to go into the field of Research. Their Positive view on work and life, as well as their exciting ideas were very motivating throughout these years. I would also like to thank Minghui Wang and Ning Jiang for their guidance and reassuring comments, especially during project meeting.
  • 7. 7 Abbreviations A Adenine ASW African ancestry in the Southwest USA BLAT BLAST-like alignment tool C Cytosine CEU Residents with ancestry from Northern and Western Europe CHB Han Chinese in Beijing CEPH Centre d'Etude du Polymorphisme Humain CHD Chinese in metropolitan Denver, CO, from USA DNA Deoxyribose Nucleic acid e QTL Expression Quantitative Loci EV EigenVector FDR False Discovery Rate G Guanine GC Genomic Control GWAS Genome Wide Association Studies GCTA Genome-Wide Complex Trait Analysis GIH Gujarati Indians in Houston, TX, from USA JPT Japan kb Kilo Base LR Linear Regression LWK Luhya in Webuye from Kenya LD Linkage Disequilibrium MAF Minor Allele Frequency m RNA Messenger Ribonucleic Acid MKK Maasai in Kinyawa, from Kenya
  • 8. 8 MEX Mexican ancestry in Los Angeles, CA, USA MDS Multidimensional Scaling Mb Mega Base PCA Principal Components Analysis QTL Quantitative Trait Loci REML Restricted Maximum Likelihood Analysis RNA Ribonucleic Acid r RNA Ribosomal Ribonucleic Acid SNP Single Nucleotide Polymorphism T Thymine t RNA Transfer Ribonucleic Acid TSI Tuscans in Italy TSS Transcription Start Site UCSC University of California, Santa Cruz YRI Yoruba from Nigeria
  • 9. 9 Table of Contents Declaration……………………………………..……………………………………....3 Abstract….………………………….………...…..…..……….……………………….4 Keywords……………………………………………….…………………………..….5 Acknowledgements…………….……………………….…………..…….….………...6 Abbreviations…………………….……………………….……………………...…….7 Table of Contents……………….………………………….…………………...……...9 1. Introduction………………….……………………………………………………12 1.1 Expression Quantitative Trait Loci…….….……….…….….………………...13 1.2 Challenges with Microarray Technology…….……….…….….…….….…….14 1.3 Impact of Genetical Genomics………….………..………....….…….……..….14 1.4 Genome Wide Association Studies (GWAS) in eQTL Analysis…….…...…...15 1.5 Focus on Gene Expression…………………………….…..….….….……….....15 1.6 Natural Variation in Gene Expression………………….…….…..….………..16 1.7 Associating the variation in gene expression with marker genotypes....….....16 1.8 Focus on HapMap individuals………….…..….….….…….………………......17 1.8.1 Phase I………………………………………….……....….……………...…18 1.8.2 Phase II……………………………………….…….……….……………….18 1.8.3 Phase III……………………………………………..……………………....18 1.9 Effect of Population Structure in GWAS…………….……………………....19 1.10 Example of Population Stratification……….…………….…………………...19 1.11 Controlling the stratification…………………….………….……………….....21 1.12 Two Major classes of eQTL…………………….…………….………………..22 1.13 Measure of Heritability……………………….………………..……………….23 1.14 Main Aim of the study………………….……………………….……………...26 2. Methods……………………………….…………………………….…………….27
  • 10. 10 2.1 Selecting the genes to study for the analysis………………………………..…28 2.2 Investigation and Correction of population structure…………………..……29 2.3 Performing Association studies between the gene expression traits and SNP marker genotypes………………………………………………….…30 2.4 Correcting the Association signals by Multiple testing……….….…...………31 2.5 Heritability Analysis……………………………………………………………32 2.6 Statistical approach to REML…………………………………………………33 3. Results……………………………………………………….…………….….…36 3.1 Addressing the problem of population stratification…….….….…………….37 3.2 Cis –associations of SNPs with the gene expression phenotypes and the distribution of cis –acting eQTL………….…..……….……...………38 3.3 Reasons for choosing phase II HapMap individuals………....….…..………..39 3.4 Positions of SNPs relative to the Transcription Start Sites…….….….……...40 3.5 Usefulness of Multipopulation over Single population analysis…..…….……40 3.6 Trans –associations of SNPs with the gene expression phenotypes….….…..44 3.7 Heritability of gene expression traits……………..…………………….……...46 3.8 eQTL heritability……………………….….………………….………………...48 4. Discussion……………………………………….…………….….……………...49 5. Conclusion……….…………….…..………..…………….……………………..56 6. References………………………….….……….………….……..….…………..59 7. Appendix……………………………….….….…………………………………70 7.1 S1: BLAT search……………………………….….….…………………………71 7.2 S2: Table Showing 10 eigenvectors obtained from the PCA analysis….….....72 7.3 S3: Association Studies……………………………….….….…………………..76 7.4 S4: Table of genes showing significant cis –associations and their Heritability estimates………………………………….….…….……………….78 7.5 S5: Table of genes showing significant trans –associations and their
  • 11. 11 Heritability estimates………………………………..……….………………….86 7.6 S6: Estimating the heritability of Gene expression traits and eQTLs...……..87 7.7 References………………………………………………………………..….…..89
  • 13. 13 Expression Quantitative Trait Loci Gene Expression Quantititative trait loci (eQTL) is a particular site or position in the genome where the variation in the nucleotide sequences between two genotypes leads to a significant difference in the gene expression levels between the individuals with these genotypes1 . eQTL studies have gained more importance in the recent years especially after the rapid advancements in the microarray expression profiling. The expression levels of the genes explain several key questions. The most notable ones are - the susceptibility to a disease, adaptive evolution, characteristic feature of cells, regulatory mechanisms within the cells. Initially microarray technology was primarily used to compare the expression levels of the genes under various physiological and environmental conditions. In the past few years several research groups have started to combine the Variation in the DNA sequence to the individual differences in the gene expression (figure 1). The variations in the gene expression were regarded as a Quantitative trait and were tested for associations with the SNP marker genotypes. Thus the integration of genetic association with the differences in the expression of genes led to the development of Genetical Genomics. It aims to study the genetic basis of gene expression by linking the conventional genetic analysis with the gene expression analysis.
  • 14. 14 Challenges with the Microarray technology The accurate measure of the mRNA‘s quality and quantity is very much necessary to identify eQTLs. This is done with the help of DNA microarrays. We can accurately measure the expression levels of all the genes from different tissues. There are few disadvantages of this hybridization based technology. 1. Coding SNPs present within the probes could alter the hybridization efficiency, thereby giving false values of mRNA expression levels explained by the SNP. 2. The use of the relevant probes to be used in the hybridization experiments requires prior knowledge about the RNA sequences. 3. Significant background noise. Despite these limitations, gene expression studies with the help of the microarray technology have detected QTLs for many gene genes in various tissues with satisfactory power. Impacts of Genetical Genomics Genetical genomics is considered to be a very good solution to explain the basis of complex traits and susceptibility to diseases. It is well known that a complex trait is usually controlled by a few genes of major effect, many genes which exhibit a minor effect and modified by the environment. The most prominent examples include – height, weight, intelligence, etc. the genes that are involved are known as Quantitative Trait Loci (QTL). Genetical Genomics essentially attempts to address the relationship between genotype, gene expression and phenotype. It treats gene expression level as a quantitative trait and links the variation in the trait to genomic locations. The statistical power of such a combined genetic linkage analysis and expression profiling will be higher than either
  • 15. 15 approach alone. It gives a clear understanding of the organisation of the gene networks by identifying the polymorphisms responsible for variation in gene expression. Genome Wide Association Studies (GWAS) in eQTL analysis: The genes vary from one individual to another. GWAS examines the genome from different individuals belonging to a particular species. It associates this variation with different traits, such as disease, gene expression (in case of eQTL), etc. it helps us to identify, whether a particular gene is associated with a disease. It involves testing thousands of individuals for mutation, polymorphisms (SNPs). It is mainly used in the epidemiological studies to find disease pathways, disease susceptibility etc. The genetic variations are assumed to be associated with a particular trait (eg. Gene Expression) if they occur frequently in the people having that trait. The methods used to identify trait associated mutations can be either hypothesis driven or non-hypothesis driven one. Hypothesis-driven studies are based on the hypothesis that a gene may be associated with a specific trait and basically attempts to find this association. Non-hypothesis-driven method tries to scan entire genome and determines which of these demonstrates a significant association. Most of the GWAS are non-hypothesis-driven35, 36 . Focus on the Gene expression Gene expression is defined as the process by which information from a gene is used in the synthesis of a functional gene product2 . Main products of the gene expression in the form of proteins or a functional RNA molecule (i.e. rRNA, tRNA, microRNA) are well known. Expression levels of the genes can be regarded as a complex trait since they are controlled mainly by Genetic3-7 , epigenetic8, 9 , and environmental10, 11 factors. Hence for these reasons they can be considered as continuously varying phenotype.
  • 16. 16 Natural Variation in gene expression Regulation of gene expression is one of the key events in the developmental programme of an organism. The changes in the expression pattern are known to bring about a considerable change in the normal functioning of the cell. Natural variation in gene expression has been studied in many species (such as yeast, fruit flies, mice, and humans) 10, 12-18 . In one of the experiments, approximately half (2,698 out of 6,215) of all the in the genome were differentially expressed in a cross between two different strains of yeast13 . In one of the studies related to human population, it was observed that there was a significant natural variation in the gene expression among 16 individuals from European and African descent. About 83% and 17% of the genes were differentially expressed among the individuals and populations respectively19 . In another study11 , more than 65% of the genes were differentially expressed among the three populations of healthy Moroccan Amazigh (Berbers). These results clearly suggest that there is a significant amount of natural variation in gene expression within the species. Associating variation in Gene expression with Marker Genotypes: Only 0.01% of the 6 billion nucleotides making up the human genome vary between any two randomly chosen individuals22 . This variation could be mainly due to the single nucleotide polymorphisms, copy number variants, insertions, deletions, retroposon insertions or a combination of these eQTL mainly focuses on the association between Single Nucleotide Polymorphisms (SNP) marker genotypes and variation in the gene expression to detect eQTLs. SNP is a common type of DNA sequence variation occurring due to the change in a single nucleotide — A, T, C, or G — in the genome20 . For example, let the DNA sequence from different individual be AGTTACAGT and AGTTGCAGT. In this case we can clearly see a difference in a single nucleotide. In
  • 17. 17 other words, there are two alleles: A and G. SNPs account for approximately 75% of the total observed variation in humans21 . There are more than ten million SNPs in the human genome. The DNA position is considered to be polymorphic at an allele frequency between 1% and 99% in the population. International Hapmap project‘s main aim was to group these variants to identify the genetic similarities and differences between humans22 . Based on the position of the SNPs in the genome, they can be classified as coding or non- coding. It has been observed that only 1.5% of the entire genome encode for the proteins and there is great deal of redundancy in the genetic code where a particular amino acid is encoded by multiple codons. SNPs can be classified as – Synonymous SNPs and non- synonymous SNPs. Former does not alter the amino acid sequences, while the latter brings about a single amino acid change. Focus on the Hapmap Individuals: Hapmap project is an international project and a very useful public resource that collaborate various researchers and their groups from non-profit organizations22 . The main aim of this project is to identify and group SNP variants to identify genetic similarities and differences between humans from four different geographically distinct populations (i.e. samples from Nigeria (YRI-Yoruba), Japan (JPT), China (CHB – Han Chinese in Beijing) and the U.S. (residents with ancestry from Northern and Western Europe (abbreviated CEU), collected by the Centre d'Etude du Polymorphisme Humain (CEPH)) 22-24 . This project generates the detailed haplotype map of the human genome and explores the common patterns of genetic variation. In this project DNA from the Lymphoblastoid Cell Lines obtained from the blood samples of the individuals from four distinct populations are genotyped22-24 . This project consists of three phases:
  • 18. 18 Phase 1 In this phase at least one common SNP was genotyped every five kb spanning the euchromatic portion of the genome. Phase 1 comprised of over a million accurate and complete SNP genotypes from 269 individuals who were grouped into four distinct groups (30 mother–father–adult child trios of CEU, 45 unrelated CHB, 44 unrelated JPT, and 30 trios from the YRI individuals)23 . This resource was released in 200523 . Phase 2: Phase 2 had more than 2.1 million SNPs from each of 270 individuals (each and every individual from the Phase 1 were included along with few extra JPT individuals). The density of the SNP in this phase was close to one SNP for every 1000bp. The studies capturing the patterns of genetic variation were better understood in the presence of the tag SNPs. The use of the fixed marker sets increased the power of association through imputation. The detailed information about this resource was released in 2007. Phase 3 Phase 3 had seven additional populations (Luhya in Webuye from Kenya (LWK); Maasai in Kinyawa, from Kenya (MKK); Gujarati Indians in Houston, TX, from USA (GIH); Chinese in metropolitan Denver, CO, from USA(CHD); Toscani in Italia (Tuscans in Italy - TSI); African ancestry in the Southwest USA (ASW); and Mexican ancestry in Los Angeles, CA, USA(MEX)) apart from the four initial populations. Over 4 million SNPs from 541 individuals from the four initial populations (CHB, CEU, YRI, and JPT) and close to about 1.5 million SNPs from the 760 individuals belonging to seven additional populations were genotyped.
  • 19. 19 Table 1 gives a brief summary of SNPs and the individuals studied in each of the 3 phases 22-24 Effect of Population Structure in GWAS In the tests of association between the gene expression traits and SNP marker genotypes among the unrelated individuals, Population stratification has a significant effect in inflating the test statistics and thereby giving rise to type 1 error. ―Population stratification occurs due to the systematic difference in the allele frequency between subpopulations in a study due to ancestry difference between study subjects‖25 . If unnoticed, the presence of population structure could result in false association signals. Hence it needs to be corrected before carrying out GWAS. Example of Population Stratification The use of chopsticks is more common in Chinese population. There was once a study by an ethnogeneticist who wanted to figure out the reason for the high use of chopsticks in Chinese population (chopsticks gene). He included both the American and the Chinese participants in his study. His main aim was to find any gene that differed in the frequency
  • 20. 20 between these two population groups and it was observed that there was an association with the use of chopsticks. However such a study had no relevance because such an association was found to be spurious26 . The use of chopsticks among the Chinese populations was mainly due to their traditional practices. This practice was correlated with the gene frequencies and hence this led to a false correlation between the genes and the use of chopsticks (i.e. the gene in association with the chopstick use are those showing difference in frequency between the Chinese and American population). Genetics does not play a huge role in defining the population structure. It is not a significant problem in case of non-human animals. We can experimentally control their environments. It is possible to avoid the spurious correlations between the occurrence of a particular allele and its exposure to the environments by choosing a random and an invariant environment. The best approach to control the population stratification would be to raise all the individuals in the same common environment, which is very unlikely in case of the humans. In such an approach it would be possible to raise the humans identically and assign them to their respective environments of our choice. Types of population structure: We can categorize the population structure into 3 basic categories: 1. Discrete structure – comprises of remotely related populations. (i.e. Asians, Europeans, Africans). It is easier to detect the structure since the individuals are well separated. 2. Admixed structure – comprises of individuals from mixed ancestry (i.e. Hispanic Americans, African Americans). Certain degrees of admixture are associated with these individuals. Hence it is difficult to categorize them into distinct clusters.
  • 21. 21 3. Hierarchical structure - comprises of both Discrete and Admixed structure. It consists of multi-ethnic individuals27 . Controlling the stratification Genomic control (GC) is one of most useful approaches employed to control the population structure28-30 . It has the following principle: For a large number of markers measuring association with phenotype, the estimate of Chi-square statistic is derived and this test statistic is adjusted with an inflation factor provided by the random set of markers which is not associated with the phenotype of interest. Stringent genome-wide significance threshold is required as large number of SNP markers are tested in the Genome-wide association scan (GWAS). Even such a threshold cannot control type 1 error29, 31, 32 because the variance inflation is violated across the SNPs. The only problem with GC is that the uniform adjustment may not be sufficient if the allele frequencies of some markers differ more than others across the ancestral population. Structured association is another approach that can be employed to control the stratification. It makes use of the STRUCTURE program33 where the individuals to be studied are assigned to discrete subpopulations and the evidence of association is collected within each subpopulation. This approach is computationally demanding and assigning the individuals to clusters depends on the number of clusters, which is not well defined in many of the studies. One of the recently developed approaches is the stratification control by EIGENSTRAT65 . It helps identify the population structure by calculating the principal component of SNPs across the genome. The first few (topmost) principal components represent the axes of greatest genetic variation among the individuals. These principal components are taken as covariates to carry out regression analysis. Another
  • 22. 22 interesting approach is the use of Multi Dimensional Scaling analysis (MDS)34 which is an extension of EIGENSTRAT. "It is a clustering approach which attempts to recognize the patterns of genetic variation (both discrete and Admixed). The positions of a subject along the axis of genetic variation are identified along with the cluster membership (Figure 2). Each of these positions is then adjusted in order to correct for the potential confounding effects. On the other hand, EIGENSTRAT measures the genetic correlation between the subjects. The results obtained from simulations clearly demonstrate that MDS offers a better control over the stratification as compared to the EIGENSTRAT approach34 . Two major classes of eQTLs It is well known that the abundance of each transcript from many subjects in the population is used to map eQTLs. The mean abundance of a transcript from each of the cell line is then utilized in mapping the QTLs responsible for expression levels of each transcript by standard mapping approaches37 . eQTLs can be classified into two main
  • 23. 23 categories – cis (proximal) and trans (distal) based on their physical distance from the regulated gene. Gene expression trait linked to a locus is known as cis-linkage if the locus is located near the target gene itself. Otherwise it is known as trans-linkage (figure 3). Such a distance based criteria provides a broad classification of regulatory elements. In their studies related to yeast, Brem et al38 defined the linkage as cis if the gene and the marker was less than 10 kb apart. Schadt et al39 and Wang et al40 defined the linkage as cis if the gene and the marker was less than 20Mb apart from each other. Cis-linkage mainly occurs due to the first order effect which causes the DNA polymorphism of the gene itself and produces a strong linkage signal. Trans-linkage on the other hand occurs due to the second order effect where the transcription product (protein) of the gene containing the DNA polymorphism affects the expression of a different gene thereby producing a weaker signal. Measure of heritability The significant association of the SNPs with the expression traits forms the basis of genome wide association studies in human population. However, only a small proportion
  • 24. 24 of variation in the gene expression levels is explained by the variation in the genotypes across the individuals. The important question here is – What about the missing heritability? 58,59 . The possible explanations would be: 1) gene * gene or gene * environment interactions60 . 2) Epigenetic factors61 . 3) Rare variant and common disease hypothesis62, 63 . The variation in the gene expression explained by the SNPs is generally lower than the narrow sense heritability (based on the additive genetic effects). Hence non additive effect cannot explain missing heritability. When a small proportion of variation is explained by casual variants, it has been observed that their effects fail to reach strict significant thresholds and they would not be in complete linkage disequilibrium with SNP markers. Thus the heritability of the expression trait would be heavily dependent on these factors. Thus heritability of the gene expression can be defined as ―the ratio of genetic variance to the total phenotypic variance66 ‖ Genetic variance referes to ―the variation in the gene expression phenotypes occuring due to the presence of different genotypes across the population‖79 . On the other hand environmental variance explains the proportion of the variance in the gene expression phenotypescaused due to the differences in the environments to hich the individuals in a population have been exposed‖ 80 Similarly ―eQTL heritability can be defined as
  • 25. 25 Although a particular SNP and the Loci in LD with the SNP can have a substantial effect in the variation of mRNA expression levels, average heritability of eQTL has been reported to be ~25% in many studies67,68,69 . However the gene expression studies by Schadt and et.al.70 on human liver concluded that the variation explained by the SNPs can be as low as 2% and as high as 90% thereby suggesting a large variability in gene expression heritability and its dependency to a particular type of cell or tissue.
  • 26. 26 Main Aim of the study The studies related to the genetic aspect of the gene expression with the main focus on the variation in the gene expression have been one of the hot topics for the past few years. It is now well known that regulatory polymorphisms are present in the human genome, with the cis and trans- regulation of the genes. Most of the studies have focussed single genetic variant effects and expression measured in a single cell type. Many eQTL studies have mainly focussed on the experimental crosses in yeast and mice41 42 . Recently several studies performed in a variety of organisms have clearly shown that gene expression levels are heritable43-47 . Considering these aspects, the main objective of our study is: 1. To detect cis- and trans- acting expression QTLs (eQTLs) in a mixture of HapMap populations. 2. To estimate the heritability of the gene expression traits. 3. To explore the relationship between the heritability of the gene‘s expression level and the power to identify cis- and trans-acting eQTLs. 4. To predict presence/absence of the eQTLs for a particular gene based on the heritability estimates of that gene.
  • 28. 28 Selecting the genes to study for the analysis The Gene Expression levels of 210 unrelated HapMap individuals (CEU: 60; CHB: 45; JPT: 45; YRI: 60) in four technical replicates, were measured using Illumina‘s human whole-genome expression array (Box1). ―Quantile normalization was performed on the gene expression data within the replicates and this was then followed by the median normalization across the Hapmap individuals‖6 The Normalized gene expression values were readily available in Gene Expression Omnibus database with an accession number of GSE6536 6 . The array consisted of 47,294 Illumina probes. There were few genes that were mapped by multiple probes. BLAT48 Search was carried out in order to map them to human RNA sequences from Refseq (hg 19)49 . All those probes that mapped to the more than one gene with more than 90% similarities were discarded. Only those probes that matched exactly to a unique gene were selected. From these I looked for those probes for which a gene had only one RNA accession number in the RefSeq database and retained them. The genes having multiple splice forms and multiple annotated start sites were avoided thereby making the analysis simpler50 . Now, out of the remaining genes, majority of them had only one probe, while few of them still had multiple probes. I had to check whether these probes mapped to 3‘ end or 5‘ end of the gene. If the majority of the probes were biased towards the 3‘end of the gene, such a bias was reduced by selecting the probes nearest to the 5‘end out of multiple probes. I searched for the SNPs positions to see if if there were any within the probe sequences and discarded those as it could have a significant impact on the
  • 29. 29 measured expression level. The probes that have the SNPs in them were indentified in UCSC Genome Browser51 . Finally probes that mapped to X and Y chromosomes were left out and not included in the analysis. After performing the above mentioned steps, data set of 11,446 gene expression values were left for the final data analyses. Investigation and Correction of population structure EIGENSTRAT65 method was used to correct the stratification. The principal Components analysis was applied to the SNP genotypes. The PC axes (Eigenvectors) contributing to the genetic variation was inferred from the PCA analysis.
  • 30. 30 These Eigenvectors were used as the covariates in association and heritability analysis (figure 5). The complexity and multidimensionality of the data was greatly reduced by the Axes of variation. These axes of variation explained maximum variability in the data. The eigenvectors were obtained by carrying out the PCA analysis in Golden Helix Softwarewebref1 . 10 eigenvectors scores obtained from the PCA analyses are shown in the (Appendix-S2). Performing Association studies between the gene expression traits and SNP marker Genotypes Each SNP from ~2.2 million SNPs and each of selected probes were fitted into a simple Linear Regression (LR) model71 . This model is also called as Single-SNP model. In this model every single SNP is regressed with the selected probes. Let Xi be the ith individual‘s genotype for a given SNP. This genotype may take any of the three form: Xi = 2, 1, 0 for common homozygous alleles (AA), heterozygous alleles (Aa) and rare homozygous alleles (aa) respectively. Linear regression was fitted for this additive model: Where Yi is the Quantitative trait variable (i.e. log normalized probe expression values of ith individual, where i = 1, 2, 3, …, 210. εi are the random residuals following the normal distribution. These variables have a constant variance and a mean = 0. Only the SNPs with Minor Allele Frequency (MAF) greater than 0.05, were used for the association testing. In order to classify the significant association signals as cis or trans, the distance between the midpoint of the probe and the SNP genomic location was assessed. The association between the SNPs and probes were known as ‗Cis association‘ if the above mentioned distance was less than or equal to 1Mb72 . Otherwise they were considered as the trans association. The association analysis was performed using PLINK software54 .
  • 31. 31 The screen shots and the few commands that were used in the association studies are detained in (Appendix-S3). The association testing was carried separately on each of the population groups and also on the pooled samples (Multipopulation). In the multipopulation analysis, PCA axes were incorporated as covariates in the linear regression model to control for the population structure. The use of the pooled samples in the association testing was mainly done to increase the power of the analysis (especially to detect the hidden cis –association signals). Correcting the association signals by multiple testing The multiple test correction was carried out by using following three methods: 1. Bonferroni Correction73 , 2. False Discovery Rate approach74 , 3. Sidak corrections75 Each of these three approaches was used for both Cis and Genome wide analysis. Bonferroni Corrections were too conservative as it failed to identify many true association signals, thereby giving rise to type II error. This can be clearly seen in table 2. Sidak method was less conservative compared to the Bonferroni correction, but could be used only if all the genes were independently regulated. Hence the FDR correction was considered to be a better approach
  • 32. 32 Heritability analysis The heritability estimates of the gene expression traits and eQTLs associated with those gene expression traits were obtained by Restricted Maximum Likelihood Analysis57,76 from GCTA program77 . For estimating the heritability of the eQTLs, a simple approach was used to estimate the proportion of the phenotypic variation explained by the SNP. For each locus only one SNP showing the highest significance (lowest FDR corrected p- value) was chosen and the variance explained by them was calculated by the following the equation: However, the above equation could explain only a fraction of the variance in the gene expression levels and this was not necessarily the same as heritability (because the denominator var(y) did not account for population structure), although they were related
  • 33. 33 with heritability in some extent. The heritability estimates obtained for some of the eQTLs traits with the help of the above equation and REML are shown in the (table 3). During the heritability analysis the population structure arising the from the HapMap populations was corrected by estimating relationships in each individual population separately. These relationship matrices were merged by setting the relationships between individuals of different populations to zero77 . Such an adjusted relationship matrix was used to analyse heritability of the gene expression traits in the mixed populations. The first two eigenvectors (from PCA analysis)65 which showed maximum stratification were used as the covariates in the heritability analysis. The Screen shots and commands used in the GCTA program for the calculation of the heritability estimates are explained in the (Appendix-S6). Statistical approach to REML The gene expression value of a particular individual depends on the total genetic effect of that individual, number of causal variant and its scaled additive effect. This can be represented by the following equation57 .
  • 34. 34 In general, the equation becomes: where g = z*u Now the scaled additive effect (u) is taken as a random effect. The variance of the total additive genetic effects depends on the number of the causal variants and the variance of causal additive effects. This can be explained by the following equation: Thus the variance in the gene expression values cam be broken down into two parts – variance due to the additive genetic effects (σg 2 ), and the residual variance (σe 2 ), G is a very important term in the above equation. G represents the genetic relationship matrix individuals of different population groups. This matrix is very important for adjusting the stratification effects. The Equation 3 is very much similar to the classical description of Heritability.
  • 35. 35 However it is very difficult to predict anything about the causal variants. Which SNP is the actual causal variant or whether it is LD with the causal variant? We cannot tell that how many causal variant would be present in ~ 2.2 million genotypes SNPs and which are they. Therefore the relationship matrix based on the genotypes SNPs (A) is calculated. This matrix is then used in the REML analysis to obtain the heritability estimates. Thus A is calculated for each SNP and the weighted average is taken across all the SNPs. Finally a common genome wide relationship matrix (A) is obtained by combining individual matrices and used in REML analysis. For eQTL heritability, a separate relationship matrix involving a particular loci or region in the chromosome to which the gene is associated was constructed. This region of the genome consisted of the SNPs showing significant association with that gene. For example if 10 SNPs (covering a region of 100 kb in the genome) showed a strong association signal with a particular gene (namely Gene X), that particular segment of the chromosome was primarily used to construct the relationship matrix. Such a relationship matrix was fitted into the REML analysis to explain the variation explained by that particular locus, thereby giving rise to eQTL heritability estimates.
  • 37. 37 Based on the above mentioned strategy (Methods) for selecting the genes to study in the GWAS, I was left with a data set of 11,446 gene expression values. Each of these Expression values was treated as the phenotypes and was subsequently used for the further analyses. As mentioned before, the main motive of the study was to find the SNPs that were significant associated with the variation in the gene Expression Phenotypes and the heritability of these Phenotypes. Along with the heritability of Gene expression traits, the heritability measures for the eQTLs (cis and trans) significantly associated with expression variants were also measured. About 2.2 million SNPs were selected from the HapMap Phase II project23,24 for Association and Heritability analyses. The Minor allele frequencies of the selected SNPs were greater than 5% from each of the four unrelated populations (CEU, YRI, CHB, JPT). Addressing the problem of Population Stratification Since HapMap samples used in this study comprised of individuals from four different populations, there was a high possibility of Population stratification effect giving rise to false association signals and heritability results. Hence Principal Component method65 was employed to control the effects of Population stratification. In this approach the Principal Components of SNPs across the genome were calculated. These Principal Components were treated as the covariates in the Regression analysis. The Presence of the Population structure was clearly detected by plotting the Fist two components (Figure 5). These two components were used to adjust the test statistics for the markers that gave rise to the stratification. The P value distribution here therefore takes account of the correction. Figure clearly shows the adjustment made after the stratification correction for the gene ‗hmm28636‘. It is clear from the (Figure 6) that the distribution of the majority of the p-values is in agreement with random expectation (i.e. a model without any significant associations – expected p-values under null hypothesis). The Genomic
  • 38. 38 inflation factor (based on median chi-squared) was 1 for all the genes indicating that there was no residual population stratification after the correction. Cis Associations of SNPs with the gene Expression Phenotypes and the distribution of Cis-Acting eQTLs. The association analyses between the variation in the SNPs and the variation in the Gene Expression for each of 11,446 Gene Expression traits from 210 unrelated individuals were performed by using a linear regression model (additive model – equation). The
  • 39. 39 association analysis was carried out using PLINK54 . To analyse the SNPs that caused a pronounced effect on the measured mRNA levels in Cis, only those SNPs situated within 1-Mb upstream and downstream from the midpoint of the expression probe was considered. Such an approach is referred to as the Cis Candidate region approach 72 . In case of large genes (> 500kb), a region 500 kb upstream and downstream of the transcription start site (TSS) and end site (TES) respectively were regarded as the Cis candidate regions. The P-values obtained from the association analyses were then adjusted for multiple testing. In order to avoid the false positives and to constrain the study-wise significance level, the p- values obtained from association testing were corrected for multiple testing by employing stringent and conservative False Discovery rate (FDR) method. The FDR of 5 % was set as the threshold for association analysis and all the SNPs with the adjusted Significant P-values (FDR < 0.05) were recorded and were assumed to be significantly associated with the variations in Gene Expression trait. Significant Cis association were observed in 579 genes (Appendix-S4). A total of 611 (5.33%) genes were found to contain at least one SNP with a p value < 1x10-7 . In majority of the cases, the exact location of the eQTLs could be predicted because the SNPs that were significantly associated with mRNA levels were localized to a restricted region. (Figure 7) shows the examples of the p-values of three such genes. In cis association, position of genes and eQTLs superimpose on each other to produce cis diagonal (Figure7 and 8). The reasons for choosing Phase II HapMap Individuals Significant Cis- associations shown by certain Genes identified in the Phase 1 HapMap were compared with those identified in the Phase II HapMap data. It was observed that the Phase 1 identified 82% of the genes detected with Phase II. (Figure 9) shows the example of one such gene. This could be due to slow decay of Linkage Disequilibrium.
  • 40. 40 Thus instead of detecting the diversity among the common haplotype, Phase II HapMap detected additional variation in genotypes compared to the Phase I HapMap. Positions of the SNPs relative to Transcription start Site. Because of the high density of SNPs in the HapMap, it could be possible that several SNPs included in the analysis were in fact the causal variants. Hence from the Pooled sample of individuals (multipopulation), the SNPs having Significant Cis- associations for each of the genes from the multi- population analysis were mapped relative to the transcription start sites (TSS) of the genes. It was found that these Cis- associated SNPs were present very close to the TSS (Figure 11). Usefulness of Multi-population over Single-population analysis. The analysis was mainly focussed on four different Unrelated Population groups. The association analysis was initially carried out on each of the single population groups (Single-population analysis). However the use of four Population Groups together (Multi- population) in the analysis was considered to be more appropriate because by pooling together the four population groups, we would be able to obtain many hidden association signals. By carrying out Multi-population analysis, it was observed that this methodology detected majority of the population-shared Cis- association signals that were identified in the single-population analysis. Along with these association signals, it was able to detect many additional Cis effects (Figure 10a). Majority of the effects captured by Multi- population analysis were much smaller (R2 = 0.10-0.60) compared to those captured by Single-population analysis (Figure 10b).Thus 579 Cis- associations were detected in the pooled sample of four population groups. Since effects of population stratification were removed by using principal components as the covariates in the linear regression model. The linear regression analysis thus yielded accurate Cis- association signals.
  • 41. 41
  • 42. 42
  • 43. 43
  • 44. 44 Trans- associations of SNPs with the gene Expression Phenotypes Genetic effects acting in Trans- could be explained due to the availability of SNP and whole-genome expression data. Testing approx 2.2 million SNPs per population with an allele frequency of >= 0.5 against all the gene expression traits was statistically and computationally demanding. Once again Linear Regression model was used to test the association between the variation in gene expression and the variation in the SNPs. To analyse the SNPs that caused a pronounced effect on the measured mRNA levels in trans-, only those SNPs situated beyond 1-Mb upstream and downstream from the midpoint of the expression probe was considered. These SNPs could either be present in the same
  • 45. 45 chromosome as the gene to which they were significantly associated or could be present in a different chromosome. From the regression analysis it was found that 32 genes had significant trans association (Appendix-S5). One of genes having a trans eQTL is shown in the (Figure 12). Out of these 32 genes had one eQTL, 2 had two eQTLs and rest had more than two eQTLs. Only one gene (gene ‗HLA-C‘) had trans- associations on the same chromosome (distance greater than 1 Mb. There were 8 SNPs that showed strong association with multiple genes (Table 3). Of these, 2 SNPs showed both cis and trans association with the more than one gene; 4 SNPs were associated with multiple genes in Cis and the remaining two SNPs showed trans association with more than one gene. It was observed that the effects of Trans associations were much weaker compared to Cis-associations. Majority of the Cis associations (60%) had a –logP score greater than 10 (Figure 13).
  • 46. 46 Heritability of Gene expression traits The heritability of the gene expression traits and eQTLs were estimated by REML approach57,76 (explained in the methods section). Of the 611 genes that had eQTLs (cis and trans), only 112 genes had heritability higher that 0.5, 507 genes had heritability higher than 0.2, and rest of the genes had heritability lower than 0.2 (Figure 14 and Appendix-S4&S5).
  • 47. 47 Although heritability measures show a reasonable correlation (r = 0.2) with the cis- association significance (Figure 15), It is also clear from the that some of the gene expression traits with very low heritability estimates (heritability < 0.1) show significant cis associations and the maximum –log10P values of these associated SNPs are greater than the –log10P values of some of the SNPs showing strong cis- associations with the gene expression traits that have high heritability ( heritability > 0.5). The heritability measures of the genes showing significant Cis- association were compared with those showing strong trans- association signals (Figure16, Appendix-S4). It was found that there was no significant difference between them with respect to their heritability estimates (t = -1.0995, p-value = 0.2792) (Figure 16).
  • 48. 48 eQTL heritability The mean heritability of the SNP that was strongly associated with each gene expression trait was 0.16 (s.d. 0.16577 maximum of 0.817) compared to 0.36 (s.d. 0.161, maximum 0.86) which was the mean heritability for the overall gene expression trait. This suggests that about 44% of the heritability in the gene expression can be explained by the peak SNP. However the proportion of variation explained by the cis eQTLs (cis eQTL heritability) is significantly higher than the variation explained by trans eQTLs (trans eQTL heritability) (p-value = 1.190e-10) (Figure 17 and Appendix-S4&S5).
  • 50. 50 The detailed analysis of genetic effects giving rise to the variation in the expression of the mRNA levels (Genes) extracted from the lymphoblastoid cell lines was carried out. Based on the results obtained from the analysis, it was observed that 579 genes had significant Cis associations and 32 genes had significant trans associations with the SNPs. It can be assumed that only a small proportion of the functional regulatory effect could be well explained in these four population groups owing to the limited power of the analysis. Besides this limitation, the analysis was mainly focussed on a single cell type. Variations arising in different cell types are not shown in this analysis. Therefore it is possible that a plethora of Cis- regulatory variants were obtained from the analysis. The main purpose of the analysis was not limited to the identification of Cis regulatory events alone. The analysis attempted to depict the characteristics of cis- association signals. The four different population used in this analysis gave rise to a significant stratification. In order to avoid spurious association signals, the stratification was corrected by using the principal components scores as the covariates. Although there are various ways by which the stratification can be controlled (namely – Genomic control, Structured Association), PCA seems to perform much better in controlling the stratification. The computational effort to carry PCA is much lesser than SA and GC78 . In PCA, the SNPs showing significant difference in the allele frequency between different population groups are corrected to a greater extent. This analysis provides some interesting insights with respect to the variation in gene expression. The sensitivity of the analysis was greatly improved by using the pooled sample of four population groups (multipopulation) (Figure 10). Because of the small sample size in the single population analysis, the genetic effects captured by the association testing was high and this could have possible given rise to a wider range of squared correlation coefficient
  • 51. 51 scores (R2 from 0.30 to 1). Due to Multipopulation analysis, weaker regulatory effects shared across the four population groups could be identified. It was possible to carry out the conditional permutation calculations which would allow pooling the members of the population if the population identities or their relationships were known. Because of the presence of unrelated individuals in the pooled sample and the complexity of the time consuming permutation calculations, linear regression method was employed along with PCA approach for the association analysis. Although the outliers can have pronounced effect on the p- values, prediction capabilities and additional factors can be added in this method. Calculations were much simpler and less time consuming by performing simple linear regression for association testing. The association testing was carried out for few genes using the SNPs belonging to both HapMap 1 and II. However due to slow decay of Linkage Disequilibrium with the causal variants and the inclusion of variants with low minor allele frequency (less than 5%)ref , the SNPs from the phase II HapMap were used for the association testing. Hence association signals in some of genes could easily pass the stringent significance threshold (Figure 9) when the Phase II genotypes were used. The presence of majority of the Cis associated SNPs near the TSS (Figure 11), provides valuable information about the location of cis- regulatory variants in the genome. It can be said that most of these variants with cis effects are in genic or immediate intergenic regions in the human genome. Compared to Cis effects very few genes had significant trans associations with the SNPs. This was expected because, trans regulations are considered to more indirect. It could be possible that the gene may well be regulated by a large number SNPs, many of which fail to pass through the stringent significance threshold. Also size of the sample used in this analysis possibly provides less power to detect many trans association signals. The
  • 52. 52 previous studies ref have also shown it has been quite difficult to capture trans effects in humans as compared to yeast. As is yeast is a unicellular organism, the biologicalinteraction studied in a single cell is capable of detecting all of the other interactions. However human cell can be considered as a minute part of the whole organism. Therefore majority of the trans effects mediated by intercellular events can be difficult to detect. Also trans effects may well be shared across different cell types thereby diluting them. Finally the use of strict significance threshold could have made the detection of trans eQTLs difficult. Large number of false negatives could have cropped up by the use of strict significance threshold (FDR < 0.05) thereby giving rise to fewer trans effects compared large number of cis-effects. However the use of such stringent threshold was necessary to avoid any false positives and to be made sure that they were indeed true eQTLs. Our analysis indicates that much of the variation of the complex phenotypes in humans can be explained by the by cis – regulatory variants because of its large enrichment among a group of potential trans –regulatory variants. From the analysis it was observed that trans effects were not as strong as cis. Most of the –log10P scores greater than 10 belonged to cis (Figure 13). Previous studies in miceref and humansref have shown similar observations. In spite of the weak trans effects, several distant associations between the SNPs and gene expression phenotypes were observed. Thus it could be said that large sample size will be able to predict the trans regulation effect of the transcripts more accurately. The analysis showed that the median heritability of the genes expression traits having eQTLs was 34.5% which is consistent with the previous studiesref . However it was surprising to note that there were about 104 genes with heritability estimates of below 0.2 (Figure 14). These genes were significantly associated with the SNPs and had eQTLs. This raised an important issue of missing heritability among the genes. The significant
  • 53. 53 associations of the individual SNPs with the gene expression trait and the power to capture them mainly depend on the variance that can be explained by the SNPs. This, in turn is dependent on the linkage disequilibrium between the actual causal variant and SNPs. The large effects of the rare alleles or the small effects of the causal variants probably is less likely to explain large variation and hence would not turn out to be significant even if the assayed SNPs were in high LD with the actual causal variant. However the total effect of SNPs genotyped in the Phase II HapMap was considered during the heritability calculations. Thus for few genes we expect only a small fraction of the variance in gene expression level to be explained by the SNPs giving rise to the low heritability estimate in those genes. Even after making use of about ~2.2 million markers spanning the entire genome, there are possibilities that several causal variants could still have very low minor allele frequency (MAF) and a poor LD with the assayed SNPs. As a result, the power of identifying such causal variants is greatly reduced in the association studies which in turn reduce the heritability measures (i.e the variance explained by SNPs) of some of the genes. If there are many causal variants for a particular gene expression trait, there are chances that only a small percentage of variance is explained by the majority of the causal variants. Many GWASs have observed this phenomenon by having the wider distribution of high test statistics for most part of the genomeref . Could there be an ascertainment Bias in the data analysis and problem of population structure in the heritability analysis? For this reason, it was necessary to assume that individuals of different populations are unrelated and their genetic relationships are almost zero. Therefore, to analyze the heritability of gene expression and eQTLs from the pooled population (Multipopulation), the approach was to estimate relationships in each individual population group separately and then merge all the relationship matrices by
  • 54. 54 setting the relationships between individuals of different populations to zero. With this adjusted relationship matrix, heritability measures were obtained for the genes and the their eQTLs in the mixed populations. The REML analysis was also performed by the fitting first two Eigen vectors from the mixed populations as the covariates. Hence the results obtained at the end of the analysis were not biased by the population stratification effect. Any individuals having a relationship score > 0.025 with another individual was not included in the analysis. The relationships obtained from the SNPs is based on the LD between the SNPsref and this LD gives rise to significant association signals between the SNPs (which is not an actual causal variant) and the gene expression trait. Since the heritability measures of the gene expression takes into account the variance explained by all the genotyped SNPs obtained from Phase II HapMap project, it is not necessary that individual SNPs pass strict Significance thresholds. The large variation in the heritability of the gene expression traits depends on the cell state. It is possible that under certain conditions of stress (hypoxia) or in other developmental stages, the genetic effect of the gene expression is diluted to some extentref . Since Heritability is ―the proportion of phenotypic variation caused by the additive gene tic factors‖ref the additive effects of the SNPs were therefore fitted into REML analysis. The variations caused by the gene-environment interactions and non-additive effects do not affect the heritability estimates. The main aim for carrying out the heritability analysis was mainly to find out the variance in the gene expression traits explained by the SNPs. The analysis do show that the SNPs explained as low as 0% to as high as 86% variation among the phenotypes and that the missing heritability in some of the genes could be due to lack of complete LD between the causal variants and SNPs. However there were few genes which had high heritability ( ) but were not significantly associated any SNP. The variation in these gene expression traits could have been regulated by
  • 55. 55 multiple SNPs or loci, each giving rise to small effects in order to regulate the overall expression collectively. From the previous studiesref , it is known that the loci having small effects become difficult to be captured than those with large effects.Previous studiesref have suggested that the magnitude of the heritability of the gene expression trait heavily depends on the precision and power of the gene maping experiments. Thus the detection of the eQTLs are less likely if there is a large environmental variance. The analysis however has shown that the heritability estimate of the gene expression traits does not necessarily determine about the presence of absence of eQTLs for a particular gene. The analysis showed that cis-eQTLs have higher estimates of heritability (median = 0.3592) compared to the trans-eQTLs (median = 0.1017). There were about 36 eQTLs (35 cis-eQTL and 1 trans-eQTL) that had big allelic effects and explained more that 50% variance in the gene expression levels (Appendix-S4&S5). This implies that variation in the expression levels of these genes was also due to small effects other multiple loci together with the major cis –eQTL. Several studies in yeast have also shown similar kind of observationsref .
  • 57. 57 As the variation explained by individual SNPs are too small the results presented in this work hints at employing larger association studies (use of large number of individuals) to identify the significant associations between the SNPs and the gene expression traits. The heritabilities of some of the gene expression traits are still insignificant possibly due to the weak LD between the SNPs and the causal variants. Such causal variants and rare polymorphisms are likely to be identified by carrying out detailed Resequencing studies and by the use of advanced genotyping arrays. The lack of the heritability of some of the genes strongly associated with eQTLs could be due the large number of causal variants with very small effects and in order to show their effects as significant, it would be beneficial to carry out the analysis with large sample size. Therefore one needs to be careful while selecting the phenotype for fine mapping based only on the heritability estimates. This is many because the genes with low heritability (<0.2) may still have significant association with the SNPs. Although some of the rare alleles with large effects can explain a small proportion of the variation in the gene expression, still a large sample size would be required for these effects to be statistically significant. A comprehensive analysis of the variation in the gene expression phenotypes among the 210 unrelated individuals from four different population groups has been described. Detailed genetic characteristics and the approximate positions of the cis and trans effects across the human genome has been studied in the work. The results and the detailed analysis about the heritability and eQTL characteristics of the gene expression traits could provide valuable insights and robust framework for further downstream analysis and the future studies related the gene expression variation among large population groups with each group having large number of individuals
  • 58. 58 (large sample size) using different types of cells and tissues which would involve the SNPs with MAF (< 0.001) and SNP densities. I strongly believe that some of the observations presented in this study can be used to interpret the findings of the association signals in some of the diseases and identify the biological effects of the SNPs or a particular region (loci) in the genome that show significant association in the disease states. Thus it would possible to explain the functional variation taking place in the genome and how this variation could lead to the variation in the phenotypes across the human populations.
  • 59. 59 References: 1. Kliebenstein, D.J. (2009) ―Quantitative Genomics; analyzing intra-specific variation using global gene expression polymorphisms or eQTLs.‖ Annual Reviews Plant Biology 60(1)93-114. 2. Lewin, B. (2008). Genes IX. Sudbury, MA ; London, Jones and Bartlett. 3. Monks, S. A., A. Leonardson, et al. (2004). "Genetic inheritance of gene expression in human cell lines." Am J Hum Genet 75(6): 1094-105. 4. Morley, M., C. M. Molony, et al. (2004). "Genetic analysis of genome-wide variation in human gene expression." Nature 430(7001): 743-7. 5. Cheung, V. G., R. S. Spielman, et al. (2005). "Mapping determinants of human gene expression by regional and genome-wide association." Nature 437(7063): 1365-9. 6. Stranger, B. E., M. S. Forrest, et al. (2007). "Relative impact of nucleotide and copy number variation on gene expression phenotypes." Science 315(5813): 848-53. 7. Stranger, B. E., A. C. Nica, et al. (2007). "Population genomics of human gene expression." Nat Genet 39(10): 1217-24. 8. Eckhardt, F., J. Lewin, et al. (2006). "DNA methylation profiling of human chromosomes 6, 20 and 22." Nat Genet 38(12): 1378-85. 9. Petronis, A. (2006). "Epigenetics and twins: three variations on the theme." Trends Genet 22(7): 347-50. 10. Gibson, G. (2008). "The environmental contribution to gene expression profiles." Nat Rev Genet 9(8): 575-81.
  • 60. 60 11. Idaghdour, Y., J. D. Storey, et al. (2008). "A genome-wide gene expression signature of environmental geography in leukocytes of Moroccan Amazighs." PLoS Genet4(4): e1000052. 12. Hartman, J. L. t., B. Garvik, et al. (2001). "Principles for the buffering of genetic variation." Science 291(5506): 1001-4. 13. Jin, W., R. M. Riley, et al. (2001). "The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster." Nat Genet 29(4): 389-95. 14. Brem, R. B., G. Yvert, et al. (2002). "Genetic dissection of transcriptional regulation in budding yeast." Science 296(5568): 752-5. 15. Schadt, E. E., S. A. Monks, et al. (2003). "Genetics of gene expression surveyed in maize, mouse and man." Nature 422(6929): 297-302. 16. Stranger, B. E. and E. T. Dermitzakis (2005). "The genetics of regulatory variation in the human genome." Hum Genomics 2(2): 126-31. 17. Stranger, B. E., M. S. Forrest, et al. (2005). "Genome-wide associations of gene expression variation in humans." PLoS Genet 1(6): e78. 18. Boone, C., H. Bussey, et al. (2007). "Exploring genetic interactions and networks with yeast." Nat Rev Genet 8(6): 437-49. 19. Storey, J. D., J. Madeoy, et al. (2007). "Gene-expression variation within and among human populations." Am J Hum Genet 80(3): 502-9. 20. Hartl, D. L. and A. G. Clark (2007). Principles of population genetics. Sunderland, Mass., Sinauer Associates.
  • 61. 61 21. Levy, S., G. Sutton, et al. (2007). "The diploid genome sequence of an individual human." PLoS Biol 5(10): e254. 22. International HapMap Consortium (2003). "The International HapMap Project." Nature 426(6968): 789-96. 23. International HapMap Consortium (2005). "A haplotype map of the human genome." Nature 437(7063): 1299-320. 24. International HapMap Consortium (2007). "A second generation human haplotype map of over 3.1 million SNPs." Nature 449(7164): 851-61. 25. Li MY, Reilly MP, Rader DJ, Wang LS (2010) Correcting population stratification in genetic association studies using a phylogenetic approach. Bioinformatics 26(6): 798–806. 26. Hamer D, Sirota L (2000) Beware the chopsticks gene. Mol Psychiatry 5: 11–13. 27. Serre,D. et al. (2008) Correction of population stratification in large multi-ethnic association studies. PLoS ONE, 1, e1382. 28. Devlin, B., K. Roeder, and L. Wasserman (2001). Genomic Control, a New Approach to Genetic-Based Association Studies. Theoretical Population Biology 60 (3), 155-166. 29. Devlin, B., S. Bacanu, and K. Roeder (2004). Genomic Control to the extreme. Nature Genetics 36 (11), 1129-1130.
  • 62. 62 30. Devlin, B. and K. Roeder (1999). Genomic Control for Association Studies. Biometrics 55 (4), 997-1004. 31. Marchini, J., L. Cardon, M. Phillips, and P. Donnelly (2004). The effects of human population structure on large genetic association studies. Nature Genetics 36, 512-517. 32. Zhang, F., Y. Wang, and H.-W. Deng (2008). Comparison of population-based association study methods correcting for population stratification. PLoS ONE 3 (10), e3392 33. Pritchard,J.K. and Rosenberg,N.A. (1999) Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet., 65, 220–228. 34. Li,Q. and Yu,K. (2008) Improved correction for population stratification in genome wide association studies by identifying hidden population structures. Genet. Epid., 32, 215– 226. 35. Pearson TA, Manolio TA (March 2008). "How to interpret a genome-wide association study". J. Am. Med. Ass. 299 (11): 1335–44. 36. Hunter DJ, Altshuler D, Rader DJ (June 2008). "From Darwin's Finches to Canaries in the Coal Mine — Mining the Genome for New Biology". N. Engl. J. Med. 358 (26): 2760–63.
  • 63. 63 37. Doerge RW (2002). ―Mapping and analysis of quantitative trait loci in experimental populations‖. Nat. Rev. Genet. 3:42-52. 38. Rachel B Brem, Gael Yvert, Rebecca Clinton, and Leonid Kruglyak (April 2002). ―Genetic dissection of transcriptional regulation in budding yeast.‖Science, 296(5568):752–755. 39. Eric E Schadt, Stephanie A Monks, Thomas A Drake, Aldons J Lusis, Nam Che, Veronica Colinayo, Thomas G Ruff, Stephen B Milligan, John R Lamb, Guy Cavet, Peter S Linsley, Mao Mao, Roland B Stoughton, and Stephen H Friend. (March 2003). ―Genetics of gene expression surveyed in maize, mouse and man.‖ Nature, 422(6929):297–302. 40. Susanna Wang, Nadir Yehya, Eric E Schadt, Hui Wang, Thomas A Drake, and Aldons J Lusis (February 2006). ―Genetic and genomic analysis of a fat mass trait with complex inheritance reveals marked sex specificity.‖ PLoS Genet, 2(2):e15, 41. Ronald J, Brem R, Whittle J, Kruglyak L (2005). ―Local Regulatory variation in Saccharomyces cerevisiae‖. PLos Genet 1:e25. 42. GuhaTakurta D, Xie T, Anand M, Edwards S, Li G, et al. (2006). ―Cis- regulatory variations: A study of SNPs around genes showing cis-linkage in segregating mouse populations‖. BMC Genomics 7:235.
  • 64. 64 43. Cheung V, Conlin L, Weber T, Arcaro M, Jen K, et al. (2003) Natural variation in human gene expression assessed in lymphoblastoid cells. Nat Genet 33: 422–425. doi:10.1038/ng1094. 44. Dixon A, Liang L, Moffatt M, Chen W, Heath S, et al. (2007) A genome-wide association study of global gene expression. Nature Genetics 39: 1202–1207. 45. Göring H, Curran J, Johnson M, Dyer T, Charlesworth J, et al. (2007) Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes. Nature Genetics 39: 1208–1216. 46. Emilsson V, Thorleifsson G, Zhang B, Leonardson A, Zink F, et al. (2008) Genetics of gene expression and its effect on disease. Nature 452: 423–428. 47. Dunning, M. J., M. L. Smith, et al. (2007). "beadarray: R classes and methods for Illumina bead-based data." Bioinformatics 23(16): 2183-4 48. Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12:656–64. 49. Pruitt K, Tatusova T, Maglott D (2007) NCBI reference sequences (Ref-Seq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research 35: D61–D65. 50. Veyrieras B, Kudaravalli S, Kim SY, Dermitzakis ET, Gilad Y, Stephens M, Pritchard JK. 2008. High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet 4:e1000214.
  • 65. 65 51. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. (June 2002) ―The human genome browser at UCSC‖. Genome Res 12(6):996-1006. 52. McCaroll, S.A. et al.(2006) Common deletion polymorphisms in the human genome. Nat.Genet.38,86-92. 53. Spielman RS, Bastone LA, Burdick JT, Morley M, Ewens WJ, et al. (2007) Common genetic variants account for differences in gene expression among ethnic groups. Nat Genet 39: 226–231. 54. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81: 559–575. 55. Aulchenko YS, Ripke S, Isaacs A, van Duijn CM . (May 2007). ―GenABEL: an R library for genome-wide association analysis‖. Bioinformatics 23(10):1294-6. 56. Veyrieras J-B, Kudaravalli S, Kim SY, Dermitzakis ET, Gilad Y, Stephens M, Pritchard JK (2008) High-Resolution Mapping of Expression-QTLs Yields Insight into Human Gene Regulation. PLoS Genet 4:e1000214. 57. Yang, et al, (2010) Common SNPs explain a large proportion of the heritability for human height, Nature Genetics online June 2010.
  • 66. 66 58. Maher, B. (2008) Personal genomes: The case of the missing heritability. Nature 456, 18– 21. 59. Manolio, T.A. et al. (2009) Finding the missing heritability of complex diseases. Nature 461, 747–753. 60. Frazer, K.A., Murray, S.S., Schork, N.J. & Topol, E.J. (2009) Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 10, 241–251. 61. Pritchard, J.K. (2001) Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 69, 124–137. 62. Johannes, F., Colot, V. & Jansen, R.C. (2008) Epigenome dynamics: a quantitative genetics perspective. Nat. Rev. Genet. 9, 883–890. 63. Johannes, F. et al. (2009) Assessing the impact of transgenerational epigenetic variation on complex traits. PLoS Genet. 5, e1000530. 64. Hayes, B.J., Visscher, P.M. & Goddard, M.E. (2009) Increased accuracy of artificial selection by using the realized relationship matrix. Genet. Res. 91, 47–60. 65. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904–909.
  • 67. 67 66. Kadarmideen H.N. (2008). Genetical systems biology in Livestock – Application to GnRH and Reproduction. IET Systems Biology 2: 423-441. 67. Emilsson V, Thorleifsson G, Zhang B, et al. Genetics of gene expression and its effect on disease. Nature 2008;452(7186):423–8. 68. Dixon AL, Liang L, Moffatt MF, et al. A genome-wide association study of global gene expression. Nat Genet 2007;39(10):1202–7 69. Goring HH, Curran JE, Johnson MP, et al. Discovery of expression QTLs using large- scale transcriptional profiling in human lymphocytes. Nat Genet 2007;39(10):1208–16. 70. Schadt EE, Molony C, Chudin E, etal. Mapping the genetic architecture of gene expression in human liver. PLoS Biol 2008;6(5):e107. 71. Meng, J. F. & Fingerlin, T. E. 2008: Linear models for analysis of multiple single nucleotide polymorphisms with quantitative traits in unrelated individuals. — Ann. Zool. Fennici 45: 429–440 72. Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, Ingle CE, Dunning M, Flicek P, Koller D, Montgomery S, Tavare S, Deloukas P, Dermitzakis ET, 2007: Population genomics of human gene expression. Nat Genet,vol.39(10): 1217-1124.
  • 68. 68 73. Duggal P, Gillanders EM, Holmes TN, Bailey-Wilson JE. Establishing an adjusted p- value threshold to control the family-wide type 1 error in genome wide association studies. BMC Genomics. 2008 Oct 31;9:516. 74. Dudoit, S., J. P. Shaffer and J. C. Boldrick, 2003. Multiple hypothesis testing in microarray experiments. Stat. Sci. 18 71–103. 75. Holm, S., 1979. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6 65–70. 76. Patterson, H.D. & Thompson, R. Recovery of inter-block information when block sizes are unequal.Biometrika 58, 545–554 (1971). 77. Yang J, Lee SH, Goddard ME and Visscher PM. GCTA: a tool for Genome-wide Complex Trait Analysis. Am J Hum Genet. 2011 Jan 88(1): 76-82. 78. Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet 2010;11(7):459-463. 79. Falconer, D. S. & Mackay TFC (1996). Introduction to Quantitative Genetics. Fourth edition. Addison Wesley Longman, Harlow, Essex, UK 80. Daniel L. Hartl and Elizabeth W. Jones, Essential Genetics: A Genomics Perspective. Sudbury, MA: Jones and Bartlett, 2002.
  • 71. 71 S1: BLAT search In the methods section of the report, it was reported that BLAT1 search was carried out to map the probes to the human RNA sequences2 . The identity scores were checked. The location of the probe sequence in the human genome was searched with the help of BLAT1 . With the help of the search it was possible to find out the whether the probes mapped to a single or multiple human RNA sequences. Only the probes with the exact match to a unique gene were retained. The flowchart of the BLAT1 search is described below. This process of the search can be done manually for a small number of probes sequence. However for large number of probe sequence this process can become tedious and painstaking. Therefore the BLAT1 search was automated. For this purpose the human Paste the probes sequence in BLAT Query search BLAT search results revealing likely location of the probe sequence and the identity score of the match with different regions of the human genome Discard the probes sequences with more than 90% identity to multiple regions of the genome Of the remaining probes sequences, confirm the location of the probe sequence in the human genome with the help of genome browser
  • 72. 72 RNA sequences from RefSeq (hg 19)2 and the BLAT source and executables1 were first downloaded. The Following the script was then used to automate the search process. Thus after comparing the human RNA sequences from RefSeq (hg 19)2 , the above script retained only the probes with the score > 20 and %identity >= 90%. S2: Table showing 10 eigenvectors scores obtained from the PCA analyses.
  • 73. 73
  • 74. 74
  • 75. 75
  • 76. 76 S3: Association studies Genome wide Association analysis was performed using PLINK3 . Linear regression analysis for the association testing was carried out with the help of this software. The basic files required for the association testing were: 1. ―Pedigree (PED) file – contains genotype information‖3 2. ―Map (MAP) files – position and name of the markers present in PED file‖3 . Running the association analysis with the help of these two files was time consuming and computationally intensive. Hence, these files were first converted into binary files and were used for association testing.
  • 77. 77 The commands used for the association testing in PLINK: Here the ‗bfile hapmap23maf0.05‘ reads the genotypes, names and positions of the markers in the binary format. ‗df1.xls‘ contains all the gene expression values. Only the markers with MAF > 0.5 are taken for the analysis. ‗210.cov‘ contains the eigenvectors in the form of covariates. Only the p-values <= 1e-5 are allowed to be shown after the analysis. The p-values are then adjusted for multiple testing. Output of an association test for gene ‗ZNF266‘ (figure S3-1 and S3-2) from PLINK:
  • 78. 78 S4: Genes showing significant Cis associations and their heritability estimates
  • 79. 79
  • 80. 80
  • 81. 81
  • 82. 82
  • 83. 83
  • 84. 84
  • 85. 85
  • 86. 86 S5: Genes showing significant trans association and their heritability estimates:
  • 87. 87 S6: Estimating the heritability of the gene expression trait and eQTLs. The heritability calculation by GCTA program4 also requires the same input files as the PLINK3 (i.e. ‗PED‘ and ‗MAP‘ files in binary format). The heritability calculations were performed in 3 steps: 1. ―Estimating the genetic relationship matrix (GRMs) between the individuals from the SNP markers‖4 . Markers from the sex chromosomes were excluded from the calculations. Command used – gcta --bfile hapmap23maf0.05 --autosome --maf 0.05 --make-grm --out hapmap23maf0.05grm The above command would use the Binary Genome wide SNPs (MAF >0.05) from the all the 22 autosomes and construct the GRMs between the pair wise individuals. 2. Estimating the genetic relationship matrix (GRMs) between the individuals from the locus (100 – 1000 kb region) that is significantly associated with a particular gene (for eQTL heritability). Markers from the sex chromosomes were excluded from the calculations. Commands used: gcta --bfile hapmap23maf0.05 --extract gene.snp.list --autosome --maf 0.05 -- make-grm --out hapmap23maf0.05genegrm The above command would use only the SNPs (MAF >0.05) form the selected locus known to be associated with a particular gene from the all the 22 autosomes and construct
  • 88. 88 the GRMs between the pair wise individuals. The GRMs were constructed separately for each of the four population groups and merged together in a file named as ‗multi_grm.txt‘. 3. Estimating the variance explained by the by all the SNPs (heritability of gene expression trait) and a significantly associated locus (~5 or more significantly associated SNPs within a span of 1000 kb): Commands used: gcta --reml --mgrm multi_grm.txt --pheno gene.phen --qcovar hapmap23maf0.05_10PCs.txt --out hapmap23maf0.05gene_mgrm The above command fits both the GRMs constructed in the steps 2 and 3 into the REML model. ‗gene.phen‘ contains the expression values from all the 210 individuals. ‗hapmap23maf0.05_10PCs.txt‘ contains all the 10 eigenvectors. The merged GRMs and the eigenvectors (in the form of the covariates) were used in the REML analysis to correct the population structure. Output obtained from the GCTA for gene ‘ZNF266’: where V(1) = variance explained by the eQTL V(2) = variance explained by all the SNPs V(e) = Environmental variance. V(p) = Phenotypic variance V(1)/V(p) = heritability estimate of the eQTL V(2)/V(p) = heritability estimate of the gene expression phenotypes.
  • 89. 89 References 1. Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12:656–64 2. Pruitt K, Tatusova T, Maglott D (2007) NCBI reference sequences (Ref-Seq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research 35: D61–D65. 3. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81: 559–575. 4. Yang J, Lee SH, Goddard ME and Visscher PM. GCTA: a tool for Genome-wide Complex Trait Analysis. Am J Hum Genet. 2011 Jan 88(1): 76-82.