University of Agricultural Sciences
Department of Genetics and Plant Breeding
PG Seminar: GPB 581(0+1)
Genome wide association studies.
Submitted By: Varsha Gayatonde
Sr.MSc, PALB 2235
Dept. of Genetics & Plant Breeding
UAS, GKVK, Bangalore
Submitted To: Dr. R.Nandini
Dept. of Genetics & Plant Breeding
UAS, GKVK, Bangalore
Department of Genetics and Plant Breeding
University of Agricultural Sciences
SI. No. Title Content
1 Introduction to mapping.
3 Brief History
4 Association mapping
5 Comparison of GWAS and Biparental mapping
6 Concept of Linkage disequilibrium(LD)
7 Factors affecting LD and use in plant system.
8 Genome wide association studies.
10 Challenges while conducting GWAS
11 Advantages and disadvantages
12 GWAS studies in Arabidopsis
13 Studies on Rice
14 Maize smut studies
GWAS studies on MYB related traits and in other
16 Current association challenges
DEPARTMENT OF GENETICS AND PLANT BREEDING
GKVK, UNIVERSITY OF AGRICULTURAL SCIENCES, BANGALORE – 560 065
First Seminar: GPB 581 (0+1)
GENOME WIDE ASSOCIATION STUDIES
Genome wide association is a study design in which many markers spread across a genome, are
genotyped and test a statistical association with a phenotype are performed locally along the
genome. It is also an examination of many common genetic variants in different individuals to
see if any variant is associated with a trait.
The first prospects for whole genome association studies began in early 2002¹. This LD
based association mapping started with human beings, later in Arabidopsis, rice, grapevine,
wheat, soybean, maize, tomato and other model organisms.̒ HapMap,’ the multi-country effort to
identify, catalog common human genetic variant put a milestone to extend application to other
organisms in order to make GWAS powerful. SNPs need to be chosen widely distributed in a
way, that reflects the genetic variation. Selection of suitable and desirable markers yield fine
mapping² and the genome wide chips, which enabled increased coverage of markers improving
power in association signals. But this doesn’t necessarily imply increased power of detecting
association loci. The other drawbacks here are need of large population size, pooling and cost of
preparing DNA samples and less knowledge about the risk of a trait.
To overcome this drawback recently researchers upgraded the statistical approaches,
proper imputation of genotypes and advanced approaches like nested association mapping,
candidate gene association approach, and the web application of GWAS (GWAPP in
Arabidopsis)³. Despite of its drawbacks still GWAS is famous due to its dropping genotyping
costs, which is likely to drive association studies away from candidate gene based studies. This
will likely tinvolve whole genome resequencing of all the individuals in a population, will allow
assessment of the effect of point mutation,insertions,deletions and large structure variation.
Many studies conducted using GWAS as a tool worldwide on different factors like
temperature effect on cobs, agronomic variants, agroclimatic diversities, flowering and grain
yield traits, disease diversity etc. A major benefit of GWAS is one time genotyping and repeated
phenotyping in different environmental conditions help to study ̒n’ number of traits within a
short period over a large area. The rapid development of high throughput sequencing technology
is that, population choice for GWAS studies will no longer be restricted to current model
organisms and will slowly become more forced on which species are more relevant for
answering biological questions.
1. Ozoki, K., 2001, A high throughput SNP typing system for GWAS, Springer. 16:1134-
2. Tohn, P, A., 2009, Validating and refining GWAS signals, Nature. 10:318-329.
Time: 10:00 AM
The level of the genetic diversity is pivotal for world food security and
survival of human civilization on earth. Domestication resulted as improved
cultivars in several crops to produce food for the better supply of the human
diet. Presently 150 plant species cultivated in agriculture, twelve provide about
75% of human food and four produce 50% of human diet. According to FHO
report, ∼800 million people are suffering from food deficiency. An attention to
improve agricultural production to eliminate or, at least, reduce the feeding
The narrow genetic base of modern crop cultivars is the serious obstacle to
sustain and improve crop productivity due to rapid vulnerability to potentially
new biotic and abiotic stresses. Plant germplasm resources comprising of wild
plant species, modern cultivars, and their crop wild relatives, are the important
reservoirs of natural genetic variations.
Originated from a number of historical genetic events as a respond to
environmental stresses and selection through crop domestication
• The objective of genetic mapping is to identify simply inherited markers
in close proximity to genetic factors affecting quantitative traits
(Quantitative trait loci, or QTL).
• This localization relies on processes that create a statistical association
between marker and QTL alleles and processes that selectively reduce
that association as a function of the marker distance from the QTL.
Why we need genome mapping?
Gene mapping in the map of genes present inside our chromosome. In
Eukaryotes genes are condensed tightly inside the compact system. We have to
know which gene is answering for the trait of interest. Expression of genotypes
give us phenotypes. We can’t look in to gene and genotypes, though it is our
disability. So to know that we have to calculate mathematically. The further
extension of mapping technology is to know the traits in a more easier, cheaper
and within a short period of time.
• The genome is all the DNA in a cell. All the DNA on all the chromosomes
,includes genes, intergenic sequences, repeats, Specifically, it is all the
DNA in an organelle.
• Eukaryotes can have 2-3 genomes; Nuclear genome, Mitochondrial
genome, Plastid genome respectively. If not specified, “genome” usually
refers to the nuclear genome.
False negative: the declaration of an outcome as statistically non-significant,
when the effect is actually genuine.
False positive: the declaration of an outcome as statistically significant, when
there is no true effect.
Linkage: refers to coinheritance of different loci within a genetic distance on
Linkage equilibrium: LE is a random association of alleles at different loci and
equals the product of allele frequencies within haplotypes.
Linkage disequilibrium: LD is a non-random association of alleles at different
loci, describing the condition with non-equal frequency of haplotypes in a
Minor allele Frequency(MAF):The frequency of the less common alleles of a
polymorphic locus. Its value lies between 0 to 0.5,and can be vary between
Odd ratio: Measurement of association that is commonly used in case control
studies. Defined as odd of exposure to the susceptible genetic variant in case
compared with that in controls. If OR significantly greater than 1,then the
genetic variant is associated with a disease.
Association mapping, a high resolution method for mapping quantitative
trait loci based on linkage disequilibrium. Association refers to covariance of a
marker polymorphism and a trait of interest.
The first association study to attempt a genome scanning plants
was conducted in sea beet (Beta vulgaris ssp. maritima), a wild relative of sugar
beet (Beta vulgaris ssp. vulgaris).The first association study of a quantitative
trait based on a candidate gene was the analysis of flowering time and the
dwarf8 (d8) gene in maize. . Association mapping is based on the principle of
Linkage disequilibrium (LD) and is based on the entire population.
How it works?
A group of unrelated individuals normally presents variation for many
phenotypic aspects, thus several traits can be studied in the same population
using the same genotypic data. A higher proportion of molecular markers are
likely to be polymorphic, providing better genome coverage than any biparental
map. Elite lines are used for study, multi-year and multi-location phenotypic
data may be available at no additional cost.
Goal of association mapping
Identification of susceptibility variant, replication in differecohort/population, u
nderstanding of genetic function at cell level,this can lead to
identification of durable targets, development of drug for prevention better
understanding of the cellular processes that are involved in disease treatments.
Association mapping offers many advantages over linkage analysis:
• much higher mapping resolution;
• greater allele number and broader reference population;
• less research time in establishing an association
• Utilizes existing individuals.
• Multi-trial phenotypic data stored in databases can be used.
• Resources for phenotyping and statistical issues.
• Population structure results in spurious associations.
Two types of association mapping:
• Success of either methods depends on population size and degree of LD
• Genome wide scanning Markers spanned across the genome, Moderate
to extensive LD. If LD is high, GWA is useful with low resolution
• Candidate gene scanning Sequencing only candidate gene which has
Flowchart of a gene association study
Biparental mapping GWAS
What are genome-wide association studies?
Genome-wide association studies are a relatively new way for scientists to
identify genes involved in human disease. This method searches the genome for
small variations, called single nucleotide polymorphisms or SNPs (pronounced
“snips”), that occur more frequently in people with a particular disease than in
people without the disease. Each study can look at hundreds or thousands of
SNPs at the same time. Researchers use data from this type of study to
pinpoint genes that may contribute to a person’s risk of developing a certain
Because genome-wide association studies examine SNPs across the
genome, they represent a promising way to study complex, common diseases in
which many genetic variations contribute to a person’s risk. This approach has
already identified SNPs related to several complex conditions including
diabetes, heart abnormalities, Parkinson disease, and Crohn disease.
Researchers hope that future genome-wide association studies will identify
more SNPs associated with chronic diseases, as well as variations that affect a
person’s response to certain drugs and influence interactions between a
person’s genes and the environment.
1) No cross required, works with
2) Phenotypic data can be already
3) High resolution.
4) More than 2 alleles are tested.
5) Many loci for a single trait are
6) Comparatively low.
1) Experimental cross required.
2) Phenotypes to be collected.
3) Limited mapping resolution.
4) Essentially 2 alleles are tested
5) Constraints to segregating loci
6) between parental lines.
High detection power
Genome-wide case-control studies; Genome-wide genetic association analysis.
Genome-wide association studies (GWAS) are projects to investigate the
statistical association between phenotypes and a dense set of genetic markers
(Genetic Marker) that capture a substantial amount of genetic variations in the
genome, using a large number of matched samples.
Phenotypes can be qualitative traits such as disease status or quantitative
traits such as blood pressure. Statistical association between disease status
and alleles of a genetic marker is carried out by categorical data analysis.
Genetic markers are usually genotyped by microarray chips. Whether a
substantial genetic variation in the genome, including common, rare, and
structural variations, is captured by the set of markers depends on the number
of markers and their chromosome locations.
The typical number of single nucleotide polymorphism (SNP) markers
used in a current GWAS,
societies depends on the exploitation of genetic recombination and allelic
diversity for crop improvement, and many of the world’s farmers depend
directly on the harvests of the genetic diversity they sow for food and fodder as
well as the next seasons seed (Smale et al., 2004).
The considerable genetic diversity of traditional varieties of crops is the
most immediately useful and economically valuable part of global biodiversity.
Subsistence farmers use landraces as a key component of their cropping
systems. Such farmers account for about 60% of agricultural land use and
provide approximately 15-20% of the world’s food (Francis, 1986). In addition,
landraces are the basic raw materials used by plant breeders for developing
modern varieties. Over the last few decades, awareness of the rich diversity of
exotic or wild germplasm has increased. This has lead to a more intensive use
of this germplasm in breeding and thereby yields of many crops increased
Aim to identify which regions (or SNPs) in the genome are associated with
disease or certain phenotype.
Design: it identifies the population structure, Select case subjects (those
with disease),Select control subjects (healthy),Genotype a million SNPs for each
subject, Determine which SNP is associated, Encoded data ,Ranking SNPs.
History of GWAS
Successful study published in 2005, with investigating patients age
related molecular degeneration.
Prior to GWAS in 2000 Inheritance studies of linkage families. Then
the revolution occurred in HapMap2003, which is the variety of sequencing
techniques to discover and catalog SNPs in different population.
Human Genome Project
The Human Genome Project was declared complete in April 2003. An
initial rough draft of the human genome was available in June 2000 and by
February 2001 a working draft had been completed and published followed by
the final sequencing mapping of the human genome on April 14, 2003.
Although this was reported to be 99% of the human genome with 99.99%
accuracy a major quality assessment of the human genome sequence was
published on May 27, 2004 indicating over 92% of sampling exceeded 99.99%
accuracy which is within the intended goal. Further analyses and papers on the
HGP continue to occur.
Hap Map is a Multi-country effort to identify, catalog common human
genetic variants. Developed to better understand and catalogue LD patterns
across the genome in several populations. Genotyped ~4 million SNPs on
samples of African, east Asian, European ancestry. All genotype data in a
publicly available data base. here we can download the genotype data. It is able
to examine LD patterns across genome, Can estimate approximate coverage of
a given SNP chip and Can represent 80-90% of common SNPs with~300,000
tag SNPs for European or Asian samples and~500,000 tag SNPs for African
Thousand genome project
Another spinoff from Human genome project, the 1000 genome project
launched in 2008.A 3 year project covered most of the countries worldwide. It
mainly targeted the African countries. This 1000 genome project showing
researchers how dynamic the human genome really is and why it is so.
Concepts Underlying the Study Design
Single Nucleotide Polymorphisms
The modern unit of genetic variation is the single nucleotide
polymorphism or SNP. SNPs are single base-pair changes in the DNA
sequence that occur with high frequency in the human genome. For the
purposes of genetic studies, SNPs are typically used as markers of a genomic
region, with the large majority of them having a minimal impact on biological
systems. SNPs can have functional consequences, however, causing amino acid
changes, changes to mRNA transcript stability, and changes to transcription
factor binding affinity. SNPs are by far the most abundant form of genetic
variation in the human genome. SNPs are notably a type of common genetic
variation; many SNPs are present in a large proportion of human populations.
SNPs typically have two alleles, meaning within a population there are two
commonly occurring base-pair possibilities for a SNP location. The frequency of
a SNP is giving in terms of the minor allele frequency or the frequency of the
less common allele. For example, a SNP with a minor allele (G) frequency of
0.40 implies that 40% of a population has the G allele versus the more
common allele (the major allele), which is found in 60% of the population.
Linkage disequilibrium (LD) is a property of SNPs on a contiguous stretch
of genomic sequence that describes the degree to which an allele of one SNP is
inherited or correlated with an allele of another SNP within a population. The
term linkage disequilibrium was coined by population geneticists in an attempt
to mathematically describe changes in genetic variation within a population
over time. It is related to the concept of chromosomal linkage, where two
markers on a chromosome remain physically joined on a chromosome through
generations of a family. Recombination events within a family from generation
to generation break apart chromosomal segments. This effect is amplified
through generations, and in a population of fixed size undergoing random
mating, repeated random recombination events will break apart segments of
contiguous chromosome (containing linked alleles) until eventually all alleles in
the population are in linkage equilibrium or are independent. Thus, linkage
between markers on a population scale is referred to as linkage disequilibrium.
The rate of LD decay is dependent on multiple factors, including the
population size, the number of founding chromosomes in the population, and
the number of generations for which the population has existed. As such,
different human sub-populations have different degrees and patterns of LD.
African-descent populations are the most ancestral and have smaller regions of
LD due to the accumulation of more recombination events in that group.
European-descent and Asian descent populations were created by founder
events (a sampling of chromosomes from the African population), which altered
the number of founding chromosomes, the population size, and the
generational age of the population. These populations on average have larger
regions of LD than African-descent groups. Many measures of LD have been
proposed, though all are ultimately related to the difference between the
observed frequency of co-occurrence for two alleles (i.e. a two-marker
haplotype) and the frequency expected if the two markers are independent. The
two commonly used measures of linkage disequilibrium are D’ and r².
LD in animal system
LD in Humans:
LD has been studied extensively in humans (Homo sapiens) - Pritchard
There is tremendous heterogeneity in human LD estimates because of
differences in loci, marker types (microsatellites versus SNPs), sample
populations, and chromosome type (sex chromosomes versus
LD in other animal systems:
LD studies have also been conducted in cattle and Fruit flies
Extensive LD reported in the Dutch black and white dairy cattle
populations Globalization of semen trading.
LD in plant systems
MAIZE: In maize (Zea mays ssp. mays), several studies have been
conducted to investigate. LD over a wide range of population and marker types.
The patterns of LD vary substantially with the population chosen.
Ten million investigated sequence diversity at 21 loci on chromosome 1 in
a diverse group of maize germplasm.
ARABIDOPSIS: The LD pattern in Arabidopsis is a sharp contrast to the
pattern in maize. During last five years most studies described LD in Wheat
and Barley, besides single reports on rice, rye grass, soybean, sugarcane and
The factors, which lead to an increase in LD, include
Inbreeding, Small population size, Genetic isolation between lineages,
Population subdivision, Low recombination rate, Population admixture, Natural
and artificial selection, Balancing selection, etc.
The factors, which lead to a decrease/disruption in LD, include Outcrossing,
High recombination rate, High mutation rate, etc.
Stages of GWAS designs
1) One stage design-First time used in HapMap project where for this cost
involvement was more. It is a process of genotyping all samples on all the
2) Two stage design-It reduces the genotyping requirements and reduces
the false positive rate.
3) Multistage designs-Joint analysis has more power than replication. p-
value in Stage 1 must be liberal. CaTs power calculator. Here signals
from an initial, First-stage GWA are used to define a subset of SNPs that
are retyped in additional second stage samples. Lower cost do not gain
Analysis of GWAS
Most common approach: look at each SNP one-at-a-time. Possibly add in multi-
marker information. Further investigate / report top SNPs only Or backwards
replication…Most commonly trend test.Log additive model, logistic regression
are the foremost methods to analyze GWAS. Adjust for potential population
Basics for GWAS
Calculate the odd ratio
• If 2 events are considered
• odds of A and B is
OR = Odds(A)/Odds(B) = Pr(A)/Pr(~A) / Pr(B)/Pr(~B)
• Odd ratio of many Events comparing the 2 groups
OR = Odds(D|G=1)/Odds(D|G=0) =
= Pr(D|G=1)/Pr(~D|G=1) / Pr(D|G=0)/Pr(~D|G=0)
= Pr(D|G=1)/Pr(D|G=0) x Pr(~D|G=0)/Pr(~D|G=1)
= RR x Pr(~D|G=0)/Pr(~D|G=1).
• Symmetry in odds ratio
OR = Odds(D|G=1)/Odds(D|G=0) = Odds(G|D=1)/Odds(G|D=0).
Testing Significance by using Chi-square test and Rank SNP by P-
value. (Statistical test of association).
Challenges we are going to address while conducting GWAS
Multiple hypothesis testing. In GWAS the number of statistical tests is
commonly is on the order of 10⁶. At significance level of 0.01we would
expect 10,000 false positive. Thus individual p-value <0.01are not
significant anymore. Correction of multiple hypothesis testing is critical.
Population structure-Confounding structure leads to false positive. It
requires favorable conditions like Statistical power and resolution Small
samples, large number of hypothesis, increased power, testing compound
Association Test Single Locus Analysis
When a well-defined phenotype has been selected for a study population,
and genotypes are collected using sound techniques, the statistical
analysis of genetic data can begin. The de facto analysis of genome-wide
association data is a series of single-locus statistic tests, examining each
SNP independently for association to the phenotype. The statistical test
conducted depends on a variety of factors, but first and foremost,
statistical tests are different for quantitative traits versus case/control
studies. Quantitative traits are generally analyzed using generalized
linear model (GLM) approaches, most commonly the Analysis of Variance
(ANOVA), which is similar to linear regression with a categorical predictor
variable, in this case genotype classes. The null hypothesis of an ANOVA
using a single SNP is that there is no difference between the trait means
of any genotype group. The assumptions of GLM and ANOVA are 1) the
trait is normally distributed; 2) the trait variance within each group is
the same (the groups are homoscedastic); 3) the groups are inde-
Dichotomous case/control traits are generally analyzed using either
contingency table methods or logistic regression. Contingency table tests
examine and measure the deviation from independence that is expected
under the null hypothesis that there is no association between the
phenotype and genotype classes. The most ubiquitous form of this test is
the popular chi-square test (and the related Fisher’s exact test). Logistic
regression is an extension of linear regression where the outcome of a
linear model is transformed using a logistic function that predicts the
probability of having case status given a genotype class. Logistic
regression is often the preferred approach because it allows for
adjustment for clinical covariates (and other factors), and can provide
adjusted odds ratios as a measure of effect size. Logistic regression has
been extensively developed, and numerous diagnostic procedures are
available to aid interpretation of the model.
For both quantitative and dichotomous trait analysis (regardless
of the analysis method), there are a variety of ways that genotype data
can be encoded or shaped for association tests. The choice of data
encoding can have implications for the statistical power of a test, as the
degrees of freedom for the test may change depending on the number of
genotype-based groups that are formed. Allelic association tests examine
the association between one allele of the SNP and the phenotype.
Genotypic association tests examine the association between genotypes
(or genotype classes) and the phenotype. The genotypes for a SNP can
also be grouped into genotype classes or models, such as dominant,
recessive, multiplicative, or additive models. Each model makes different
assumptions about the genetic effect in the data assuming two alleles for
a SNP, A and a,a dominant model (for A) assumes that
having one or more copies of the A allele increases risk compared to a
(i.e. Aa or AA genotypes have higher risk). The recessive model (for A)
assumes that two copies of the A allele are required to alter risk, so
individuals with the AA genotype are compared to individuals with Aa
and aa genotypes. The multiplicative model (for A) assumes that if there
is 36risk for having a single A allele, there is a 96risk for having two
copies of the A allele: in this case if the risk for Aa is k, the risk for
AA is k2. The additive model (for A) assumes that there is a uniform,
linear increase in risk for each copy of the A allele, so if the risk is 36for
Aa, there is a 66risk for AA - in this case the risk for Aa is k and the risk
for AA is 2k.A common practice for GWAS is to examine additive models
only, as the additive model has reasonable power to detect both additive
and dominant effects, but it is important to note that an additive model
may be underpowered to detect some recessive effects . Rather than
choosing one model a priori, some studies evaluate multiple genetic
models coupled with an appropriate correction for multiple testing.
In addition to single-locus analyses, genome-wide association studies
provide an enormous opportunity to examine interactions among genetic
variants throughout the genome. Multi-locus analysis, however, is not
nearly as straightforward as conducting single-locus tests, and presents
numerous computational, statistical, and logistical challenges. Because
most GWAS genotype between 500,000 and one million SNPs, examining all
pair-wise combinations of SNPs is a computationally intractable approach,
even for highly efficient algorithms. One approach to this issue is to reduce
or filter the set of genotyped SNPs, eliminating redundant information.
A simple and common way to filter SNPs is to select a set of results
from a single-SNP analysis based on an arbitrary significance threshold and
exhaustively evaluate interactions in that subset. This can be perilous,
however, as selecting SNPs to analyze based on main effects will prevent
certain multi-locus models from being detected – so called ‘‘purely epistatic’’
models with statistically undetectable marginal effects. With these models, a
large component of the heritability is concentrated in the interaction rather
than in the main effects. In other words, a specific combination of markers
incurs a significant change in disease risk. The benefits of this analysis are
that it performs an unbiased analysis for interactions within the selected set
of SNPs. It is also far more computationally and statistically tractable than
analyzing all possible combinations of markers.
In many complex diseases there are numerous genetic variants which
have been identiﬁed. But for many of the recent studies these common
variants only explain a small fraction of the increased risk. Most of those
that have been identiﬁed have no established biological relevance to the
disease and often they are not located inside ’active’ genes . From the last
years of GWAS it is clear that the common variants fail to explain the
majority of the genetic heritability of most human diseases. This suggests
that the hypothesis of ’Common disease, common variant’ is not as valid as
was previously believed. The problem is that the biological reality does not
correspond to the study design and assumption of GWAS, and the solution
is not to increase the sample size even further but to improve the study
design and statistical methods. One possible explanation to the missing
heritability could be some kind of interaction between different
genes(epistasis). These interactions could be hard to detect when analyzing
one SNP at the time, as the marginal effect of a single SNP will be small.
Another explanation is that part of the increased risk can be explained by
many rare variants, which are present among less than 1 % of the
population. This suggests that there could be heterogeneity, where different
genetic proﬁles can cause diseases that are diagnostically the same.
A general deﬁnition of genetic interaction (epistasis) is that the effect (pene-
trance) of one locus varies according to the genotype present at another
locus. To detect interactions we need to deﬁne how a ’natural’ combined
effect of two risk loci would be expressed in the organism. The concept of
gene-gene interactions is not new, but still it is confusing since the term is
used in various ways. Biological interaction or epistasis was deﬁned ﬁrst by
Bateson in 1909 . In that example one of the alleles at one locus G is
preventing the alleles at locus B from being expressed in the organism. This
relation does not necessarily have to be symmetric. This deﬁnition is similar
to the deﬁnition biologists use to examine a biological interaction between
proteins, where proteins interact to regulate several cellular processes. In
statistics the deﬁnition of interaction is usually a deviation from a linear
model. In 1918 Fisher made a statistical deﬁnition of epistasis , as
deviation from additivity in effects of the alleles at different loci on a
This deﬁnition is more similar to the classical statistical deﬁnition of
interaction and do not quite correspond to the biological deﬁnition of
epistasis. These deﬁnitions get troublesome when the trait is binary, in
these cases the mathematical modelling often focus on the penetrances.
Hence the deﬁnitions of epistasis need to be modiﬁed. For binary traits an
example could be that both allele A and allele B at two different loci are
needed to develop the trait. In this case A is epistatic to B, and B is epistatic
to A, hence the epistasis is symmetric in contrast to the deﬁnition by
A classic way to represent lack of epistasis has been the heterogeneity
model, a person gets the trait by possessing (at least) one of the
predisposing genotypes. This deﬁnition actually falls under Bateson’s
deﬁnition of epistasis, for example if a person has both risk variants
(situated at different loci) the effect of allele A will be masked by allele B -
another confusing issue about these genetic interactions. There are two
types of genetic heterogeneity, allelic heterogeneity is when several
mutations on the same allele cause the same disease. Locus heterogeneity
means that mutations in several unrelated loci can cause the same
The above example of locus heterogenetity could be generalized to a
situation without full penetrance, that is 0 <fi,j < 1 for some of the
penetrances. Mathematically, locus heterogeneity can be expressed as
fij = i + $j # i$j
where ↵i and $j are the penetrance factors for the two genetic
variants.Locus heterogeneity is similar to a daisy chain, where it is enough
for one of the components to break (caused by having at least one of the risk
variants) for the entire system to malfunction, i.e. to obtain the disease.
There are two other common two-locus models for binary traits, the
multiplicative model and the additive model. The multiplicative model can
be expressed as
fij = ↵i$j ,this model is often considered as epistatic. Both the additive model
fij = ↵i + $j , and the heterogeneity model are thought of as non-epistatic by
most authors. However, some authors considers epistasis as departure from
the multiplicative model. Further problems appear when considering that
both the multiplicative and the heterogeneity models become additive with
suitable log transformations. It will be difﬁcult to really model the true
epistatic interactions in complex diseases, and discovered epistatic effects
may have limited input to the understanding of the disease. Still, models
that allow for interactions can improve the statistical power of detecting the
genetic risk variants .The main issue in ﬁnding interactions, independent of
how you deﬁne epistasis, is how you should detect it in complex diseases
when analyzing millions of genetic markers. Assume that the disease is
caused by different mutations on different loci in various families, and these
genes have a strong effect in each of the subpopulations. Then the
heterogenetic risk genes will probably show a very weak marginal effect
when the markers are analyzed one at the time. For epistatic interactions it
will be very computationally demanding to examine all possible gene-gene
interactions, in addition to the issue of correcting for testing multiple
hypotheses. One way to handle this is to ﬁrst test for marginal main effects
for each marker in the sample, and hope that the genes involved in
interactions will also show at least a modest marginal effect. Then the
results from this analysis is combined with biological knowledge to suggest
a number of candidates for interaction analysis.
To conduct a meta-analysis properly, the effect of the same allele
across multiple distinct studies must be assessed. This can prove difficult if
different studies use different genotyping platforms (which use different SNP
marker sets). As this is often the case, GWAS datasets can be imputed to
generate results for a common set of SNPs across all studies. Genotype
imputation exploits known LD patterns and haplotype frequencies from the
HapMap or 1000 Genomes project to estimate genotypes for SNPs not
directly genotyped in the study.
The concept is similar in principle to haplotype phasing algorithms,
where the contiguous set of alleles lying on a specific chromosome is
estimated. Genotype imputation methods extend this idea to human
populations. First, a collection of shared haplotypes within the study sample
is computed to estimate haplotype frequencies among the genotyped SNPs.
Phased haplotypes from the study sample are compared to reference
haplotypes from a panel of much more dense SNPs, such as the HapMap
data. The matched reference haplotypes contain genotypes for surrounding
markers that were not genotyped in the study sample. Because the study
sample haplotypes may match multiple reference haplotypes, surrounding
genotypes may be given a score or probability of a match based. On the
haplotype overlap. For example, rather than assign an imputed SNP a single
allele A, the probability of possible alleles is reported (0.85 A,0.12 C,0.03
T)based on haplotype frequencies. This information can be used in the
analysis of imputed data to take into account uncertainty in the genotype
estimation process, typically using Bayesian analysis approaches. Popular
algorithms for genotype imputation include BimBam , IMPUTE , MaCH ,and
Beagle . Much like conducting a meta-analysis, genotype imputation must
be conducted with great care. The reference panel (i.e. the 1000 Genomes
data or the HapMap project) must contain haplotypes drawn from the same
population as the study sample in order to facilitate a proper haplotype
match. If a study was conducted using individuals of Asian descent, but
only European descent populations are represented in the reference panel,
the genotype imputation quality will be poor as there is a lower probability
of a haplotype match. Also, the reference allele for each SNP must be
identical in both the study sample and the reference panel. Finally, the
analysis of imputed genotypes should account for the uncertainty in
genotype state generated by the imputation process.
Statistical methods in GWAS
If a genetic marker is associated to a particular disease, then the
genotype or allele frequencies will be different among affected and healthy
individuals. A commonly used test for searching for associated SNPs in case-
control studies is a Pearson % test applied to a 2-by-2 table of allele counts in
the two groups. For complex traits it is commonly assumed that the
contribution to the genetic effect from each SNP is roughly additive, i.e. the
penetrance for heterozygous are somewhere in between the penetrance for the
two homozygotes. This test is powerful for additive models, whereof the
popularity of this test in these studies. Other common tests include a Pearson
%test comparing the genotype frequencies instead of allele frequencies,
Cochran Armitage test for trend in penetrances, and logistic regression. The
Transmission Disequilibrium Test (TDT) is an association test using data from
families with at least one affected child. This test was introduced by Spielman
et al. , and the test evaluates the transmission of an allele from a heterozygous
parent to the offspring. The TDT is based on the assumption that each of the
two alleles M1 and M2 at a locus is transmitted with equal probability to the
offspring, hence for a sample of heterozygous parents we expect approximately
half of them to transmit the alleleM1. If one of the alleles is transmitted more
often among families where the children have a genetic disease, we suspect
that the allele is associated to the disease. Let b denote the number of
heterozygous parents who transmits alleleM1 to their offspring, and c the
number of heterozygous parents who transmits allele M2. Conditioned on b + c,
b is is binomially distributed, but usually the test statistic has the following
T = (b # c)2 b + c, This test asymptotically follows a % distribution and is
equivalent to a Pearson%2-test.
Generalized Linear Models (GLMs) extend the ordinary regression model to
other response variables than the Normal distributed. GLMs are applicable if
the response variable has a distribution which belongs to the natural
exponential family. One of those distributions is the Binomial distribution, and
with Logistic Regression we model the binomial probability p(x)= P(Y =1|x) as
logp(x) 1 # p(x)= ↵ +Xj$jxj Here xj denotes the value of the jth element in the
predictor x. In the simple Logistic regression with one binary predictor x, $ is
equal to the log odds ratio$ = p(x = 1)/(1 # p(x = 1)) p(x = 0)/(1 # p(x = 0)).
In retrospective (individuals are sampled based on their affection status)
studies the effect parameter $ will be the same as in the prospective (sampling
based on the predictors) design, if we assume that the sampling probability is
independent of x. This is one of the main reasons for using this method in
biomedical studies. Another advantage with the logistic regression is that it is
easy to include several predictor in the analysis and make inference for
interactions between genes and environment, as well as gene-gene interactions.
Schaid described a univariate method for case-parent data, modelling genotype
relative risks with conditional logistic regression using three pseudo controls
based on the parents’ untransmitted alleles. This method can be generalized to
two loci. For case-control data logistic regression can be used to analyse
interactions by comparing the saturated model to an additive model,
speciﬁed on the form of. The additive logistic model is roughly equivalent to the
heterogeneity mode if the relative risk (RR) or odds ratio (OR) is of moderate
size. However, North et. al show examples of heterogeneity models which are
marginally recessive (marginal RR⇡ 150), in this case the logistic regression
yields non-zero interaction estimates. Hence, to really examine deviations from
the heterogeneity model (and not the multiplicative or logistic model) more
advanced methods need to be applied.
GWAS studied in Various crops:
Genome-Wide Association Mapping in Arabidopsis Identifies Previously
Known Flowering Time and Pathogen Resistance Genes.
A very large number of spurious genotype–phenotype correlations are
found, especially for traits that vary geographically. For example, plants from
northern latitudes flower later; however, in addition to sharing genetic variants
that make them flower late, they also tend to share variants across the genome,
making it difficult to determine which genes are responsible for flowering. This
notwithstanding, several previously known genes were successfully identified in
this study, and the researchers are optimistic about the prospects for
association mapping in this species.
They checked flowering time and pathogen resistance in a sample of 95
accessions for which genome wide polymorphism data were available. In spite
of an extremely high rate of false positives due to population structure, we were
able to identify known major genes for all phenotypes tested, thus
demonstrating the potential of genome-wide association mapping in A. thaliana
and other species with similar patterns of variation. The rate of false positives
differed strongly between traits, with more clinal traits showing the highest
rate. However, the false positive rates were always substantial regardless of the
trait, highlighting the necessity of an appropriate genomic control in
The columns on the left give the genotype and associated phenotype for four
loci, for each of the 95 accessions. The four loci are the flowering time locus
FRI (þ, wild-type; 1, Ler null allele; 2, Col null allele, for which the associated
phenotype is flowering time in long-day conditions without vernalization (late
flowering is indicated by height and color of bar), and the three pathogen
resistance loci Rps5, Rpm1, and Rps2 (þ, wild-type;, null allele, for which the
associated phenotypes are hypersensitive response to the appropriate bacterial
avr gene (red indicates resistance, black indicates susceptibility, and missing
data are indicated by missing bar). The tree on the right illustrates the genetic
relationships between the accessions.
GWAPP: A Web Application for Genome-Wide Association Mapping in
GWAPP, an interactive Web-based application for conducting GWAS in A.
thaliana. Using an efﬁcient implementation of a linear mixed model, traits
measured for a subset of 1386 publicly available ecotypes can be uploaded and
mapped with a mixed model and other methods in just a couple of minutes.
GWAPP features an extensive, interactive, and user-friendly interface that
includes interactive Manhattan plots and linkage disequilibrium plots. It also
facilitates exploratory data analysis by implementing features such as the
inclusion of candidate polymorphisms in the model as cofactors.
Fig 1(A) The ﬁlter box allows the user to exclude speciﬁc accessions as well
as change the name and the description of the data set.(B) The data set list
displays information for each accession in the data set. In edit mode, the user
can use the checkbox to add and remove accessions from the data set.(C) A
Google map shows the locations of all accessions in the data set. Clicking on
one marker will show a pop-up with information about the name and ID of the
selected accession.(D) The geographic distribution map (GeoMap) shows the
geographic distribution of the accessions in the data set. Moving the mouse
over a country will show the number of accessions located in that region.
Fig2.The result view displays GWAS plots for each of the ﬁve chromosomes.
Each GWAS plot itself consists of three panels. The top panel (A) contains a
scatterplot. The positions on the chromosome are on the x axis and the score
on the y axis. The dots in the scatterplot represent SNPs (E).A horizontal
dashed line (H) shows the 5% FDR threshold. At the top of the GWAS results
view, a search box for genes is displayed (D). These genes will be displayed as a
colored band (red in the ﬁgure). The second panel (B) shows the gene
annotation and is only shown for a speciﬁc zoom range (<1.5Mb). It will display
genes, gene features, and gene names. Moving the mouse over a gene will
display additional information in a pop-up (F), and clicking on a gene will open
the TAIR page for the speciﬁc gene. Panel (C) displays various chromosome-
wide statistics. The region highlighted by a yellow band (I) is shown in the
scatterplot and in the gene annotation. The gear icon opens a pop-up (G) with
the available statistics the user can choose from.
Genome-wide association mapping reveals a rich genetic architecture
of complex traits in Oryza sativa.o
Here the earlier approaches revealed that the Biparental and QTL approaches
are not scalable to investigate the genetic potential and tremendous phenotypic
variation of more than 12000 accessions available in public germplasm
Here they took global collection of 413 diverse(sativa) varieties from 82
contries using high quality custom designed 441000 oligonucleotides
phenotyping array.For these accesions they phenotyped 34 morphological,
developmental and agronomic traits over 2 consicutive field seasons.
This mapping stretegy evaluated variation both within and among 4 of the
major subgroups of rice, revealing significant heterogenity of genetic
architechture among groups as well as gene by environmental effect.
Fig. Phenotypic distribution and genome-wide association scan for plant
height. ( a ) Quantile – Quantile plots for both na ï ve and mixed model for
plant height in all samples. ( b ) Boxplot showing the differences in plant
height among subpopulations. Box edges represent the upper and lower
quantile with median value shown as bold line in the middle of the box.
Whiskers represent 1.5 times the quantile of the data. Individuals falling
outside the range of the whiskers shown as open dots. ( c ) Histogram of plant
height in all samples. Dashed black line represents the null distribution. ( d )
Genome-wide P -values from the mixed model and na ï ve method. x axis
shows the SNPs along each chromosome; y axis is the− log 10 (P -value) for
the association. Dots in ( a ) and ( c ) indicate SNPs with P -values <1 × 10
−4 in the mixed model and the top 50 SNPs in the naïve method; SNPs within
200 kb range of known genes are in red; other signiﬁcant SNPs are in blue.
Candidate gene locations shown as red vertical dashed lines with names on
A Genome-Wide Association Study Identifies Genomic Regions for
Virulence in the Non-Model Organism Heterobasidion annosum s.s
The dense single nucleotide polymorphisms (SNP) panels needed for
genome wide association (GWA) studies have hitherto been expensive to
establish and use on non-model organisms. To overcome this, we used a next
generation sequencing approach to both establish SNPs and to determine
genotypes. We conducted a GWA study on a fungal species, analyzing the
virulence of Heterobasidion annosum s.s., a necrotrophic pathogen, on its
hosts Picea abies and Pinus sylvestris. From a set of 33,018 single nucleotide
polymorphisms (SNP) in 23 haploid isolates, twelve SNP markers distributed on
seven contigs were associated with virulence (P,0.0001). Four of the contigs
harbour known virulence genes from other fungal pathogens and the remaining
three harbour novel candidate genes. Two contigs link closely to virulence
regions recognized previously by QTL mapping in the congeneric hybrid H.
irregulare6H. occidentale. The study demonstrates the efficiency of GWA
studies for dissecting important complex traits of small populations of non-
model haploid organisms with small genomes.
Genome-wide association study (GWAS) of resistance to head smut
Head smut, caused by the fungus Sphacelotheca reiliana (Kühn)
Clint, is a devastating global disease in maize, leading to severe quality
and yield loss each year. The present study is the ﬁrst to conduct a
genome-wide association study (GWAS) of head smut resistance using
the Illumina MaizeSNP50 array. Out of 45,868 single nucleotide
polymorphisms in a panel of 144 inbred lines, 18 novel candidate genes
were associated with head smut resistance in maize. These candidate
genes were classiﬁed into three groups, namely, resistance genes, disease
response genes, and other genes with possible plant disease resistance
functions. The data suggested a complicated molecular mechanism of
maize resistance against S. reiliana. This study also suggested that
GWAS is a useful approach for identifying causal genetic factors for head
smut resistance in maize.
Fig. Manhattan plots of a mixed linear model (MLM) for resistance to
head smut. Plots above the blue horizontal dashed line show the
genome-wide signiﬁcance with a moderately stringent threshold of −log
(1/45,868). Plots above the red horizontal dashed line show the genome-
wide signiﬁcance with stringent threshold of −log(0.05/45,868). The
different colors indicate plots for different chromosomes, which follow the
order: chromosome 1–chromosome 10. The plots with the −log10 (P)
Value above 8 were not shown
Advantages of GWAS
1) Biological pathway of the trait does not have to be known.
2) Potential to discover novel candidate genes, not identified through other
3) Encourage the formation of collaborative consortia to recruit sufficient
number of participants for analysis, which tend to continue their
collaboration with subsequent analysis.
4) Rules act at specific genetic association.
5) Provides data on ancestry of each subject, which assists in matching
case subject with control subject.
6) Provides data on 2 types of structural variants-sequence and copy
number variations- which provides more robust data.
7) It is large enough to identify mutations explaining a few percent of
1) Results need replication in independent samples in different population.
2) A large study of population is required.
3) GWAS detect association not causation.
4) Identifying specific location not complete gene. Many variants identified
are nowhere near a protein coding gene or are within genes that were not
previously believed to associate with a trait or condition.
5) Falls on common variants.
6) Detect any variant that are common(>5%) in a population.
7) Typically for any particular trait, the cumulative effect of multiple SNPs
only explain a small function of an individual risk of a train.
Still why GWAS is popular?
The dropping genotyping costs are likely to drive association studies
away from candidate genes. It involves whole genome resequencing of all
the individuals in a population, will allow an assessment of point
mutation, insertions deletions and large structure variation such as copy
number variation Eg. Resequencing of Arabidopsis lyrata. In future this
will help in RNA-seq data to include in e-QTL mapping in GWAS studies.
Population choice for GWAS studies will no longer restricted to
model organisms will slowly become more focused on the spp which are
more relevant in answering biological questions. The accuracy of GWAS
depends on 1 time genotyping and repeated phenotyping in different
Output of GWAS
• To ensure greatest utility of GWAS result in the future,all phenotype and
genotype data should to be made public and be deposited in public
• As such file format and minimum information standards should to be
established, such as those available for sequence data or microarray
experiments. Priority to storage and dissemination of phenotypic and
Despite the caveats outlined above, it seems that genome-wide
association studies of the role of common variants in complex disease will be
carried out in the near future. Initial studies will define more accurately the
principal factors, which have been summarized above, that can reduce the
power of such studies. In these studies, large sample sizes should be used,
biases taken into account, multiple-testing issues addressed and replication
studies carried out, therefore optimizing experimental design, statistical power
and cost efficiency. Close evaluation of the yields of true susceptibility loci in
relation to the cost of such rigorously designed studies will determine whether
the genome-wide analyses of common SNPs is a worthwhile approach in the
continuing dissection of the genetic basis of common disease.
Summary and conclusions
The past year has seen a remarkable shift in our capacity to dissect the
genetic basis of common diseases and continuous traits of biomedical
significance. The GWA approach has proven itself extremely well-suited to the
identification of common SNP-based variants with modest to large effects on
phenotype. Careful implementation and appropriate interpretation has resulted
in discoveries that have proven more robust than many had anticipated.
Growing numbers of novel susceptibility loci have been identified, shedding
light on the fundamental mechanisms that influence disease predisposition,
and much is being learned about the complex relationships between changes in
genome sequence and phenotypic variation.
However, we are far from the end of this particular voyage, and recent
discoveries are nothing more than initial forays into the terra incognita of our
genomes. We remain unable to explain more than a small proportion of
observed familial clustering for most multifactorial traits, a fact that
emphasizes the need to extend analysis to a more complete range of potential
susceptibility variants, and to support more explicit modelling of the joint
effects of genes and environment. Many of the greatest challenges to be
faced in the years ahead lie not so much in the identification of the association
signals themselves, but in defining the molecular mechanisms through which
they influence disease risk and/or phenotypic expression.
Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, et al. (2010)
Integrating common and rare genetic variation in diverse human
populations. Nature 467: 52–58. doi: 10.1038/ nature09298.
Atwell S, Huang YS, Vilhjalmsson BJ, Willems G, Horton M, et al.
(2010)Genome-wide association study of 107 phenotypes in Arabidopsis
thaliana inbred lines. Nature 465: 627–631.
Connelly CF, Akey JM (2012) On the prospects of whole-genome association
mapping in Saccharomyces cerevisiae. Genetics.
Cooper GM, Johnson JA, Langaee TY, Feng H, Stanaway IB, et al. (2008) A
genome-wide scan for common genetic variants with a large influence on
warfarin maintenance dose. Blood 112: 1022–1027. doi: 10.1182/blood-
Cumagun CJR, Bowden RL, Jurgenson JE, Leslie JF, Miedaner T (2004)
Genetic mapping of pathogenicity and aggressiveness of Gibberella zeae
(Fusarium graminearum) toward wheat. Phytopathology 94: 520–526.
Edwards AO, Ritter R, III, Abel KJ, Manning A, Panhuysen C, et al. (2005)
Complement factor H polymorphism and age-related macular degener-
ation. Science 308: 421–424. doi: 10.1126/ science.1110189.
Ellison CE, Hall C, Kowbel D, Welch J, Brem RB, et al. (2011) Population
genomics and local adaptation in wild isolates of a model microbial
Freedman, M. et al. Assessing the impact of population stratiﬁcation on
genetic association studies. Nat. Genet. 36, 388–393 (2004).
Freimer, N. & Sabatti, C. The use of pedigree, sib-pair and association studies
of common diseases for genetic mapping and epidemiology. Nature
Genet. 36, 1045–1051 (2004).A clear and unbiased review of the main
current genetic mapping strategies that discusses analyses using
extended pedigrees, affected sib-pairs and association.
Genomes Project Consortium (2010) A map of human genome variation from
population-scale sequencing. Nature 467: 1061–1073. Doi:10.1038
Griffith OL, Montgomery SB, Bernier B, Chu B, Kasaian K, et al. (2008)
ORegAnno: an open- access community-driven resource for regulatory
annotation. Nucleic Acids Res 36: D107-D113. doi: 10.1093/nar/gkm967
Haines JL, Hauser MA, Schmidt S, Scott WK, Olson LM, et al. (2005)
Complement factor H variant increases the risk of age-related macular
degeneration. Science 308: 419–421. doi: 10.1126/science.1110359.
Hall D, Tegstrom C, Ingvarsson PK (2010) Using association mapping to
dissect the genetic basis of complex traits in plants. Brief Funct Genomics
Hawthorne B, Rees-George J, Bowen J, Ball R (1997) A single locus with a
large effect on virulence in Nectria haematococca MPI. Fungal Genet
Newsl 44: 24–26.
Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, et al.
(2009)Potential etiologic and functional implications of genome-wide
association loci for human diseases and traits. Proc Natl Acad Sci U S A
IRGSP .he map-based sequence of the rice genome. Nature 436 , 793– 800(
Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, et al. (2005) Complement
factor H polymorphism in age-related macular degeneration. Science 308:
385–389. doi: 10.1126/science.1109557.
Lander, E.S. & Schork, N.J.Genetic dissection of complex traits. Science
Li, Y., Huang, Y.S., Bergelson, J., Nordborg, M., and Borevitz,
J.O.(2010).Association mapping of local climate-sensitive quantitative
trait loci in Arabidopsis thaliana. Proc.Natl.Acad.Sci.USA 107: 21199–
Lind M, Dalman K, Stenlid J, Karlsson B, Olson A (2007) Identification of
quantitative trait loci affecting virulence in the basidiomycete
Heterobasidion annosum s.l. Curr Genet 52: 35–44.
Lind M, van der Nest M, Olson A ˚ , Brandstro ¨m-Durling M, Stenlid J (2012)
A 2nd generation linkage map of Heterobasidion annosum s.l. based on in
silico anchoring of AFLP markers. PLoS One 7: e48347.
Liti G, Carter DM, Moses AM, Warringer J, Parts L, et al. (2009) Population
genomics of domestic and wild yeasts. Nature 458: 337–341.
Lohmueller, K. et al. Meta-analysis of genetic association studies supports a
contribution of common variants to susceptibility to common disease.
Nat. Genet. 33,177–182 (2003).
Muller LAH, Lucas JE, Georgianna DR, McCusker JH (2011) Genome-wide
association analysis of clinical vs. nonclinical origin provides insights
into Saccharomyces cerevisiae pathogenesis. Mol Ecol 20: 4085–4097.
Neafsey DE, Barker BM, Sharpton TJ, Stajich JE, Park DJ, et al.
(2010)Population genomic sequencing of Coccidioides fungi reveals
recent hybridization and transposon control. Genome Res 20: 938–946.
Olson A, Stenlid J (2001) Plant pathogens - Mitochondrial control of
fungalhybrid virulence. Nature 411: 438–438.
Ozoki, K., 2001, A high throughput SNP typing system for GWAS, Springer.
Pandelova I, Ciuffetti LM (2005) A proteomics-based approach for identification
of the Tox D gene. Fungal Genet Newsl 52.
Price,A. L. et al. Principal components analysis corrects for stratification in
genome-wide association studies .Nat. Genet.38, 906 –909 (2006).
Santoyo F, Gonzalez AE, Terron MC, Ramirez L, Pisabarro AG
(2008)Quantitative linkage mapping of lignin-degrading enzymatic
activities in Pleurotus ostreatus. Enzyme Microb Technol 43: 137–143.
Tohn, P, A., 2009, Validating and refining GWAS signals, Nature. 10:318-329.
Umit Seren. And Bjarni., 2012, GWAPP:A web application for genome wide
association mapping in Arabidopsis, The plant cell J. 24:4793-4805.
Yamamoto , T., Yonemaru J.&Yano,M .Towards the understanding of
complex traits in rice: substantially or superficially? DNA Res.16, 141–
154 ( 2009).
Yan J, Shah T, Warburton ML, Buckler ES, McMullen MD, et al. (2009)
Genetic characterization and linkage disequilibrium estimation of a
global maize collection using SNP markers. PLoS One 4.
Zhang H, Zhao Q, Liu K, Zhang Z, Wang Y, et al. (2009) MgCRZ1,a
transcription factor of Magnaporthe grisea, controls growth, development
and is involved in full virulence. FEMS Microbial Lett 293: 160–169.