7. differentially to adaptive changes. In D. melanogaster, the X
chromosome contains greater repetitive content (Mackay
et al. 2012), displays different gene density (Adams et al.
2000), has potentially smaller population sizes (Wright 1931;
Andolfatto 2001), lower levels of background selection
(Charlesworth 2012), and an excess of genes involved in
female-specific expression (Ranz et al. 2003) in comparison
to the autosomes. Furthermore, the X chromosome is hemi-
zygous in males, exposing recessive mutations to the full
effects of selection more often than comparable loci on the
autosomes (Charlesworth et al. 1987). Hence, the incidence of
duplications on the X and the types of genes affected may
each species (as a control for genome quality and false pos-
itives) (Drosophila Twelve Genomes Consortium 2007; Hu
et al. 2013). Genomes are sequenced to high coverage of
50–150Â for a total of 42 complete genomes (supplementary
tables S1–S5, Supplementary Material online, see Materials
and Methods). We have used mapping orientation of
paired-end reads to identify recently derived, segregating du-
plications in these samples <25kb in length that are sup-
ported by three or more divergently oriented read pairs (see
Materials and Methods, supplementary text S1, tables S6 and
S7, Supplementary Material online). We limit analysis to re-
gions of the genome, which can be assayed with coverage
FIG. 1. Tandem duplications in 20 sample strains of Drosophila yakuba. Regions spanned by divergently oriented reads are shown with sample strains
plotted on different rows, whereas axes list genomic location in Mbp. Duplications are more common around the centromeres, especially on
chromosome 2. Frequencies are shaded in grayscale according to frequency, with high-frequency variants shown in solid black. The D. simulans X
chromosome appears to have an excess of high-frequency variants in comparison to the D. simulans autosomes and the D. yakuba X chromosome.
Tandem Duplications in Nonmodel Drosophila doi:10.1093/molbev/msu124 MBE
Rogers, R. L. et al. doi:10.1093/molbev/msu124
D. yakuba D. simulans
Whole gene 248 296
Partial Gene 745 462
Intergenic 745 577
8. mapping patterns indicative of a modified duplication surrounding jingwei in Drosophila yakuba line NY66-2. Duplications
ly oriented paired-end reads (blue) as well as with split read mapping of long molecule sequencing (purple). Deletions in
apped read mapping of long molecule reads (red) as well as multiple long-spanning read pairs at the tail of mapping distan
encing (green) just upstream from jgw. Up to 20% of duplicates observed have long-spanning read pairs indicative of putativ
. doi:10.1093/molbev/msu124
Rogers, R. L. et al. doi:10.1093/molbev/msu124
9. yakuba and 76 in D. simulans where both breakpoints fall
derived from parental genes in parallel orientation
a result of 10.4% of tandem duplications that capture g
D. yakuba and 9.5% of tandem duplications that
coding sequences in D. simulans. These numbers are
eral agreement with rates of chimeric genes formati
mated from a within-genome study of D. melanog
16.0% compared with the rate of formation of du
genes (Rogers et al. 2009).
FIG. 8. Abnormal gene structures. Duplicated sequence is highlighted
with bold colors and is framed by the dashed box. (A) The partial du-
plication of a coding sequence (blue) results in the recruitment of pre-
viously upstream noncoding sequence (dashed lines) to create a novel
open reading frame (blue and turquoise). (B) Tandem duplication where
both boundaries fall within coding sequences results in a chimeric gene.
FIG. 9. Dual promoter genes. Duplicated sequence is highligh
bold colors and is framed by the dashed box. Tandem duplicatio
both boundaries fall within coding sequences results in a chim
which contains two promoters, one which facilitates transcr
one direction, the other facilitating transcription from the
strand. The chimera is capable of making partial antisense tra
Rogers et al. . doi:10.1093/molbev/msu124 M
Rogers, R. L. et al. doi:10.1093/molbev/msu124
D. yakuba D. simulans
Chimeric
gene
structures
78 38
Recruited
ncDNA
143 96
10. 0.00.20.40.60.8
SFS for Duplications in D. simulansSFS for Duplications in D. yakubaA B SFS for X−linked muta
0.00.20.40.60.8
C
0.00.20.40.60.8
Figure 1: SFS for tandem duplications in D. yakuba and D. simulans, co
ascertainment bias. A. Site frequency spectra on the autosomes (black) and on t
in D. yakuba. B. SFS on the autosomes (black) and on the X (grey) in D. si
SFS for X-linked intronic SNPs (black) and duplicates (white). The excess of hig
variants on the X in D. simulans suggests widespread selection for tandem duplic
D. simulans X.
A B
Figure 5: A) Gene ontology classes overrepresented by species among singly duplicated
genes or among multiply duplicated genes. B) Number of genes duplicated by species. MostRogers, R. L. et al. Submitted (1)
11. D. yakuba D. simulans D. melanogaster
12 MY
μwholegene
1.17 × 10−9
6.03 × 10−10
μchim
μrecruit
3.46 × 10−10
3.70 × 10−10
2.42 × 10−10
8.52× 10−11
Ne
1.21 × 106
5.93 × 105
Figure 6: Genomewide population mutation rates for all duplic
sizes (Ne), and per gene mutation rates (µ) for gene structures pr
duplication, recruitment of non-coding sequence, and chimeric ge
mutation rates and mutation limited evolution leads to low levels
Rogers, R. L. et al. Submitted (1)
Schrider, D. R. et al. doi:10.1534/genetics.115.174912
12. Table 1: Activated genes
Chimeras Tissue Upregulated Total
Female Carcass 5 76
Female Ovary 11 76
Male Carcass 10 76
Male Testes 7 76
All 24 76
Whole Gene Tissue Upregulated Total
Female Carcass 3 66
Female Ovary 2 66
Male Carcass 1 66
Male Testes 0 66
All 5 66
Whole Gene and 100 bp Intergenic Tissue Upregulated Total
Female Carcass 3 58
Female Ovary 2 58
Male Carcass 1 58
Male Testes 0 58
All 5 58
Rogers, R. L. et al. doi:10.1534/g3.114.013532
Rogers, R. L. et al. Submitted (2)
13. GE18451 GE18452 GE18453GE18452’Chimera
Figure 2: Chimeric gene structures result in novel expression patterns. A tandem duplication
that does not respect gene boundaries unites the 50
end of GE18453 with the 30
end of
GE18451 to produce a chimeric gene on chromosome 2L. Plot shows quantile normalized
coverage in RNA seq data for sample (red) and reference (grey) with HMM output (blue)Rogers, R. L. et al. Submitted (2)
14. Hu, X., & Worton, R. G., doi:10.1002/humu.1380010103
GENE DUPLICATION IN HUMAN DISEASE 5
TABLE 1. A Summary on Reported Cases of Partial Gene Duplication Associated With Human Diseases
Number of
independent Exons(s) Translational
Genes duplications duplicated" reading frameb Disorders' References
HPRT 1
LDL receptor 3
Dystrophin 10
1
13
1
2
a-Galactosidase A 1
Factor VIII 1
LPL 1
2.3
2-8
9-12
13-15
8.9
3-11
38-43
50-52
3, 4
45-51
20-41
3,4
2-7
22-27
ND
ND
13-42
5-11
17
13
2-6
6 hartial)
In-frame
In-frame
Shift
Shift
Shift
Shift
Shift
Shift
In-frame
In-frame
Shift
In-frame
In-frame
ND
ND
ND
In-frame
shift
shift
ND
Lesch-Hyhan syndrome
Familial hyper
cholesterolemia
DMD
DMD
DMD
DMD
DMD
DMD
Intermediate
Intermediate
BMD
BMD
ND
DMD/BMD
BMD
DMD
DMD
Fabry disease
Yang et al., 1984, 1988
Lehrman et al.. 1987a
Top et al., 1990
Lelli et al.. 1991
Hu et al.. 1988,1990
Greenberg et al., 1988
Den Dunnen et al., 1989
Angelini et al., 1990
Roberts et al., 1991
Bernstein et al.. 1989
In-h-ame Hemophilia A Casula et al., 1990
~~ ~ , ND Lipoprotein lipase deficiency Devlin et al., 1990
Type 11 collagen 1 bbp In-frame Spondyloepiphyseal dysplasia Tiller et al., 1990
C1 inhibitor 1 4 In-frame Hereditory angioedema Stoppa-Lyonnet et al., 1990
p-Galactosidase 1 165 bp In-frame G,,-gangliosidosis Yoshida et al., 1991
"ND, not defined.
the majority of cases,the readingkame status of the mRNA was not actuallydeterminedbut was predicted based on the assumption
that the exons contained in the duplicated segment were spliced correctly to the exons flanking the duplicated segment. ND, not
determined.
'DMD-Duchenne muscular dystrophy. BMD-Becker muscular dystrophy. Intermediateintermediate phenotype of the muscular
(within exon 48)
dystrophy.
the original copy in a head-to-tail direct orienta- tandem duplication of a limited number of nucle-
18. most extreme example of this was the gene nessy, where the
average expression level of the line with the TE was 40
standard deviations lower than TE free lines. This is likely
an effectively null mutation at this gene segregating in na-
ture. Comparing the average standard deviation in mean
expression per DGRP line for TE-associated transcripts to
transcripts with no TEs in or within 10 kb indicates that
TE-associated transcripts have a larger mean standard de-
viation, 0.33 vs. 0.28, (t-test; P = 1.0e-22). Since the DGRP incorporating TE information into work on expression and
Table 3 Mean z-scores for transcripts with TEs
Category Mean z-score N
Within exon 23.44 249
Introns #400 bp 21.03 72
Within 200 bp of acceptor site 20.90 64
Within 200 bp of donor site 20.67 64
Within first intron 20.37 545
Not within first intron 20.11 852
#500 bp of TSS 20.43 186
501 bp to 2 kb of TSS 20.01 418
.2 kb of TSS 20.05 2121
#500 bp of TES 20.52 213
501 bp to 2 kb of TES 20.04 347
.2 kb of TES 20.02 1976
Mean z-scores are calculated from the transcript/TE pairs for all transcripts with an
insertion in each location category.
Figure 2 Transposable elements as a class of variation. Probability–probability
plot of observed and expected P-values from t-tests of all cases where four or
more lines show an independent TE insertion in the same location category
for the same transcript.
most extreme example of this was the gene nessy, where the
average expression level of the line with the TE was 40
standard deviations lower than TE free lines. This is likely
an effectively null mutation at this gene segregating in na-
ture. Comparing the average standard deviation in mean
expression per DGRP line for TE-associated transcripts to
transcripts with no TEs in or within 10 kb indicates that
TE-associated transcripts have a larger mean standard de-
viation, 0.33 vs. 0.28, (t-test; P = 1.0e-22). Since the DGRP
line with the TE insertion is not used in calculating the
standard deviation this suggests that transcripts with TEs
may be those that are more tolerant of variation in expres-
incorporating TE information into work on expression and
phenotypic analyses. Given that our analysis has focused on
insertions of large effect, this is a conservative estimate of
the number of TEs that may contribute to expression differ-
Table 3 Mean z-scores for transcripts with TEs
Category Mean z-score N
Within exon 23.44 249
Introns #400 bp 21.03 72
Within 200 bp of acceptor site 20.90 64
Within 200 bp of donor site 20.67 64
Within first intron 20.37 545
Not within first intron 20.11 852
#500 bp of TSS 20.43 186
501 bp to 2 kb of TSS 20.01 418
.2 kb of TSS 20.05 2121
#500 bp of TES 20.52 213
501 bp to 2 kb of TES 20.04 347
.2 kb of TES 20.02 1976
Mean z-scores are calculated from the transcript/TE pairs for all transcripts with an
insertion in each location category.
Figure 2 Transposable elements as a class of variation. Probability–probability
plot of observed and expected P-values from t-tests of all cases where four or
more lines show an independent TE insertion in the same location category
for the same transcript.
Cridland, J. M., et al. doi:10.1534/genetics.114.170837
19. Summary, part 1
• Structural variants are typically rare
• Duplications have non-additive effects on gene
expression
• Low change of precise convergence at molecular level
• TEs are strong candidates for “RALE” in flies
• All of these variants are poorly-tagged in current-
generation association studies
21. ironments.
inflated if
etic effects
ed familial
genotypes
rom pedi-
mated from
rom family
ronmental
ecause the
pirical esti-
airs of rela-
netic com-
y it ranged
nces to the
a were used
remarkably
ree of their
is not over-
Allele frequency
Effect size
Very rare Common
Low
High
Rare Low frequency
0.001 0.005 0.05
Intermediate
Modest
Rare alleles
causing
Mendelian
disease
Few examples of
high-effect
common variants
influencing
common disease
Common
variants
implicated in
common disease
by GWA
Rare variants of
small effect
very hard to identify
by genetic means
Low-frequency
variants with
intermediate effect
3.0
1.5
1.1
50.0
Figure 1 | Feasibility of identifying genetic variants by risk allele frequency
and strength of genetic effect (odds ratio). Most emphasis and interest lies
in identifying associations with characteristics shown within diagonal dotted
lines. Adapted from ref. 42.
REVIEWS
Manolio, T. A., et al. doi:10.1038/nature08494
25. −4 −2 0 2 4
0.00.20.40.60.81.0
Value
Relativedensity
Fitness, σS
2
Gaussian noise, σe
2
Effect sizes, mean = λ
VG=4µdσS
2
, H2
=
VG
VG + σe
2
Thornton, K. R., et al. doi:10.1371/journal.pgen.1003258
Turelli, M. (1984). Theoretical Population Biology, 25(2), 138–193.
26. H1 =
X
i
ei H2 =
X
j
ej
G =
p
H1 ⇥ H2
P = G + N(0, e)
Thornton, K. R., et al. doi:10.1371/journal.pgen.1003258
VG ⇡ 2µ 2
s
Slatkin, M. (1987). Genetical Research, 50(1), 53–62.
27. Kaul, R., et al. (1994). Journal of Inherited Metabolic Disease, 17(3), 356–358.
32. 0 20 40 60 80 100
051015
Position (kbp)
−log10(p)
Common
Common, causative
Rare
Rare, causative
Thornton, K. R., et al. doi:10.1371/journal.pgen.1003258
33. xtentofthegene ofinterest;
power to detect, at levels of
We turn now to a discussio
focusing here only on the m
2015
25
20
20
15
15
10
10
5
5
30
0
0
25
20
15
10
5
5
30
0
0
25
30
25
30
CAD
RA
Burton, P. R., et al. doi:doi:10.1038/nature05911
35. MAF of most asscociated SNP
Frequency
0.0 0.1 0.2 0.3 0.4 0.5
0.000.020.040.060.080.10
Wray, N. R., et al. PLoS Biology, 9(1), e1000579.
Gibson, G. doi:10.1038/nrg3118
See also:
36. Wray, N. R., et al. PLoS Biology, 9(1), e1000579.
Gibson, G. doi:10.1038/nrg3118
See also:
37. e of common variation genome-wide on the Affymetrix chip;
verage (by design) of rare variants, including many structural
(thereby reducing power to detect rare, penetrant, alleles)25
;
eswithdefining thefullgenomicextentof thegene of interest;
pite the sample size, relatively low power to detect, at levels of
and fine-mapping wo
required before such in
ments about the molec
We turn now to a dis
focusing here only on
25
20
20
15
15
10
10
5
5
30
0
0
25
20
20
15
15
10
10
5
5
30
0
0
25
20
15
10
5
30
0
0
25
20
15
10
5
30
25
20
15
10
5
30
25
20
15
10
5
30
BD
Observedteststatistic
CAD
HT RA
QQ plot from Burton, P. R., et al. doi:doi:10.1038/nature05911
Thornton, K. R., et al. doi:10.1371/journal.pgen.1003258
ESMK =
KX
i=i
✓
log10Pi + log10
i
M
◆
38.
39. Summary, part 2
• GWAS observations consistent with a simple model
of loss of function (recessive) mutations in genes
• The (genetic) model matters much more than the
demographic assumptions!
• Standard models are rejected by the data.
• We need to get our hands on better GWAS data
sets!
40. 1/15 1/151/15 2/15 1/151/15
1/14 1/122
1/1181/120 1/1181/124
2/117
2/1182/119
1/122
1/119 1/118
2/118
1/121 3/114
1/115
1/114
1/119
1/113
1/119
1/120
1/120
1/119
1/120
3/119
1/119
A klarsichtDGRP
DSPR
Insertion in an Exon
3/14 11/95
4/121D Cyp6a20
B Notch
1/15 1/1211/122
C Delta
1/114 1/120
* Same insertion in both resources
***
*
*
Cridland, J. M., et al. doi:10.1093/molbev/mst129
King, E. G., et a. doi:10.1371/journal.pgen.1004322 McClellan, J., King, M.-C. doi:10.1016/j.cell.2010.03.032
ary factors, including the impact of the
illness on selection (Pritchard and Cox,
2002). In order to be maintained at poly-
morphic frequencies worldwide, com-
mon variants with even modest influence
on disease must withstand selective
pressure in every generation. Not sur-
prisingly, therefore, the common alleles
with the best documented relationship
to disease are associated with disorders
that arise later in life, i.e., Alzheimer dis-
ease’s or age-related macular degenera-
tion. For illnesses that impact reproduc-
tive fitness, balancing positive selection
is often demonstrable. Illness in these
cases may arise from interaction between
genetic and environmental factors, such
that an otherwise adaptive mechanism
or trait is deleterious in certain individu-
als. For example, adaptive inflammatory
responses can cause autoimmune dis-
orders when turned against the host, or
efficient storage of calories can lead to
type II diabetes or to obesity in food-rich
cultures.
Both common and rare alleles may
lead to the same disease. For example,
common traits remains to be explained
(Goldstein, 2009). We further suggest
A recently published genome-wide
association study of autism (Wang et al.,
Figure 4. Genetic Heterogeneity of Severe Mental Illness
Recent genomic analyses have revealed many individually rare, or even de novo, micro-deletions, micro-
duplications, and point mutations associated with schizophrenia (red), autism (blue), or both (black). The
most frequently replicated genes and loci include DISC1, NRXN1 (neurexin), CNTNAP2, and SHANK3, as
well as genomic hotspots at 1q21.1, 15q13.3, 16p11.2, and 22q11.2.
42. Acknowledgements
• Julie Cridland, Andrew Foran, Rebekah Rogers,
Jaleal Sanjak, Ling Shao
• Peter Andolfatto, Tony Long, Stuart MacDonald
• Joseph Farran Harry Mangalam
• NIH, UCI Center for Complex Biological Systems