DOI: 10.1126/science.1237619
, 562 (2013);341Science
et al.G. David Poznik
Common Ancestor of Males Versus Females
Sequenc...
Sequencing Y Chromosomes Resolves
Discrepancy in Time to Common
Ancestor of Males Versus Females
G. David Poznik,1,2
Brenn...
(Fig. 1 and fig. S2). We then implemented a hap-
loid model expectation-maximization algorithm
to call genotypes (11).
We ...
fines an early diversification episode of the Y
phylogeny in Eurasia (11).
To account for missing genotypes, we as-
signed...
mtDNA coalescence times are not significantly
different. This conclusion would hold whether
or not an alternative approach...
www.sciencemag.org/cgi/content/341/6145/562/DC1
Supplementary Materials for
Sequencing Y Chromosomes Resolves Discrepancy ...
2
Table of Contents
Materials and Methods....................................................................................
3
Supplementary Figures
Fig. S1. Map of populations. ........................................................................
4
Materials and Methods
Sequencing
We prepared genomic libraries (26) from cell lines (HGDP) and blood (Gabonese), then
se...
5
393 pass the regional and mapping quality filters, and of these, just one failed the
missingness filter and a further tw...
6
mtDNA Analysis
mtDNA Pipeline
To call mitochondrial haplogroups, we converted sequences from the GRCh37 to the
rCRS coor...
7
We estimated the { Di } using a maximum likelihood phylogeny (Fig. 2), and we estimate
the yearly mutation rate, µly, as...
8
their MRCA are independent. However, they share all mutations possessed by their
MRCA. Thus,
where Dij is the number of ...
9
An alternative frequentist estimator defines D as half the average mutational distance dij
between pairs of individuals ...
10
where
€
ρ = Corr[ ˆM | M, ˆY |Y ]. We cannot disregard the correlation term in this case. If the
TMRCA of male and fema...
11
Coalescent theory measures time in units of Ne generations. To convert to years, we use
the maximum likelihood estimate...
12
Supplementary Text
Novel Y Chromosome Phylogenetic Structure
Haplogroup B2
Within hgB2, we identify one clade and three...
13
missing data is the most diverged in just three. Thus, we correctly impute five-sixths of
quadrupletons.
Polarizing Var...
14
For reference and comparison, Table S3 summarizes mutation rate point estimates on
four scales. The Y chromosome mutati...
15
inferred to possess. Finally, the maximum observed tip-to-root height (1188), could be
considered a conservative upper ...
16
10). With perfect data, these nine SNPs would have been classified as doubletons, but
they were instead misclassified a...
17
classified as HGDP00877 singletons, so accounting for type 2 errors reduces this branch
length by 7.9 (9 – 1.1).
Puttin...
18
The 1000 Genomes lineages are inappropriate to calibrate upon due to lower sequencing
coverage (average = 2.9×; Supplem...
19
Fig. S1. Map of populations.
We sampled Y chromosomes and mtDNAs from nine populations including Baka Pygmies from Gabo...
20
Fig. S2. Sequencing read mapping on Xq21.
Total read depth and the depth of MQ0 reads are plotted for 24 HGDP females. ...
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females
Upcoming SlideShare
Loading in …5
×

Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females

5,101 views
5,040 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
5,101
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Sequencing y chromosomes resolves discrepancy in time to common ancestor of males versus females

  1. 1. DOI: 10.1126/science.1237619 , 562 (2013);341Science et al.G. David Poznik Common Ancestor of Males Versus Females Sequencing Y Chromosomes Resolves Discrepancy in Time to This copy is for your personal, non-commercial use only. clicking here.colleagues, clients, or customers by , you can order high-quality copies for yourIf you wish to distribute this article to others here.following the guidelines can be obtained byPermission to republish or repurpose articles or portions of articles ):August 7, 2013www.sciencemag.org (this information is current as of The following resources related to this article are available online at http://www.sciencemag.org/content/341/6145/562.full.html version of this article at: including high-resolution figures, can be found in the onlineUpdated information and services, http://www.sciencemag.org/content/suppl/2013/08/01/341.6145.562.DC1.html can be found at:Supporting Online Material http://www.sciencemag.org/content/341/6145/562.full.html#related found at: can berelated to this articleA list of selected additional articles on the Science Web sites http://www.sciencemag.org/content/341/6145/562.full.html#ref-list-1 , 22 of which can be accessed free:cites 46 articlesThis article http://www.sciencemag.org/content/341/6145/562.full.html#related-urls 1 articles hosted by HighWire Press; see:cited byThis article has been registered trademark of AAAS. is aScience2013 by the American Association for the Advancement of Science; all rights reserved. The title CopyrightAmerican Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by theScience onAugust7,2013www.sciencemag.orgDownloadedfrom
  2. 2. Sequencing Y Chromosomes Resolves Discrepancy in Time to Common Ancestor of Males Versus Females G. David Poznik,1,2 Brenna M. Henn,3,4 Muh-Ching Yee,3 Elzbieta Sliwerska,5 Ghia M. Euskirchen,3 Alice A. Lin,6 Michael Snyder,3 Lluis Quintana-Murci,7,8 Jeffrey M. Kidd,3,5 Peter A. Underhill,3 Carlos D. Bustamante3 * The Y chromosome and the mitochondrial genome have been used to estimate when the common patrilineal and matrilineal ancestors of humans lived. We sequenced the genomes of 69 males from nine populations, including two in which we find basal branches of the Y-chromosome tree. We identify ancient phylogenetic structure within African haplogroups and resolve a long-standing ambiguity deep within the tree. Applying equivalent methodologies to the Y chromosome and the mitochondrial genome, we estimate the time to the most recent common ancestor (TMRCA) of the Y chromosome to be 120 to 156 thousand years and the mitochondrial genome TMRCA to be 99 to 148 thousand years. Our findings suggest that, contrary to previous claims, male lineages do not coalesce significantly more recently than female lineages. T he Y chromosome contains the longest stretch of nonrecombining DNA in the human genome and is therefore a pow- erful tool with which to study human history. Estimates of the time to the most recent common ancestor (TMRCA) of the Y chromosome have dif- fered by a factor of about 2 from TMRCA estimates for the mitochondrial genome. Y-chromosome coalescence time has been estimated in the range of 50 to 115 thousand years (ky) (1–3), although larger values have been reported (4, 5), whereas estimates for mitochondrial DNA (mtDNA) range from 150 to 240 ky (3, 6, 7). However, the quality and quantity of data available for these two uni- parental loci have differed substantially. Whereas the complete mitochondrial genome has been resequenced thousands of times (6, 8), fully sequenced diverse Y chromosomes have only recently become available. Previous estimates of the Y-chromosome TMRCA relied on short re- sequenced segments, rapidly mutating micro- satellites, or single-nucleotide polymorphisms (SNPs) ascertained in a small panel of individ- uals and then genotyped in a global panel. These approaches likely underestimate genetic diver- sity and, consequently, TMRCA (9). We sequenced the complete Y chromosomes of 69 males from seven globally diverse pop- ulations of the Human Genome Diversity Panel (HGDP) and two additional African populations: San (Bushmen) from Namibia, Mbuti Pygmies from the Democratic Republic of Congo, Baka Pygmies and Nzebi from Gabon, Mozabite Berbers from Algeria, Pashtuns (Pathan) from Pakistan, Cambodians, Yakut from Siberia, and Mayans from Mexico (fig. S1). Individuals were selected without regard to their Y-chromosome haplogroups. The Y-chromosome reference sequence is 59.36 Mb, but this includes a 30-Mb stretch of constitutive heterochromatin on the q arm, a 3-Mb centromere, 2.65-Mb and 330-kb telomeric pseudoautosomal regions (PAR) that recombine with the X chromosome, and eight smaller gaps. We mapped reads to the remaining 22.98 Mb of assembled reference sequence, which consists of three sequence classes defined by their com- plexity and degree of homology to the X chro- mosome (10): X-degenerate, X-transposed, and ampliconic. Both the high degree of self-identity within the ampliconictractsandthe X-chromosome homology of the X-transposed region render por- tions of the Y chromosome ill suited for short-read sequencing. To address this, we constructed filters that reduced the data to 9.99 million sites (11) 1 Program in Biomedical Informatics, Stanford University School of Medicine, Stanford, CA, USA. 2 Department of Statistics, StanfordUniversity,Stanford,CA,USA.3 DepartmentofGenetics, Stanford University School of Medicine, Stanford, CA, USA. 4 Department of Ecology and Evolution, Stony Brook University, Stony Brook, NY, USA. 5 Department of Human Genetics and Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA. 6 Department of Psychiatry, Stanford University, Stanford, CA, USA. 7 Institut Pasteur, Unit of Human Evolutionary Genetics, 75015 Paris, France. 8 Centre National de la Recherche Scientifique, URA3012, 75015 Paris, France. *Corresponding author. E-mail: cdbustam@stanford.edu 050100150200250300350400450500 FilteredDepthEWMA 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Position (Mb) 0.00.10.20.30.40.50.60.70.80.91.0 (MQ0/UnfilteredDepth)EWMA Depth Filter MQ0 Ratio Filter Exclusion Mask Inclusion Mask Compatible Site Incompatible Site ... 0 Mb 59.36 Mb X degenerate X transposed Ampliconic Heterochromatic Pseudoautosomal Other Fig. 1. Callability mask for the Y chromosome. Exponentially weighted moving averages of read depth (blue line) and the proportion of reads mapping ambiguously (MQ0 ratio; violet line) versus physical position. Regions with values outside the envelopes defined by the dashed lines (depth) or dotted lines (MQ0) were flagged (blue and violet boxes) and merged for exclusion (gray boxes). The complement (black boxes) defines the regions within which reliable genotype calls can be made. Below, a scatter plot indicates the positions of all observed SNVs. Those incom- patible with the inferred phylogenetic tree (red) are uniformly distributed. The X-degenerate regions yield quality sequence data, ampliconic sequences tend to fail both filters, and mapping quality is poor in the X-transposed region. 2 AUGUST 2013 VOL 341 SCIENCE www.sciencemag.org562 REPORTS onAugust7,2013www.sciencemag.orgDownloadedfrom
  3. 3. (Fig. 1 and fig. S2). We then implemented a hap- loid model expectation-maximization algorithm to call genotypes (11). We identified 11,640 single-nucleotide vari- ants (SNVs) (fig. S3). A total of 2293 (19.7%) are present in dbSNP (v135), and we assigned haplogroups on the basis of the 390 (3.4%) present in the International Society of Genetic Genealogy (ISOGG) database (12) (fig. S4). At SNVs, me- dian haploid coverage was 3.1x (interquartile range 2.6 to 3.8x) (table S1 and fig. S5), and sequence validation suggests a genotype calling error rate on the order of 0.1% (11). Because mutations accumulate over time along a single lengthy haplotype (13), the male- specific region of the Y chromosome provides power for phylogenetic inference. We constructed a maximum likelihood tree from 11,640 SNVs using the Tamura-Nei nucleotide substitution model (Fig. 2) and, in agreement with (14), ob- serve strong bootstrap support (500 replicates) for the major haplogroup branching points. The tree both recapitulates and adds resolution to the previously inferred Y-chromosome phyloge- ny (fig. S6), and it characterizes branch lengths free of ascertainment bias. We identify extra- ordinary depth within Africa, including lineages sampled from the San hunter-gatherers that coalesce just short of the root of the entire tree. This stands in contrast to a tree from autosomal SNP genotypes (15), wherein African branches were considerably shorter than others; genotyp- ing arrays primarily rely on SNPs ascertained in European populations and therefore undersample diversity within Africa. Two regions of reduced branch length in our tree correspond to rapid expansions: the out-of-Africa event (downstream of F-M89) and the agriculture-catalyzed Bantu expansions (downstream of E-M2). Among the three hunter-gatherer populations, we find a rel- atively high number of B2 lineages. Within this haplogroup, six Baka B-M192 individuals form a distinct clade that does not correspond to extant definitions (11) (fig. S7). We estimate this pre- viously uncharacterized structure to have arisen ~35 thousand years ago (kya). We resolve the polytomy of the Y macro- haplogroup F (16) by determining the branching order of haplogroups G, H, and IJK (Fig. 2 and fig. S6). We identified a single variant (rs73614810, a C→T transition dubbed “M578”) for which haplogroup G retains the ancestral allele, whereas its brother clades (H and IJK) share the derived allele. Genotyping M578 in a diverse panel con- firmed the finding (table S2). We thereby infer more recent common ancestry between hgH and hgIJK than between either and hgG. M578 de- 0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0 900.0 1000.0 1100.0 1200.0 H-M138Cambodian N-M231Cambodian E-P59 Nzebi Q-M3 Maya E-P116 Nzebi E-M191Nzebi E-P252 Nzebi B-P70 San E-U290 Nzebi B-M192Baka N-L708 Yakut E-M183Mozabite N-L708 Yakut E-U290 Baka E-P116 Nzebi N-L708 Yakut L-M357 Pashtun R-L657 Pashtun E-M154Nzebi A-P28 San Q-L54 Maya B-M192Baka A-M14 Baka B-M30 Baka E-P277 Nzebi E-M183Mozabite B-M192Baka O-Page23 Cambodian E-P278.1Nzebi E-P252 Baka E-P277 Nzebi E-U290 Nzebi E-P278.1Nzebi E-P277 Nzebi B-M211Baka A-M51San E-P252 Baka E-M191Nzebi E-P252 Mbuti G-M406Pashtun E-L515 Baka N-L708 Yakut E-P252 Baka E-M183Mozabite B-M112Baka B-P6San B-M211Baka E-P277 Nzebi B-M192Baka A-P262San G-M377Pashtun E-P277 Nzebi B-M109Nzebi E-P277 Mbuti E-M183Mozabite B-M112Baka B-Page18 Mbuti B-M192Baka E-P277 Nzebi B-P6San E-P252 Mbuti B-M192Mbuti E-P252 Nzebi B-M30 Baka B-M192Baka E-P277 Nzebi E-P252 Baka O-M95 Cambodian B-M112Baka CT-M168 N-Page56 B-M150 P-M45 O-P186 E-U290 A-M6 B-P6 G-P287 B-M182 E-M2/M180 Q-L54 B-M211 E-M191 E-L514 BT-M42 E-P179 KxLT-M526 B-M192 E-U175/P277 N-L708 A-M14 B-M30 F-M89 E-M183 E-P252 A-L419 K-M9 NO-M214 BEFT(Non-African)A Haplogroups HIJK-M578 Fig. 2. Y-chromosome phylogeny inferred from genomic sequencing. This tree recapitulates the previously known topology of the Y-chromosome phylogeny; however, branch lengths are now free of ascertainment bias. Branches are drawn proportional to the number of derived SNVs. Internal branches are labeled with defining ISOGG variants inferred to have arisen on the branch. Leaves are colored by major haplogroup cluster and labeled with the most derived mutation observed and the population from which the individual was drawn. Previously uncharacterized structure within African hgB2 is indicated in orange. (Inset) Resolution of a polytomy was possible through the identification of a variant for which hgG retains the ancestral allele, whereas hgH and hgIJK share the derived allele. www.sciencemag.org SCIENCE VOL 341 2 AUGUST 2013 563 REPORTS onAugust7,2013www.sciencemag.orgDownloadedfrom
  4. 4. fines an early diversification episode of the Y phylogeny in Eurasia (11). To account for missing genotypes, we as- signed each SNV to the root of the smallest sub- tree containing all carriers of one allele or the other and inferred that the allele specific to the subtree was derived (fig. S8). We used the chim- panzee Y-chromosome sequence to polarize 398 variants assigned to the deepest split—a task complicated by substantial structural divergence (11, 17). We estimated the coalescence time of all Y chromosomes using both a molecular clock–based frequentist estimator and an empirical Bayes ap- proach that uses a prior distribution of TMRCA from coalescent theory and conducts Markov chain simulation to estimate the likelihood of param- eters given a set of DNA sequences (GENETREE) (11, 18) (Table 1). To directly compare the TMRCA of the Y chromosome to that of the mtDNA, we estimated their respective mutation rates by cali- brating phylogeographic patterns from the initial peopling of the Americas, a recent human event with high-confidence archaeological dating. Archaeological evidence indicates that humans first colonized the Americas ~15 kya via a rapid coastal migration that reached Monte Verde II in southern Chile by 14.6 kya (19). The two Native American Mayans represent Y-chromosome hgQ lineages, Q-M3 and Q-L54*(xM3), that likely diverged at about the same time as the initial peopling of the continents. Q is defined by the M242 mutation that arose in Asia. A descendent haplogroup, Q-L54, emerged in Siberia and is ancestral to Q-M3. Because the M3 mutation appears to be specific to the Americas (20), it likely occurred after the initial entry, and the prevalence of M3 in South America suggests that it emerged before the southward migratory wave. Consequently, the divergence between these two lineages provides an appropriate cal- ibration point for the Y mutation rate. The large number of variants that have accumulated since divergence, 120 and 126, contrasts with the pedigree-based estimate of the Y-chromosome mutation rate, which is based on just 4 mutations (21). Using entry to the Americas as a calibration point, we estimate a mutation rate of 0.82 × 10−9 per base pair (bp) per year [95% confidence interval (CI): 0.72 × 10−9 to 0.92 × 10−9 /bp/year] (table S3). False negatives have minimal effect on this estimate due to the low probability, at 5.7x and 8.5x coverage, of observing fewer than two reads at a site (observed proportions: 3.1% and 0.6%) and due to the fact that the number of unobserved singletons possessed by one individual is offset by a similar number of Q doubletons unobserved in the same individual and thereby misclassified as singletons possessed by the other (11) (figs. S9 and S10). This calibra- tion approach assumes approximate coincidence between the expansion throughout the Americas and the divergence of Q-M3 and Q-L54*(xM3), but we consider deviation from this assumption and identify a strict lower bound on the point of divergence using sequences from the 1000 Ge- nomes Project (11). As a comparison point, we consider the out-of-Africa expansion of modern humans, which dates to approximately 50 kya (22) and yields a similar mutation rate of 0.79 × 10−9 /bp/year. We constructed an analogous pipeline for high coverage (>250x) mtDNA sequences from the 69 male samples and an additional 24 females from the seven HGDP populations (11) (fig. S11). As in the Y-chromosome analysis, we calibrated the mtDNA mutation rate using divergence with- in the Americas. We selected the pan-American hgA2, one of several initial founding haplogroups among Native Americans. The star-shaped phy- logeny of hgA2 subclades suggests that its di- vergence was coincident with the rapid dispersal upon the initial colonization of the continents (23). Calibration on 108 previously analyzed hgA2 sequences (11) (fig. S12) yields a point estimate equivalent to that from our seven Mayan mtDNAs, but within a narrower confidence interval. From this within-human calibration, we estimate a mu- tation rate of 2.3 × 10−8 /bp/year (95% CI: 2.0 × 10−8 to 2.5 × 10−8 /bp/year), higher than that from human-chimpanzee divergence but similar to other estimates using within-human calibration points (24, 25). The global TMRCA estimate for any locus con- stitutes an upper bound for the time of human population divergence under models without gene flow. We estimate the Y-chromosome TMRCA to be 138 ky (120 to 156 ky) and the mtDNA TMRCA to be 124 ky (99 to 148 ky) (Table 1) (11). Our mtDNA estimate is more recent than many previous studies, the majority of which used mu- tation rates extrapolated from between-species divergence. However, mtDNA mutation rates are subject to a time-dependent decline, with pedigree- based estimates on the faster end of the spectrum and species-based estimates on the slower. Be- cause of this time dependency and the need to calibrate the Yand mtDNA in a comparable man- ner, it is more appropriate here to use within- human clade estimates of the mutation rate. Rather than assume the mutation rate to be a known constant, we explicitly account for the uncertainty in its estimation by modeling each TMRCA as the ratio of two random variables. We estimate the ratio of the mtDNA TMRCA to that of the Y chromosome to be 0.90 (95% CI: 0.68 to 1.11) (fig. S13). If, as argued above, the divergence of the Y-chromosome Q lineages occurred at approximately the same time as that of the mtDNA A2 lineages, then the TMRCA ratio is invariant to the specific calibration time used. Regardless, the conclusion of parity is robust to possible discrepancy between the di- vergence times within the Americas (11). Using comparable calibration approaches, the Y and Table 1. TMRCA and Ne estimates for the Y chromosome and mtDNA. Pop., population. Method Y chromosome mtDNA Pop. n TMRCA* Ne Pop. n TMRCA* Ne Molecular clock All 69 139 (120–156) 4500† All 93 124 (99–148) 9500† GENETREE‡ San 6 128 (112–146) 3800 Nzebi 18 105 (91–119) 11,500 Baka 11 122 (106–137) 1800 Mbuti 6 121 (100–143) 3700 *Employs mutation rate estimated from within-human calibration point. Times measured in ky. †Uses Watterson’s estimator, %qw. ‡Each coalescent analysis restricted to a single population spanning the ancestral root (11). Fig. 3. Similarity of TMRCA does not imply equivalent Ne of males and females. The TMRCA for a given locus is drawn from a predata (i.e., prior) distribution that is a func- tion of Ne, generation time, sample size, and demo- graphic history. Consider the distribution of possible TMRCAs for a set of 100 uniparental chromosomes. Although the Mbuti mtDNA Ne is twice as large as that of the Baka Y chromosome, the corresponding predata TMRCA distributions overlap considerably. 0.0000.0020.0040.0060.0080.010 Time (ky) ProbabilityDensity 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 2 AUGUST 2013 VOL 341 SCIENCE www.sciencemag.org564 REPORTS onAugust7,2013www.sciencemag.orgDownloadedfrom
  5. 5. mtDNA coalescence times are not significantly different. This conclusion would hold whether or not an alternative approach would yield more definitive TMRCA estimates. Our observation that the TMRCA of the Y chromosome is similar to that of the mtDNA does not imply that the effective population sizes (Ne) of males and females are similar. In fact, we observe a larger Ne in females than in males (Table 1). Although, due to its larger Ne, the dis- tribution from which the mitochondrial TMRCA has been drawn is right-shifted with respect to that of the Y-chromosome TMRCA, the two dis- tributions have large variances and overlap (Fig. 3). Dogma has held that the common ancestor of human patrilineal lineages, popularly referred to as the Y-chromosome “Adam,” lived considera- bly more recently than the common ancestor of female lineages, the so-called mitochondrial “Eve.” However, we conclude that the mitochon- drial coalescence time is not substantially greater than that of the Y chromosome. Indeed, due to our moderate-coverage sequencing and the ex- istence of additional rare divergent haplogroups, our analysis may yet underestimate the true Y-chromosome TMRCA. References and Notes 1. J. K. Pritchard, M. T. Seielstad, A. Perez-Lezaun, M. W. Feldman, Mol. Biol. Evol. 16, 1791–1798 (1999). 2. R. Thomson, J. K. Pritchard, P. Shen, P. J. Oefner, M. W. Feldman, Proc. Natl. Acad. Sci. U.S.A. 97, 7360–7365 (2000). 3. H. Tang, D. O. Siegmund, P. Shen, P. J. Oefner, M. W. Feldman, Genetics 161, 447–459 (2002). 4. M. F. Hammer, Nature 378, 376–378 (1995). 5. F. Cruciani et al., Am. J. Hum. Genet. 88, 814–818 (2011). 6. M. Ingman, H. Kaessmann, S. Pääbo, U. Gyllensten, Nature 408, 708–713 (2000). 7. R. L. Cann, M. Stoneking, A. C. Wilson, Nature 325, 31–36 (1987). 8. P. A. Underhill, T. Kivisild, Annu. Rev. Genet. 41, 539–564 (2007). 9. M. A. Jobling, C. Tyler-Smith, Nat. Rev. Genet. 4, 598–612 (2003). 10. H. Skaletsky et al., Nature 423, 825–837 (2003). 11. Materials and methods are available as supplementary materials on Science Online. 12. ISOGG, International Society of Genetic Genealogy (2013); available at www.isogg.org/. 13. P. A. Underhill et al., Ann. Hum. Genet. 65, 43–62 (2001). 14. W. Wei et al., Genome Res. 23, 388–395 (2013). 15. J. Z. Li et al., Science 319, 1100–1104 (2008). 16. T. M. Karafet et al., Genome Res. 18, 830–838 (2008). 17. J. F. Hughes et al., Nature 463, 536–539 (2010). 18. R. C. Griffiths, S. Tavaré, Philos. Trans. R. Soc. London B Biol. Sci. 344, 403–410 (1994). 19. T. Goebel, M. R. Waters, D. H. O’Rourke, Science 319, 1497–1502 (2008). 20. M. C. Dulik et al., Am. J. Hum. Genet. 90, 229–246 (2012). 21. Y. Xue et al.; Asan, Curr. Biol. 19, 1453–1457 (2009). 22. R. G. Klein, Evol. Anthropol. 17, 267–281 (2008). 23. S. Kumar et al., BMC Evol. Biol. 11, 293 (2011). 24. S. Y. W. Ho, M. J. Phillips, A. Cooper, A. J. Drummond, Mol. Biol. Evol. 22, 1561–1568 (2005). 25. B. M. Henn, C. R. Gignoux, M. W. Feldman, J. L. Mountain, Mol. Biol. Evol. 26, 217–230 (2009). Acknowledgments: We thank O. Cornejo, S. Gravel, D. Siegmund, and E. Tsang for helpful discussions; M. Sikora and H. Costa for mapping reads from Gabonese samples; and H. Cann for assistance with HGDP samples. This work was supported by National Library of Medicine training grant LM-07033 and NSF graduate research fellowship DGE-1147470 (G.D.P.); NIH grant 3R01HG003229 (B.M.H. and C.D.B.); NIH grant DP5OD009154 (J.M.K. and E.S.); and Institut Pasteur, a CNRS Maladies Infectieuses Émergentes Grant, and a Foundation Simone et Cino del Duca Research Grant (L.Q.M.). P.A.U. consulted for, P.A.U. and B.M.H. have stock in, and C.D.B. is on the advisory board of a project at 23andMe. C.D.B. is on the scientific advisory boards of Personalis, Inc.; InVitae (formerly Locus Development, Inc.); and Ancestry.com. M.S. is a scientific advisory member and founder of Personalis, a scientific advisory member for Genapsys Former, and a consultant for Illumina and Beckman Coulter Society for American Medical Pathology. B.M.H. formerly had a paid consulting relationship with Ancestry.com. Variants have been deposited to dbSNP (ss825679106–825690384). Individual level genetic data are available, through a data access agreement to respect the privacy of the participants for transfer of genetic data, by contacting C.D.B. Supplementary Materials www.sciencemag.org/cgi/content/full/341/6145/562/DC1 Materials and Methods Supplementary Text Figs. S1 to S13 Tables S1 to S3 Data File S1 References (26–51) 11 March 2013; accepted 25 June 2013 10.1126/science.1237619 Low-Pass DNA Sequencing of 1200 Sardinians Reconstructs European Y-Chromosome Phylogeny Paolo Francalacci,1 * Laura Morelli,1 † Andrea Angius,2,3 Riccardo Berutti,3,4 Frederic Reinier,3 Rossano Atzeni,3 Rosella Pilu,2 Fabio Busonero,2,5 Andrea Maschio,2,5 Ilenia Zara,3 Daria Sanna,1 Antonella Useli,1 Maria Francesca Urru,3 Marco Marcelli,3 Roberto Cusano,3 Manuela Oppo,3 Magdalena Zoledziewska,2,4 Maristella Pitzalis,2,4 Francesca Deidda,2,4 Eleonora Porcu,2,4,5 Fausto Poddie,4 Hyun Min Kang,5 Robert Lyons,6 Brendan Tarrier,6 Jennifer Bragg Gresham,6 Bingshan Li,7 Sergio Tofanelli,8 Santos Alonso,9 Mariano Dei,2 Sandra Lai,2 Antonella Mulas,2 Michael B. Whalen,2 Sergio Uzzau,4,10 Chris Jones,3 David Schlessinger,11 Gonçalo R. Abecasis,5 Serena Sanna,2 Carlo Sidore,2,4,5 Francesco Cucca2,4 * Genetic variation within the male-specific portion of the Y chromosome (MSY) can clarify the origins of contemporary populations, but previous studies were hampered by partial genetic information. Population sequencing of 1204 Sardinian males identified 11,763 MSY single-nucleotide polymorphisms, 6751 of which have not previously been observed. We constructed a MSY phylogenetic tree containing all main haplogroups found in Europe, along with many Sardinian-specific lineage clusters within each haplogroup. The tree was calibrated with archaeological data from the initial expansion of the Sardinian population ~7700 years ago. The ages of nodes highlight different genetic strata in Sardinia and reveal the presumptive timing of coalescence with other human populations. We calculate a putative age for coalescence of ~180,000 to 200,000 years ago, which is consistent with previous mitochondrial DNA–based estimates. N ew sequencing technologies have pro- vided genomic data sets that can recon- struct past events in human evolution more accurately (1). Sequencing data from the male-specific portion of the Y chromosome (MSY) (2), because of its lack of recombination and low mutation, reversion, and recurrence rates, can be particularly informative for these evolution- ary analyses (3, 4). Recently, high-coverage Y chromosome sequencing data from 36 males from different worldwide populations (5) assessed 6662 phylogenetically informative variants and estimated the timing of past events, including a putative coalescence time for modern humans of ~101,000 to 115,000 years ago. MSY sequencing data reported to date still represent a relatively small number of individuals from a few populations. Furthermore, dating esti- mates are also affected by the calibration of the 1 Dipartimento di Scienze della Natura e del Territorio, Uni- versitàdiSassari,07100Sassari,Italy.2 IstitutodiRicercaGenetica e Biomedica (IRGB), CNR, Monserrato, Italy. 3 Center for Ad- vanced Studies, Research and Development in Sardinia (CRS4), Pula, Italy. 4 Dipartimento di Scienze Biomediche, Università di Sassari, 07100 Sassari, Italy. 5 Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA. 6 DNA Sequencing Core, University of Michigan, Ann Arbor, MI 48109, USA. 7 Center for Human Genetics Re- search, Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN 37235, USA. 8 Dipartimento di Biologia, Universitàdi Pisa, 56126 Pisa, Italy. 9 Departamento de Genética, Antropología Física y Fisiología Animal, Universi- dad del País Vasco/Euskal Herriko Unibertsitatea, 48080 Bilbao, Spain. 10 Porto Conte Ricerche, Località Tramariglio, Alghero, 07041 Sassari, Italy. 11 Laboratory of Genetics, National Institute on Aging, Baltimore, MD 21224, USA. *Corresponding author. E-mail: pfrancalacci@uniss.it (P.F.); fcucca@uniss.it (F.C.) †Laura Morelli prematurely passed away on 20 February 2013. This work is dedicated to her memory. www.sciencemag.org SCIENCE VOL 341 2 AUGUST 2013 565 REPORTS onAugust7,2013www.sciencemag.orgDownloadedfrom
  6. 6. www.sciencemag.org/cgi/content/341/6145/562/DC1 Supplementary Materials for Sequencing Y Chromosomes Resolves Discrepancy in Time to Common Ancestor of Males Versus Females G. David Poznik, Brenna M. Henn, Muh-Ching Yee, Elzbieta Sliwerska, Ghia M. Euskirchen, Alice A. Lin, Michael Snyder, Lluis Quintana-Murci, Jeffrey M. Kidd, Peter A. Underhill, Carlos D. Bustamante* *Corresponding author. E-mail: cdbustam@stanford.edu Published 2 August 2013, Science 341, 562 (2013) DOI: 10.1126/science.1237619 This PDF file includes: Materials and Methods Supplementary Text Figs. S1 to S13 Tables S1 to S3 References Other Supplementary Material for this manuscript includes the following: (available at www.sciencemag.org/cgi/content/full/341/6245/562/DC1) Data File S1. Sample, phylogeny, and variant data (zipped archive). Data File S2. Y chromosome genotype calls. To protect participant privacy, this zipped archive is available through a data access agreement (DAA) for transfer of genetic data by contacting C.D.B. Data File S3. Y chromosome mapped sequencing reads. This BAM file is also available via the DAA described above. Mapping, quality score recalibration, and indel realignment are described in Materials and Methods.
  7. 7. 2 Table of Contents Materials and Methods.............................................................................................. 4 Sequencing.......................................................................................................................... 4 Genotypes ........................................................................................................................... 4 Validation............................................................................................................................ 5 Phylogenetic Inference........................................................................................................ 5 mtDNA Analysis................................................................................................................. 6 Frequentist Estimation of TMRCA ......................................................................................... 6 Empirical Bayesian Estimation of TMRCA and Ne: GENETREE......................................... 10 Predata Distribution of TMRCA ........................................................................................... 11 Supplementary Text.................................................................................................. 12 Novel Y Chromosome Phylogenetic Structure................................................................. 12 Imputation......................................................................................................................... 12 Calibration and Mutation Rate Estimation ....................................................................... 13 Impact of Sequencing Error and Sequence Coverage on TMRCA Estimation..................... 14 Calibration Time............................................................................................................... 17 Existence of Rare Yet More Basal Lineages .................................................................... 18 Effective Population Size.................................................................................................. 18 Additional Acknowledgements......................................................................................... 18
  8. 8. 3 Supplementary Figures Fig. S1. Map of populations. ............................................................................................ 19 Fig. S2. Sequencing read mapping on Xq21. ................................................................... 20 Fig. S3. Quality control and genotype calling on the Y chromosome.............................. 21 Fig. S4. Cross-tabulation of populations and Y haplogroups........................................... 22 Fig. S5. Call rate and mean sequencing coverage on the Y chromosome........................ 23 Fig. S6. Y chromosome phylogenetic backbone. ............................................................. 24 Fig. S7. Novel structure in Y hgB2. ................................................................................. 25 Fig. S8. Phylogeny-aware imputation. ............................................................................. 26 Fig. S9. Y chromosome hgQ clade with Phase 1 1000 Genomes samples included........ 27 Fig. S10. Sequencing coverage for Mayan HGDP00856 at singleton sites. .................... 28 Fig. S11. mtDNA phylogeny............................................................................................ 29 Fig. S12. mtDNA calibration tree..................................................................................... 30 Fig. S13. Comparing the Y chromosome TMRCA to that of mtDNA.................................. 31 Supplementary Tables Table S1. Y chromosome summary of samples............................................................... 32 Table S2. M578 genotyping results. ................................................................................ 34 Table S3. Mutation rate point estimates........................................................................... 36 Supplementary Data Data File S1. Sample, phylogeny, and variant data. ........................................................ 37 Data File S2. Y chromosome genotype calls................................................................... 37 Data File S3. Y chromosome mapped sequencing reads................................................. 37 FTP Addresses and Accession Numbers for External Data....................... 38 Y Chromosome hgQ Sequences from the 1000 Genomes Project ................................... 38 Complete mtDNA hgA2 Sequences: GenBank Accession Numbers............................... 38 References and Notes................................................................................................ 39
  9. 9. 4 Materials and Methods Sequencing We prepared genomic libraries (26) from cell lines (HGDP) and blood (Gabonese), then sequenced the libraries on Illumina HiSeq 2000 machines at the Stanford Center for Genomics and Personalized Medicine. We used BWA (27) to map paired 101 bp reads to the GRCh37 human reference, removed PCR duplicates with Picard (28), and then utilized the Genome Analysis Tool Kit (GATK) (29, 30) to recalibrate quality scores, perform local realignment around candidate indels, and compute genotype likelihoods. Genotypes Callability Mask To learn directly from the read data the boundaries of the regions within which short-read sequencing could yield reliable variant calls, we calculated average filtered read depth across all samples in contiguous 1 kb windows and computed an exponentially-weighted moving average (EWMA) of these values (Fig. 1). Regions for which the EWMA deviated from a narrow envelope were identified as problematic. Those of depressed depth corresponded to ampliconic sequences, within which reads do not map uniquely and were thus filtered out. Regions of inflated depth corresponded to heterochromatin, where naïve application of standard genotype calling methods would give the impression of abundant heterozygosity due to the pileup of highly similar reads around the borders of unassembled regions. After constructing the depth-based filter, we repeated this procedure for the MQ0 ratio, the proportion of unfiltered reads with fully ambiguous mapping. Although the X-transposed region showed no deviation in the depth-based mask, it failed the MQ0 ratio based mask. In females we found depressed read depth in the homologous region of the X chromosome (Fig. S2); we hypothesize that in males, each of whom possesses one X and one Y, there is an equal exchange of mismapped reads between the two chromosomes. The depth and MQ0 masks were merged and smoothed, leaving 10.45 Mb of sequence for down-stream quality control. Site-Level Quality Control With the regional mask in hand, we defined a series of site-level quality control filters (Fig. S3A). Of the 22,974,737 mapped coordinates, 12,532,580 fell within the bounds of the regional exclusion mask. A further 129,411 were excluded due to an MQ0 ratio greater than or equal to 0.10, and 170,144 were excluded because more than 20 samples had missing genotypes, either due to an absence of sequencing reads or to a heterozygous maximum likelihood genotype (Fig. S3B). The remaining polymorphic sites had a median depth (across all samples) of 265, and we filtered out all sites whose depth was outside three median absolute deviations of this value, thus excluding 12,425 with depth above 371 and 141,512 below 159 (Fig. S3C). Finally, we culled 547 sites with a heterozygous maximum likelihood genotype in more than seven samples (Fig. S3D). This left 9,988,118 callable sites. Of 432 ISOGG SNPs with observed variation in our data,
  10. 10. 5 393 pass the regional and mapping quality filters, and of these, just one failed the missingness filter and a further two the depth filter. Genotype Calling To call genotypes, we implemented a haploid model EM algorithm that treated allele frequency as the latent variable and used the homozygous state genotype likelihoods calculated by GATK. Genotypes with a heterozygous maximum likelihood state were classified as missing because calls in such cases were found to be disproportionately incompatible with the inferred phylogeny. Validation The false positive rate is kept low primarily by the fact that GATK generally requires at least 2 reads of support to identify a site as variable. In addition, we exclude sites incompatible with the phylogeny. Though this filter discards some genuine homoplasic variants, the class is enriched for false positives, and we have chosen to err on the side of conservatism. We consider three means of validation. Sanger Sequencing We validated Y chromosome genotypes for the 29 male HGDP samples at 46 sites using a combination of targeted PCR and Sanger sequencing (3 sites), and exome capture followed by Illumina sequencing (43 sites). Validation failed to yield data for two genotypes, and we compared the remaining 1,245 genotypes to the main data set to find a concordance rate of 99.92%. Just one genotype was discordant (M150, hg19 position 21869519, in HGDP00462). The genotype had zero sequencing reads of support, and the individual had been imputed to carry the reference allele whereas the validation data indicated that this sample actually carries the non-reference allele. Only one other sample, the nearest neighbor to HGDP00462, also carried the non-reference allele, and this illustrates the fact that it is impossible to properly impute missing genotypes for sites otherwise identified as singletons (Supplementary Text, “Imputation” section). Minimally Diverged Samples We also consider private variation among minimally diverged individuals to argue that sequencing errors are minimized in our study. Specifically, we observe a cluster of five Baka hgB2 samples with just a handful of singletons per lineage. This group approximates a replication set and thus gives tight upper bounds on the false positive variant rate. Haplogroup Assignments All HGDP haplogroup assignments were consistent with prior ISOGG designations. Phylogenetic Inference We used MEGA5 (31) to construct maximum likelihood phylogenetic trees.
  11. 11. 6 mtDNA Analysis mtDNA Pipeline To call mitochondrial haplogroups, we converted sequences from the GRCh37 to the rCRS coordinate system and imported to HaploGrep (32), which draws on the Phylotree database (33). We explicitly utilized data presented in Table 1 of Behar et al. (34) to polarize alleles for variants assigned to the most ancient split—that between hgL0 and the rest of the tree (Fig. S11). Whereas the mutation rate on the Y chromosome is sufficiently low that we could regard base substitutions as unique events and simply discard sites that were incompatible with the phylogeny, excluding sites would have been inappropriate for the mitochondrial genome, in which a much higher mutation rate has led to considerable homoplasy. To account for this, we split sites with multiple substitutions into pseudo-sites, each of which constitute a unique event. We discarded a few mutational hotspot sites with evidence for more than four unique substitution events. Calibration Based on mtDNA hgA2 Since there are far fewer segregating sites in the mitochondrial genome, and we only had seven hgA2 lineages, we used 108 publicly available hgA2 Native American sequences to calibrate. Kumar et al. (23) list 568 accession numbers for mitochondrial genomes, 134 of which belong to hgA2 and are of American descent. We downloaded the subset of 108 entries that included the full mtDNA sequence and, along with the GRCh37 reference sequence, conducted a multiple alignment using MUSCLE (35). We then called haplogroups, built a tree (Fig. S12), assigned variants to branches, and resolved homoplasies as described above. Frequentist Estimation of TMRCA The Molecular Clock Under the infinite sites model, mutations accumulate in a Poisson process of rate µl, the locus-wide mutation rate. To estimate TMRCA, molecular clock approaches first estimate the mean number of derived mutations per lineage and then divide by an estimate of the mutation rate. For both the Y chromosome and the mtDNA, we estimate TMRCA with: where D is the sample average of { Di }, the inferred number of mutations accumulated by each lineage since the global MRCA: ˆT = D ˆµly , D = 1 n nX i=1 Di.
  12. 12. 7 We estimated the { Di } using a maximum likelihood phylogeny (Fig. 2), and we estimate the yearly mutation rate, µly, as: where t is the known TMRCA of the calibration subclade and C is the sample average of { Ci }, the number of derived mutations acquired by each lineage since the common ancestor of the subtree: Here nc is the number of individuals within the calibration subclade. is therefore a scaled ratio of two random variables: TMRCA Confidence Intervals From the frequentist perspective, we consider T a fixed but unknown constant, and we are interested in the sampling variance of our estimator conditional on its true value. Since the calibration subtree is a small fraction of the total tree, D and C are approximately uncorrelated. This fact simplifies the expression for the standard deviation of a ratio of random variables, which is obtained using the δ method (36): Since both D and C are sums of Poisson random variables with a large number of total events, each is well approximated by the normal distribution. Consequently, their ratio is also approximately normally distributed (37). Therefore, if we are able to compute σD|T and σC, we can construct a confidence interval for T. We first consider σD|T. The { Di } are identically Poisson distributed, but they are not independent due to the shared internal branches (3). Thus, Since each Di is a Poisson random variable, its variance is equal to its mean. Now consider samples i and j. The numbers of mutations that have accumulated in each since ˆµly = C t , C = 1 nc ncX i=1 Ci. € ˆT ˆT = t D C . ˆT|T ⇡ t C s✓ D C C ◆2 + 2 D|T . 2 D|T = Var[D|T] = 1 n2 " X i Var[Di|T] + 2 · X i X j>i Cov [Di, Dj|T] # .
  13. 13. 8 their MRCA are independent. However, they share all mutations possessed by their MRCA. Thus, where Dij is the number of derived variants possessed by the common ancestor of i and j. Let I denote the set of internal branches, and let bs and bl be the number of descendants and the length of a branch, b, respectively. Each internal branch will be shared by bs choose 2 pairs of individuals. Thus, which gives: An identical argument applies to σC within the calibration subtree. We, therefore, construct a 95% confidence interval for TMRCA as: The bias of the point estimator is minimal (36). Precision of TMRCA Estimation The standard error for the mean estimate of a Poisson random variable with mean µlT is € µlT n , so the coefficient of variation (the ratio of the standard error to the mean) declines in proportion to € nµlT . On the Y chromosome, T is large and, because the non- recombining locus is so long, µl is quite large as well. Consequently, the standard error for estimating the mean branch length is relatively small, and the greater source of uncertainty lies in estimating the mutation rate, where the time intervals over which mutations have accumulated are shorter, and the number of lineages is smaller. However, µl is sufficiently large that we could derive a narrow confidence interval based solely on the two hgQ lineages we had sequenced. In contrast, for the mtDNA, the uncertainty due to σD|T exceeds that due to σC. An Alternative Frequentist Estimator Cov [Di, Dj|T] = Dij, 2 · X i X j>i Cov [Di, Dj|T] = 2 · X b2I ✓ bs 2 ◆ bl = X b2I bs(bs 1)bl, D|T = 1 n sX i Di + X b2I bs(bs 1)bl. T = ˆT ± z0.025 · ˆT|T T = t 2 4D C ± z0.025 · 1 C v u u t ✓ D C C ◆2 + 1 n2 X i Di + X b2I bs(bs 1)bl !3 5 .
  14. 14. 9 An alternative frequentist estimator defines D as half the average mutational distance dij between pairs of individuals that span the ancestral root (3): Here, L and R represent sets of individuals on the left and right side of the root. This estimator is less well-suited to our data set. We have four Y hgA individuals on the left side of the tree and 65 individuals on the right side. This partition-based approach effectively upweights information from the hgA samples, since all distances are measured with respect to a member of this clade. However, we have lower effective coverage on the internal branches of hgA than elsewhere in the tree. This is due to both the lower number of samples and the fact that hgA lineages are highly diverged. Consequently, these are exactly the samples for which false negatives are of greatest potential impact. For the sake of comparison, the TMRCA point estimates from this approach are 134 ky and 118 ky for the Y chromosome and mtDNA, respectively. Estimating the Ratio of mtDNA TMRCA to Y TMRCA To compare the TMRCA of the Y chromosome to that of the mtDNA, we estimate the ratio: where we define M and Y as the fixed but unknown unscaled TMRCA of the mtDNA and Y respectively, and R as the ratio M / Y. The quantity τ = tm / ty is the ratio of coalescence times of the Native American lineages, mtDNA hgA2 and Y chromosome hgQ. Our estimator of γ is: where The standard error is: Since R is the ratio of two random variables, its standard error is: D = 1 2|L||R| X i2L X j2R dij. = Tm Ty = tmM tyY = ⌧R, ˆ = ⌧ ˆR = ⌧ ˆM ˆY , ˆM = Dm/Cm, ˆY = Dy/Cy, ˆR = ˆM/ˆY . ˆ| = ⌧ ˆR|M,Y .
  15. 15. 10 where € ρ = Corr[ ˆM | M, ˆY |Y ]. We cannot disregard the correlation term in this case. If the TMRCA of male and female lineages are correlated, their estimates will be as well, though the correlation of the estimates would necessarily be less than that of the true values due to the uncertainty in both variables. Confidence bands for γ are defined by: To assume zero correlation would be conservative, as positive correlation reduces the variance. We consider representative values of ρ for the sake of comparison (Fig. S13). Again, the bias of the point estimator is minimal (36). Empirical Bayesian Estimation of TMRCA and Ne: GENETREE As distributed, GENETREE can handle only 99 sites per run, but we modified the source code to enable runs of several thousand SNPs. First, we perform a grid search to obtain a maximum likelihood estimate for the scaled mutation rate, θ = 2Neµlg, where µlg is the locus-wide per generation mutation rate. We then simulate the posterior distribution of TMRCA, conditional on this estimate. We restricted each analysis to a single population so that the assumption of exchangeability of lineages (38) would hold. As the TMRCA is determined by the deepest coalescence in a sample, we exclusively analyzed populations that sample from both sides of the tree (Fig. 2): the San and Baka for the Y chromosome and the Mbuti and Nzebi for the mitochondrial genome. Results from the Baka and Mbuti Pygmy populations are the most directly comparable (Table 1).! We excluded several lineages from the GENETREE analyses. In the Baka, we excluded three samples possessing high levels of autosomal identity by descent with another individual, as inferred with Illumina Omni SNP arrays. We also excluded six Baka hgE samples, as these likely represent West African agriculturalist lineages that have introgressed into the Baka a few thousand years ago (39) in violation of the exchangeability assumption of coalescent theory. In the mitochondrial analysis we removed two Nzebi and one Mbuti because GENETREE does not allow for identical lineages. Point estimates for the Baka Y chromosomes reflect averages of multiple coalescent runs. Each run subsampled 1500 (of 2927) segregating sites to overcome computation limitations for the full dataset. Estimates for the Mbuti mtDNAs reflect averages of multiple coalescent runs, each with a different random seed, as these runs were more variable due to a smaller Poisson mean (nµl). ˆR|M,Y ⇡ 1 E[ˆY |Y ] v u u t E[ ˆM|M] E[ˆY |Y ] ˆY |Y !2 + 2 ˆM|M 2⇢ ˆM|M ˆY |Y E[ ˆM|M] E[ˆY |Y ] , = ⌧ " ˆM ˆY ± z0.025 · ˆR|M,Y # .
  16. 16. 11 Coalescent theory measures time in units of Ne generations. To convert to years, we use the maximum likelihood estimate of θ, the gender-specific generation time (g; Table S3), and the Native American calibration estimate for µly, the locus-wide per year mutation rate: GENETREE is suboptimal for our data set. Due to the exchangeability assumption and computational limitations, each analysis draws information from just a subset of the data. Because the full sequence data is highly informative about the underlying gene genealogy, very few random trees are compatible with it. This makes GENETREE a highly inefficient approach to estimating population genetic parameters. Thus, we emphasize the point estimates and confidence intervals derived from the frequentist approach. Predata Distribution of TMRCA For a constant population size, the TMRCA of a locus, measured in Ne generations, is given by: where Ti is the time during which i ancestral lineages of the sample existed. Coalescent theory (38) models Ti as an exponential random variable with parameter: To obtain the distributions presented in Fig. 3, we simulated five million draws of TMRCA for n = 100 lineages and scaled each value by a factor of Ne·g to convert to years. ˆNe = ˆ✓ 2ˆµlg = ˆ✓ 2gˆµly ˆTMRCA = ˆTc ˆNeg = ˆTc ˆ✓ 2ˆµly TMRCA = nX i=2 Ti, i = ✓ i 2 ◆ .
  17. 17. 12 Supplementary Text Novel Y Chromosome Phylogenetic Structure Haplogroup B2 Within hgB2, we identify one clade and three additional lineages that represent previously uncharacterized structure (Figs. 2, S7). Each lineage represents an ancient divergence within the Y chromosome phylogeny and carries no known differentiating mutations downstream of M192 and Page72, which define hgB2b1. First, in the main text we describe a subclade of B2b1a that encompasses six Baka individuals. Previously, B2b1a2 was associated with the P70 variant, but because these six Baka individuals carry the ancestral allele for P70, we propose reassociating P70 with a new label, “B2b1a2a,” and labeling the new clade “B2b1a2b.” Second, B2b1b was previously associated with P6, but we have identified a Mbuti individual carrying the ancestral allele for this variant. Thus, we propose associating P6 with a new label, “B2b1b1,” and designating the new lineage “B2b1b2.” Finally, we identify two new lineages within B2b1a1. The individuals representing both of these lineages carry the ancestral T allele for the M169 variant that defines B2b1a1a, the only extant sublineage of B2b1a1 not represented. Haplogroup F Table S2 presents genotyping results for the M578 variant in separate panel of individuals. The results confirm the (G, H, IJK) → (G, (H, IJK)) polytomy resolution. The demographic fates of hgG and hgHIJK were geographically asymmetric, with the spread zone of hgG (40) considerably more restricted than that of hgHIJK (Fig. S6). The latter now spans all continents, including Africa due to the back migration of some haplogroups (41). Imputation We used our phylogeny-aware algorithm (Fig. S8) to impute approximately 5.3 missing genotypes per Y chromosome variant site and a median of 826 per individual. Imputation Limitations It is not possible to impute singletons: when the carrier of a unique allele has zero reads of support, there is no evidence for variation at the site. Doubletons pose a similar problem. Let A and B be nearest neighbors in the phylogeny. Consider the case where, at a given site, A possesses an allele not observed in any other sample, and B has zero reads. It is impossible to distinguish whether the site is an A singleton or an A/B doubleton. However, conditional on one sample missing data at a particular site, our imputation strategy correctly imputes two thirds of tripletons; it fails only in the case where the lineage of the missing sample is the last to coalesce. For four lineages, there are 18 possible trees. Of these, twelve consist of stepwise coalescence, and the lineage with
  18. 18. 13 missing data is the most diverged in just three. Thus, we correctly impute five-sixths of quadrupletons. Polarizing Variants on the Branch Spanning the Ancestral Root Our method to infer the ancestral state at a given site was inapplicable to the 398 variants assigned to the most ancient (basal) split, as no outgroup for these branches was present within the data set. For these, we first conducted a LiftOver (42) to map GRCh37 coordinates to those of the chimpanzee reference (PanTro3). Due to the abundance of large-scale inversions between the two chromosomes (17), it was necessary to BLAT (43) 101 bp chunks of DNA surrounding each human variant to infer relative orientation. Ancestral states were thereby inferred for 322 variants, and those of the remaining 76, for which the corresponding chimpanzee allele could not be inferred, were randomly assigned in the corresponding proportion. Homoplasy and the Infinite Sites Model We deemed a SNV consistent with the tree when we observed no ancestral alleles in the subtree rooted at the branch to which the SNV was assigned. Most variants (11,279) were consistent with the tree, and we imputed missing genotypes for those that were. Sites incompatible with the phylogeny were uniformly distributed across the callable regions (Fig. 1) and were excluded from downstream analyses. Just 199 (of 361) incompatibilities were supported by more than one sequencing read. This lack of homoplasy on the Y chromosome justifies usage of the infinite sites model. Calibration and Mutation Rate Estimation Mutation rate estimates are typically based on family pedigrees (14) or species phylogenies, such as the human-chimpanzee divergence (2, 3). However, just one pedigree-based rate is available for the Y chromosome, and, though the mutation process is highly stochastic, this rate is based on a single pedigree. Furthermore, precise alignment between the human Y chromosome and that of the chimpanzee is difficult due to extreme structural divergence. Finally, if the Y is subject to a time-dependent mutation rate, as is mtDNA (24, 25), then neither estimation approach is ideal for dating human population events. Instead, we estimate mutation rates using a within-human calibration point, the initial migration into and expansion throughout the Americas. Well-dated archaeological sites include Paisley Cave in Oregon, which dates to 14.3 kya (19); Buttermilk Creek in Central Texas, at 13.2–15.5 kya (44); and Monte Verde II in Southern Chile, 14.6 kya (45). To date the expansion of genetic lineages unique to the Americas, we follow Goebel et al. who state that the most parsimonious estimate is that “humans colonized the Americas around 15 kya” (19). We show that a lack of parity between the expansion event and the divergence of lineages used for calibration would have minimal effect on the difference between the TMRCA of the Y and mtDNA if the divergences are within a few thousand years of one another (Fig. S13, Materials and Methods).
  19. 19. 14 For reference and comparison, Table S3 summarizes mutation rate point estimates on four scales. The Y chromosome mutation rates are similar to previous autosomal phylogenetic-based mutation rates and extended pedigree-based rates, but they are almost two-fold higher than autosomal mutation rates based on trios (46). Impact of Sequencing Error and Sequence Coverage on TMRCA Estimation We developed a method to estimate the variance in estimated TMRCA that is due to the stochastic nature of the mutation process (Materials and Methods, “Frequentist Estimation of TMRCA” section). Here we discuss the potential impact of bias due to sequencing error and modest sequencing coverage. We have estimated TMRCA by calculating the ratio of two quantities, divergence and the mutation rate, each of which depends on experimental measurements. The numerator is the average tip-to-root height of the tree, and we estimate the denominator as the ratio of average branch length within the calibration subtree to the calibration time. Data for each of the three measurements is imperfect. In this section, we consider potential biases in the first two, and we consider calibration time in the next section. Tip-to-Root Height We measure tip-to-root height as the total number of SNVs assigned to all branches separating an individual from the common ancestor of all individuals. This sum includes the singletons of the terminal branch and the shared variants on the internal branches. Two factors act in opposition to stretch and shrink an observed branch length with respect to its true value: sequencing error and the total sequencing coverage of the branch, which itself is influenced both by sequencing coverage of individuals and by sampling density of the clade rooted at the branch. The primary effect of sequencing error is to stretch terminal branches, as it is unlikely that random sequencing errors will cluster phylogenetically. We have demonstrated that genotype error is minimal (Materials and Methods, “Validation” section). Consequently, branch lengths are not significantly inflated by sequencing error. Though modest sequencing coverage translates to unobserved variants near the tips of the tree, thereby shortening observed heights, the internal branches of the tree, which constitute the overwhelming majority of any tip-to-root path, have quite high coverage due to the superposition of sequencing from all descending lineages. Thus, most observed internal branch lengths cannot differ significantly from their true lengths. Fortunately, the most divergent sample with the longest terminal branch, the San individual in the hgA- M51 clade, had higher than average sequencing coverage (6.15×) and, consequently, call rate (0.985). We observed 1012 private variants in this individual, and we estimate approximately 22 false negatives—unobserved variants with either a no-call genotype or just one sequencing read, an event insufficient to identify a site as variable. This worst- case scenario is less than 2% of the average tip-to-root height. We likely have very few false negatives in other individuals, even among those of lower coverage, since the lower coverage samples are clustered in the densely sampled portions of the tree, such as in hgE and portions of hgB, and the imputation strategy we’ve implemented enables these lineages to receive credit for variation detected in neighbors and which they can be
  20. 20. 15 inferred to possess. Finally, the maximum observed tip-to-root height (1188), could be considered a conservative upper bound on the true mean, and it differs from the observed mean by just 5%. Branch Lengths in the Calibration Subtree We now consider how sequencing coverage affects branch lengths in the Y chromosome hgQ subtree used to estimate the mutation rate. We sequenced Mayan HGDP00856, a representative of hgQ-M3, to 5.7× coverage and Mayan HGDP00877, whose haplogroup is labeled hgQ-L54*(xM3) because it carries the L54 mutation but is ancestral at the M3 SNP, to an average depth of 8.5×. Had we sequenced the two Mayan lineages to lower coverage, we would have artificially boosted TMRCA estimates by underestimating the mutation rate. However, haploid coverage for the Mayan samples are high enough that false negatives have little impact on our calibration. The rate of false negatives is dominated by sites in the terminal branches of the tree with either zero or one sequencing read for a sample. When an individual has zero or one read at a shared SNP, we can usually impute its genotype, but it is not possible to impute singletons or to distinguish a singleton from a doubleton in the presence of missing data (Supplementary Text, “Imputation” section). Although missing singletons and misclassified doubletons have little impact on total branch length from the tips to the root of the entire tree, they are quite important for calibration because singletons constitute a significant portion of branch length within the calibration subtree. In our study, the shared hgQ branch is of approximately the same length as the Q-M3 and Q-L54*(xM3) terminal branches. Consequently, no-call genotypes at singletons sites, which lead to missing singletons, are counterbalanced by no-call genotypes in the shared hgQ branch, which lead to doubletons misclassified as singletons. This relies on the fact that at 5.7× and 8.5× coverage, the no-call rates on the doubleton and singleton branches are comparable. In general, a no-call due to the presence of just a single sequencing read is less likely to occur on the doubleton branch than on the singleton branch, but of the 9,988,118 callable sites only 194,966 (2.0%) and 23,989 (0.2%) are covered by just one read in HGDP00856 and HGDP00877, respectively. To empirically estimate the false negative rate within the hgQ subtree used for calibration, we incorporated data from the 1000 Genomes Project (47). We downloaded genotype calls (VCF files) for 525 males from Phase 1, called haplogroups, and identified eleven individuals belonging to hgQ1 . We then downloaded aligned sequence data (BAM files) for these samples, converted from the GRCh37 to hg19 reference, and applied our pipeline to the combined set of 80 individuals (Fig. S9). In the combined analysis, the branch shared by all hgQ lineages grew from 136 to 146 SNPs2 . One SNP had not been called in either HGDP sample (hg19 position 15825218), and nine SNPs were no-calls in HGDP00856: three due to the absence of reads, and six due to one erroneous read (of 4– 1 A twelfth, NA19753, was sequenced using SOLiD. We did not include this sequence in our analysis since it is likely to have different error and mapping properties than those generated by Illumina technology. 2 The exact length is 149, but the difference includes two SNPs that were on the borderline of the depth- based filter in the main study and a net of one SNP discarded due to homoplasy: two in the main study and one in the combined analysis.
  21. 21. 16 10). With perfect data, these nine SNPs would have been classified as doubletons, but they were instead misclassified as HGDP00877 singletons. Thus, for HGDP00856, we can estimate the no-call rate within the hgQ subtree, β0 ≈ 6.8% (10 / 146). Partly because the coverage is higher, we observed no doubletons misclassified as singletons due to missingness in HGDP008773 . Thus, for HGDP00877, β0 ≈ 0.7% (1 / 146). Whereas on the shared doubleton branch the no-call rate should sufficiently inform the type 2 error rate (βd ≈ β0), the no-call rate does not provide complete information for the terminal branches since GATK, prudently, will most often not designate a site as variable if there is just one sequencing read with the alternative allele in the entire sample. Thus, to fully model the singleton type 2 error rate, βs, we must also consider the probability of observing just one read, β1, since when this occurs at a singleton site, a false negative will most often result. To do so, we computed the sequencing read depth distribution over all ten million callable sites for each sample. Scaling this empirical probability mass function by the number of singletons observed in the individual and censoring to discard the zero-read and one-read bins, we observe that when coverage exceeds 4×, the expected read-depth distribution among singletons closely mirrors the observed distribution (Fig. S10). This suggests that there are few false negatives at sites for which at least two sequencing reads are observed. Thus, βs ≈ β0 + β1. When a branch with false negative rate β has true length L and observed length Y, the number of unobserved variants, X, is given by: . On the HGDP00856 singleton branch, we have Y = 126 and, from the empirical read- depth distribution, β1 = 2.0%. Thus, βs ≈ β0 + β1 = 6.8% + 2.0% = 8.8%, which gives X ≈ 12.2 missing singletons. This is likely an overestimate because the no-call rate across all variable sites, 2.2% (Table S1), is lower than the empirical rate within the subtree, 6.8%. The branch shared by all hgQ-M3 lineages (branch 18 in Fig. S9) affords an opportunity to empirically check the singleton false negative rate for HGDP00856, since this individual should possess each of these variants. We had correctly called 16 of 17 in our main analysis. This suggests a singleton false negative rate for this sample of 1/17 = 5.9%4 , but the variance for this particular estimate is quite high since it is based on just 17 sites, so to be conservative, we use the value of 8.8% estimated above. For HGDP00877, we have Y = 120 and β1 = 0.2%, which give βs ≈ 0.7% + 0.2% = 0.9%, and X ≈ 1.1 missing singletons. This prediction cannot be tested empirically with these data because the lineage is an outgroup to the two hgQ-L54*(xM3) sequences from the 1000 Genomes Project. As discussed above, there were nine doubletons previously 3 It is possible that one such SNP exists and is missing in all three hgQ-L54*(xM3) sequences, but this is a low probability event. 4 The lone false negative occurred at hg19 position 22613361. Prior to imputation, we do make the correct call in the combined analysis, because one read was present, and it carried the derived A allele. X = L = 1 Y
  22. 22. 17 classified as HGDP00877 singletons, so accounting for type 2 errors reduces this branch length by 7.9 (9 – 1.1). Putting these two together, we compute the average branch length since MRCA of the two samples as 125 SNPs, which differs by the observed value of 123 by 1.6%. Thus, one might wish to scale our Y chromosome TMRCA estimates by a factor of 123 / 125 = 0.984. However, the effect of false negatives would be offset by false positives, should one or two exist, so we choose not to. False negatives are not an issue for mitochondria, where all sequences are complete. Calibration Time In light of the above, the largest potential source of bias is the calibration time: the dating of the arrival of humans into the Americas and the approximation of synchronicity of this arrival with phylogenetic divergences. Timing of Expansion into the Americas Archaeological dates for the time of first arrival in the Americas range from 14.3–16.5 ky. Goebel, et al. (19) conclude that the most parsimonious estimate is that “humans colonized the Americas around 15 kya,” so we elect 15 ky as reasonable figure for both the maternal and paternal loci. If the true divergence time of American lineages were 14.3 ky, one must scale down the TMRCA ranges we report by about 5%. Likewise, for 16.5 ky, an increase of 10% would be requisite. However, the specific number used will have no effect on the relative TMRCA estimates for the two loci, provided the divergences of the two loci were contemporaneous. We consider the case of unequal split times in Fig. S13 (Materials and Methods, “Estimating the Ratio of mtDNA TMRCA to Y TMRCA” subsection). Y Chromosome Calibration Point With 108 sampled lineages, the point of rapid expansion within the Americas among mtDNA hgA2 lineages is clear. However, the corresponding point within Y hgQ is less so. Though we have argued that M3 most likely occurred shortly subsequent to initial entry to the Americas, it remains possible that hgQ-M3 and hgQ-L54*(xM3) diverged within Siberia or Beringia. When we include lower coverage 1000 Genomes hgQ lineages, we observe a star-like diversification among the Q-M3 derived lineages (Figure S9, below branch #18). It is possible that some subset of the 17 M3-equivalent mutations accumulated prior to entry—within Beringia, for example, as has been proposed for mtDNA founding lineages (48). However, 12 of the 13 sequenced individuals are from Mexico, and this sampling bias could obscure a more upstream initiation of the expansion. For example, it is possible that hgQ-M3 lineages within Greenland do not share all 17 of these mutations. Because just three sequences represent hgQ-L54*(xM3), the phylogenetic structure of this subhaplogroup remains largely unknown, but the root of the sampled hgQ-M3 lineages can be used to calculate a strict lower bound on the mutation rate, as entry to the Americas certainly happened no later than this point.
  23. 23. 18 The 1000 Genomes lineages are inappropriate to calibrate upon due to lower sequencing coverage (average = 2.9×; Supplementary Text, “Branch Lengths in the Calibration Subtree” subsection), so we are left with a single lineage from our sample, HGDP00856, for this lower bound calculation. Accounting for false negatives had little effect when two samples were used for calibration, as the degree to which the hgQ-M3 branch grew was offset by a corresponding shrinkage of the hgQ-L54*(xM3) due to the hgQ doubletons that were unobserved in HGDP00856 and thereby misclassified as HGDP00877 singletons. However, it is important to correct for type 2 errors when considering this lineage alone. In the main analysis, the observed length of the M3 lineage was 126 mutations. This breaks down to 16 observed M3-equivalent SNPs and 110 post-M3 SNPs. Using a singleton false negative rate of 8.8%, this translates to approximately 10.6 (0.088*110/(1–0.088)) unobserved post-M3 SNPs, which gives a calibration length of 120.6 SNPs. This differs from the calibration used in the main text by 1.9%. Existence of Rare Yet More Basal Lineages We emphasize that the estimates we derive refer to the coalescence times within our sample. For the mitochondrial genome, we have likely sampled the most divergent branches in the tree (34). However for the Y chromosome, our estimate of the TMRCA reaches as far back as the A1b clade. Inclusion of samples from hgA1a or the newly discovered hgA0 (5) or hgA00 (49) would push the date further back. However, these haplogroups are very rare, and it is difficult to assess whether correspondingly divergent but singular mitochondrial genomes may also await discovery. Effective Population Size The Ne differences we observe between males and females are most likely due to a greater variance in reproductive success among males, a phenomenon influenced by cultural and demographic factors, such as the practice of polygyny (50). Both purifying and positive selection could also act to reduce the Ne along the linked regions of the Y chromosome. However, both forms of selection may have also acted on the mitochondrial genome. Additional information would be necessary before one could invoke natural selection as the primary cause of reduced male Ne, and the hypothesis is neither necessary nor sufficient. Additional Acknowledgements This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1147470. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
  24. 24. 19 Fig. S1. Map of populations. We sampled Y chromosomes and mtDNAs from nine populations including Baka Pygmies from Gabon, Cambodians, Maya from Mexico’s Yucatán Peninsula, Mbuti Pygmies from the Democratic Republic of Congo, Mozabite Berbers from Algeria, Nzebi from Gabon, Pashtuns (Pathan) from Pakistan’s North-West Frontier Province, San from Namibia, and Yakut from Siberia. ● ● ● ● ● ● ● ● ● Baka Cambodian Maya Mbuti Mozabite Nzebi Pashtun San Yakut
  25. 25. 20 Fig. S2. Sequencing read mapping on Xq21. Total read depth and the depth of MQ0 reads are plotted for 24 HGDP females. Mean values in contiguous 5 kb windows are shown along chrXq21. Dashed gray lines indicate the region that corresponds to the “X-transposed” segment of the Y chromosome. chrX Position (Mb) DepthinHGDPFemales ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ● ● ●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●● ● ● ● ●●● ● ● ● ●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●● ● ● ●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●● ● ●●●●●●●● ● ●●●●●●●●● ● ●●●●●● ● ●●●●●●●●●●●●●● ● ● ●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●● ● ● ●●●● ● ●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●● ● ●●●●●●●●● ● ●●●●●●●●●●●● ● ●●●●● ● ● ●●●●●●●●● ● ● ●●●●● ● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●● ● ●●●●●●●●●●●●●●●●● ● ●●●● ● ●● ● ● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ●●●●●● ●●● ● ● ● ● ● ● ●● ● ●●● ● ●● ●● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●●●●●●●●●● ● ●●●● ●●●●● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ●● ●●● ● ● ● ● ●● ●●●● ● ● ●● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ● ● ● ● ● ●●● ● ●● ●●● ●● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ● ● ● ●● ● ●●●●●●●●●●●●●●● ●● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ●● ● ●●●● ● ●● ● ●●● ● ● ● ● ●●● ●● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ●●● ●●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ●●●●●●●● ● ● ●●●●● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●●●●● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ●●● ●● ●●● ●● ●● ● ● ● ● ● ● ●●●●●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●●● ●● ● ●●● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●● ● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●● ●●●●●●●●● ● ●●● ● ●●●●●● ● ● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ●●●●●●●●●●●● ● ●●●●●●●●●●● ● ●●●●●●●● ● ● ●●●●●●●●●● ● ●●●●●●● ● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●● ● ●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●● ●● ● ● ● ●●●●●●●●●● ●●● ●● ● ● ●●●● ● ●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●● ● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 85 86 87 88 89 90 91 92 93 94 95 96 050100150200250300 Homologue of X−transposed Region● ● Filtered Depth MQ0 Depth

×