The American Journal of Human Genetics - Best of 2011 & 2012

  • 6,334 views
Uploaded on

Las mejores publicaciones de la Revista Americana de Genética Humana, periodo 2011-2012

Las mejores publicaciones de la Revista Americana de Genética Humana, periodo 2011-2012

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
6,334
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
16
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 85% 90% 70% 75% 80% MAF >1% Coverage MAF >5% Competing Array Axiom® World Array 4 We’ve got you covered The Definitive Evolution of Genotyping Affymetrix Axiom® Arrays “For Research Use Only. Not for use in diagnostic procedures.” ©Affymetrix, Inc. All rights reserved. Axiom® Genotyping Solution. Survival of the fittest. Axiom Genotyping Solution is the most powerful genotyping workflow delivering superior coverage of populations, disease genes, and rare variants at an affordable price. Unique GWAS, replication, and fine-mapping content on one array Unrivaled coverage of the exome, disease genes, and functional regions Cost-effective custom array design with 100% SNP conversion Axiom Genotyping Solution adapts to the needs of your research— coverage and flexibility like never before. Contact your Affymetrix Representative today. For more information on Axiom Genotyping Solution, visit www.affymetrix.com/axiomevolution
  • 2. www.nanostring.com | info@nanostring.com | 888 358 6266 FOR RESEARCH USE ONLY. Not for use in diagnostic procedures. Molecules That Count® Gene Expression miRNA Expression Epigenomics Copy Number Variation The NEW nCounter® Single Cell Expression NanoString’s nCounter® Single Cell Gene Expression Assay offers a superior approach to identifying cell-to-cell differences within a population of cells. The highly multiplexed, single tube assay allows the analysis of 20 – 800 genes and frees you from the constraints of fixed format microfluidic platforms. Let biology guide your research. Take the Single Cell Challenge - Try Before You Buy! Go to www.nanostring.com/challenge for complete details. nCounter® Analysis System Direct Digital Quantification of Nucleic Acids More Genes » Analyze multiple pathways for up to 800 genes High Sensitivity » Eliminate sample splitting, minimize amplification - get better data from every cell Digital Counting » Determine fractional fold changes - eliminate the variability of analog data High Throughput » Analyze hundreds of samples per day Make Every Cell Count The New nCounter® Single Cell Expression Assay
  • 3. Menkes? What is
  • 4. Cell Press content is widely accessible At Cell Press we place a high priority on ensuring that all of our journal content is widely accessible and on working with the community to develop the best ways to achieve that goal. Here are just some of those initiatives... www.cell.com/cellpress/access Open archives We provide free access to Cell Press research journals 12 months following publication Open access journal We launched Cell Reports - a new Open Access journal spanning the life sciences Access for developing nations We provide free & low-cost access through programs like Research4Life Funding body agreements We work cooperatively and successfully with major funding bodies Submission to PubMed Central Cell Press deposits accepted manuscripts on our authors' behalf for a variety of funding bodies, including NIH and HHMI, to PubMed Central (PMC) Public access Full-text online via ScienceDirect is also available to the public via walk in user access from any participating library
  • 5. Don’t be kept in the dark 523_12_JL Image courtesy of an Abreview by Dr. Shaohua Li, UMDNJ-Robert Wood Johnson Medical School Discover more at abcam.com/brighter_days
  • 6. Back by popular demand for 2013: • New sessions on cutting-edge clinical trials, along with commentaries on the implications of these trials for improved patient care • Poster session on Clinical Trials in Progress • Regulatory science and policy track Join us in Washington, DC, the appropriate location for our conference and events that will emphasize the vital importance of reaffirming our nation’s commitment to the conquest of cancer. Continuing Medical Education Activity–AMA PRA Category 1 CreditsTM available Late-breaking and placeholder abstract submission deadline: Monday, January 28 Early registration deadline: Friday, December 21 ANNUAL MEETING 2013 April 6-10, 2013 Walter E. Washington Convention Center Washington, DC Secure your spot today for the premier event for cancer research covering the spectrum of science from the bench to the clinic! New for 2013: An exciting new series of sessions focused on Current Concepts in Epidemiology and Prevention www.aacr.org/annual meeting13
  • 7. Foreword We are pleased to introduce a new series of “Best of…” reprint collections from Cell Press, which give us a chance to reflect on what has caught the attention of AJHG readers in late 2011 and early 2012. This collection includes a selection of eight of the most-accessed research articles across a range of topics and the most highly accessed review article of 2012. To select the articles, we considered the number of requests for PDF and full-text HTML versions of a given article. Half of the articles were published in the last six months of 2011 and half were published between January and June of 2012; in doing so, we are able to capture the full spectrum of articles that have been published during the past 12 months. We acknowledge that no single measurement can truly be indicative of “the best” research papers over a given period of time. This is especially true when sufficient time has not necessarily passed to allow one to fully appreciate the relative importance of a discovery. That said, we think it is still informative to look back at the scientific community’s interests in what has been published in AJHG over the past year. In this collection, you will see a range of the exciting topics that have widely captured the attention and enthusiasm of our readers, including genome-wide association studies, evolutionary and population genetics, genetics of disease, and new approaches for analyzing sequencing data. We hope that you will enjoy reading this special collection and that you will visit http://www. cell.com/AJHG/home to check out the latest findings that we have had the privilege to publish. To stay on top of what your colleagues have been reading over the past 30 days, check out http://www.cell.com/AJHG/top20. Also be sure to visit http://www.cell.com to find other high quality papers published in the full collection of Cell Press journals. Finally, we are grateful for the generosity of our sponsors, who helped make this reprint collection possible. For information for the Best of Series, please contact: Jonathan Christison Program Director, Best of Cell Press jchristison@cell.com 617-397-2893
  • 8. LetL s o v d v
  • 9. Volume 89 Best of 2011 and 2012 Volume 90 Denisova Admixture and the First Modern Human Dispersals into Southeast Asia and Oceania Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test Expansion of Intronic GGCCTG Hexanucleotide Repeat in NOP56 Causes SCA36, a Type of Spinocerebellar Ataxia Accompanied by Motor Neuron Involvement A Mutation in a Skin-Specific Isoform of SMARCAD1 Causes Autosomal-Dominant Adermatoglyphia Five Years of GWAS Discovery Mitochondrial DNA and Y Chromosome Variation Provides Evidence for a Recent Common Ancestry between Native Americans and Indigenous Altaians A ‘‘Copernican’’ Reassessment of the Human Mitochondrial DNA Tree from its Root Age-Related Somatic Structural Changes in the Nuclear Genome of Human Blood Cells Rare Mutations in XRCC2 Increase the Risk of Breast Cancer David Reich, Nick Patterson, Martin Kircher, Frederick Delfin, Madhusudan R. Nandineni, Irina Pugach, Albert Min-Shan Ko, Ying-Chin Ko, Timothy A. Jinam, Maude E. Phipps, Naruya Saitou, Andreas Wollstein, Manfred Kayser, Svante Pääbo, and Mark Stoneking Michael C. Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke, and Xihong Lin Hatasu Kobayashi, Koji Abe, Tohru Matsuura, Yoshio Ikeda, Toshiaki Hitomi, Yuji Akechi, Toshiyuki Habu, Wanyang Liu, Hiroko Okuda, and Akio Koizumi Janna Nousbeck, Bettina Burger, Dana Fuchs-Telem, Mor Pavlovsky, Shlomit Fenig, Ofer Sarig, Peter Itin, and Eli Sprecher Peter M. Visscher, Matthew A. Brown, Mark I. McCarthy, and Jian Yang Matthew C. Dulik, Sergey I. Zhadanov, Ludmila P. Osipova, Ayken Askapuli, Lydia Gau, Omer Gokcumen, Samara Rubinstein, and Theodore G. Schurr Doron M. Behar, Mannis van Oven, Saharon Rosset, Mait Metspalu, Eva-Liis Loogväli, Nuno M. Silva, Toomas Kivisild, Antonio Torroni, and Richard Villems Lars A. Forsberg, Chiara Rasi, Hamid R. Razzaghian, Geeta Pakalapati, Lindsay Waite, Krista Stanton Thilbeault, Anna Ronowicz, Nathan E. Wineinger, Hemant K. Tiwari, Dorret Boomsma, Maxwell P. Westerman, Jennifer R. Harris, Robert Lyle, Magnus Essand, Fredrik Eriksson, Themistocles L. Assimes, Carlos Iribarren, Eric Strachan, Terrance P. O’Hanlon, Lisa G. Rider, Frederick W. Miller, Vilmantas Giedraitis, Lars Lannfelt, Martin Ingelsson, Arkadiusz Piotrowski, Nancy L. Pedersen, Devin Absher, and Jan P. Dumanski D.J. Park, F. Lesueur, T. Nguyen-Dumont, M. Pertesi, F. Odefrey, F. Hammet, S.L. Neuhausen, E.M. John, I.L. Andrulis, M.B. Terry, M. Daly, S. Buys, F. Le Calvez-Kelm, A. Lonie, B.J. Pope, H. Tsimiklis, C. Voegele, F.M. Hilbers, N. Hoogerbrugge, A. Barroso, A. Osorio, the Breast On the cover: Whole-mount preparation of a mouse cochlea, immunolabeled with myosin VIIa in green, DAPI in blue, and phalloidin in red to stain hair cells, nuclei, and actin, respectively. The background sequence is that of connexin 26, the most commonly mutated gene in deaf individuals. Image courtesy of Shaked Shivatzki and Karen Avraham, Tel Aviv University, Tel Aviv, Israel. Support: grant R01 DC011835 from the National Institute on Deafness and Other Communication Disorders, National Institutes of Health. This image was the winner of the 2012 ASHG GenArt competition.
  • 10. ARTICLE Denisova Admixture and the First Modern Human Dispersals into Southeast Asia and Oceania David Reich,1,2,* Nick Patterson,2 Martin Kircher,3 Frederick Delfin,3 Madhusudan R. Nandineni,3,4 Irina Pugach,3 Albert Min-Shan Ko,3 Ying-Chin Ko,5 Timothy A. Jinam,6 Maude E. Phipps,7 Naruya Saitou,6 Andreas Wollstein,8,9 Manfred Kayser,9 Svante Paa¨bo,3 and Mark Stoneking3,* It has recently been shown that ancestors of New Guineans and Bougainville Islanders have inherited a proportion of their ancestry from Denisovans, an archaic hominin group from Siberia. However, only a sparse sampling of populations from Southeast Asia and Oceania were analyzed. Here, we quantify Denisova admixture in 33 additional populations from Asia and Oceania. Aboriginal Australians, Near Oceanians, Polynesians, Fijians, east Indonesians, and Mamanwa (a ‘‘Negrito’’ group from the Philippines) have all inherited genetic material from Denisovans, but mainland East Asians, western Indonesians, Jehai (a Negrito group from Malaysia), and Onge (a Negrito group from the Andaman Islands) have not. These results indicate that Denisova gene flow occurred into the common ancestors of New Guineans, Australians, and Mamanwa but not into the ancestors of the Jehai and Onge and suggest that relatives of present-day East Asians were not in Southeast Asia when the Denisova gene flow occurred. Our finding that descendants of the earliest inhabitants of Southeast Asia do not all harbor Denisova admixture is inconsistent with a history in which the Denisova interbreeding occurred in mainland Asia and then spread over Southeast Asia, leading to all its earliest modern human inhabitants. Instead, the data can be most parsimoniously explained if the Denisova gene flow occurred in Southeast Asia itself. Thus, archaic Denisovans must have lived over an extraordinarily broad geographic and ecological range, from Siberia to tropical Asia. Introduction The history of the earliest arrival of modern humans in Southeast Asia and Oceania from Africa remains contro- versial. Archaeological evidence has been interpreted to support either a single wave of settlement1 or, alternatively, multiple waves of settlement, the first leading to the initial peopling of Southeast Asia and Oceania via a southern route and subsequent dispersals leading to the peopling of all of East Asia.2 Mitochondrial DNA studies have been inter- preted as supporting a single wave of migration via a southern route,3–5 although other interpretations are possible,6,7 and single-locus studies are unlikely to resolve this issue.8 The largest genetic study of the region to date, based on 73 populations genotyped at 55,000 SNPs, concluded that the data were consistent with a single wave of settlement of Asia that moved from south to north and gave rise to all of the present-day inhabitants of the region.9 However, another study of genome-wide SNP data argued for two waves of settlement10 as did an analysis of diversity in the bacterium Helicobacter pylori.11 The recent finding that Near Oceanians (New Guineans and Bougainville Islanders) have received 4%–6% of their genetic material from archaic Denisovans12 in principle provides a powerful tool for understanding the earliest human migrations to the region and thus for resolving the question of the number of waves of settlement. The Denisova genetic material in Southeast Asians should be easilyrecognizable because it is verydivergent from modern human DNA. Thus, the presence or absence of Denisova genetic material in particular populations should provide an informative probe for the migration history of Southeast Asia and Oceania, in addition to being interesting in its own right. However, the populations previously analyzed for signatures of Denisova admixture12 comprise a very thin sampling of Southeast Asia and Oceania. In particular, no groups from island Southeast Asia or Australia were surveyed. Here, we report an analysis of genome-wide data from an additional 33 populations from south Asia, Southeast Asia, and Oceania; analyze the data for signatures of Denisova admixture; and use the results to infer the history of human migration(s) to this part of the world. Material and Methods SNP Array Data We analyzed data for modern humans genotyped on Affymetrix 6.0 SNP arrays. We began by assembling previously published data for YRI (Yoruba in Ibadan, Nigeria) West Africans, CHB (Han Chinese in Beijing, China) Han Chinese and CEU (Utah resi- dents with Northern and Western European ancestry from the CEPH collection) European Americans from HapMap 3;13 Onge Andaman ‘‘Negritos’’;14 and New Guinea highlanders, Fijians, one Bornean population, and Polynesians from seven islands.10 1 Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; 2 Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; 3 Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig D-04103, Germany; 4 Laboratory of DNA Finger- printing, Centre for DNA Fingerprinting and Diagnostics, Nampally, Hyderabad 500 001, India; 5 Center of Excellence for Environmental Medicine, Kaohsiung Medical University, Kaohsiung City 807, Taiwan; 6 Division of Population Genetics, National Institute of Genetics, Yata 1111, Mishima, Shi- zuoka 411-8540, Japan; 7 School of Medicine and Health Sciences, Monash University (Sunway Campus), Selangor 46150, Malaysia; 8 Cologne Center for Genomics, University of Cologne, Cologne D-50931, Germany; 9 Department of Forensic Molecular Biology, Erasmus MC University Medical Center Rotterdam, 3000 CA Rotterdam, The Netherlands *Correspondence: reich@genetics.med.harvard.edu (D.R.), stoneking@eva.mpg.de (M.S.) DOI 10.1016/j.ajhg.2011.09.005. Ó2011 by The American Society of Human Genetics. All rights reserved. 516 The American Journal of Human Genetics 89, 516–528, October 7, 2011
  • 11. We also assembled data including two aboriginal Australian popu- lations: one from the Northern Territories15 and one from a human diversity cell line panel in the European Collection of Cell Cultures. The data also include nine Indonesian populations: four from the Nusa Tenggaras, two from the Moluccas, one from Borneo, and two from Sumatra. Finally, the data include three Malaysian populations (Temuan and Jehai [a Negrito group] both from the Malay peninsula, and Bidayuh from Sarawak on the island of Borneo), two Philippine populations (Manobo and a Negrito group, the Mamanwa), six aboriginal Taiwanese popula- tions, one Dravidian population from southern India, and San Bushmen from southern Africa from the Centre d’E´tude du Polymorphisme Humain (CEPH)-Human Genome Diversity Panel.16 All volunteers provided informed consent for research into population history and the approval of appropriate local ethical review boards was obtained. This project was approved by the ethical review boards of the University of Leipzig Medical Faculty and Harvard Medical School. The genotype data that we analyzed for this study are available from the authors on request. Merging Genotyping Data with Chimpanzee, Denisova, and Neandertal We merged the SNP array data from modern humans with genome sequence data from chimpanzee (CGSC 2.1/PanTro217 ), Deni- sova,12 andNeandertal.18 We eliminatedA/TandC/GSNPstomini- mize strand misidentification. After removing SNPs with low geno- typing completeness, we had data for 353,143 autosomal SNPs. Removal of Outlier Samples We carried out principal components analysis by using EIGENSOFT.19 We removed samples that were visual outliers rela- tive to others from the same population on eigenvectors that were statistically significant by using a Tracy-Widom statistic (p < 0.05),19 resulting in the removal of three YRI, two CHB, five Polyne- sians, one New Guinea highlander, two Jehai, and three Mamanwa. Sequencing Data We preparedDNAsequencinglibrarieswith300 bpinsertsizes from a Papua New Guinea highlander (SH10) and Mamanwa Negrito (ID36) individual by using a previously described protocol.12 The two libraries were sequenced on an Illumina Genome Analyzer IIx instrument with 2 3 101 þ 7 cycles according to the manufac- turer’s instructions for multiplex sequencing (FC-104-400x v4 sequencing chemistry and PE-203-4001 cluster generation kit v4). Bases and quality scores were generated with the Ibis base caller,20 and the reads were aligned with the Burrows-Wheeler Aligner (BWA) software 21 to the human (NCBI 36/hg18) and chimpanzee (CGSC 2.1/pantro2) genomes with default parameters. The result- ing BAM files were filtered as follows: (1) a mapping quality of at least 30 was required; (2) we removed duplicated reads with the same outer coordinates; and (3) we removed reads with sequence entropy < 1.0, calculated by summing Àp$log2(p) for each of the four nucleotides. The sequencing data are publicly available from the European Nucleotide Archive (Project ID ERP000121), and summary statistics are provided in Table S1, available online. Estimating Denisova pD(X), Near Oceanian pN(X) and Australian pA(X) ancestry We define the frequency of one of the alleles at a SNP i as zi x. We can then compute three statistics for a given population X that are informative about admixture: pDðXÞ ¼ Pn i¼1  zi Outgroup À zi Archaic  zi East Asian À zi x  Pn i¼1  zi Outgroup À zi Archaic  zi East Asian À zi New Guinea  ¼ f4ðOutgroup; Archaic; East Asian; XÞ f4ðOutgroup; Archaic; East Asian; New GuineaÞ (Equation 1) pN ðXÞ ¼ 1 À Pn i¼1  zi Outgroup À zi Australia  zi x À zi New Guinea  Pn i¼1  zi Outgroup À zi Australia  zi East Asia À zi New Guinea  ¼ 1 À f4ðOutgroup; Australia; X; New GuineaÞ f4ðOutgroup; Australia; East Asia; New GuineaÞ (Equation 2) pAðXÞ ¼ 1 À Pn i¼1  zi Outgroup À zi New Guinea  zi x À zi Australia  Pn i¼1  zi Outgroup À zi New Guinea  zi East Asia À zi Australia  ¼ 1 À f4ðOutgroup; New Guinea; X; AustraliaÞ f4ðOutgroup; New Guinea; East Asia; AustraliaÞ (Equation 3) The right side of each equation shows that these statistics can also be expressed as ratios of f4 statistics,14 which provide unbiased estimates of admixture proportions even in the absence of popula- tions that are closely related to the analyzed populations (Appendix A). For the ancestry estimates reported in Table 1, we use Outgroup ¼ YRI (West Africans), Archaic ¼ Denisova, and East Asian ¼ CHB (Han Chinese). Table S2 and Table S3 demon- strate that consistent values are obtained when we replace these choices with a variety of distantly related populations. Further details are provided in Appendix A. Block Jackknife Standard Error and Statistical Testing We used a block jackknife22,23 to compute standard errors, drop- ping each nonoverlapping five cM stretch of the genome in turn and studying the variance of each statistic of interest to obtain an approximately normally distributed standard error.12,18 To test whether pD(X), pN(X), pA(X), and pD(X) À pN(X) are statistically consistent with zero for any tested population X, we computed the statistics along with a standard error from the block jackknife, and then used a two-sided Z test that computes the number of standard errors from zero. To implement the 4 Population Test14 for whether an unrooted phylogenetic tree ([A,B],[C,D]) relating four populations is consistent with the data, we computed the statistic f4(A,B;C,D) and assessed the number of standard errors from zero. Results Quantifying Denisova Admixture from Genome-wide SNP Data To investigate which modern humans have inherited genetic material from Denisovans, we assembled SNP data from 33 populations from mainland East Asia, island Southeast Asia, New Guinea, Fiji, Polynesia, Australia, and India, and genotyped all of them on Affymetrix 6.0 arrays. After removing samples that were outliers with respect to The American Journal of Human Genetics 89, 516–528, October 7, 2011 517
  • 12. Table 1. Estimates of Denisovan and Near Oceanian Ancestry from SNP Data Population Information pD(X): Denisovan Ancestry as % of New Guinea pN(X): Near Oceanian ancestry p value for Difference Broad Grouping Detailed Code N Estimated Ancestry Standard Error in the Estimate Z Score Estimated Ancestry Standard Error in the Estimate Z Score pN(X) À pD(X) New Guinea Highlander SH 24 100% 0% n/a 100% 0% n/a n/a Australian all 10 103% 6% 17.1 n/a n/a n/a n/a Northern Territories AU1 8 103% 6% 16.6 n/a n/a n/a n/a Cell Cultures AU2 2 103% 7% 14.1 n/a n/a n/a n/a Fiji Fiji FI 25 56% 3% 17.7 58% 1% 94.6 0.38 Nusa Tenggaras all 10 40% 3% 12.8 38% 1% 54.7 0.34 Alor AL 2 51% 6% 8.3 49% 1% 35.6 0.69 Flores FL 1 40% 8% 5.0 37% 2% 19.8 0.68 Roti RO 4 27% 4% 6.4 27% 1% 29.4 0.85 Timor TI 3 50% 5% 9.8 45% 1% 41.7 0.29 Philippines all 27 28% 3% 8.2 6% 1% 10.6 3.4 3 10À10 Mamanwa (N) MA 11 49% 5% 9.2 11% 1% 11.4 1.5 3 10À12 Manobo MN 16 13% 3% 4.2 4% 1% 5.7 0.0018 Moluccas all 10 35% 4% 10.1 34% 1% 46.0 0.59 Hiri HI 7 35% 4% 9.0 32% 1% 38.4 0.36 Ternate TE 3 36% 5% 7.2 38% 1% 33.7 0.67 Polynesia all PO 19 20% 4% 5.1 27% 1% 34.8 0.052 Cook 2 16% 6% 2.5 24% 1% 17.3 0.21 Futuna 4 28% 5% 5.3 29% 1% 26.9 0.87 Niue 1 27% 8% 3.3 30% 2% 16.3 0.72 Samoa 5 13% 5% 2.6 24% 1% 23.3 0.024 Tokelau 2 22% 6% 3.5 31% 1% 23.8 0.14 Tonga 2 17% 7% 2.5 31% 1% 22.5 0.027 Tuvalu 3 21% 6% 3.6 28% 1% 22.8 0.28 Andamanese Onge (N) AN 10 10% 6% 1.6 3% 1% 1.8 0.27 Taiwan all TA 12 4% 3% 1.2 1% 1% 1.5 0.35 Puyuma 2 4% 6% 0.6 2% 1% 1.8 0.79 Rukai 2 0% 6% 0.0 2% 1% 1.6 0.74 Paiwan 2 5% 6% 0.8 3% 1% 2.2 0.67 Atayal 2 À5% 5% À0.9 0% 1% 0.3 0.34 Bunun 2 12% 6% 2.1 À2% 1% À1.6 0.01 Pingpu 2 7% 6% 1.2 1% 1% 1.1 0.30 Malaysia all 18 5% 3% 1.4 0% 1% À0.2 0.16 Jehai (N) JE 8 7% 5% 1.4 1% 1% 0.8 0.21 Temuan TM 10 3% 4% 0.8 À1% 1% À0.9 0.32 Sumatra All 17 4% 3% 1.4 0% 1% 0.3 0.17 Besemah BE 8 5% 3% 1.5 1% 1% 0.9 0.20 Semende SM 9 3% 4% 0.9 0% 1% À0.3 0.31 518 The American Journal of Human Genetics 89, 516–528, October 7, 2011
  • 13. their own populations (reflecting admixture in the last few generations or genotyping error), we had data from 243 individuals (Table 1). We restricted the analysis to auto- somal SNPs with high genotyping completeness and with data from the Denisova genome, leaving 353,143 SNPs. To quantify the proportion of Denisova genes in each population X, we computed a statistic pD(X), which measures the proportion of Denisova genetic material in a population as a fraction of that in New Guineans. Our main analyses in Figure 1 and Table 1 compute pD(X) as a ratio of two f4 statistics,14 each of which measures the correlation in allele frequency differences between the two populations used as outgroups (Yoruba and Denisova) and two East or Southeast Asian populations (Han and X ¼ tested population). If Han and X descend from a single ancestral population without any subsequent admixture Table 1. Continued Population Information pD(X): Denisovan Ancestry as % of New Guinea pN(X): Near Oceanian ancestry p value for Difference Broad Grouping Detailed Code N Estimated Ancestry Standard Error in the Estimate Z Score Estimated Ancestry Standard Error in the Estimate Z Score pN(X) À pD(X) Borneo all 49 1% 2% 0.6 1% 1% 1.3 0.79 Bidayuh BI 10 6% 4% 1.7 1% 1% 1.4 0.80 Barito River BO 23 0% 3% 0.2 1% 1% 1.7 0.18 Land Dayak DY 16 0% 3% À0.1 0% 1% 0.2 0.94 India Dravidian SI 12 À7% 5% À1.5 n/a n/a n/a n/a We provide each population’s estimated ancestry, the standard error in the estimate, and the Z score for deviation from zero (Z). Negrito populations are marked with (N). The New Guinea highlanders by definition have 100% Denisovan and 100% Near Oceanian ancestry because they are used as a reference population for computations. Results are not provided for Australians and Dravidians for whom the phylogenetic relationships do not allow the estimate (n/a). The last column reports the two-sided p value for a difference based on a block jackknife and a Z test. DENISOVA HE OR AL Al MN M b XI UY HEDRMO AL Alor MN Manobo AN Andaman (Onge) MO Mongola AU Australian NA Naxi BE Besemah NG New Guinea BG Bougainville OR Oroqen BI Bidayuh PO Polynesia JA TU SE HA TJ MI BO Borneo RO Roti CA Cambodia SE She DA Dai SH S. Highlands DR Daur SI Southern India DY Dayak SM Semende FI Fiji TA Taiwan MA MN TA LA DA MI j FL Flores TE Ternate HA Han TI Timor HE Hezhen TJ Tujia HI Hiri TM Temuan JA Japan TU Tu JE Jehai UY Uygur BGHI MN JE BITM AN JE Jehai UY Uygur LA Lahu XI Xibo MA Mamanwa YI Yi MI Miao SH NG FI POTE ALBO DY SM BE AU1 TIFL RO AU2 NA YI CA SI Figure 1. Denisovan Genetic Material as a Fraction of that in New Guineans Populations are only shown as having Denisova ancestry if the estimates are more than two standard errors from zero (we combine esti- mates for populations in this study with analogous estimates from CEPH- Human Genome Diversity Panel populations reported previ- ously12 ). No population has an estimate of Denisova ancestry that is significantly more than that in New Guineans, and hence we at most plot 100%. The sampling location of the AU2 population is unknown and hence the position of this population is not precise. The American Journal of Human Genetics 89, 516–528, October 7, 2011 519
  • 14. from Denisova, then the allele frequency differences between Han and X must have arisen solely since their separation from their common ancestor, and the two frequency differences should be uncorrelated; thus, the f4 statistic has an expected value of zero. However, if popula- tion X inherited some of its ancestry from an archaic population related to Denisovans, then the allele frequency differences between Han and X will be corre- lated, the higher the admixture from the archaic popula- tion, the higher the correlation. Because the f4 statistic in the numerator uses X as the test population, and the f4 statistic in the denominator uses New Guinea as the test population, the ratio pD(X) estimates a quantity propor- tional to the percentage of Denisova ancestry qX; that is, the Denisova admixture fraction in X divided by that in New Guinea, qX/qNew Guinea (Appendix A). We computed pD(X) for a range of non-African popula- tions and found that for mainland East Asians, western Negritos (Jehai and Onge), or western Indonesians, pD(X) is within two standard errors of zero when a standard error is computed from a block jackknife (Table 1 and Figure 1). Thus, there is no significant evidence of Denisova genetic material in these populations. However, there is strong evidence of Denisovan genetic material in Australians (1.03 5 0.06 times the New Guinean proportion; one stan- dard error), Fijians (0.56 5 0.03), Nusa Tenggaras islanders of southeastern Indonesia (0.40 5 0.03), Moluccas islanders of eastern Indonesia (0.35 5 0.04), Polynesians (0.020 5 0.04), Philippine Mamanwa, who are classified as a ‘‘Negrito’’ group (0.49 5 0.05), and Philippine Manobo (0.13 5 0.03) (Table 1 and Figure 1). The New Guineans and Australians are estimated to have indistinguishable proportions of Denisovan ancestry (within the statistical error), suggesting Denisova gene flow into the common ancestors of Australians and New Guineans prior to their entry into Sahul (Pleistocene New Guinea and Australia), that is, at least 44,000 years ago.24,25 These results are consistent with the Common Origin model of present- day New Guineans and Australians.26,27 We further con- firmed the consistency of the Common Origin model with our data by testing for a correlation in the allele frequency difference of two populations used as outgroups (Yoruba and Han) and the two tested populations (New Guinean and Australian).The f4 statistic that measures their correlation is only jZj ¼ 0.8 standard errors from zero, as expected if New Guineans and Australians descend from a common ancestral population after they split from East Asians, without any evidence of a closer relationship of one group or the other to East Asians. Two alternative histories, in which either New Guineans or Australians have a common origin with East Asians, are inconsistent with the data (both jZj > 52). To assess the robustness of these estimates of Denisova admixture proportion, we recomputed pD(X) for diverse choices of A (YRI, San, and chimpanzee), B (Denisova, Neandertal, and chimpanzee), C (CHB and Borneo) and X (17 different populations). For any population X, we obtain consistent estimates of the archaic mixture propor- tion, regardless of the choice of A, B, and C. Thus, the method is robust to the choice of comparison populations, suggesting that the underlying model of population rela- tionships (Appendix A) provides a reasonable fit to the data and that our pD(X) ancestry estimates are reliable. For our main estimates of admixture proportion, we report results for A ¼ YRI, B ¼ Denisova and C ¼ CHB because Table S2 shows that the standard errors are smallest (in part because of larger sample sizes). To test whether our estimates of pD(X) are robust to ascer- tainment bias—the complex ways that SNPs were chosen for inclusion on genotyping arrays originally designed for medical genetics studies—we also estimated Denisova admixture by using sequencing data. For this purpose, we generated new shotgun sequencing data from a Philippine Mamanwa individual (~13) and a New Guinea highlander (~33, from a different New Guinean group than the one sampled in the Human Genome Diversity Panel16 ). We merged these with data from Neandertal, Denisova, chim- panzee, and 12 present-day humans analyzed as part of the Neandertal and Denisova genome sequencing studies.12,18 We then computed the same pD(X) statistics for the se- quencing as for the genotyping data, replacing YRI with a Yoruba (HGDP00927), CHB with a Han (HGDP00778), and New Guinea with a Papuan sample (Papuan2; HGDP00551). Both the full sequence data and the SNP data produce consistent estimates of pD(X) (Table 2), sug- gesting that ascertainment bias is not influencing the pD(X) estimates from genome-wide SNP data. Near Oceanian Ancestry Explains Denisovan Genes Outside of Australia and the Philippines A parsimonious explanation for the Denisova genetic mate- rial that we detect in the non-Australian populations is the well-documented admixture that has occurred in many Southeast Asian and Oceanian groups between (1) Near Oceanian populations related to New Guineans and (2) populations from island Southeast Asia related to mainland East Asians, who are the primary populations of Taiwan and Indonesia today.28–31 Thus, many groups might have Denisova admixture as an indirect consequence of their history of Near Oceanian admixture. For those populations whose Denisova ancestry is explained in this way, their frac- tion of Denisovan ancestry is predicted to be exactly proportional to their fraction of Near Oceanian ancestry. To test this hypothesis, we designed a second statistic, pN(X), to estimate the fraction of a population’s Near Ocean- ian ancestry, defined here as the proportion of its ancestry inherited from a population that is more closely related to New Guineans than to Australians (Appendix A). A virtue of pN(X) is that it provides an unbiased estimate of a popula- tion’s Near Oceanian ancestry proportion even without access to close relatives of the ancestral populations (Appendix A), whereas previous estimators10,30 depend on the accuracy of the surrogate contemporary popula- tions used to approximate the ancestral populations. We 520 The American Journal of Human Genetics 89, 516–528, October 7, 2011
  • 15. compared pD(X) and pN(X) for all relevant populations (Table 1, Figure 2, and Figure S1) and found that, allowing for sampling error, they occur in a one-to-one ratio for the populations from the Nusa Tenggaras, Moluccas, Polynesia, and Fiji. Common ancestry with Near Oceania thus can account for the Denisova genetic material in these groups. A striking exception is observed in the two Philippine populations, neither of which conforms to this relation- ship: pD(Mamanwa) ¼ 0.49 5 0.05 versus pN(Mamanwa) ¼ 0.11 5 0.01 (p ¼ 1.5 3 10À12 for the difference) and pD(Manobo) ¼ 0.13 5 0.03 versus pN(Manobo) ¼ 0.04 5 0.01 (p ¼ 0.0018) (Figure 2). An alternative hypothesis that could account for the Denisovan genetic material in the Philippines is common ancestry with Australians.32,33 We thus computed a third statistic, pApp (X), that estimates the relative proportion of Australian ancestry (Appendix A). However, Australian ancestry cannot explain these patterns either: pD(Mamanwa) ¼ 0.49 5 0.05 versus pApp (Mamanwa) ¼ 0.13 5 0.01 and pD(Manobo) ¼ 0.13 5 0.03 versus pApp (Manobo) ¼ 0.05 5 0.01. The estimates of pN(X) and pApp (X) are consistent for a variety of outgroups (Appendix A and Table S3). Thus, the Denisova genetic material in Mamanwa, as well as the smaller proportion in their Manobo neighbors, cannot be due to common ancestry with Near Oceanians or Australians after the two groups diverged from one another. In the following section, we focus on the Mamanwa because they have a higher proportion of Denisova genetic material and allow us to study the pattern at a higher resolution. Modeling Denisova Admixture and Population History To test whether the patterns observed in the Philippine populations might reflect a history of Denisova gene flow into a population that was ancestral to New Guineans, Australians, and Mamanwa, followed by separation of the Mamanwa first and then divergence of the New Guin- eans from Australians, we fit f statistics summarizing the allele frequency correlations among all possible sets of populations to admixture graphs.14 Admixture graphs are formal models of population relationships with the impor- tant feature that simply by specifying a topology of popu- lation relationships, admixture proportions, and genetic drift values on each lineage, they produce precise predic- tions of the values that will be observed at f4ff , f3ff , and f2ff statistics (Appendix B). These predictions can then be compared to the empirically observed values (with standard Figure 2. Denisovan and Near Oceanian Ancestry Are Propor- tional Except in the Philippines We plot pDpp (X), the estimated percentage of Denisova ancestry as a fraction of that seen in New Guineans, against the estimated percentage of Near Oceanian ancestry pN(X) by using the values from Table 1 (horizontal and vertical bars specify 51 standard errors). The Mamanwa deviate significantly from the pD(X) ¼ pN(X) line, indicating that their Denisova genetic material does not owe its origin to gene flow from a population related to Near Oceanians. A weaker deviation is seen in the Manobo, who live near the Mamanwa on the island of Mindanao. Table 2. Denisovan Admixture pD(X) Estimated from Sequencing versus Genotyping Data Sample HGDP ID for Sequence Data Sequencing Data Genotyping Data Estimated Ancestry Standard Error in the Estimate Z Score Estimated Ancestry Standard Error in the Estimate Z Score Papuan HGDP00542 105% 9% 11.8 100% n/a n/a New Guinea Highlander 104% 9% 11.7 100% n/a n/a Bougainville HGDP00491 83% 10% 8.3 82% 5% 15.9 Mamanwa 28% 10% 2.9 49% 5% 9.2 Cambodian HGDP00711 19% 9% 2.0 À3% 3% À0.8 Karitiana HGDP00998 9% 12% 0.7 4% 6% 0.7 Mongolian HGDP01224 À6% 12% À0.5 3% 3% 1.1 For the sequencing data, we present the ratio f4(Yoruba, Denisova; Han, X)/f4(Yoruba, Denisova; Han, Papuan2), estimating the proportion of Denisova ancestry in a population X as a fraction of that in the Papuan2 sample (for the first line, the Papuan sample in the numerator is Papuan1 HGDP000551). For the genotyping data, we present the ratio f4(YRI, Denisova; CHB, X)/f4(YRI, Denisova; CHB, Papuan). No standard errors are given for the genotyping-based estimates in the first two rows because the Papuans and New Guineans are the reference populations, and so by definition those fractions are 100%. The American Journal of Human Genetics 89, 516–528, October 7, 2011 521
  • 16. errors from a block jackknife) to assess the fit to the data.14 The best-fitting admixture graph for seven populations (Neandertal, Denisova, Yoruba, Han Chinese, Mamanwa, Australians, and New Guineans) specifies Denisova gene flow into a population ancestral to New Guineans, Austra- lians, and Mamanwa, followed by the splitting of the ances- tors of the Mamanwa and much more recent admixture between them and populations related to East Eurasians (Figure3 andFigureS2).Forthismodel,theadmixturegraph predicts the values of 91 allele frequency correlation statis- tics (f statistics) relating the seven analyzed populations, and only one f statistic has an observed value more than three standard errors from the prediction (Appendix B). Encouraged by the fit of the admixture graph to the data from the seven populations, we extended the model to include two additional populations—Andaman Islanders (Onge) and Negrito groups from Malaysia (Jehai)—both of which have been hypothesized to descend from the same migration that gave rise to Australians and New Guineans4,5 (Figure 3 and Figure S3). This analysis provides overwhelming support for common ancestry for the Onge and Jehai: an admixture graph specifying such a history is an excellent fit to the joint data in the sense that only one of the 246 possible f statistics is more than three standard errors from expectation (Appendix B). The analysis also suggests that after their separation from the Onge, the Je- hai received substantial admixture (about three-quarters of their genome) from populations related to mainland East Asians (Appendix B). In contrast, a model in which the Onge have no recent East Asian admixture is a good fit to the data, providing further evidence that the Onge have been unadmixed (at least with non-South Asians8 ) since their initial arrival in the region.14 A striking finding that emerges from the admixture graph model fitting is the evidence of an episode of addi- tional gene flow into Australian and New Guinean ances- tors—after their ancestors separated from those of the Ma- manwa—from a modern human population that did not have Denisova genetic material. A model in which this admixture accounts for half of the genetic material in Australians and New Guineans is an excellent fit to the data (Figure 3, Figures S2 and S3, and Appendix B). Admix- ture graphs that do not model a second admixture event are much poorer fits, producing 11 f statistics at jZj > 3 standard errors from expectation (Appendix B). Our analysis further suggests that the modern humans who admixed with the ancestors of Australians and New Guin- eans were closer to Andamanese and Malaysian Negritos than to mainland East Asians (Figure 3), although this is a weaker signal (1 f statistic with jZj > 3 versus 3) (Fig- ure S3). This suggests that populations with Denisova admixture could have been in proximity to the ancestors of the Onge and Jehai during the earliest settlement of the region but provides no evidence for ancestors of pres- ent-day East Asians in the region at that time (Appendix B). Thus, these findings suggest that the present-day East Asian and Indonesian populations are primarily descended from more recent migrations to the region. Discussion This study has shown that Southeast Asia was settled by modern humans in multiple waves: One wave contributed the ancestors of present-day Onge, Jehai, Mamanwa, New Guineans, and Australians (some of whom admixed with Denisovans), and a second wave contributed much of the ancestry of present-day East Asians and Indonesians. This scenario of human dispersals is broadly consistent with the archaeologically-motivated hypothesis of an early southern route migration leading to the colonization of Sahul and East Asia2 but also further clarifies this scenario. In particular, our data provide no evidence for multiple dispersals of modern humans out of Africa, as all non- Africans have statistically indistinguishable amounts of 1.3%98.7% 7%93% 51% 24%76% 49% Chinese Jehai (N) Onge (N) Australian DenisovaNew GuineaMamanwa (N)Yoruba Neandertal 24%76% 27%73% Figure 3. A Model of Population Separa- tion and Admixture that Fits the Data The admixture graph suggests Denisova- related gene flow into a common ancestral population of Mamanwa, New Guineans, and Australians, followed by admixture of New Guinean and Australian ancestors with another population that did not experience Denisova gene flow. We cannot distinguish the order of population diver- gence of the ancestors of Chinese, Onge/ Jehai, and Mamanwa/New Guineans/ Australians, and hence show a trifurcation. Admixture proportion estimates (red) are potentially affected by ascertainment bias and hence should be viewed with caution. In addition, although admixture graphs are precise about the topology of popula- tion relationships, they are not informa- tive regarding timing. Thus, the lengths of lineages should not be interpreted in terms of population split times and admix- ture events. 522 The American Journal of Human Genetics 89, 516–528, October 7, 2011
  • 17. Neandertal genetic material.12,18 Instead, our data are consistent with a single dispersal out of Africa (as proposed in some versions of the early southern route hypothesis1 ) from which there were multiple dispersals to South and East Asia. This study is also important in providing a clue about the geographic location of the Denisova gene flow. Given the high mobility of human populations, it is difficult to use genetic data from present-day populations to infer the loca- tion of past demographic events with high confidence. Nevertheless, the fact that Denisova genetic material is present in eastern Southeast Asians and Oceanians (Ma- manwa, Australians, and New Guineans), but not in the west (Onge and Jehai) or northwest (the Eurasian conti- nent) suggests that interbreeding might have occurred in Southeast Asia itself. Further evidence for a Southeast Asian location comes from our evidence ofancient gene flow from relatives of the Onge and Jehai into the common ancestors of Australians and New Guineans after the initial Denisova gene flow (Figure 3); this suggests that ancestors of both of these groups (but not of East Asians) were present in the region at the time. Although some of the observed patterns could alternatively be explained by a history in which there was initially some Denisova genetic material throughout Southeast Asia—which was subsequently displaced by major migrations of people related to present-day East Asians—such a history cannot parsimoniously explain the absence of Denisova genetic material in the Onge and Jehai. Our evidence of a Southeast Asian location for the Deniso- van admixture thus suggests that Denisovans were spread across a wider ecological and geographic region—from the deciduous forests of Siberia to the tropics—than any other hominin with the exception of modern humans. Finally, this study is methodologically important in showing that there is much to learn about the relation- ships among modern humans by analyzing patterns of genetic material contributed by archaic humans. Because the archaic genetic material is highly divergent, it is easily detected in a modern human even if it contributes only a small proportion of the ancestry; this makes it possible to use archaic genetic material to study subtle and ancient gene flow much as a medical imaging dye injected into a patient allows the tracing of blood vessels. A priority for future research should be to obtain direct estimates for the dates of the Denisova and Neandertal gene flow, as these will provide a better understanding of the interac- tions among Denisovans, Neandertals, and the ancestors of various present-day human populations. Appendix A: Statistics Used for Estimating Admixture Proportions pD(X) Statistic Used for Estimating Denisova Admixture Proportion We first discuss the pD(X) statistic that we use for esti- mating the Denisova admixture proportion in any popula- tion X. Define the frequency of allele i in a sample from population Y as zi Y . Then pD(X) is defined as in Equation 1. The rightmost part of Equation 1 shows that pD(X) can also be expressed as a ratio of f4 statistics, which we intro- duced previously14 to measure the correlation in allele frequency differences between pairs of populations. We previously reported simulations showing that the expected values of f4 statistics are in practice robust to ascertainment bias (how the polymorphisms are chosen for inclusion in an analysis), making them useful for learning about history with SNP array data.14 The expected values of f4 statistics can be understood visually by following the arrows through the phylogenetic trees with admixture relating sets of samples, assuming that these are accurate models for the relationships among the populations.14 Figure 4 illustrates how the ratio of f4 statistics computed in Equation 1 estimates an admixture proportion. Both the numerator and denominator can be viewed as a correlation of two allele frequency differences: zi A À zi B is the correlation in the allele frequency differ- ence between an Outgroup ‘‘A’’ that did not experience admixture and an Archaic group ‘‘B’’ hypothesized to be related to the admixing group (e.g., A ¼ {chimpanzee, Yoruba, or San} and B ¼ {Denisova or Neandertal}). This follows the blue arrows in Figure 4. zi C À zi X is the correlation in the allele frequency differ- ence between a modern non-African population ‘‘C’’ and a test population ‘‘X’’ (e.g., C ¼ {Chinese or Bornean}). This follows the red arrows in Figure 4. If populations C and X are sister groups that descend from a homogeneous non-African ancestral population, then the allele frequency differences are expected to have arisen entirely since the split from that common ancestral popula- tion, and thus the correlation to A and B is expected to be zero (no overlap of the arrows). In contrast, if population X has inherited some proportion qX of its lineages from an archaic population, then the expected value of the product of the frequency differences is proportional to qX times the overlap of the paths of A and B and C and X in Figure 4, which corresponds to genetic drift a þ b. While we do not know the value of a þ b, when we take the ratio of the numerator and denominator to compute the pD(X) statistic, this unknown quantity cancels, and we obtain qX/qNew Guinea, the proportion of archaic ancestry in a popu- lation as a fraction of that in New Guineans (Figure 4). Two issues merit further discussion. First, Figure 4 is an oversimplification in that it does not show two archaic gene-flow events (corresponding to Denisovans and Nean- dertals). However, we have previously reported that the data are consistent with the same amount of Neandertal gene flow into the ancestors of East Asians (C, such as CHB) and populations with Denisovan ancestry (X).12,18 As a result, the same genetic drift terms are added to the numerator and denominator, which then cancel in the ratio pD(X) so that they do not affect results. Second, pD(X) is expected to provide an unbiased estimate of the admixture proportion even if the genetic drift on various The American Journal of Human Genetics 89, 516–528, October 7, 2011 523
  • 18. lineages has been large. This contrasts with previous methods for estimating admixture, which have required accurate proxies for the ancestral populations.10 pN(X) and pApp (X) Statistics for Estimating Near Oceanian and Denisova Admixture We next discuss the statistics that we use for estimating the New Guinean pN(X) or Australian pApp (X) mixture proportion in any East Eurasian or island Southeast Asian population X, which are defined in Equations 2 and 3, respectively. Figure 5 shows the admixture graph corresponding to the computation of pN(X). Both the numerator and the denominator are of the form f4ff (A(( ,Australia; X,New Guinea). The first term measures the correlation in allele frequency differences between (A(( À Australia) and (X(( À New Guinea). If X and New Guinea descended from a common ancestral population since the split from Austra- lians, then they are perfect sister groups, and the expected value of f4ff is zero (the sample is consistent with 100% Near Oceanian ancestry). On the other hand, if X has a proportion (1 À qXqq ) of non-Near Oceanian ancestry, then the two terms will have a nonzero correlation, which as shown in Figure 5 is proportional to the genetic drift shared between the two population comparisons and has an expected value of (1 À qXqq )[(1 À pXpp )b þ g] (the proportions of ancestry flowing along various genetic drift paths times the genetic drift on each of these lineages, indicated by the overlap of the red and blue arrows). When we take one minus the ratio pN(X) ¼ 1 À f4ff (A(( ,Australia; X,New Guinea)/f4ff (A(( ,Australia; CHB,New Guinea), the complicated term on the right side of this expectation cancels, and we obtain E[p[[ N(X)] ¼ qXqq . As with Figure 4, we do not show the independent Neandertal admixture because the effect of this term is to cancel from the numerator and denominator. In Table S3 we report the pN(X) estimates for diverse choices of outgroup populations A (Yoruba, San, and chim- panzee) and E (China and Borneo). The estimates are con- sistent whatever the choice of A and E, suggesting that our inferences are robust. (We do not report pN(X) estimates in Table S3 for the Australians because this population is not expected to conform to the population relationships shown in Figure 5; indeed, the pN(X) estimates for Austra- lians, when we do compute them, are significantly greater than 1.) Further evidence for the usefulness of the pN(X) estimates comes from the fact that it is consistent with the pD(X) estimate for nearly all the populations in Table 1 (except for the Philippine populations, in which the De- nisova ancestry does not appear to be explainable by Near Oceanian gene flow as described in the main text). We also computed a statistic pApp (X) that is identical to pN(X) except for the transpositions of the positions of Aus- tralia and New Guinea in the statistics (Equations 2 and 3). Once again, we obtain consistent inferences of pApp (X) in Table S3 regardless of the choice of outgroup populations. Because New Guinea and Australia are sister groups, de- scending from a common ancestral population, the justifi- cations for the two statistics are very similar. The only problem we found with the estimation of pN(X) procedure is that when X is any non-African population known to have West Eurasian ancestry (e.g., Europeans or South Asians), we often obtained negative pN(X) statistics. Two hypotheses could be consistent with this observation: (1) In unpublished data, we have attempted to write down a model of population separation and mixture analogous Figure 4. Computation of the Estimate of Denisovan Ancestry pD(X) The black lines show the model for how populations are related that is the basis for the pD(X) ancestry estimate. Population X arose from an admixture of a proportion (1 À qXqq ) of ancestry from an ancestral non- African population C0 and (qXqq ) from archaic population B0 (C and B are their unmixed descendants). The expected value of f4ff (A,B;C,X) is proportional to the correlation in the allele frequency differ- ences A À B and C À X, and can be com- puted as the overlap in the drift paths separating A À B (blue arrows) and C À X (red arrows). These paths only overlap over the branches a and b, in proportion to the percentage qXqq of the lineages of pop- ulation X that are of archaic ancestry and so the expected value is qXqq (a(( þ b). When we compute the ratio pD(X), (a(( þ b) cancels from both the numerator and denomi- nator, and we obtain qXqq /qXX New Guinea, the fraction of archaic ancestry in a population X divided by that in New Guinea. This provides unbiased estimates of the mixture proportion even if populations C and B have experienced a large amount of genetic drift since splitting from their ancestors, that is, even if we do not have good surrogates for the ancestral populations. This robustness arises because the genetic drift on the branches B/B0 and C/C0 does not contribute to the expectations. 524 The American Journal of Human Genetics 89, 516–528, October 7, 2011
  • 19. to that in Figure 3 that jointly fits the genetic data com- paring eastern and western Eurasian populations and have so far not succeeded in developing a model that passes goodness-of-fit tests. This suggests that the population relationships between eastern and western Eurasians might be more complex than we have been able to model to date, and therefore we cannot use them in the pN(X) computa- tion. (2) An alternative possibility is that the negative pN(X) statistics reflect an artifact of ascertainment bias on SNP arrays. Ascertainment bias is likely to be particularly complex with regard to the joint information from Euro- peans and East Asians because these populations were heavily used in choices of SNPs for medical genetics arrays. Thus, it might be difficult to make inferences using popula- tions from both regions together with data from conven- tional SNP arrays developed for medical genetic studies. Whatever the explanation, we have some reason to believe that estimates of Near Oceanian admixture by using data from populations with West Eurasians might be unreliable. Thus, we have excluded West Eurasians from the estimates reported in Table 1. Appendix B: Admixture Graphs Overview of Admixture Graphs A key finding from this study is that there is Denisova genetic material in the Mamanwa, a Negrito group from the Philippines, which cannot be explained by a history of recent gene flow from relatives of New Guineans (Near Oce- anians) or Australians. To further understand this history, we use the admixture graph methodology that we initially developed for a study of Indian genetic variation14 to test whether varioushypothesesabout populationrelationships are consistent with the data. Specifically, we tested the hypothesis of a single episode of Denisovan gene flow into theancestors ofNew Guineans,Australians,andMamanwa, prior to the separation of New Guineans and Australians. Admixture graphs refer to generalizations of phyloge- netic trees that incorporate the possibility of gene flow. Like phylogenetic trees, admixture graphs describe the topology of population relationships without specifying the timing of events (such as population splits or gene- flow events), or the details of population size changes on different lineages. While this can be a disadvantage in that fitting admixture graphs to data does not allow infer- ences of these important details, it is also an advantage in that one can fit genetic data to an admixture graph without having to specify a demographic history. This allows for inferences that are more robust to uncertainties about important parameters of history. Once the topology of the population relationships is inferred, one can in principle use other methods to make inferences about the timing of events and population size changes. This makes the problem of learning about history simpler than if one had to simultaneously infer topology, timing, and demography. An admixture graph makes precise predictions about the patterns of correlation in allele frequency differences across all subsets of two, three, and four populations in an analysis, as measured for example by the f2ff , f2 3ff , and f4ff statistics of Reich et al.14 Given n populations, there are n(n À 1)/2 f2ff statistics, n(n À 1)(n À 2)/6 f3ff statistics, and n(nÀ1)(nÀ2)(nÀ3)/24 f4ff statistics. To fit an admixture graph to data, one first proposes a topology, then identifies the set of admixture proportions and genetic drift values on each lineage (variation in allele frequency correspond- ing to random sampling of alleles from generation to generation in a population of finite size) that are the best match to the data under that model. The admixture graph topology, admixture proportions, and genetic drift values Figure 5. Computation of the Estimate of Near Oceanian Ancestry pN(X) The test population X is assumed to have arisen from a mixture of a proportion (1 À qXqq ) of ancestry from ancestral East Asians E0 and (qXqq ) of ancestral Near Ocean- ians N0 NN . The Near Oceanians are, in turn, assumed to have received a proportion pXpp of their ancestry from the Denisovans (E(( and New Guinea are assumed to be unmixed descendants of these two). The expected value of f4ff (A,Australia; X, New Guinea) can be computed from the correla- tion in the allele frequency differences A À Australia (blue arrows) and X À New Guinea (red arrows). These paths only overlap along the proportion (1 À qXqq ) of the ancestry of population X that takes the East Asian path, where the expected shared drift is (1 À pXpp )bþg as shown in the figure. Thus, the expected value of the f4ff statistic is (1 À qXqq )(1 À pXpp )bþg. Because qXqq ¼ 0 for the denominator of pN(X) (no Near Oceanian ancestry), the ratio of f4ff statistics has an expected value of (1 À qXqq ) and E [p[[ N(X)] ¼ qXqq . The American Journal of Human Genetics 89, 516–528, October 7, 2011 525
  • 20. on each lineage together generate expected values for the f2, f3 and f4 statistics14 that can be compared to the observed values—which have empirical standard errors from a block jackknife—to assess the adequacy of the best fit under the proposed topology. As we showed previ- ously,14 the topology relating populations in an admixture graph can be accurately inferred even if the polymor- phisms used in an analysis are affected by substantial ascer- tainment bias. The software that we have developed for fitting admixture graphs carries out a hill-climb to find the genetic drift values and admixture proportions that minimize the discrepancy between the observed and ex- pected f2, f3, and f4 statistics for a given topology relating a set of populations. A complication in fitting admixture graphs to data is that we do not know how many effectively independent f statistics there are, out of the [n(n À 1)/2][1 þ (n À 2)/ 3 þ (n À 3)/12] that are computed. These statistics are highly correlated, and in fact can be related algebraically to each other; for example, all the f3 and f4 statistics are a linear combinations of the f2 statistics. Although we believe that it is possible to construct a reasonable score for how well the model fits the data by studying the covari- ance matrix of the f statistics—and indeed a score of this type is the basis for our hill-climbing software—we have not yet found a formal way to assess how many indepen- dent hypotheses are being tested, and thus we do not at present have a goodness-of-fit test. Instead, we simply compute all possible f statistics and search for extreme outliers (e.g., Z scores of 3 or more from expectation). A large number of Z scores greater than 3 are not likely to be observed if the admixture graph topology is an accurate description of a set of population relationships. Denisova Gene Flow into Mamanwa/New Guinean/ Australian Ancestors We initially fit an admixture graph to the data from Mamanwa, New Guineans, Australians, Denisova, Nean- dertal, West Africans (YRI), and Han Chinese (CHB), basing some of the proposed population relationships on pre- vious work that hypothesized a model of an out-of-Africa migration of modern humans, Neandertal gene flow into the ancestors of all non-Africans, and sister group status for Neandertals and Denisovans.12 A complication in fitting an admixture graph to these data is that because of the low coverage of the Neandertal and Denisova genomes, we could not accurately infer the diploid geno- type at each SNP. Thus, we sampled a single read from Neandertal and Denisova to represent each site and (incor- rectly) assumed that these individuals were homozygous for the observed allele at each analyzed SNP. This means that the estimates of genetic drift on the Neandertal and Denisova branches are not reliable (the genetic drift values are overestimated). However, these sources of error do not introduce a correlation in allele frequencies across popula- tions and hence are not expected to generate a false infer- ence about the population relationships. Figure S2 shows an admixture graphthat proposes that the Mamanwa, New Guineans, and Australians descend from a common ancestral population; the Mamanwa split first and the New Guinean and Australian ancestors split later. This is an excellent fit to the data in the sense that only one of 91 f statistics is more than three standard errors from zero (jZj ¼ 3.4). An interesting feature of this admixture graph is that it specifies an additional admixture event, after the Mamanwa lineage separated, into the ancestors of Australians and New Guineans that contributed about half of their ancestry and involved a population without Deni- sova admixture. A model that does not include such a secondary admixture event is strongly rejected (see below). The estimated proportion of Neandertal ancestry in all non-Africans from the admixture graph fitting in Figure 3, at 1.3%, is at the low end of the 1%–4% previously esti- mated from sequencing data.18 Similarly, we infer a propor- tion of Denisova ancestry in New Guineans of 3.5% ¼ 6.6% 3 53%, which is lower than the 4%–6% previously estimated based on sequencing data but not significantly so when one takes into account the standard errors quoted in that study.12 These low numbers could reflect statistical uncertainty from the previously reported analyses of sequencing data or in the admixture graph estimates (the latter possibility is especially important to consider because we do not at present understand how to compute standard errors on the admixture estimates derived from admixture graphs). Another possible explanation for the low estimates of mixture proportions is ascertainment bias affecting the way SNPs were selected, which can affect estimates of mixture proportions and branch lengths (while having much less impact on the inference of topology). Further support for the hypothesis that ascer- tainment bias might be contributing to our lower estimates of mixture proportions comes from the fact that in unpub- lished work we have found that the polymorphisms most enriched for signals of archaic admixture are those in which the derived allele is present in the archaic popula- tion, absent in West Africans, and present at low minor allele frequency in the studied population. In our admix- ture graph fitting, we filtered out this class of SNPs, as the f statistics used in the admixture graph have denomi- nators that require frequency estimates from a polymor- phic reference population, and we used YRI as our refer- ence. Thus, when we refitted the same admixture graph with CHB instead of YRI as the reference population, we obtained the same topology but the Neandertal mixture proportion increased to 1.9%. We have chosen to use YRI as the reference population in all of our reported admix- ture graphs because they are a better outgroup for the modern populations whose history we are studying than the CHB (populations related to the Chinese were directly involved in admixture events in Southeast Asia). Adding Onge and Jehai The Andamanese Negrito group (Onge) and Malaysian Negrito group (Jehai) have been proposed to share ancient 526 The American Journal of Human Genetics 89, 516–528, October 7, 2011
  • 21. common ancestry with Philippine Negritos (e.g., Ma- manwa). The fact that neither the Onge nor the Jehai have evidence of Denisova genetic material, however, suggests that any common ancestry must date to before the Denisova gene flow into the ancestors of the Ma- manwa, New Guineans, and Australians. To explore the relationship between the Onge and Jehai and the other populations, we added them into the admixture graph. The only family of admixture graphs that we could identify as fitting the data have the Onge as a deep lineage of modern humans, with the Jehai deriving ancestry from the same lineage but also harboring a substantial additional contribution of East Asian related admixture (Figure S3). A striking feature of the family of admixture graphs shown in Figure S3 is that both the Jehai and Mamanwa are inferred to have up to about three-quarters of their ancestry due to recent East Eurasian admixture, which is not too surprising given that these populations have been living side by side with populations of East Eurasian ancestry for thousands of years. Moreover, both Y-chromosome and mtDNA anal- yses strongly suggest recent East Asian admixture in the Mamanwa.32,34 In contrast, the genome-wide SNP data for the Onge are consistent with having no non-Negrito admix- ture within the limits of our resolution, perhaps reflecting their greater geographic isolation. We next sought to resolve how the lineage including Onge and Jehai ancestors, the mainland East Asian (e.g., Chinese), and the eastern group (including Mamanwa, Australian and New Guinean ancestors) are related. Three relationships are all consistent with the data. Specifically, for all three of the admixture graphs shown in Figure S3, only one of the 246 possible f statistics has a score of jZj > 3. Thus, we cannot discern the order of splitting of these three lineages and represent the relationships as a trifurcation in Figure 3. The actual estimates of mixture proportions are similar for all three figures as well. Perturbing the Best-Fitting Admixture Graph to Assess the Robustness of Our Inferences To assess the robustness of the admixture graphs, we per- turbed Figure S3 (in practice, we perturbed Figure 3A, but given the fact that the graphs are statistically indistin- guishable we expected that results would be similar for all three). First, we considered the possibility that after the initial Denisova gene flow into the ancestors of Ma- manwa, New Guineans, and Australians, the New Guinean and Australian ancestors did not experience an additional gene-flow event with a population without Denisovan admixture. However, when we try to fit this simpler model to the data, we find that instead of one f statistic that is jZj > 3 standard errors from expectation, there are now 11, and all but one of them involve the Mamanwa, suggest- ing that this population is poorly fit by such a model. Thus, an additional admixture event in the ancestry of New Guineans and Australians (resulting in a decrease in their proportion of Denisova ancestry) results in a major improvement in the fit. Second, we considered the possibility that the secondary gene-flow event into the ancestors of Australians and New Guineans came from relatives of Chinese (CHB) rather than western Negritos such as the Onge. However, when we fit this alternative history to the data, we find three f statistics (rather than one) with scores of jZj > 3, a substantially worse fit. We conclude that the modern human population with which the ancestors of Australians and New Guineans interbred was likely to have been more closely related to western Negritos than to mainland East Asians. Supplemental Data Supplemental Data include three figures and three tables and can be found with this article online at http://www.cell.com/AJHG/. Acknowledgments We thank the volunteers who donated DNA samples. We acknowl- edge F.A. Almeda Jr., J.P. Erazo, D. Gil, the late J. Kuhl, E.S. Larase, I. Motinola, G. Patagan, W. Sinco, A. Sofro, U. Tadmor, and R. Trent for assistance with sample collections. We thank M. Meyer for preparing DNA libraries for high-throughput sequencing; A. Barik and P. Nu¨renberg for assistance with genotyping; and O. Bar-Yosef, K. Bryc, R.E. Green, J.-J. Hublin, J. Kelso, D. Lieberman, B. Paken- dorf, M. Slatkin, and B. Viola for comments on the manuscript. T.A. Jinam was supported by a grant from the SOKENDAI Graduate Student Overseas Travel Fund. This work was supported by the Max Planck Society and by a National Science Foundation HOMINID grant (1032255). Received: August 11, 2011 Revised: September 8, 2011 Accepted: September 8, 2011 Published online: September 22, 2011 Web Resources The URLs for data presented herein are as follows: Burrows-Wheeler Aligner, http://bio-bwa.sourceforge.net/index. shtml CEPH-Human Genome Diversity Cell Line Panel, http://www. cephb.fr/en/hgdp/diversity.php EIGENSOFT, http://genepath.med.harvard.edu/~reich/Software.htm European Collection of Cell Cultures, http://www.hpacultures. org.uk/pages/Ethnic_DNA_Panel.pdf European Nucleotide Archive (Project ID ERP000121), http:// www.ebi.ac.uk/ena/ Ibis, http://bioinf.eva.mpg.de/Ibis/ SAMtools, http://samtools.sourceforge.net/ References 1. Mellars, P. (2006). Going east: New genetic and archaeological perspectives on the modern human colonization of Eurasia. Science 313, 796–800. 2. Lahr, M., and Foley, R. (1994). Multiple dispersals and modern human origins. Evol. Anthropol. 3, 48–60. The American Journal of Human Genetics 89, 516–528, October 7, 2011 527
  • 22. 3. Endicott, P., Gilbert, M.T., Stringer, C., Lalueza-Fox, C., Willer- slev, E., Hansen, A.J., and Cooper, A. (2003). The genetic origins of the Andaman Islanders. Am. J. Hum. Genet. 72, 178–184. 4. Macaulay, V., Hill, C., Achilli, A., Rengo, C., Clarke, D., Mee- han, W., Blackburn, J., Semino, O., Scozzari, R., Cruciani, F., et al. (2005). Single, rapid coastal settlement of Asia revealed by analysis of complete mitochondrial genomes. Science 308, 1034–1036. 5. Thangaraj, K., Chaubey, G., Kivisild, T., Reddy, A.G., Singh, V.K., Rasalkar, A.A., and Singh, L. (2005). Reconstructing the origin of Andaman Islanders. Science 308, 996. 6. Cordaux, R., and Stoneking, M. (2003). South Asia, the Andamanese, and the genetic evidence for an early human dispersal out of Africa. Am J Hum Genet 72, 1586–1590; author reply 1590-1583. 7. Palanichamy, M.G., Agrawal, S., Yao, Y.G., Kong, Q.P., Sun, C., Khan, F., Chaudhuri, T.K., and Zhang, Y.P. (2006). Comment on ‘‘Reconstructing the origin of Andaman islanders’’. Science 311, 470, author reply 470. 8. Barik, S.S., Sahani, R., Prasad, B.V.R., Endicott, P., Metspalu, M., Sarkar, B.N., Bhattacharya, S., Annapoorna, P.C.H., Sreenath, J., Sun, D., et al. (2008). Detailed mtDNA genotypes permit a reassessment of the settlement and population structure of the Andaman Islands. Am. J. Phys. Anthropol. 136, 19–27. 9. Abdulla, M.A., Ahmed, I., Assawamakin, A., Bhak, J., Brahmachari, S.K., Calacal, G.C., Chaurasia, A., Chen, C.H., Chen, J., Chen, Y.T., et al; HUGO Pan-Asian SNP Consortium; Indian Genome Variation Consortium. (2009). Mapping human genetic diversity in Asia. Science 326, 1541–1545. 10. Wollstein, A., Lao, O., Becker, C., Brauer, S., Trent, R.J., Nu¨rn- berg, P., Stoneking, M., and Kayser, M. (2010). Demographic history of Oceania inferred from genome-wide data. Curr. Biol. 20, 1983–1992. 11. Moodley, Y., Linz, B., Yamaoka, Y., Windsor, H.M., Breurec, S., Wu, J.Y., Maady, A., Bernho¨ft, S., Thiberge, J.M., Phuanukoon- non, S., et al. (2009). The peopling of the Pacific from a bacte- rial perspective. Science 323, 527–530. 12. Reich, D., Green, R.E., Kircher, M., Krause, J., Patterson, N., Durand, E.Y., Viola, B., Briggs, A.W., Stenzel, U., Johnson, P.L., et al. (2010). Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468, 1053–1060. 13. Altshuler, D.M., Gibbs, R.A., Peltonen, L., Altshuler, D.M., Gibbs, R.A., Peltonen, L., Dermitzakis, E., Schaffner, S.F., Yu, F., Peltonen, L., et al; International HapMap 3 Consortium. (2010). Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58. 14. Reich, D., Thangaraj, K., Patterson, N., Price, A.L., and Singh, L. (2009). Reconstructing Indian population history. Nature 461, 489–494. 15. Redd, A.J., and Stoneking, M. (1999). Peopling of Sahul: mtDNA variation in aboriginal Australian and Papua New Guinean populations. Am. J. Hum. Genet. 65, 808–828. 16. Cann, H.M., de Toma, C., Cazes, L., Legrand, M.F., Morel, V., Piouffre, L., Bodmer, J., Bodmer, W.F., Bonne-Tamir, B., Cam- bon-Thomsen, A., et al. (2002). A human genome diversity cell line panel. Science 296, 261–262. 17. Chimpanzee Sequencing and Analysis Consortium. (2005). Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87. 18. Green,R.E.,Krause,J.,Briggs,A.W.,Maricic,T.,Stenzel,U.,Kircher, M., Patterson, N., Li, H., Zhai, W., Fritz, M.H., et al. (2010). A draft sequence of the Neandertal genome. Science 328, 710–722. 19. Patterson, N., Price, A.L., and Reich, D. (2006). Population structure and eigenanalysis. PLoS Genet. 2, e190. 20. Kircher, M., Stenzel, U., and Kelso, J. (2009). Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol. 10, R83. 21. Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760. 22. Busing, F., Meijer, E., and Van Der Leeden, R. (1999). Delete-m jackknife for unequal m. Stat. Comput. 9, 3–8. 23. Kunsch, H.K. (1989). The jackknife and the bootstrap for general stationary observations. Ann. Stat. 17, 1217–1241. 24. O’Connell, J., and Allen, J. (2004). Dating the colonization of Sahul (Pleistocene Australia - New Guinea): A review of recent research. J. Archaeol. Sci. 31, 835–853. 25. Summerhayes, G.R., Leavesley, M., Fairbairn, A., Mandui, H., Field, J., Ford, A., and Fullagar, R. (2010). Human adaptation and plant use in highland New Guinea 49,000 to 44,000 years ago. Science 330, 78–81. 26. McEvoy, B.P., Lind, J.M., Wang, E.T., Moyzis, R.K., Visscher, P.M., van Holst Pellekaan, S.M., and Wilton, A.N. (2010). Whole- genome genetic diversity in a sample of Australians with deep Aboriginal ancestry. Am. J. Hum. Genet. 87, 297–305. 27. Roberts-Thomson, J.M., Martinson, J.J., Norwich, J.T., Harding, R.M., Clegg, J.B., and Boettcher, B. (1996). An ancient common origin of aboriginal Australians and New Guinea highlanders is supported by alpha-globin haplotype analysis. Am. J. Hum. Genet. 58, 1017–1024. 28. Friedlaender, J.S., Friedlaender, F.R., Reed, F.A., Kidd, K.K., Kidd, J.R., Chambers, G.K., Lea, R.A., Loo, J.H., Koki, G., Hodg- son, J.A., et al. (2008). The genetic structure of Pacific Islanders. PLoS Genet. 4, e19. 29. Kayser, M., Brauer, S., Cordaux, R., Casto, A., Lao, O., Zhivo- tovsky, L.A., Moyse-Faurie, C., Rutledge, R.B., Schiefenhoevel, W., Gil, D., et al. (2006). Melanesian and Asian origins of Poly- nesians: mtDNA and Y chromosome gradients across the Pacific. Mol. Biol. Evol. 23, 2234–2244. 30. Kayser, M., Lao, O., Saar, K., Brauer, S., Wang, X., Nu¨rnberg, P., Trent, R.J., and Stoneking, M. (2008). Genome-wide analysis indicates more Asian than Melanesian ancestry of Polyne- sians. Am. J. Hum. Genet. 82, 194–198. 31. Mona, S., Grunz, K.E., Brauer, S., Pakendorf, B., Castrı`, L., Sudoyo, H., Marzuki, S., Barnes, R.H., Schmidtke, J., Stoneking, M., and Kayser, M. (2009). Genetic admixture history of Eastern Indonesia as revealed by Y-chromosome and mitochondrial DNA analysis. Mol. Biol. Evol. 26, 1865– 1877. 32. Delfin, F., Salvador, J.M., Calacal, G.C., Perdigon, H.B., Tabbada, K.A., Villamor, L.P., Halos, S.C., Gunnarsdo´ttir, E., Myles, S., Hughes, D.A., et al. (2011). The Y-chromosome landscape of the Philippines: Extensive heterogeneity and varying genetic affinities of Negrito and non-Negrito groups. Eur. J. Hum. Genet. 19, 224–230. 33. Matsumoto, H., Miyazaki, T., Omoto, K., Misawa, S., Harada, S., Hirai, M., Sumpaico, J.S., Medado, P.M., and Ogonuki, H. (1979). Population genetic studies of the Philippine Negritos. II. gm and km allotypes of three population groups. Am. J. Hum. Genet. 31, 70–76. 34. Gunnarsdo´ttir, E.D., Li, M., Bauchet, M., Finstermeier, K., and Stoneking, M. (2011). High-throughput sequencing of complete human mtDNA genomes from the Philippines. Genome Res. 21, 1–11. 528 The American Journal of Human Genetics 89, 516–528, October 7, 2011
  • 23. Discover the latest Trends in your field Trends Cell Press Trends journals feature: Cutting-edge Review and Opinion articles Authoritative, succinct and accessible content Discussion, analysis and debate For more information visit cell.com/trends
  • 24. ARTICLE Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test Michael C. Wu,1,5 Seunggeun Lee,2,5 Tianxi Cai,2 Yun Li,1,3 Michael Boehnke,4 and Xihong Lin2,* Sequencing studies are increasingly being conducted to identify rare variants associated with complex traits. The limited power of clas- sical single-marker association analysis for rare variants poses a central challenge in such studies. We propose the sequence kernel asso- ciation test (SKAT), a supervised, flexible, computationally efficient regression method to test for association between genetic variants (common and rare) in a region and a continuous or dichotomous trait while easily adjusting for covariates. As a score-based vari- ance-component test, SKAT can quickly calculate p values analytically by fitting the null model containing only the covariates, and so can easily be applied to genome-wide data. Using SKAT to analyze a genome-wide sequencing study of 1000 individuals, by segment- ing the whole genome into 30 kb regions, requires only 7 hr on a laptop. Through analysis of simulated data across a wide range of practical scenarios and triglyceride data from the Dallas Heart Study, we show that SKAT can substantially outperform several alternative rare-variant association tests. We also provide analytic power and sample-size calculations to help design candidate-gene, whole-exome, and whole-genome sequence association studies. Introduction Genome-wide association studies (GWASs) have identified more than 1000 genetic loci associated with many human diseases and traits,1 yet common variants identified through GWASs often explain only a small proportion of trait heritability. The advent of massively parallel sequencing2 has transformed human genetics3,4 and has the potential to explain some of this missing heritability through identification of trait-associated rare variants.5 Although considerable resources have been devoted to sequence mapping and genotype calling,6–9 successful application of sequencing to the study of complex traits requires novel statistical methods that allow researchers to test efficiently for association given data on rare vari- ants10 and to perform sample-size and power calculations to help design sequencing-based association studies. Rare genetic variants, here defined as alleles with a frequency less than 1%–5%, can play key roles in influ- encing complex disease and traits.11 However, standard methods used to test for association with single common genetic variants are underpowered for rare variants unless sample sizes or effect sizes are very large.12,13 A logical alter- native approach is to employ burden tests that assess the cumulative effects of multiple variants in a genomic region.12–18 Burden tests proposed to date are based on collapsing or summarizing the rare variants within a region by a single value, which is then tested for association with the trait of interest. For example, the cohort allelic sum test (CAST)14 collapses information on all rare variants within a region (e.g., the exons of a gene) into a single dichoto- mous variable for each subject by indicating whether or not the subject has any rare variants within the region and then applies a univariate test. Instead of collapsing by dichotomizing the number of rare variants within a region, collapsing by counting them is also possible.18 The combined multivariate and collapsing method12 extends CAST by collapsing rare variants within a region into subgroups on the basis of allele frequency, collapsing subgroups as in CAST, and applying a multivariate test to the subgroups. The weighted sum test (WST)13 specifically considers the case-control setting and collapses a set of SNPs into a single weighted average of the number of rare alleles for each individual. Numerous alternative methods are largely variations on these approaches.16,17,19 A limitation for all these burden tests is that they implic- itly assume that all rare variants influence the phenotype in the same direction and with the same magnitude of effect (after incorporating known weights). However, one would expect most variants (common or rare) within a sequenced region to have little or no effect on pheno- type, whereas some variants are protective and others dele- terious, and the magnitude of each variant’s effect is likely to vary (e.g., rarer variants might have larger effects). Hence, collapsing across all variants is likely to introduce substantial noise into the aggregated index, attenuate evidence for association, and result in power loss. Further- more, burden tests require either specification of thresh- olds for collapsing or the use of permutation to estimate the threshold.16–20 Permutation tests are computationally expensive, especially on the whole-genome scale, and are difficult for covariate adjustment because permutation 1 Department of Biostatistics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA; 2 Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA; 3 Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA; 4 Depart- ment of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA 5 These authors contributed equally to this work *Correspondence: xlin@hsph.harvard.edu DOI 10.1016/j.ajhg.2011.05.029. Ó2011 by The American Society of Human Genetics. All rights reserved. 82 The American Journal of Human Genetics 89, 82–93, July 15, 2011
  • 25. requires independence between the genotype and the co- variates. The recently proposed C-alpha test21 is a non-burden- based test and is hence robust to the direction and magni- tude of effect. For case-control data, it compares the expected variance to the actual variance of the distribution of allele frequencies. These important advantages allow the C-alpha test to have improved power over burden-based tests, especially when the effects are in different directions. Despite these attractive features, the C-alpha test does not allow for easy covariate adjustment, such as for controlling population stratification, which is important in genetic association studies. The C-alpha test also uses permutation to obtain a p value when linkage disequilibrium is present among the variants, which is, as noted earlier, computa- tionally expensive for whole-genome experiments. The approach has not been generalized to analysis of contin- uous phenotypes. We propose in this paper the sequence kernel association test (SKAT), a flexible, computationally efficient, regression approach that tests for association between variants in a region (both common and rare) and a dichotomous (e.g., case-control) or continuous phenotype while adjusting for covariates, such as principal components, to account for population stratification.22 The kernel machine regression framework was previously considered for common vari- ants.23,24 In this paper, we provide several essential method- ological improvements necessary for testing rare variants. SKAT uses a multiple regression model to directly regress the phenotype on genetic variants in a region and on cova- riates, and so allows different variants to have different directions and magnitude of effects, including no effects; SKAT also avoids selection of thresholds. We develop a kernel association test to test the regression coefficients of the variants by using a variance-component score test in a mixed-model framework by accounting for rare variants. SKAT is computationally efficient. This quality is espe- cially important in genome-wide studies because SKAT only requires fitting the null model in which phenotypes are regressed on the covariates alone; p values are easily computed with simple analytic formulae. Additional features of SKAT include exploitation of local correlation structure, incorporation of flexible weights to boost power (e.g., by increasing the weight of rarer variants or incorpo- rating functionality), and allowance for epistatic variant effects. As discussed in more detail below, under special cases, the SKAT, C-alpha test, and individual variant test statistics are closely related. We demonstrate through simulation and analysis of resequencing data from the Dallas Heart Study that SKAT is often more powerful than existing tests across a broad range of models for both continuous and dichotomous data. We also investigate the factors that influence power for sequence association studies. Finally, we describe analytic tools to estimate statistical power and sample sizes to guide the design of new sequence association studies of rare variants with SKAT. Material and Methods Sequencing Kernel Association Test SKAT is a supervised test for the joint effects of multiple variants in a region on a phenotype. Regions can be defined by genes (in candidate-gene or whole-exome studies) or moving windows across the genome (in whole-genome studies). For each region, SKAT analytically calculates a p value for association while adjust- ing for covariates. Adjustments for multiple comparisons are necessary for analyzing multiple regions, for example with the Bonferroni correction or FDR control. Notation Assume n subjects are sequenced in a region with p variant sites observed. Covariates might include age, gender, and top principal components of genetic variation for controlling population strat- ification.22 For the i-th subject, yi denotes the phenotype variable, Xi ¼ (Xi1, Xi2, .., Xim) denotes the covariates, and Gi ¼ (Gi1, Gi2, ., Gip) denotes the genotypes for the p variants within the region. Typically, we assume an additive genetic model and let Gij, ¼ 0, 1, or 2 represent the number of copies of the minor allele. Domi- nant and recessive models can also be considered. SKAT Model and Test for Linear SNP Effects For a simple illustration of SKAT, we focus here on testing for a rela- tionship between the variants and the phenotype by using clas- sical multiple linear and logistic regression. We describe how the SKAT can incorporate epistatic effects later. To relate the sequence variants in a region to the phenotype, consider the linear model yi ¼ a0 þ a0 Xi þ b0 Gi þ 3i; (Equation 1) when the phenotypes are continuous traits, and the logistic model logit P À yi ¼ 1 Á ¼ a0 þ a0 Xi þ b0 Gi; (Equation 2) when the phenotypes are dichotomous (e.g., y ¼ 0/1 for case or control). Here a0 is an intercept term, a ¼ [a1,., am]’ is the vector of regression coefficients for the m covariates, b ¼ [b1,.,bp]’ is the vector of regression coefficients for the p observed gene variants in the region, and for continuous phenotypes 3i is an error term with a mean of zero and a variance of s2 . Under both linear and logistic models, and evaluating whether the gene variants influence the phenotype, adjusting for covariates, corresponds to testing the null hypothesis H0: b ¼ 0, that is, b1 ¼ b2 ¼ . ¼ bp ¼ 0. The stan- dard p-DF likelihood ratio test has little power, especially for rare variants. To increase the power, SKAT tests H0 by assuming each bj follows an arbitrary distribution with a mean of zero and a variance of wjt, where t is a variance component and wj is a pre- specified weight for variant j. One can easily see that H0: b ¼ 0 is equivalent to testing H0: t ¼ 0, which can be conveniently tested with a variance-component score test in the corresponding mixed model; this is known to be a locally most powerful test.25 A key advantage of the score test is that it only requires fitting the null model yi ¼ a0 þ a1’Xi þ 3i for continuous traits and the logit P(yi ¼ 1) ¼ a0 þ a1’Xi for dichotomous traits. Specifically, the variance-component score statistic is Q ¼ À y À bm Á0 K À y À bm Á ; (Equation 3) where K ¼ GWG’, bm is the predicted mean of y under H0, that is bm ¼ ba0 þ Xba for continuous traits and bm ¼ logitÀ1 ðba0 þ XbaÞ for dichotomous traits; and ba0 and ba are estimated under the null model by regressing y on only the covariates X. Here G is an n 3 p matrix with the (i, j)-th element being the genotype of The American Journal of Human Genetics 89, 82–93, July 15, 2011 83
  • 26. variant j of subject i, and W ¼ diag(w1,., wp) contains the weights of the p variants. In fact, K is an n 3 n matrix with the (i, i’)-th element equal to KðGi; Gi0 Þ ¼ Pp j¼1wjGijGi0j. Kð,; ,Þ is called the kernel function, and KðGi; Gi0 Þ measures the genetic similarity between subjects i and i’ in the region via the p markers. This particular form of Kð,; ,Þ is called the weighted linear kernel function. We later discuss other choices of the kernel to model epistatic effects. Good choices of weights can improve power. Each weight wj is prespecified, with only the genotypes, covariates and external biological information, that is estimated without using the outcome, and reflects the relative contribution of the j-th variant to the score statistic: if wj is close to zero, then the j-th variant makes only a small contribution to Q. Thus, decreasing the weight of noncausal variants and increasing the weight of causal variants can yield improved power. Because in practice we do not know which variants are causal, we propose to set ffiffiffiffiffi wj p ¼ BetaðMAFj; a1; a2Þ, the beta distribution density function with prespecified parameters a1 and a2 evaluated at the sample minor-allele frequency (MAF) (across cases and controls combined) for the j-th variant in the data. The beta density is flex- ible and can accommodate a broad range of scenarios. For example, if rarer variants are expected to be more likely to have larger effects, then setting 0 < a1 % 1 and a2 R 1 allows for increasing the weight of rarer variants and decreasing the weight of common weights. We suggest setting a1 ¼ 1 and a2 ¼ 25 because it increases the weight of rare variants while still putting decent nonzero weights for variants with MAF 1%–5%. All simulations were conducted with this default choice unless stated otherwise. Note that a smaller a1 results in more strongly increasing the weight of rarer variants. Examples of weights across a range of a1 and a2 values are presented in Figure S1, available online. Note that a1 ¼ a2 ¼ 1 corresponds to wj ¼ 1, that is all variants are weighted equally, and a1 ¼ a2 ¼ 0.5 corresponds to ffiffiffiffiffi wj p ¼ 1= ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MAFjð1 À MAFjÞ p , that is wj is the inverse of the variance of the genotype of marker j, which puts almost zero weight for MAFs > 1% and can be used if one believes only variants with MAF < 1% are likely to be causal. Note that SKAT calculated with this weight is identical to the unweighted SKAT test with the standardized genotypes in Equations 1 and 2. Other forms of the weight as a function of MAF can also be used. Because SKAT is a score test, the type I error is protected for any choice of pre- chosen weights. Note that the weights used in the weighted sum test13 involve phenotype information and will therefore alter the null distribution of SKAT if such weights are used. Under the null hypothesis, Q follows a mixture of chi-square distributions, which can be closely approximated with the compu- tationally efficient Davies method.26 See Appendix A for details. A special case of SKAT arises when the outcome is dichotomous, no covariates are included, and all wj ¼ 1. Under these conditions, we show in Appendix A that the SKAT test statistic Q is equivalent to the C-alpha test statistic T. Hence, the C-alpha test can be seen as a special case of SKAT, or alternatively, SKAT can be seen as a generalized C-alpha test that does not require permutation but calculates the p value analytically, allows for covariate adjust- ment, and accommodates either dichotomous or continuous phenotypes. Because SKAT under flat weights is also equivalent to the kernel machine regression test23,24 and because the kernel machine regression test is in turn related to the SSU test,27 it follows transitively that SKAT under flat weights, the kernel machine regression test, the SSU test, and the C-alpha test are all equivalent and special cases of SKAT. Note that the null distribu- tion is calculated differently via these methods, and SKAT gives more accurate analytic p values, especially in the extreme tail, when sample sizes are sufficient. Relationship between Linear SKAT and Individual Variant Test Statistics One can efficiently compute the test statistic Q by exploiting a close connection between the SKAT score test statistic Q and the individual variant test statistics. In particular, Q is a weighted sum of the individual score statistics for testing for individual variant effects. Hence, by letting gj ¼ [G1j, G1j, ., Gnj]’ denote the n 3 1 vector containing the genotypes of the n subjects for variant j, it is straightforward to see that Q ¼ Pp j¼1wjS2 j , where Sj ¼ g0 jðy À bm0Þ is the individual score statistic for testing the marginal effect of the j-th marker (H0: bj ¼ 0) under the individual linear or logistic regression model of yi on Xi and only the j-th variant Gij: yi ¼ a0 þ X0 i a þ bjGij þ 3i for continuous phenotypes and logit P À yi ¼ 1 Á ¼ a0 þ X0 i a þ bjGij for dichotomous phenotypes. bm0 is estimated as bm0 ¼ ba0 þ X0 i ba for continuous traits and bm0 ¼ logitÀ1 ðba0 þ X0 i baÞ for dichotomous traits. As a score test, one needs to fit the null model only a single time to be able to compute the Sj for all individual variants j as well as all regions to be tested. Similarly, if multiple regions are under consideration, then the same bm0 can be used to compute the SKAT Q statistics for each region. Accommodating Epistatic Effects and Prior Information under the SKAT An attractive feature of SKAT is the ability to model the epistatic effects of sequence variants on the phenotype within the flexible kernel machine regression framework.28–30 To do so, we replace Gi’b by a more flexible function f(Gi) in the linear and logistic models (1) and (2) where f(Gi) allows for rare variant by rare variant and common variant by rare-variant interactions. Specifi- cally, for continuous traits we use the semiparametric linear model23,29 yi ¼ a0 þ a0 Xi þ f ðGiÞ þ 3i; (Equation 4) and for dichotomous traits, we use the semiparametric logistic model24,30 logit P À yi ¼ 1 Á ¼ a0 þ a0 Xi þ f ðGiÞ: (Equation 5) Here the variants, Gi, are related to the phenotype through a possibly nonparametric function f($), which is assumed to lie in a functional space generated by a positive semidefinite kernel function Kð,; ,Þ. Models (1) and (2) assume linear genetic effects and are specified by KðGi; Gi0 Þ ¼ Pp j¼1wjGijGi0j. By changing Kð,; ,Þ, one can allow for more complex models. Intuitively, KðGi; Gi0 Þ is a function that measures genetic similarity between the i-th and i’-th subjects via the p variants in the region, and any positive semidefinite function KðGi; Gi0 Þ can be used as a kernel function. We tailored several useful and commonly used kernels specifically for the purpose of rare-variant analysis: the weighted linear kernel, the weighted quadratic kernel, and the weighted identity by state (IBS) kernel. The weighted linear kernel function KðGi; Gi0 Þ ¼ Pp j¼1wjGijGi0j implies that the trait depends on the variants in a linear fashion and is equivalent to the classical linear and logistic model pre- sented in Equations 1 and 2. The weighted quadratic kernel KðGi; Gi0 Þ ¼ ð1 þ Pp j¼1wjGijGi0jÞ2 implicitly assumes that the model depends on the main effects and quadratic terms for the gene 84 The American Journal of Human Genetics 89, 82–93, July 15, 2011
  • 27. variants and the first-order variant by variant interactions. The weighted IBS kernel KðGi; Gi0 Þ ¼ Pp j¼1wjIBSðGij; Gi0jÞ, defines simi- larity between individuals as the number of alleles that share IBS. For additively coded autosomal genotype data, KðGi; Gi0 Þ ¼ Pp j¼1wjð2 À jGij À Gi0jjÞ. The model implied by the weighted IBS kernel models the SNP effects nonparametrically.31 Consequently, this allows for epistatic effects because the function f($) does not assume linearity or interactions of a particular order (e.g., the second order), Using the weighted IBS kernel removes the assump- tion of additivity because the number of alleles that are identical by state is a physical quantity that does not change on the basis of different genotype encodings. We note that a kernel function that better captures both the similarity between individuals and the causal variant effects will increase power. In particular, if relationships are linear and no interactions are present, then the weighted linear kernel will have highest power. If interactions are present, the weighted quadratic and weighted IBS kernels can increase power. Our expe- rience suggests using the IBS kernel when the number of interact- ing variants within the region is modest. As our understanding of genetic architecture improves so too will our knowledge of which kernel to use. In each of the above kernels, wj is an allele specific weight that controls the relative importance of the jth variant and might be a function of factors such as allele frequency or anticipated func- tionality. Without prior information, we suggest the use of the ffiffiffiffiffi wj p ¼ BetaðMAFj; 1; 25Þ suggested earlier. However, if prior infor- mation is available, for example some variants are predicted as functional or damaging via Polyphen32 or Sift,33 weights can be selected to increase the weight for likely functionality. To test for the effects of gene variants in a region on a phenotype, one tests the null hypothesis H0: f(G) ¼ 0. SKAT tests for this null hypothesis by assuming the n 3 1 vector f ¼ [f(G1), ., f(Gn)]’ for the genetic effects of n subjects follows a distribution with mean zero and covariance tK, where t is a variance component that indexes the effects of the variants.29,30 Hence, we can test the null hypothesis that corresponds to testing H0: t ¼ 0 by a vari- ance-component score test. In particular, we simply replace K in Equation 3 by using the K discussed in this section, for example, the weighted IBS kernel, for epistatic effect. All subsequent calcu- lations for computing a p value remain the same. Because the SKAT evaluates significance via a score test, which operates under the null hypothesis, the SKAT is valid (in terms of protecting type I error) irrespective of the kernel and the weights used. Good choices of the kernel and the weights simply increase power. Planning New Sequencing-Based Association Studies: Estimation of Power and Sample Size Power and sample-size calculations are important in designing sequencing studies of complex traits. Using a modification of the higher-order moment-approximation method,34 we provide an analytic method to carry out efficiently such calculations for SKAT.35 Specifically, for a fixed sample size and a level, given a prior hypothesis on the genetic architecture of a particular region, the effect size, and the proportion and number of causal variants within a region, our method provides the power to detect the region as significant with SKAT. Similarly, if the desired power is fixed, the approach can be used to find the necessary sample size. There are key differences between the power and sample-size estimation for single-variant- and region (set)-based tests. For a region (set)-based test, the power depends strongly on the under- lying genetic architecture, and its estimation requires modeling this genetic architecture and the linkage disequilibrium (LD) between variants. Therefore, to estimate power to detect a partic- ular region as associated with a phenotype requires specification of the significance level, sample size, which variants in the region are causal with corresponding effect size, and the LD structure of the variants in the region. Ideally, one could use prior data to assess the LD and MAF. Because prior data can be difficult to obtain, we currently recommend the use of either 1000 Genomes Project data36 or data simulated under a population genetics model.37 Relevant preliminary data will become increasingly available as sequencing studies become more common. Our SKAT software uses simulated data based on the coalescent population genetic model (released with the software package) as a default in performing sample-size and power calculations, and instead of directly specifying the effects of any given variant, the user can input an MAF threshold for determining which variants are regarded as rare and also a proportion determining how many of the rare variants are causal. The causal variants are then randomly selected from the alleles with true MAF (based on simulated or preliminary data) less than the threshold. The magnitudes of the effects jbjj for causal variants are set to be equal to c 3 jlog10 MAFj where c is determined on the basis of the maximum effect size the user would like to allow (described below in the power simulations section) at MAF ¼ 10À4 . This allows the effects of causal variants to decrease with MAFs. Because these parameters can be difficult to choose asapriori,powerandsamplesizecanbereasonably estimated by averaging results over a range of parameter values. Similarly, because the regional architecture can vary across different regions, for genome-wide studies, one can average over multiple randomly selected regions as currently implemented in the SKAT software. Numerical Experiments and Simulations To validate SKAT in terms of protecting type I error and to assess its power compared to burden tests and the accuracy of our power and sample-size tools, we carried out simulation studies under a range of configurations. For all simulations, we determined sequence genotypes by simulating 10,000 chromosomes for a 1 Mb region on the basis of a coalescent model that mimics the LD pattern local recombination rate and the population history for Europeans by using COSI.37 Type I Error Simulations To investigate whether SKAT preserves the desired type I error rate at the near genome-wide threshold level, for example a ¼ 10À6 , it is necessary to conduct simulations with hundreds of millions of simulated datasets. Although SKAT is computationally efficient, generating such a large number of datasets is challenging. To reduce the computation burden, we took the following approach. Using 10,000 randomly selected sets of 30 kb subregions within a 1 Mb chromosome, we first generated 10,000 sets of genotypes G(n 3 p) from the coalescent model, with p variants on n subjects. Then, for each of the 10,000 simulated genotype data sets, we simulated 10,000 sets of continuous phenotypes such that we were able to obtain 108 individual genotype-phenotype data sets by using the model: y ¼ 0:5X1 þ 0:5X2 þ 3; where X1 is a continuous covariate generated from a standard normal distribution, X2 is a dichotomous covariate taking values 0 and 1 with a probability of 0.5, and 3 follows a standard normal distribution. Note that the continuous trait values are not related to the genotype so that the null model holds. The 30 kb regions on The American Journal of Human Genetics 89, 82–93, July 15, 2011 85
  • 28. which the genotype values are based contained 605 variants on average, but the number of observed variants for any given data set was considerably less and depended on the sample size n, which we set to 500, 1000, 2500, and 5000. We repeated the type I error simulations for dichotomous phenotypes as above, except the dichotomous outcomes were generated via the model: logit Pðy ¼ 1Þ ¼ a0; where a0 was determined to set the prevalence to 1% and case- control sampling is used. For both continuous and dichotomous simulations, we applied SKAT by using the default weighted linear kernel to each of the 108 data sets and estimated the empirical type I error rate as the proportion of p values less than a ¼ 10À4 , 10À5 , or 10À6 . We note that the estimated type I error from this approach is not the same as the empirical type I error when genotypes are generated randomly for each simulation, because for each of the 10,000 genotype data sets, only the outcomes are resampled. However, our type I error estimator is still unbiased and results in very accurate type I error estimates. For larger a levels (0.05 and 0.01), we directly computed the empirical type I error rate by using data sets in which genotypes were randomly generated for each simulation. Empirical Power Simulations We simulated data sets in which 30 kb subregions were randomly selected from the generated 1 Mb chromosomes and used to create causal variants and a phenotype variable as well as additional simulated covariates. We generated continuous phenotypes by y ¼ 0:5X1 þ 0:5X2 þ b1Gc 1 þ b2Gc 2 þ . þ bpb Gc pG þ 3; where X1, X2X , and 3 are as defined for the type I error simulations, Gc 1; Gc 2; .; Gc s are the genotypes of the s causal rare variants (a randomly selected subset of the simulated rare variants, for example 5% of variants that have MAF < 3% in Figure 1), and the bs are effect sizes for the causal variants. Similarly, we 0.5k 1k 2.5k 5k 0.00.20.40.60.81.0 β +/− = 100/0 Total Sample Size Power SKAT SKAT_M rSKAT W N C 0.5k 1k 2.5k 5k 0.00.20.40.60.81.0 β +/− = 80/20 Total Sample Size Power 0.5k 1k 2.5k 5k 0.00.20.40.60.81.0 β +/− = 50/50 Total Sample Size Power Continuous Trait 0.5k 1k 2.5k 5k 0.00.20.40.60.81.0 β +/− = 100/0 Total Sample Size Power 0.5k 1k 2.5k 5k 0.00.20.40.60.81.0 β +/− = 80/20 Total Sample Size Power 0.5k 1k 2.5k 5k 0.00.20.40.60.81.0 β +/− = 50/50 Total Sample Size Power Dichotomous Trait Figure 1. Simulation-Study-Based Power Comparisons of SKAT and Burden Tests Empirical power at a ¼ 10À6 under an assumption that 5% of the rare variants with MAF < 3% within random 30 kb regions were causal. Top panel: continuous phenotypes with maximum effect size (jbj) equal to 1.6 when MAF ¼ 10À4 ; bottom panel: case-control studies with maximum OR ¼ 5 when MAF ¼ 10À4 . Regression coefficients for the s causal variants were assumed to be a decreasing function of MAF as jbjb j ¼ c jlog10MAFjFF j (j ¼ 1,.,p [see Figure S2]), where c was chosen to result in these maximum effect sizes. From left to right, the plots consider settings in which the coefficients for the causal rare variants are 100% positive (0% negative), 80% positive (20% nega- tive), and 50% positive (50% negative). Total sample sizes considered are 500, 1000, 2500, and 5000, with half being cases in case-control studies. For each setting, six methods are compared: SKAT, SKAT in which 10% of the genotypes were set to missing and then imputed (SKAT_M), restricted SKAT (rSKAT) in which unweighted SKAT is applied to variants with MAF < 3%, the weighted sum burden test (W) with the same weights as used by SKAT, counting-based burden test (N), and the CAST method (C). All the burden tests used MAF < 3% as the threshold. For each method, power was estimated as the proportion of p values < a among 1000 simulated data sets. 86 The American Journal of Human Genetics 89, 82–93, July 15, 2011
  • 29. generated dichotomous phenotypes for case-control data under the logistic model logit Pðy ¼ 1Þ ¼ a0 þ 0:5X1 þ 0:5X2 þ b1Gc 1 þ b2Gc 2 þ . þ bpGc p; where Gc 1; Gc 2; .; Gc p are again the genotypes for the causal rare variants and bs are log ORs for the causal variants. We controlled prevalence by a0 and set to it 1% unless otherwise stated. Under both models, we set the magnitude of each bj to cjlog10MAFjj such that rarer variants had larger effects. In the simulation studies, for continuous traits, c ¼ 0.4, which gives the maximum effect size jbjj ¼ 1.6 for variants with MAF ¼ 10À4 and small effects jbjj ¼ 0.28 for MAF ¼ 0.2. For dichotomous traits, c ¼ ln5/4 ¼ 0.402, which gives the ‘‘maximum’’ OR ¼ 5.0 (jbjj ¼ ln5) for vari- ants with MAF ¼ 10À4 and smaller OR ¼ 1.32 for MAF ¼ 0.2. The effect size curves are given in Figure S2. We compared SKAT, an unsupervised variation on the WST13 that uses weighted-count-based collapsing, counting-based collapsing,18 and CAST.14 For each of these tests, we considered variants with observed MAF < 3% as rare: whether CAST collapses depends on whether an individual exhibits any variants with allele frequency < 3%, the counting method counts the number variants with MAF < 3%, and the weighted count inflates the contribution of each rare variant by multiplying the genotype with the same beta-density-based weights as used in SKAT. To accommodate missing genotypes commonly observed in sequence data, we considered the effect of imputing missing values by randomly setting 10% of the genotypes as missing, imputing genotypes on the basis of observed allele frequencies and Hardy-Weinberg equilibrium, and then applying SKAT to the imputed data. We also performed restricted SKAT (rSKAT) by applying unweighted SKAT to rare variants with MAF < 3%. Note that for dichotomous phenotypes, rSKAT is essentially equiv- alent to a covariate adjusted C-alpha test with the p value calcu- lated analytically instead of via permutation. For each of the methods, power was estimated as the proportion of p values < a, where a ¼ 10À6 to mimic genome-wide studies. Power and Sample-Size Formulae To demonstrate the utility and accuracy of our power and sample- size calculation method, we conducted several numerical experi- ments. We first illustrated the use of the methods by computing the sample size necessary to detect a 30 kb region with 5% of the variants with MAF < 3% being causal. We assume effect size (OR) increases with decreasing MAF, and seek 80% power at significance levels a ¼ 10À6 , 10À3 , 10À2 , corresponding to approx- imate genome-wide sequencing significance and candidate-gene- sequencing studies of 50 and five genes, respectively. We consid- ered both continuous and dichotomous traits. To show that the power estimated from our sample-size formula is accurate, we compared empirical power for SKAT under simula- tions to power estimated via our analytic method. Specifically, we simulated continuous and case-control data under the same setting as that used in the power simulations, and we estimated power as a function of the sample size by computing the propor- tion of p values < a ¼ 10À6 and compared the empirical power curve to the power estimated by using our analytical method. Results Simulation of the Type I Error The empirical type I error rates estimated for SKAT are pre- sented in Table 1 for a ¼ 10À4 , 10À5 , and 10À6 and suggest the type I error rate is protected for continuous pheno- types, though for smaller sample sizes the SKAT can be slightly conservative. For dichotomous phenotypes, SKAT is conservative for smaller sample sizes and very small a levels. Additional results from simulations of the type I error for SKAT and the competing methods are presented in Figure S3 for both continuous traits and dichotomous traits and show that at larger a levels, all of the considered tests correctly control at the a ¼ 0.05 and 0.01 levels. These results show that SKAT is a valid method, and despite being conservative at low a levels, SKAT maintains good power relative to existing methods (see below). However, if sample sizes are small or sharp control of type I error is necessary, then standard permutation-based procedures can be used to generate a Monte Carlo p value for signifi- cance, though this can be computationally expensive and does not work in the presence of covariates, such as controlling for population stratification and require carful modifications. Statistical Power of SKAT and Competing Methods We compared the power of SKAT with three burden tests in a series of simulation studies for both continuous traits and dichotomous traits by generating sequence data in randomly selected 30 kb regions with a coalescent model.37 For our primary power simulation, within each region, 5% of variants with population MAF < 3% were randomly chosen as causal, the effect size of causal variants was a decreasing function of MAF, and 50%–100% of the causal variants being positively associated with the trait Table 1. Type I Error Estimates of SKAT Aimed at Testing an Association between Randomly Selected 30 kb Regions with a Continuous Trait at Type I Error Rates as Low as the Genome-wide a ¼ 10À6 Level Total Sample Size (n) Continuous Phenotypes Dichotomous Phenotypes a ¼ 10À4 a ¼ 10À5 a ¼ 10À6 a ¼ 10À4 a ¼ 10À5 a ¼ 10À6 500 7.4 3 10À5 6.5 3 10À6 5.9 3 10À7 2.2 3 10À5 1.0 3 10À6 1.0 3 10À8 1000 8.5 3 10À5 8.2 3 10À6 8.0 3 10À7 5.0 3 10À5 3.5 3 10À6 2.3 3 10À7 2500 9.6 3 10À5 9.1 3 10À6 8.4 3 10À7 7.6 3 10À5 6.3 3 10À6 5.6 3 10À7 5000 9.8 3 10À5 9.6 3 10À6 8.8 3 10À7 8.9 3 10À5 7.8 3 10À6 7.0 3 10À7 Each entry represents type I error rate estimates as the proportion of p values a under the null hypothesis based on 108 simulated phenotypes. The American Journal of Human Genetics 89, 82–93, July 15, 2011 87
  • 30. (See Materials and Methods and Figure S2). The simulated regions for our power analysis contained on average 605 variants (26 causal), of which 530.9 (88%), 502.9 (83%), and 422.8 (70%) had population MAF < 3%, < 1%, and < 0.1%, respectively. The average allele frequency spec- trum across the samples is similar to that of the Dallas Heart Study data (Figure S4). Because the majority of variants have a low MAF, they might not be observed in any particular sample. The average number of observed variants (assuming no genotyping error) and the average number of observed causal variants are presented in Table 2. For continuous traits, SKAT had much higher power than all the burden tests, and the weighted count method tended to outperform the count and CAST methods (Figure 1). SKAT’s power was robust to the proportion of causal variants that were positively associated with the trait, whereas the burden tests suffered substantial loss of power when causal variants had the opposite effects. The simulation results examining dichotomous traits were qualitatively similar in that SKAT dominated the compet- ing methods. However, here the power of the SKAT decreased when both protective and harmful variants were present, although less so than for the burden tests. The difference in power for SKAT for different proportions of protective variants is due to the fact that given fixed population MAFs, protective variants imply negative log ORs and lower disease risk and hence lower MAFs in cases and more difficulties in observing rare variants in cases. The larger decrease in power for the competing methods is additionally driven by sensitivity to direction of effect due to aggregation of genotypes. Across all configurations, using imputed genotypes instead of the true genotype for 10% missing genotype data led to a very small reduction in power, despite the use of a very simple Hardy-Weinberg-based imputation strategy. This is true in part because most variants are rare. Note that SKAT increases the weight of rare variants but does not require thresholding. To show that the superior performance of SKAT is intrinsic and is not driven by the particular choice of the weight used, we calculated rSKAT, which does not weight the rare variants but instead uses the same threshold as the burden tests. Our results, pre- sented in Figure 1, show that rSKAT is still substantially more powerful than all three burden tests. Power simulation results for other type I error rates (a ¼ 0.01, 0.001), lower causal variant frequencies (population MAF < 1%), and other region sizes (10 kb and 60 kb) yielded the same conclusions (Figures S5–S8). In the 30 kb genomic regions considered, reflecting anal- ysis of genome-wide sequencing data, it is unlikely that a large proportion of the rare variants are all causal. However, for exome-scale sequencing, the number of observed rare variants can be considerably smaller and the proportion of causal rare variants can be greater. Hence, we also conducted power simulations for smaller region sizes (3 kb and 5 kb) and larger proportions of causal variants (10%, 20%, and 50%). Results for both continuous and dichotomous phenotypes are presented in Figures S9– S12 and show that if 50% of the rare variants are causal and that all of the causal variants have effects in the same direc- tion, then SKAT and rSKAT are less powerful compared to collapsing methods, with count-based collapsing having the greatest power. This result held for both 3 kb and 5 kb regions and is expected because the collapsing methods implicitly assume that all of the variants are causal and have unidirectional effects. In all other settings we considered, SKAT was the most powerful method. Power and Sample-Size Estimation To illustrate our power and sample-size calculation method, in Figure 2 we present the estimated sample-size curves as a function of maximum effect sizes (ORs for dichotomous traits) necessary to detect a 30 kb region with 5% of the variants with MAF < 3% being causal. Table 3 presents estimated sample sizes for several configu- rations of practical interest. Additional sample-size curves when causal variants are rarer (MAF < 1%) or occur more frequently (10% of variants are causal) or when prevalence is varied (5%, 0.1%) can be found in Figures S13–S15. These results show that, for a given region, one will have more power (and a lower required sample size) to detect rare causal variants if the percentage of variants that are causal is higher, the causal rare variants have higher MAFs and/or larger effect sizes (e.g., odds ratios [ORs]), and the effects are more consistently in the same direction. For case-control designs, lower prevalence yields higher power because given the same OR and popu- lation MAF, the lower prevalence results in enrichment of more harmful (ORs > 1) variants, that is higher MAFs, across both cases and controls, that is for rarer diseases harmful rare variants are more likely to be observed. Conversely, if the prevalence is low, fewer protective vari- ants (ORs < 1), that is lower MAFs, are likely to be observed in the sample. We also compared the power and sample-size formulae estimates to the empirical, simulation-based power esti- mates for both continuous and dichotomous traits. The curves plotted in Figure 3 show that the empirical power is accurately approximated by our analytical formula. Table 2. Characteristics of the 30 kb Region Data Sets Used in the Simulation Studies Average Number of Observed Variants Sample Size (n) 500 1000 2500 5000 All traits* 255 330 438 512 Continuous trait** 9.6 13.3 18.6 22.3 Dichotomous trait (b 5 ¼ 100/0)** 14.4 18.7 23.5 25.2 Dichotomous trait (b5 ¼ 80/20)** 13.3 17.1 22.0 24.3 Dichotomous trait (b5 ¼ 50/50)** 11.1 14.9 19.7 22.6 The number of observed variants* and the number of observed causal variants** within the region are averaged over the 1000 simulated data sets. 88 The American Journal of Human Genetics 89, 82–93, July 15, 2011
  • 31. Application to Dallas Heart Study Data We analyzed sequence data on 93 variants in ANGPTL3 (MIM 604774), ANGPTL4 (MIM 605910), and ANGPTL5 (MIM 607666) in 3476 individuals from the Dallas Heart Study38 to test for association between log-transformed serum triglyceride (logTG) levels and rare variants in these genes. We adjusted for sex and ethnicity (black, Hispanic, or white) but did not adjust for age as a large number of subjects have missing ages. In addition to testing for asso- ciation via SKAT and the three burden tests considered earlier, we also applied the permutation-based varying- threshold method (VT) and the Polyphen-score-adjusted VT (VTP),16 which are based on the residuals obtained from regressing the phenotype on the covariates and assume gene-covariate independence. Because VT and VTP require permutation, they are computationally expen- sive when applied genome wide. For VTP, we used the Polyphen score for rare variants (MAF < 0.01) and assigned a constant score of 0.5 to all other variants. We also analyzed a dichotomized phenotype on the highest and lowest quartiles of each of the six sex-ethnicity groups (Table 4). Table 3. Required Total Sample Size to Achieve 80% Power to Detect Rare Variants Associated with a Continuous or Dichotomous Case-Control Phenotype at the Genome-wide Level a ¼ 10À6 Total Sample Size Maximum b ¼ 1.6/ Maximum OR ¼ 5 Maximum b ¼ 1.9/ Maximum OR ¼ 7 5% Causal 10% Causal 5% Causal 10% Causal Continuous trait 5,990 1,800 4,260 1,290 Dichotomous trait with prevalence 10% 15,120 4,810 9,650 3,120 Dichotomous trait with prevalence 1% 12,030 3,870 7,010 2,290 Power was estimated via the analytical formulae assuming 5% or 10% of variants with MAF < 3% are causal. Regression coefficients for the s causal variants were assumed to be a decreasing function of MAF, jbjb j ¼ c jlog10MAFjFF j (j ¼ 1,.,s), where 80% of bj’s are positive and 20% are negative; see Figure S2. Required total sample sizes (cases and controls) are given for different ‘‘maximum’’ effect sizes (or ORs) when MAF ¼ 10À4 and different prevalences for case-control studies. Estimated sample sizes were averaged over 100 random 30 kb regions. 1.4 1.6 1.8 2.0 2.2 0200040006000800010000 β +/− = 100/0 max β TotalSampleSize α = 10−6 α = 10−3 α = 10−2 1.4 1.6 1.8 2.0 2.2 0200040006000800010000 β +/− = 80/20 max β TotalSampleSize 1.4 1.6 1.8 2.0 2.2 0200040006000800010000 β +/− = 50/50 max β TotalSampleSize Continuous Trait 5 6 7 8 9 10 11 0200040006000800010000 β +/− = 100/0 max OR TotalSampleSize 5 6 7 8 9 10 11 0200040006000800010000 β +/− = 80/20 max OR TotalSampleSize 5 6 7 8 9 10 11 0200040006000800010000 β +/− = 50/50 max OR TotalSampleSize Dichotomous Trait Figure 2. Sample Sizes Required for Reaching 80% Power Analytically estimated sample sizes required for reaching 80% power to detect rare variants associated with a continuous (top panel) or dichotomous phenotype in case-control studies (half are cases) (bottom panel) at the a ¼ 10À6 , 10À3 , and 10À2 levels, under the assump- tion that 5% of rare variants with MAF < 3% within the 30 kb regions are causal. Plots correspond to 100%, 80%, and 50% of the causal variants associated with increase in the continuous phenotype or risk of the dichotomous phenotype. Regression coefficients for the s causal variants were assumed to be the same decreasing function of MAF as that in Figure 1. The absolute values of Required total sample sizes are plotted again the maximum effect sizes (ORs) when MAF ¼ 10À4 . Estimated total sample sizes were averaged over 100 random 30 kb regions. The American Journal of Human Genetics 89, 82–93, July 15, 2011 89
  • 32. SKAT was by far the most powerful test for the dichoto- mous trait. For continuous traits, SKAT has much smaller p values than two burden methods (CAST and WST) and VT, and has a slightly higher p value than the counting- based burden test (N) and VTP. Note that SKAT was easier to apply because it did not require prior functional infor- mation (available for only a subset of variants) or permuta- tion, and it adjusted for covariates without assuming gene- covariate independence. Computation Time The computation time for the SKAT depends on the sample size and the number of markers. To analyze a 30 kb region sequenced on 1000, 2500, or 5000 individuals, SKAT required 0.21, 0.73, and 2.3 s, respectively, for continuous traits and ~20% longer for dichotomous traits, on a 2.33 GHz laptop with 6 Gb memory. Analyzing 300 kb, 3 Mb, or 3 Gb (the entire genome) on 1000 individ- uals requires 2.5 s, 25 s, and 7 hr, respectively. Discussion We propose SKAT as a supervised, flexible, and computa- tionally efficient statistical method that tests for association between a continuous or dichotomous phenotype and rare and common genetic variants in sequencing-based associa- tion studies. We demonstrate that SKAT’s power is greater than that of several burden tests over a range of genetic models. Furthermore, we have developed analytical power and sample-size calculations for SKAT that assist in designing sequencing-based association studies. 2000 4000 6000 8000 10000 0.00.20.40.60.81.0 Continuous Trait Total Sample Size Power Theoretical Empirical 2000 4000 6000 8000 10000 0.00.20.40.60.81.0 Dichotomous Trait Total Sample Size Power Figure 3. Power Comparisons Based on Simulation and Analytic Estimation Power as a function of total sample size estimated by simulation with 1000 repli- cates and by the proposed power formula for continuous and dichotomous case- control traits. Simulation configurations correspond to those used in Figure 1, in which 80% of the regression coefficients for the causal rare variants were positive. Table 4. Analysis of the Dallas Heart Study Sequencing Data SKAT C N W VTa VTPa Continuous TG level 9.5 3 10À5 1.9 3 10À3 7.2 3 10À5 2.3 3 10À4 3.5 3 10À4 2.0 3 10À5 Dichotomized TG level 1.3 3 10À4 3.2 3 10À2 2.2 3 10À3 3.1 3 10À3 8.6 3 10À3 2.1 3 10À3 Analysis of the Dallas Heart Study sequencing data with SKAT, the weighted sum burden test (W), the counting-based burden test (N), the CAST method (C), the varying-threshold method (VT), and the Polyphen-score adjusted VT (VTP) method. Beta (1, 25) is used as the weight in the SKAT and the weighted sum test. a p values are estimated on the basis of 106 permutations. Like burden tests, SKAT performs region-based testing. However, SKAT has several major advantages over the existing tests. As a supervised method, SKAT directly performs multiple re- gressions of a phenotype on genotypes for all variants in the region, adjusting for covariates. Hence, as with conven- tional multiple regression models, neither directionality nor magnitudes of the associations are assumed a priori but are instead estimated from the data. To test efficiently for the joint effects of rare variants in the region on the phenotype, SKAT assumes a distribution for the regression coefficients of the markers, whose variances depend on flexible weights. SKAT performs a score-based variance- component test, whose calculation only requires fitting the null model by regressing phenotypes on covariates alone and computing p values analytically. The flexible regression framework also allows us to allow for epistatic effects. Besides region-based analysis, SKAT can also be applied to any biologically meaningful SNP set. As SKAT is a regres- sion-based method, it can be easily extended to survival, and longitudinal and multivariate phenotypes and hence provides a comprehensive framework for a wide variety of sequencing-based association studies. The ability to obtain a p value directly without the need for permutation is an attractive feature of SKAT, and allows for rapid estimation of p values in exome and genome- wide sequencing studies. Our simulations showed that for continuous phenotype, the p values are accurate when the sample size is moderate or large; for dichoto- mous phenotypes, the p values are conservative at lower a levels (e.g., < 10À4 ) if the sample size is modest or small. Permutation can be used to obtain a more accurate estimate in the absence of covariates. In the presence of covariates, for example population stratification, standard 90 The American Journal of Human Genetics 89, 82–93, July 15, 2011
  • 33. permutations fail and require careful modifications. Despite the conservative nature of the score test, SKAT often still has higher power than competing methods at small a levels. SKAT can be combined with collapsing strategies to form a hybrid testing approach. If most of the variants within a range of allele frequencies are causal and have the same directionality (i.e., under settings that are optimal for burden-based tests), collapsing these variants and then applying SKAT to the collapsed variants can improve power. For example, because singletons are common in sequencing studies (57 of 93 variants in the Dallas Heart Study data), a possible hybrid strategy is to first collapse all of the singletons into a single value and then apply SKAT to the collapsed value and the other 36 variants. Compared to the original SKAT, this strategy gives a slightly lower p value, 3.1 3 10À5 , for the continuous trait and a slightly higher p value, 1.6 3 10À4 , for the dichotomous trait. Simulation studies showed that the two methods are of similar power under the settings we used to generate Figure 1. An important feature of SKAT is that it allows for incor- poration of flexible weight functions to boost analysis power, for example by increasing the weight of variants with lower MAFs and decreasing the weight of information from variants inferred with lower confidence. Good choices of weights are likely to improve the power of the association test with SKAT, although simulations show that even equal weights can yield high power when combined with thresholding. In our simulation studies, we employed a class of flexible continuous weights as a function of MAF by using the beta function, which increases the weight of rare variants and does not require thresholding. Users can define other types of weight func- tions. To further improve analysis power, one can estimate weights by incorporating information besides MAF, for example by using the Polyphen score or integrating other annotation information, which will become increasingly available as our understanding of genome variation improves. Therefore, because of its flexibility, SKAT has the capacity to mature, and its power to increase, as the field progresses. Appendix A Estimating the Null Distribution for Q Under the null hypothesis, Q follows a mixture of chi- square distributions.29,30 More specifically, we define P0 ¼ V À V ~Xð ~X 0 V ~XÞÀ1 ~X 0 V where ~X is the n 3 (p þ 1) matrix equal to [1, X]. For continuous phenotypes, V ¼ bs 2 0I where bs0 is the estimator of s under the null model where f(G) ¼ 0, and I is an n 3 n identity matrix. For dichoto- mous phenotypes, V ¼ diagðbm01ð1 À bm01Þ; bm02ð1 À bm02Þ; .; bm0nð1 À bm0nÞÞ where bm0i ¼ logitÀ1 ðba þ ba 0 XiÞ is the esti- mated probability that the i-th subject is a case under the null model. Then under the null model Q $ Xn i¼1 lic2 1;i; (Equation 6) where (l1, l2, ., ln) are the eigenvalues of P 1=2 0 KP 1=2 0 , and c2 1;i are independent c2 1 random variables. Several approximation and exact methods have been suggested to obtain the distribution of Q.39 Among these, the Davies exact method,26 based on inverting the charac- teristic function of Equation 6, appears to work well in practice and is used here. SKAT Is a Generalization of the C-Alpha Test The recently proposed the C-alpha test has advantages over burden tests in that it explicitly models the possibility that minor alleles can be deleterious or protective. However, it does not currently allow for the analysis of quantitative outcomes or the inclusion of covariates and p value calculation requires permutation. We demonstrate that for a dichotomous trait in the absence of covariates, the C-alpha test statistic is equivalent to the SKAT statistic with unweighted linear kernel, which is the same as the kernel machine test in Wu et al.24 Suppose the j-th variant is observed dj times in the cases, out of nj times total in cases and controls, and that p0 ¼ Pn i¼1yi=n. For a dichotomous trait and no covariates, the C-alpha test statistic Ta ¼ Xp j¼1 hÀ dj À njp0 Á2 Ànjp0 À 1 À p0 Ái (Equation 7) Denote T1 a ¼ Pp j¼1ðdj À njp0Þ2 . Because Pp j¼1njp0ð1 À p0Þ is the mean of Ta under the null hypothesis of no associa- tion, T1 a is the C-alpha test statistic without mean centering. Because dj ¼ y0 G:j and nj ¼ J0 G:j, where G:j is the j-th column of the genotype matrix G and J ¼ ð1; 1; .; 1Þ0 , it can be easily shown that T1 a ¼ À y À p0J Á0 GG0 À y À p0J Á : (Equation 8) Note that under the unweighted linear kernel, K ¼ GG’ and bm0 ¼ p0J if no covariates are present. Hence, Equation 8 is identical to Equation 3, that is T1 a is equivalent to the SKAT test statistic with unweighted linear kernel. Although the SKAT statistic with unweighted linear kernel and the C-alpha test statistic are equivalent, SKAT and C-alpha test use different null distributions to assess significance: C-alpha test uses a normal approximation, whereas we use a mixture of chi-squares. The normal approximation gives a valid p value when the tested rare variants are independent and sample sizes are large, and so requires an assumption of linkage equilibrium. In the presence of LD, permutation is used by the C-alpha test for significance testing. One can easily see that the test statistic takes a quadratic form of y, which follows a mixture of chi-square distributions. SKAT approximates this distri- bution directly with the Davies method and hence gives accurate estimation of significance regardless of the LD structure when sample size is sufficient. The American Journal of Human Genetics 89, 82–93, July 15, 2011 91
  • 34. Supplemental Data Supplemental Data include 15 figures and can be found with this article online at http://www.cell.com/AJHG/. Acknowledgments This work was supported by grants P30 ES010126 (to M.C.W.), DMS 0854970 and R01 GM079330 (to T.C.), R01 HG000376 (to M.B.), and R37 CA076404 and P01 CA134294 (to S.L. and X.L.). We thank Jonathan Cohen, Alkes Price, and Shamil Sunyaev for providing the Dallas Heart Study data and Larisa Miropolsky for help with the software development. Received: March 16, 2011 Revised: May 27, 2011 Accepted: May 30, 2011 Published online: July 7, 2011 Web Resources The URLs for data presented herein are as follows: 1000 Genomes Project, http://www.1000genomes.org/ Online Mendelian Inhereitance in Man (OMIM), http://www. omim.org SKATsoftware, http://www.hsph.harvard.edu/~xlin/software.html References 1. Hindorff, L.A., Sethupathy, P., Junkins, H.A., Ramos, E.M., Mehta, J.P., Collins, F.S., and Manolio, T.A. (2009). Potential etiologic and functional implications of genome-wide associa- tion loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367. 2. Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z., et al. (2005). Genome sequencing in microfabricated high- density picolitre reactors. Nature 437, 376–380. 3. Mardis, E.R. (2008). Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 9, 387–402. 4. Ansorge, W.J. (2009). Next-generation DNA sequencing tech- niques. New Biotechnol. 25, 195–203. 5. Eichler, E.E., Flint, J., Gibson, G., Kong, A., Leal, S.M., Moore, J.H., and Nadeau, J.H. (2010). Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 11, 446–450. 6. Ley, T.J., Mardis, E.R., Ding, L., Fulton, B., McLellan, M.D., Chen, K., Dooling, D., Dunford-Shore, B.H., McGrath, S., Hickenbotham, M., et al. (2008). DNA sequencing of a cytoge- netically normal acute myeloid leukaemia genome. Nature 456, 66–72. 7. Li, H., Ruan, J., and Durbin, R. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858. 8. Li,R.Q.,Li,Y.R.,Fang,X.D.,Yang,H.M.,Wang,J.,Kristiansen,K., and Wang, J. (2009). SNP detection for massively parallel whole- genome resequencing. Genome Res. 19, 1124–1132. 9. Bansal, V., Harismendy, O., Tewhey, R., Murray, S.S., Schork, N.J., Topol, E.J., and Frazer, K.A. (2010). Accurate detection and genotyping of SNPs utilizing population sequencing data. Genome Res. 20, 537–545. 10. Carvajal-Carmona, L.G. (2010). Challenges in the identifica- tion and use of rare disease-associated predisposition variants. Curr. Opin. Genet. Dev. 20, 277–281. 11. Schork, N.J., Murray, S.S., Frazer, K.A., and Topol, E.J. (2009). Common vs. rare allele hypotheses for complex diseases. Curr. Opin. Genet. Dev. 19, 212–219. 12. Li, B., and Leal, S.M. (2008). Methods for detecting associa- tions with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83, 311–321. 13. Madsen, B.E., and Browning, S.R. (2009). A groupwise associ- ation test for rare mutations using a weighted sum statistic. PLoS Genet. 5, e1000384. 14. Morgenthaler, S., and Thilly, W.G. (2007). A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat. Res. 615, 28–56. 15. Li, B., and Leal, S.M. (2009). Discovery of rare variants via sequencing: implications for the design of complex trait asso- ciation studies. PLoS Genet. 5, e1000481. 16. Price, A.L., Kryukov, G.V., de Bakker, P.I., Purcell, S.M., Staples, J., Wei, L.J., and Sunyaev, S.R. (2010). Pooled association tests for rare variants in exon-resequencing studies. Am. J. Hum. Genet. 86, 832–838. 17. Han, F., and Pan, W. (2010). A data-adaptive sum test for disease association with multiple common or rare variants. Hum. Hered. 70, 42–54. 18. Morris, A.P., and Zeggini, E. (2010). An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet. Epidemiol. 34, 188–193. 19. Zawistowski, M., Gopalakrishnan, S., Ding, J., Li, Y., Grimm, S., and Zo¨llner, S. (2010). Extending rare-variant testing strategies: analysisof noncoding sequenceandimputedgenotypes. Am.J. Hum. Genet. 87, 604–617. 20. Asimit, J., and Zeggini, E. (2010). Rare variant association anal- ysismethodsforcomplextraits.Annu.Rev.Genet. 44, 293–308. 21. Neale, B.M., Rivas, M.A., Voight, B.F., Altshuler, D., Devlin, B., Orho-Melander, M., Kathiresan, S., Purcell, S.M., Roeder, K., and Daly, M.J. (2011). Testing for an unusual distribution of rare variants. PLoS Genet. 7, e1001322. 22. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909. 23. Kwee, L.C., Liu, D., Lin, X., Ghosh, D., and Epstein, M.P. (2008). A powerful and flexible multilocus association test for quantitative traits. Am. J. Hum. Genet. 82, 386–397. 24. Wu, M.C., Kraft, P., Epstein, M.P., Taylor, D.M., Chanock, S.J., Hunter, D.J., and Lin, X. (2010). Powerful SNP-set analysis for case-control genome-wide association studies. Am. J. Hum. Genet. 86, 929–942. 25. Lin, X. (1997). Variance component testing in generalised linear models with random effects. Biometrika 84, 309–326. 26. Davies, R. (1980). The distribution of a linear combination of chi-square random variables. J. R. Stat. Soc. Ser. C Appl. Stat. 29, 323–333. 27. Pan, W. (2009). Asymptotic tests of association with multiple SNPsinlinkagedisequilibrium.Genet.Epidemiol. 33,497–507. 28. Cristianini, N., and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods (Cambridge: Cambridge Univ Press). 29. Liu, D., Lin, X., and Ghosh, D. (2007). Semiparametric regres- sion of multidimensional genetic pathway data: least-squares 92 The American Journal of Human Genetics 89, 82–93, July 15, 2011
  • 35. kernel machines and linear mixed models. Biometrics 63, 1079–1088. 30. Liu, D., Ghosh, D., and Lin, X. (2008). Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinformatics 9, 292. 31. Fleuret, F., and Sahbi, H. (2003). Scale-invariance of support vector machines based on the triangular kernel. In 3rd Inter- national Workshop on Statistical and Computational Theories of Vision. (ftp://ftp.inria.fr/INRIA/publication/publi-pdf/RR/ RR-4601.pdf). 32. Ramensky, V., Bork, P., and Sunyaev, S. (2002). Human non- synonymous SNPs: server and survey. Nucleic Acids Res. 30, 3894–3900. 33. Kumar, P., Henikoff, S., and Ng, P.C. (2009). Predicting the effects of coding non-synonymous variants on protein func- tion using the SIFT algorithm. Nat. Protoc. 4, 1073–1081. 34. Liu, H., Tang, Y., and Zhang, H. (2009). A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Comput. Stat. Data Anal. 53, 853–856. 35. Lee, S., Wu, M.C., Cai, T., Li, Y., Boehnke, M., and Lin, X. (2011). Power and sample size calculations for designing rare variant sequencing association studies. In Harvard University Technical Report. (http://www.hsph.harvard.edu/~xlin). 36. Durbin, R.M., Abecasis, G.R., Altshuler, D.L., Auton, A., Brooks, L.D., Gibbs, R.A., Hurles, M.E., and McVean, G.A.; 1000 Genomes Project Consortium. (2010). A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073. 37. Schaffner, S.F., Foo, C., Gabriel, S., Reich, D., Daly, M.J., and Altshuler, D. (2005). Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15, 1576– 1583. 38. Romeo, S., Yin, W., Kozlitina, J., Pennacchio, L.A., Boerwinkle, E., Hobbs, H.H., and Cohen, J.C. (2009). Rare loss-of-function mutations in ANGPTL family members contribute to plasma triglyceride levels in humans. J. Clin. Invest. 119, 70–79. 39. Duchesne, P., and Lafaye De Micheaux, P. (2010). Computing the distribution of quadratic forms: Further comparisons between the Liu-Tang-Zhang approximation and exact methods. Comput. Stat. Data Anal. 54, 858–862. The American Journal of Human Genetics 89, 82–93, July 15, 2011 93
  • 36. Discover high-quality, open-access research Cell Reports features: High-quality, cutting-edge research A focus on short, single-point papers called Reports Broad scope covering all of biology Flexible open-access policy A highly engaged editorial board A distinguished advisory board New papers online weekly cellreports.cell.com
  • 37. REPORT Expansion of Intronic GGCCTG Hexanucleotide Repeat in NOP56 Causes SCA36, a Type of Spinocerebellar Ataxia Accompanied by Motor Neuron Involvement Hatasu Kobayashi,1,4 Koji Abe,2,4 Tohru Matsuura,2,4 Yoshio Ikeda,2 Toshiaki Hitomi,1 Yuji Akechi,2 Toshiyuki Habu,3 Wanyang Liu,1 Hiroko Okuda,1 and Akio Koizumi1,* Autosomal-dominant spinocerebellar ataxias (SCAs) are a heterogeneous group of neurodegenerative disorders. In this study, we per- formed genetic analysis of a unique form of SCA (SCA36) that is accompanied by motor neuron involvement. Genome-wide linkage analysis and subsequent fine mapping for three unrelated Japanese families in a cohort of SCA cases, in whom molecular diagnosis had never been performed, mapped the disease locus to the region of a 1.8 Mb stretch (LOD score of 4.60) on 20p13 (D20S906– D20S193) harboring 37 genes with definitive open reading frames. We sequenced 33 of these and observed a large expansion of an intronic GGCCTG hexanucleotide repeat in NOP56 and an unregistered missense variant (Phe265Leu) in C20orf194, but we found no mutations in PDYN and TGM6. The expansion showed complete segregation with the SCA phenotype in family studies, whereas Phe265Leu in C20orf194 did not. Screening of the expansions in the SCA cohort cases revealed four additional occurrences, but none were revealed in the cohort of 27 Alzheimer disease cases, 154 amyotrophic lateral sclerosis cases, or 300 controls. In total, nine unrelated cases were found in 251 cohort SCA patients (3.6%). A founder haplotype was confirmed in these cases. RNA foci forma- tion was detected in lymphoblastoid cells from affected subjects by fluorescence in situ hybridization. Double staining and gel-shift assay showed that (GGCCUG)n binds the RNA-binding protein SRSF2 but that (CUG)6 does not. In addition, transcription of MIR1292, a neighboring miRNA, was significantly decreased in lymphoblastoid cells of SCA patients. Our finding suggests that SCA36 is caused by hexanucleotide repeat expansions through RNA gain of function. Autosomal-dominant spinocerebellar ataxias (SCAs) are a heterogeneous group of neurodegenerative disorders characterized by loss of balance, progressive gait, and limb ataxia.1–3 We recently encountered two unrelated patients with intriguing clinical symptoms from a commu- nity in the Chugoku region in western mainland Japan.4 These patients both showed complicated clinical features, with ataxia as the first symptom, followed by characteristic late-onset involvement of the motor neuron system that caused symptoms similar to those of amyotrophic lateral sclerosis (ALS [MIM 105400]).4 Some SCAs (SCA1 [MIM 164400], SCA2 [MIM 183090], SCA3 [MIM 607047], and SCA6 [MIM 183086]) are known to slightly affect motor neurons; however, their involvement is minimal and the patients usually do not develop skeletal muscle and tongue atrophies.4 Of particular interest is that RNA foci have been recently demonstrated in hereditary disorders caused by microsatellite repeat expansions or insertions in the non- coding regions of their gene.5–7 The unique clinical features in these families have seldom been described in previous reports; therefore, we undertook a genetic analysis. A similar form of SCA was observed in five Japanese cases from a cohort of 251 patients with SCA, in whom molec- ular diagnosis had not been performed, who were followed by the Department of Neurology, Okayama University Hospital. These five cases originated from a city of 450,000 people in the Chugoku region. Thus, we suspected the presence of a founder mutation common to these five cases, prompting us to recruit these five families (pedigrees 1–5) (Figure 1, Table 1). This study was approved by the Ethics Committee of Kyoto University and the Okayama University institutional review board. Written informed consent was obtained from all subjects. An index of cases per family was investigated in some depth: IV-4 in pedigree 1, II-1 in pedigree 2, III-1 in pedigree 3, II-1 in pedigree 4, and II-1 in pedigree 5. The mean age at onset of cerebellar ataxia was 52.8 5 4.3 years, and the disease was trans- mitted by an autosomal-dominant mode of inheritance. All affected individuals started their ataxic symptoms, such as gait and truncal instability, ataxic dysarthria, and uncoordinated limbs, in their late forties to fifties. MRI revealed relatively confined and mild cerebellar atrophy (Figure 2A). Unlike individuals with previously known SCAs, all affected individuals with longer disease duration showed obvious signs of motor neuron involvement (Table 1). Characteristically, all affected individuals ex- hibited tongue atrophy with fasciculation, although its degree of severity varied (Figure 2B). Despite severe tongue atrophy in some cases, their swallowing function was rela- tively preserved, and they were allowed oral intake even at a later point after onset. In addition to tongue atrophy, skeletal muscle atrophy and fasciculation in the limbs and trunk appeared in advanced cases.4 Tendon reflexes were generally mildly to severely hyperreactive in most 1 Department of Health and Environmental Sciences, Graduate School of Medicine, Kyoto University, Kyoto, Japan; 2 Department of Neurology, Graduate School of Medicine, Dentistry and Pharmaceutical Science, Okayama University, Okayama, Japan; 3 Radiation Biology Center, Kyoto University, Kyoto, Japan 4 These authors contributed equally to this work *Correspondence: koizumi.akio.5v@kyoto-u.ac.jp DOI 10.1016/j.ajhg.2011.05.015. Ó2011 by The American Society of Human Genetics. All rights reserved. The American Journal of Human Genetics 89, 121–130, July 15, 2011 121
  • 38. Figure 1. Pedigree Charts of the Five SCA Families Haplotypes are shown for nine markers from D20S906 (1,505,576 bp) to D20S193 (3,313,494 bp), spanning 1.8 Mb on chromosome 20p13. NOP56 is located at 2,633,254–2,639,039 bp (NCBI build 37.1). Filled and unfilled symbols indicate affected and unaffected indi- viduals, respectively. Squares and circles represent males and females, respectively. A slash indicates a deceased individual. The putative founder haplotypes among patients are shown in boxes constructed by GENHUNTER.8 Arrows indicate the index case. The pedigrees were slightly modified for privacy protection. 122 The American Journal of Human Genetics 89, 121–130, July 15, 2011
  • 39. affected individuals, none of whom displayed severe lower limb spasticity or extensor plantar response. Electrophysi- ological studies were performed in an affected individual. Nerve conduction studies revealed normal findings in all of the cases that were examined; however, an electromyo- gram showed neurogenic changes only in cases with skeletal muscle atrophy, indicating that lower motor neuropathy existed in this particular disease. Progression of motor neuron involvement in this SCA was typically limited to the tongue and main proximal skeletal muscles in both upper and lower extremities, which is clearly different from typical ALS, which usually involves most skeletal muscles over the course of a few years, leading to fatal results within several years. We conducted genome-wide linkage analysis for nine affected subjects and eight unaffected subjects in three informative families (pedigrees 1–3; Figure 1). For genotyp- ing, we used an ABI Prism Linkage Mapping Set (Version 2; Applied Biosystems, Foster City, CA, USA) with 382 markers, 10 cM apart, for 22 autosomes. Fine-mapping markers (approximately 1 cM apart) were designed accord- ing to information from the uniSTS reference physical map in the NCBI database. A parametric linkage analysis was carried out in GENEHUNTER8 with the assumption of an autosomal-dominant model. The disease allele frequency was set at 0.000001, and a phenocopy frequency of 0.000001 was assumed. Population allele frequencies were assigned equal portions of individual alleles. We per- formed multipoint analyses for autosomes and obtained LOD scores. We considered LOD scores above 3.0 to be significant.8 Genome-wide linkage analysis revealed a single locus on chromosome 20p13 with a LOD score of 3.20. Fine mapping increased the LOD score to 4.60 (Figure 3). Haplotype analysis revealed two recombination events in pedigree 3, delimiting a1.8 Mb region (D20S906– D20S193) (Figure 1). We further tested whether the five cases shared the haplotype. As shown in Figure 1, pedigrees 4 and 5 were confirmed to have the same haplotype as pedigrees 1, 2, and 3, indicating that the 1.8 Mb region is very likely to be derived from a common ancestor. The1.8Mbregionharbors44genes(NCBI,build37.1).We eliminated two pseudogenes and five genes (LOC441938, LOC100289473, LOC100288797, LOC100289507, and LOC100289538) from the candidates. Evidence view showed that the first, fourth, and fifth genes were not found in the contig in this region, whereas the second and third Table 1. Clinical Characteristics of Affected Subjects Pedigree No. Patient ID Gender Onset Age (yr) Current Age (yr) Ataxia Motor Neuron Involvement Genotype of GGCCTG Repeats Skeletal Muscle Atrophy Skeletal Muscle Fasciculation Tongue Atrophy/ Fasciculation 1 III-5 M 50 70 þþþ N.D. N.D. N.D. g.263397_263402[6]þ(1800) III-6 F 52 68 þþ þ þ þ g.263397_263402[6]þ(2300) IV-2 F 57 63 þ - - þ g.263397_263402[6]þ(2300) IV-4 M 50 59 þ - - þ g.263397_263402[6]þ(2300) 2 II-1 M 55 77 þþþ þþ þ þ g.263397_263402[6]þ(2200) II-2 F 53 70 þþ N.D. N.D. N.D. g.263397_263402[6]þ(2200) 3 II-3 M 58 77 þþ þþ þ þ g.263397_263402[3]þ(2300) III-1 M 56 62 þ - - 5 g.263397_263402[8]þ(2200) III-2 M 51 61 þþ þ þ þ g.263397_263402[6]þ(1800) 4 I-1 M 57 died in 2001 at 83 þþ N.D. N.D. N.D. g.263397_263402[5]þ(1800) II-1 F 48 61 þþ þ 5 þþ g.263397_263402[6]þ(2000) 5 I-1 M 57 86 þþ þþþ þ þ g.263397_263402[5]þ(2000) II-1 F 47 58 þþ þ þ þ g.263397_263402[8]þ(1700) SCA#1 M 52 69 þþþ þþþ þþþ þþþ g.263397_263402[5]þ(2200) SCA#2 F 43 53 þþþ - - þ g.263397_263402[6]þ(1800) SCA#3 M 55 60 þþ - - þþ g.263397_263402[8]þ(1700) SCA#4 M 57 81 þþþ þ þ þþþ g.263397_263402[5]þ(2200) Mean 52.8 SD 4.3 N.D., not determined. The American Journal of Human Genetics 89, 121–130, July 15, 2011 123
  • 40. genes are not assigned to orthologous loci in the mouse genome. Sequence similarities among paralog genes defied direct sequencing of four genes: SIRPD [NM 178460.2], SIRPB1 [NM 603889], SIRPG [NM 605466], and SIRPA [NM 602461]. Thus, we sequenced 33 of 37 genes (PDYN(( [MIM 131340], STK35 [MIM 609370], TGM3 [MIM 600238], TGM6 [NM_198994.2], SNRPB [MIM 182282], SNORD119 [NR_003684.1], ZNF343 [NM_024325.4], TMC2 [MIM 606707], NOP56 [NM_006392.2], MIR1292 [NR_031699.1], SNORD110 [NR_003078.1], SNORA51 [NR_002981.1], SNORD86 [NR_004399.1], SNORD56 [NR_002739.1], SNORD57 [NR_002738.1], IDH3B [MIM 604526], EBF4 [MIM 609935], CPXM1 [NM_019609.4], C20orf141 [NM_080739.2], FAM113A [NM_022760.3], VPS16 [MIM 608550], PTPRA [MIM 176884], GNRH2 [MIM 602352], MRPS26 [MIM 611988], OXT [MIM 167050], AVP [MIM 192340], UBOX5 [NM_014948.2], FASTKD5 [NM_021826.4], ProSAPiP1 [MIM 610484], DDRGK1 [NM_023935.1], ITPA [MIM 147520], SLC4A11 [MIM 610206], and C20orf194 [NM_001009984.1]) (Fig- ure 2C). All noncoding and coding exons, as well as the 100 bp up- and downstream of the splice junctions of these genes, were sequenced in two index cases (IV-4 in pedigree1 and III-1 in pedigree 3) and in three additional cases (II-1 in pedigree 2, II-1 in pedigree 4, and II-1 in pedigree 5) with the use of specific primers (Table S1 available online). Eight unregistered variants were found among the two index cases. Among these, there was a coding variant, c.795C>G Figure 2. Motor Neuron Involvement and (GGCCTG)n Expansion in the First Intron of NOP56 (A) MRI of an affected subject (SCA#3) showed mild cerebellar atrophy (arrow) but no other cerebral or brainstem pathology. (B) Tongue atrophy (arrow) was observed in SCA#1. (C) Physical map of the 1.8-Mb linkage region from D20S906 (1,505,576 bp) to D20S193 (3,313,494 bp), with 33 candidate genes shown, as well as the direction of transcription (arrows). (D) The upper portion of the panel shows the scheme of primer binding for repeat-primer PCR analysis. In the lower portion, sequence traces of the PCR reactions are shown. Red lines indicate the size markers. The vertical axis indicates arbitrary intensity levels. A typical saw-tooth pattern is observed in an affected pedigree. (E) Southern blotting of LCLs from SCA cases and three controls. Genomic DNA (10 mg) was extracted from Epstein-Barr virus (EBV)- immortalized LCLs derived from six affected subjects (Ped2_II-1, Ped3_III-1, Ped3_III-2, Ped5_I-1, Ped5_II-1, and SCA#1) and digested with 2 U of AvrII overnight (New England Biolabs, Beverly, MA, USA). A probe covering exon 4 of NOP56 (452 bp) was subjected to PCR amplification from human genomic DNA with the use of primers (Table S3) and labeled with 32 P-dCTP. 124 The American Journal of Human Genetics 89, 121–130, July 15, 2011
  • 41. (p.Phe265Leu), in C20orf194, whereas the other seven included one synonymous variant, c.1695T>A (p.Leu565- Leu), in ZNF343 and six non-splice-site intronic variants (Table S2). We tested segregation by sequencing exon 11 of C20orf194 in IV-2 and III-5 in pedigree 1. Neither IV-2 nor III-5 had this variant. We thus eliminated C20orf194 as a candidate. Missense mutations in PDYN and TGM6, which have been recently reported as causes of SCA, mapped to 20p12.3-p13,9,10 but none were detected in the five index cases studied here (Table S2). Possible expansions of repetitive sequences in these 33 genes were investigated when intragenic repeats were indicated in the database (UCSC Genome Bioinfor- matics). Expansions of the hexanucleotide repeat GGCCTG (rs68063608) were found in intron 1 of NOP56 (Figure 2D) in all five index cases through the use of a repeat-primed PCR method.11–13 An outline of the repeat-primed PCR experiment is described in Figure 2D. In brief, the fluorescent-dye-conjugated forward primer corresponded to the region upstream of the repeat of interest. The first reverse primer consisted of four units of the repeat (GGCCTG) and a 50 tail used as an anchor. The second reverse primer was an ‘‘anchor’’ primer. These primers are described in Table S3. Complete segregation of the expanded hexanucleotide was confirmed in all pedi- grees, and the maximum repeat size in nine unaffected members was eight (data not shown). In addition to the SCA cases in five pedigrees, four unrelated cases (SCA#1–SCA#4) were found to have a (GGCCTG)n allele through screening of the cohort SCA patients (Table 1). Neurological examination was reeval- uated in these four cases, revealing both ataxia and motor neuron dysfunction with tongue atrophy and fasciculation (Table 1). In total, nine unrelated cases were found in the 251 cohort patients with SCA (3.6%). For confirmation of the repeat expansions, Southern blot analysis was conduct- ed in six affected subjects (Ped2_II-1, Ped3_III-1, Ped3_III-2, Ped5_I-1, Ped5_II-1, and SCA#1). The data showed >10 kb of repeat expansions in the lymphoblastoid cell lines (LCLs) obtained from the SCA patients (Figure 2E). Further- more, the numbers of GGCCTG repeat expansion were estimated by Southern blotting in 11 other cases. The expansion analysis revealed approximately 1500 to 2500 repeats in 17 cases (Table 1). There was no negative associa- tion between age at onset and the number of GGCCTG repeats (n ¼ 17, r ¼ 0.42, p ¼ 0.09; Figure S1) and no obvious anticipation in the current pedigrees. To investigate the disease specificity and disease spec- trum of the hexanucleotide repeat expansions, we tested the repeat expansions in an Alzheimer disease (MIM 104300) cohort and an ALS cohort followed by the Depart- ment of Neurology, Okayama University Hospital. We also recruited Japanese controls, who were confirmed to be free from brain lesions through MRI and magnetic resonance angiography, which was performed as described previ- ously.14 Screening of the 27 Alzheimer disease cases and 154 ALS cases failed to detect additional cases with repeat expansions. The GGCCTG repeat sizes ranged from 3 to 8 in 300 Japanese controls (5.9 5 0.8 repeats), suggesting that the >10 kb repeat expansions were mutations. Expression of Nop56, an essential component of the splicing machinery,15 was examined by RT-PCR with the use of primers for wild-type mouse Nop56 cDNA (Table S3). Expression of Nop56 mRNA was detected in various tissues, including CNS tissue, and a very weak signal was detected in spinal cord tissue (Figure 4A). Immunohisto- chemistry using an anti-mouse Nop56 antibody (Santa Cruz Biotechnology, Santa Cruz, CA, USA) detected the Nop56 protein in Purkinje cells of the cerebellum as well as motor neurons of the hypoglossal nucleus and the spinal cord anterior horn (Figure 4B), suggesting that these cells may be responsible for tongue and muscle atrophy in the trunk and limbs, respectively. Immunoblotting also confirmed the presence of Nop56 in neural tissues (Figure 4C), where Nop56 is localized in both the nucleus and cytoplasm. Alterations of NOP56 RNA expression and protein levels in LCLs from patients were examined by real-time RT-PCR and immunoblotting. The primers for quantitative PCR of human NOP56 cDNA are described in Table S3. Immuno- blotting was performed with the use of an anti-human NOP56 antibody (Santa Cruz Biotechnology, Santa Cruz, CA, USA). We found no decrease in NOP56 RNA expression or protein levels in LCLs from these patients (Figure 5A). To investigate abnormal splicing variants of NOP56, we per- formed RT-PCR using the primers covering the region from the 50 UTR to exon 4 around the repeat expansion (Table S3); however, no splicing variant was observed in LCLs from the cases (Figure 5B). We also performed immu- nocytochemistry for NOP56 and coilin, a marker of the Cajal body, where NOP56 functions.16 NOP56 and coilin distributions were not altered in LCLs of the SCA patients (Figure 5C), suggesting that qualitative or quantitative changes in the Cajal body did not occur. These results indi- cated that haploinsufficiency could not explain the observed phenotype. Figure 3. Multipoint Linkage Analysis with Ten Markers on Chromosome 20p13 The American Journal of Human Genetics 89, 121–130, July 15, 2011 125
  • 42. We performed fluorescence in situ hybridization to detect RNA foci containing the repeat transcripts in LCLs from patients, as previously described.17,18 Lymphoblas- toid cells from two SCA patients (Ped2_II-2 and Ped5_I-1) and two control subjects were analyzed. An average of 2.1 5 0.5 RNA foci per cell were detected in 57.0% of LCLs (n ¼ 100) from the SCA subjects through the use of a nuclear probe targeting the GGCCUG repeat, whereas no RNA foci were observed in control LCLs (n ¼ 100) (Figure 6A). In contrast, a probe for the CGCCUG repeat, another repeat sequence in intron 1 of NOP56, detected no RNA foci in either SCA or control LCLs (n ¼ 100 each) (Figure 6A), indicating that the GGCCUG repeat was specifically expanded in the SCA subjects. The speci- ficity of the RNA foci was confirmed by sensitivity to RNase A treatment and resistance to DNase treatment (Figure 6A). Several reports have suggested that RNA foci play a role in the etiology of SCA through sequestration of specific RNA-binding proteins.5–7 In silico searches (ESEfinder 3.0) predicted an RNA-binding protein, SRSF2 (MIM 600813), as a strong candidate for binding of the GGCCUG repeat. Double staining with the probe for the GGCCUG repeat and an anti-SRSF2 antibody (Sigma-Aldrich, Tokyo, Japan) was performed. The results showed colocalization of RNA foci with SRSF2, whereas NOP56 and coilin were not colocalized with the RNA foci (Figure 6B), suggesting a specific interaction of endogenous SRSF2 with the RNA foci in vivo. To further confirm the interaction, gel-shift assays were carried out for investigation of the binding activity of SRSF2 with (GGCCUG)n. Synthetic RNA oligonucleotides (200 pmol), (GGCCUG)4 or (CUG)6, which is the latter part of the hexanucleotide, as well as the repeat RNA involved in myotonic dystrophy type 1 (DM1 [MIM 160900])18 and SCA8 (MIM 608768),5 were denatured and immediately mixed with different amounts (0, 0.2, or 0.6 mg) of recombinant full-length human SRSF2 (Abcam, Cambridge, UK). The mixtures were incubated, and the protein-bound probes were separated from the free forms by electrophoresis on 5%–20% native polyacryl- amide gels. The separated RNA probes were detected with SYBR Gold staining (Invitrogen, Carlsbad, CA, USA). We found a strong association of (GGCCUG)4 with SRSF2 in vitro in comparison to (CUG)6 (Figure 6C). Collectively, we concluded that (GGCCUG)n interacts with SRSF2. It is notable that MIR1292 is located just 19 bp 30 of the GGCCTG repeat (Figure 2D). MiRNAs such as MIR1292 are small noncoding RNAs that regulate gene expression by in- hibiting translation of specific target mRNAs.19,20 MiRNAs are believed to play important roles in key molecular Figure 4. Nop56 in the Mouse Nervous System (A) RT-PCR analysis of Nop56 (422 bp) in various mouse tissues. cDNA (25 ng) collected from various organs of C57BL/6 mice was purchased from GenoStaf (Tokyo, Japan). (B) Immunohistochemical analysis of Nop56 in the cerebellum, hypoglossal nucleus, and spinal cord anterior horn in wild-type male Slc:ICR mice at 8 wks of age (Japan SLC, Shizuoka, Japan). The arrows indicate anti- Nop56 antibody staining. The negative control was the cerebellar sample without the Nop56 antibody treatment. Scale bar represents 100 mm. (C) Immunoblotting of Nop56 (66 kDa) in the cerebellum and cerebrum. Protein sample (10 mg) was subjected to immunoblotting. LaminB1, a nuclear protein, and beta-tubulin were used as loading controls. 126 The American Journal of Human Genetics 89, 121–130, July 15, 2011
  • 43. pathways by fine-tuning gene expression.19,20 Recent studies have revealed that miRNAs influence neuronal survival and are also associated with neurodegenerative diseases.21,22 In silico searches (Target Scan Human 5.1) predicted glutamate receptors (GRIN2B [MIM 138252] and GRIK3 [MIM 138243]) to be potential target genes. Real-time RT-PCR using TaqMan probes for miRNA (Invitrogen, Carlsbad, CA, USA) revealed that the levels of both mature and precursor MIR1292 were significantly decreased in SCA LCLs (Figure 6D), indicating that the GGCCTG repeat expansion decreased the transcription of MIR1292. A decrease in MIR1292 expression may upregulate glutamate receptors in particular cell types; e.g., GRIK3 in stellate cells in the cerebellum,23 leading to ataxia because of perturbation of signal transduction to the Purkinje cells. In addition, it has been suggested, on the basis of ALS mouse models,24,25 that excitotoxicity mediated by a type of glutamate receptor, the NMDA receptor including GRIN2B, is involved in loss of spinal neurons. A very slowly progressing and mild form of the motor neuron disease, such as that described here, which is limited to mostly fasciculation of the tongue, limbs and trunk, may also be compatible with such a functional dysregulation rather than degeneration. In the present study, we have conducted genetic analysis to find a genetic cause for the unique SCA with motor neuron disease. With extensive sequencing of the 1.8 Mb linked region, we found large hexanucleotide repeat expansions in NOP56, which were completely segregated with SCA in five pedigrees and were found in four unre- lated cases with a similar phenotype. The expansion was not found in 300 controls or in other neurodegenerative diseases. We further proved that repeat expansions of NOP56 induce RNA foci and sequester SRSF2. We thus concluded that hexanucleotide repeat expansions are considered to cause SCA by a toxic RNA gain-of-function mechanism, and we name this unique SCA as SCA36. Haplotype analysis indicates that hexanucleotide expan- sions are derived from a common ancestor. The prevalence of SCA36 was estimated at 3.6% in the SCA cohort in Chugoku district, suggesting that prevalence of SCA36 may be geographically limited to the western part of Japan and is rare even in Japanese SCAs. Expansion of tandem nucleotide repeats in different regions of respective genes (most often the triplets CAG and CTG) has been shown to cause a number of inherited diseases over the past decades. An expansion in the coding region of a gene causes a gain of toxic function and/or reduces the normal function of the corresponding protein at the protein level. RNA-mediated noncoding repeat expansions have also been identified as causing eight other neuromuscular disorders: DM1, DM2 (MIM 602668), fragile X tremor/ataxia syndrome (FXTAS [MIM 300623]), Huntington disease-like 2 (HDL2 [MIM 606438]), SCA8, SCA10 (MIM 603516), SCA12 (MIM 604326), and SCA31 (MIM 117210).26 The repeat numbers in affected alleles of SCA36 are among the largest seen in this group of diseases (i.e., there are thousands of repeats). Moreover, SCA36 is not merely a nontriplet repeat expansion disorder similar to SCA10, DM2, and SCA31, but is now proven to be a human disease caused by a large hexanucleotide repeat expansion. In addition, no or only weak anticipa- tion has been reported for noncoding repeat expansion in SCA, whereas clear anticipation has been reported for most polyglutamine expansions in SCA.2 As such, absence of anticipation in SCA36 is in accord with previous studies Figure 5. Analysis of NOP56 in LCLs from SCA Patients (A) mRNA expression (upper panel) and protein levels (lower panel) in LCLs from cases (n ¼ 6) and controls (n ¼ 3) were measured by RT-PCR and immunoblotting, respectively. cDNA (10 ng) was transcribed from total RNA isolated from LCLs and used for RT-PCR. Immunoblotting was per- formed with the use of a protein sample (40 mg) extracted from LCLs. The data indi- cate the mean 5 SD relative to the levels of PP1A and GAPDH, respectively. There was no significant difference between LCLs from controls and cases. (B) Analysis for splicing variants of NOP56 cDNA. RT-PCR with 10 ng of cDNA and primers corresponding to the region from the 50 UTR to exon 4 around the repeat expansion was performed. The PCR product has an expected size of 230 bp. (C) Immunocytochemistry for NOP56 and coilin. Green signals represent NOP56 or coilin. Shown are representative samples from 100 observations of controls or cases. The American Journal of Human Genetics 89, 121–130, July 15, 2011 127
  • 44. on SCAs with noncoding repeat expansions. The common hallmark in these noncoding repeat expansion disorders is transcribed repeat nuclear accumulations with respec- tive repeat RNA-binding proteins, which are considered to primarily trigger and develop the disease at the RNA level. However, multiple different mechanisms are likely to be involved in each disorder. There are at least two possible explanations for the motor neuron involvement of SCA36: gene- and tissue-specific splicing specificity of SRSF2 and involvement of miRNA. In SCA36, there is the possibility that the adverse effect of the expansion muta- tion is mediated by downregulation of miRNA expression. The biochemical implication of miRNA involvement cannot be evaluated in this study, because availability of tissue samples from affected cases was limited to LCLs. Given definitive downregulation of miRNA 1292 in LCLs, we should await further study to substantiate its involvement in affected tissues. Elucidating which mecha- nism(s) plays a critical role in the pathogenesis will be required for determining whether cerebellar degenera- tion and motor neuron disease occur through a similar scenario. Figure 6. RNA Foci Formation and Decreased Transcription of MIR1292 (A) Cells were fixed on coverslips and then hybridized with solutions containing either a Cy3-labeled C(CAGGCC)2CAG or G(CAGGCG)2CAG oligonucleotide probe (1 ng/ml). For controls, the cells were treated with 1000 U/ml DNase or 100 mg/ml RNase for 1 hr at 37 C prior to hybridization, as indicated. After a wash step, coverslips were placed on the slides in the presence of ProLong Gold with DAPI mounting media (Molecular Probes, Tokyo, Japan) and photographed with a fluorescence microscope. The upper panels indicate LCLs from an SCA case and a control hybridized with C(CAGGCC)2CAG (left) or G(CAGGCG)2CAG (right). Red and blue signals represent RNA foci and the nucleus (DAPI staining), respectively. Similar RNA foci formation was confirmed in LCLs from another index case. The lower panels show RNA foci in SCA LCLs treated with DNase or RNase. (B) Double staining was performed with the probe for (GGCCUG)n (red) and anti-SRSF2, NOP56, or coilin antibody (green). (C) Gel-shift assays revealed specific binding of SRSF2 to (GGCCUG)4 but little to (CUG)6. (D) RNA samples (10 ng) were extracted from LCLs of controls (n ¼ 3) and cases (n ¼ 6). MiRNAs were measured with the use of a TaqMan probe for precursor (Pri-) and mature MIR1292. The data indicate the mean 5 SD, relative to the levels of PP1A or RNU6. *: p < 0.05. 128 The American Journal of Human Genetics 89, 121–130, July 15, 2011
  • 45. In conclusion, expansion of the intronic GGCCTG hexanucleotide repeat in NOP56 causes a unique form of SCA, SCA36, which shows not only ataxia but also motor neuron dysfunction. This characteristic disease phenotype can be explained by the combination of RNA gain of func- tion and MIR1292 suppression. Additional studies are required to investigate the roles of each mechanistic component in the pathogenesis of SCA36. Supplemental Data Supplemental Data include one figure and three tables and can be found with this article online at http://www.cell.com/AJHG/. Acknowledgments This work was supported mainly by grants to A.K. and partially by grants to T.M., Y.I., H.K., and K.A. We thank Norio Matsuura, Kokoro Iwasawa, and Kouji H. Harada (Kyoto University Graduate School of Medicine). Received: February 23, 2011 Revised: May 8, 2011 Accepted: May 18, 2011 Published online: June 16, 2011 Web Resources The URLs for data presented herein are as follows: ESEfinder 3.0, http://rulai.cshl.edu/cgi-bin/tools/ESE3/esefinder. cgi?process¼home NCBI, http://www.ncbi.nlm.nih.gov/ Target Scan Human 5.1, http://www.targetscan.org/ UCSC Genome Bioinformatics, http://genome.ucsc.edu References 1. Harding, A.E. (1982). The clinical features and classification of the late onset autosomal dominant cerebellar ataxias. A study of 11 families, including descendants of the ‘the Drew family of Walworth’. Brain 105, 1–28. 2. Matilla-Duen˜as, A., Sa´nchez, I., Corral-Juan, M., Da´valos, A., Alvarez, R., and Latorre, P. (2010). Cellular and molecular pathways triggering neurodegeneration in the spinocerebellar ataxias. Cerebellum 9, 148–166. 3. Scho¨ls,L., Bauer,P.,Schmidt,T.,Schulte,T.,andRiess,O.(2004). Autosomal dominant cerebellar ataxias: clinical features, genetics, and pathogenesis. Lancet Neurol. 3, 291–304. 4. Ohta, Y., Hayashi, T., Nagai, M., Okamoto, M., Nagotani, S., Nagano, I., Ohmori, N., Takehisa, Y., Murakami, T., Shoji, M., et al. (2007). Two cases of spinocerebellar ataxia accompa- nied by involvement of the skeletal motor neuron system and bulbar palsy. Intern. Med. 46, 751–755. 5. Daughters, R.S., Tuttle, D.L., Gao, W., Ikeda, Y., Moseley, M.L., Ebner, T.J., Swanson, M.S., and Ranum, L.P. (2009). RNA gain- of-function in spinocerebellar ataxia type 8. PLoS Genet. 5, e1000600. 6. Sato, N., Amino, T., Kobayashi, K., Asakawa, S., Ishiguro, T., Tsunemi, T., Takahashi, M., Matsuura, T., Flanigan, K.M., Iwasaki, S., et al. (2009). Spinocerebellar ataxia type 31 is associated with ‘‘inserted’’ penta-nucleotide repeats contain- ing (TGGAA)n. Am. J. Hum. Genet. 85, 544–557. 7. White, M.C., Gao, R., Xu, W., Mandal, S.M., Lim, J.G., Hazra, T.K., Wakamiya, M., Edwards, S.F., Raskin, S., Teive, H.A., et al. (2010). Inactivation of hnRNP K by expanded intronic AUUCU repeat induces apoptosis via translocation of PKCdelta to mitochondria in spinocerebellar ataxia 10. PLoS Genet. 6, e1000984. 8. Kruglyak, L., Daly, M.J., Reeve-Daly, M.P., and Lander, E.S. (1996). Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. Hum. Genet. 58, 1347–1363. 9. Bakalkin, G., Watanabe, H., Jezierska, J., Depoorter, C., Verschuuren-Bemelmans, C., Bazov, I., Artemenko, K.A., Yakovleva, T., Dooijes, D., Van de Warrenburg, B.P., et al. (2010). Prodynorphin mutations cause the neurodegenerative disorder spinocerebellar ataxia type 23. Am. J. Hum. Genet. 87, 593–603. 10. Wang, J.L., Yang, X., Xia, K., Hu, Z.M., Weng, L., Jin, X., Jiang, H., Zhang, P., Shen, L., Guo, J.F., et al. (2010). TGM6 identified as a novel causative gene of spinocerebellar ataxias using exome sequencing. Brain 133, 3510–3518. 11. Cagnoli, C., Michielotto, C., Matsuura, T., Ashizawa, T., Marg- olis, R.L., Holmes, S.E., Gellera, C., Migone, N., and Brusco, A. (2004). Detection of large pathogenic expansions in FRDA1, SCA10, and SCA12 genes using a simple fluorescent repeat- primed PCR assay. J. Mol. Diagn. 6, 96–100. 12. Matsuura, T., and Ashizawa, T. (2002). Polymerase chain reac- tion amplification of expanded ATTCT repeat in spinocerebel- lar ataxia type 10. Ann. Neurol. 51, 271–272. 13. Warner, J.P., Barron, L.H., Goudie, D., Kelly, K., Dow, D., Fitzpatrick, D.R., and Brock, D.J. (1996). A general method for the detection of large CAG repeat expansions by fluores- cent PCR. J. Med. Genet. 33, 1022–1026. 14. Hashikata, H., Liu, W., Inoue, K., Mineharu, Y., Yamada, S., Nanayakkara, S., Matsuura, N., Hitomi, T., Takagi, Y., Hashi- moto, N., et al. (2010). Confirmation of an association of single-nucleotide polymorphism rs1333040 on 9p21 with familial and sporadic intracranial aneurysms in Japanese patients. Stroke 41, 1138–1144. 15. Wahl, M.C., Will, C.L., and Lu¨hrmann, R. (2009). The spliceo- some: design principles of a dynamic RNP machine. Cell 136, 701–718. 16. Lechertier, T., Grob, A., Hernandez-Verdun, D., and Roussel, P. (2009). Fibrillarin and Nop56 interact before being co-assem- bled in box C/D snoRNPs. Exp. Cell Res. 315, 928–942. 17. Liquori, C.L., Ricker, K., Moseley, M.L., Jacobsen, J.F., Kress, W., Naylor, S.L., Day, J.W., and Ranum, L.P. (2001). Myotonic dystrophy type 2 caused by a CCTG expansion in intron 1 of ZNF9. Science 293, 864–867. 18. Taneja, K.L., McCurrach, M., Schalling, M., Housman, D., and Singer, R.H. (1995). Foci of trinucleotide repeat transcripts in nuclei of myotonic dystrophy cells and tissues. J. Cell Biol. 128, 995–1002. 19. Winter, J., Jung, S., Keller, S., Gregory, R.I., and Diederichs, S. (2009). Many roads to maturity: microRNA biogenesis path- ways and their regulation. Nat. Cell Biol. 11, 228–234. 20. Zhao, Y., and Srivastava, D. (2007). A developmental view of microRNA function. Trends Biochem. Sci. 32, 189–197. 21. Eacker, S.M., Dawson, T.M., and Dawson, V.L. (2009). Under- standing microRNAs in neurodegeneration. Nat. Rev. Neuro- sci. 10, 837–841. The American Journal of Human Genetics 89, 121–130, July 15, 2011 129
  • 46. 22. He´bert, S.S., and De Strooper, B. (2009). Alterations of the microRNA network cause neurodegenerative disease. Trends Neurosci. 32, 199–206. 23. Tsuzuki, K., and Ozawa, S. (2005). Glutamate Receptors. Ency- clopedia of life sciences. John Wiley and Sons, Ltd., http:// onlinelibrary.com/doi/10.1038/npg.els.0005056. 24. Nutini, M., Frazzini, V., Marini, C., Spalloni, A., Sensi, S.L., and Longone, P. (2011). Zinc pre-treatment enhances NMDAR- mediated excitotoxicity in cultured cortical neurons from SOD1(G93A) mouse, a model of amyotrophic lateral sclerosis. Neuropharmacology 60, 1200–1208. 25. Sanelli, T., Ge, W., Leystra-Lantz, C., and Strong, M.J. (2007). Calcium mediated excitotoxicity in neurofilament aggregate- bearing neurons in vitro is NMDA receptor dependant. J. Neurol. Sci. 256, 39–51. 26. Todd, P.K., and Paulson, H.L. (2010). RNA-mediated neurode- generation in repeat expansion disorders. Ann. Neurol. 67, 291–300. 130 The American Journal of Human Genetics 89, 121–130, July 15, 2011
  • 47. Want to learn how to prepare, submit and publish an article in a Cell Press journal? Watch the Cell Press publication guide. for more information visit www.cell.com/publicationguide Chapter 1: Before manuscript submission Chapter 2: After initial submission Chapter 3: Decision process Chapter 4: After manuscript acceptance
  • 48. REPORT A Mutation in a Skin-Specific Isoform of SMARCAD1 Causes Autosomal-Dominant Adermatoglyphia Janna Nousbeck,1 Bettina Burger,2 Dana Fuchs-Telem,1,4 Mor Pavlovsky,1 Shlomit Fenig,1 Ofer Sarig,1 Peter Itin,2,3 and Eli Sprecher1,4,* Monogenic disorders offer unique opportunities for researchers to shed light upon fundamental physiological processes in humans. We investigated a large family affected with autosomal-dominant adermatoglyphia (absence of fingerprints) also known as the ‘‘immigra- tion delay disease.’’ Using linkage and haplotype analyses, we mapped the disease phenotype to 4q22. One of the genes located in this interval is SMARCAD1, a member of the SNF subfamily of the helicase protein superfamily. We demonstrated the existence of a short isoform of SMARCAD1 exclusively expressed in the skin. Sequencing of all SMARCAD1 coding and noncoding exons revealed a hetero- zygous transversion predicted to disrupt a conserved donor splice site adjacent to the 30 end of a noncoding exon uniquely present in the skin-specific short isoform of the gene. This mutation segregated with the disease phenotype throughout the entire family. Using a mini- gene system, we found that this mutation causes aberrant splicing, resulting in decreased stability of the short RNA isoform as predicted by computational analysis and shown by RT-PCR. Taken together, the present findings implicate a skin-specific isoform of SMARCAD1 in the regulation of dermatoglyph development. Epidermal ridges are characteristic features of the human skin1 and in wide use in the modern era as almost unsur- passed identification tools. The physiological role of epidermal ridges remains controversial. Recent data have dismissed the theory that fingerprints might improve the grip by ramping up friction levels.2 Instead, epidermal ridges might amplify vibratory signals to deeply embedded nerves involved in fine texture perception.3 The factors underlying the formation of epidermal ridges during embryonic development and their pattern remain unknown but are likely to include both genetically deter- mined traits4 as well as environmental elements5 and to involve some form of interactions between the mesen- chymal and the dermal and the epidermal elements. At 24 weeks postfertilization, the epidermal-ridge system displays an adult morphology6 that remains permanent without any modification throughout life. The congenital absence of epidermal ridges is a rare condition known as adermatoglyphia (ADG). To date only four families with congenital absence of fingerprints have been described.7–10 In three of these families,7–9 additional features such as congenital facial milia, skin blisters, and fissures associated with heat or trauma were reported. A number of more complex syndromes such as Naegeli-Franceschetti-Jadas- sohn syndrome (MIM 161000) and dyskeratosis congenita (MIM 305000) also feature abnormal development of epidermal ridges,11,12 as detailed in a recent review of the topic.13 In the present study we investigated a large Swiss kindred presenting with autosomal-dominant adermatoglyphia recently coined as the ‘‘immigration delay disease’’13 because affected individuals report significant difficulties entering countries that require fingerprint recording. All affected members of this family displayed since birth an absence of fingerprints (Figure 1A); histological analysis13 revealed that this absence was associated with a reduced number of sweat glands and a sweat test showed a reduced ability for hand transpiration (Figure 1B). All affected (n ¼ 9) and healthy (n ¼ 7) family members or their legal guardian provided written and informed consent according to a protocol approved by the institu- tional review board of University Hospital Basel in adher- ence with the principles of the declaration of Helsinki. DNA was extracted from peripheral blood lymphocytes. We initially genotyped all family members by using the Illumina Human Linkage-12 chip comprising 6000 tagged SNPs distributed across the genome. Two hundred ng of DNA were hybridized according to the Infinium II assay (Illumina, San Diego, CA) and scanned with an Illumina BeadArray reader. The scanned images were imported into BeadStudio 3.1.3.0 (Illumina) for extraction and quality control, with an average call rate of 99.9%. Multipoint linkage analysis with the Superlink software14 generated a LOD score of 2.85 at marker rs1509948 (Figure 2). Fine mapping of the disease interval was per- formed with polymorphic microsatellite markers that were selected from the National Center for Biotechnology Infromation (NCBI) database. Genotypes were established with fluorescently labeled primer pairs (Research Genetics, Invitrogen, Carlsbad, CA) according to the manufacturer’s recommendations. PCR products were separated by PAGE on an automated sequencer (ABI PRISM 3100 Genetic Analyzer; Applied Biosystems, Foster City, CA), and allele sizes were determined with Gene Mapper v4.0 software. Haplotype analysis refined the disease locus to a 5.1 Mb interval between markers D4S423 and D4S1560 (Figure 2). 1 Department of Dermatology, Tel Aviv Sourasky Medical Center, Tel Aviv 64239, Israel; 2 Department of Biomedicine, University Hospital Basel, Basel 4051, Switzerland; 3 Department of Dermatology, University Hospital Basel, Basel 4051, Switzerland; 4 Department of Human Molecular Genetics and Biochem- istry, Sackler Faculty of Medicine, Tel-Aviv University, Ramat Aviv 61390, Israel *Correspondence: elisp@tasmc.health.gov.il DOI 10.1016/j.ajhg.2011.07.004. Ó2011 by The American Society of Human Genetics. All rights reserved. 302 The American Journal of Human Genetics 89, 302–307, August 12, 2011
  • 49. We found the disease interval contained 17 genes. All coding and noncoding exons of the disease interval genes were fully sequenced. Initially, no mutation was identified. We therefore carefully scrutinized all currently available databases for rare transcripts. We identified one minor transcript (ENST00000509418, NM_001128430.1), sharing a common nucleotide sequence with the 30 -end of SMARCAD1 (MIM 612761). SMARCAD1 encodes a protein that is structurally related to the SWI2/SNF2 superfamily of DNA-dependent ATPases, which function as catalytic subunits of chromatin-remodeling complexes and are consequently considered to be major regulators of tran- scriptional activity.15 The two SMARCAD1 isoforms differ in lengths and sites of transcription initiation. The shortest SMARCAD1 isoform is predicted to contain a unique 50 -nontranslated exon (Figure 3A). It is of interest that, in contrast with the major large isoform, which was found to be expressed ubiquitously as previously shown,16 the SMARCAD1 short isoform was mainly identifiable by RT- PCR in skin fibroblasts and to a lesser extent in keratino- cytes and esophageal tissue (Figure 4), suggesting that it might represent an attractive candidate gene for a skin condition such as ADG. To assess the possible involvement of SMARCAD1 in ADG, genomic DNA was amplified by PCR with primer pairs spanning the entire coding sequence of both SMARCAD1 isoforms (Table S1, available online) and Taq polymerase (QIAGEN, Valencia, CA). Cycling conditions were 94 C for 2 min followed by three cycles at 94 C for 40 s, 61 C for 40 s, and 72 C for 40 s; three cycles at 94 C for 40 s, 59 C for 40 s, and 72 C for 40 s; three cycles at 94 C for 40 s, 57 C for 40 s, and 72 C for 40 s; 33 cycles at 94 C for 40 s, 55 C for 40 s, and 72 C for 40 s; and a final extension step at 72 C for 10 min. DNA was extracted from gel and purified with QIAquick Gel Extraction kit (QIAGEN). Direct sequencing of the resulting PCR prod- ucts with the BigDye terminator system on an automated sequencer (Applied Biosystems) revealed a heterozygous G>T transversion in the first intron of the skin-specific SMARCAD1 short isoform. The mutation, c.378þ1G>T, was predicted to abolish the donor splice site adjacent to the 30 -end of the first unique exon of the short SMARCAD1 isoform. To confirm the existence of the mutation, we used a PCR-RFLP assay. A 537 bp long DNA fragment was ampli- fied with the forward primer 50 -AGCTGATTGGCTGGGA ATAC-30 and reverse primer 50 -GGCATTCATAAAACTCAA AATGC-30 (Figure 3B). The mutation creates a recognition site for MseI endonuclease (New England Biolabs, Ipswich, MA).A Using this assay, we confirmed segregation of the mutation with the disease phenotype throughout the entire family and also excluded the mutation from a panel of 100 healthy Swiss individuals and 100 healthy Jewish individuals (data not shown); this suggests that the muta- tion does not represent a common neutral polymorphism but rather is a disease-causing mutation. To assess the consequences of the mutation on the SMARCAD1-splicing pattern, we initially used RT-PCR to amplify cDNA derived from the RNA extracted from the fibroblast cell cultures that were established from a patient and a healthy individual. Total RNA was extracted with RNeasy Extraction Kit (QIAGEN). cDNA was synthesized (Thermo Scientific Verso cDNA Synthesis Kit, ABgene, Surrey, UK) and amplified by PCR with exon-crossing primers, 50 -GAAAGCAAGAATGTGGCAG-30 ; 50 -GGGCTT GAGTGACAAACT-30 , located in exons 1 and 3 of the short SMARCAD1 isoform, respectively. DNA was extracted from gel, purified with QIAquick Gel Extraction kit (QIAGEN), and directly sequenced as described above. Only the wild-type splice product was identified, suggesting that aberrant splice variants might undergo degradation. To obtain further support for this possibility, we generated a minigene construct17 by subcloning exon 1, parts of intron 1 (because the first intron is very large [~10.5 kb], we trimmed the intronic sequence) and exon 2 of the SMARCAD1 short isoform into the pEGFP-C3 vector (Figure 5A). More specifically, a 1.7 kb genomic DNA frag- ment comprising exon 1 and the first 1358 bp of intron 1 was cloned into the EcoR1 and Kpn1 restriction sites of the pEGFP-C3 vector with primers 50 -AAAAAGAATTCA AGAAATTAGAGCTTACATTTAG-30 and 50 -AAAAAGGTAC CTCACTGATTAACAGGGAAAAAG-30 , respectively. Then, a 0.7 kb genomic fragment comprising the last 500 bp of intron 1 followed by exon 2 was cloned into the Kpn1 and BamHI sites of the first construct with primers 50 -AAAAAGGTACCTATACTTTGATGATAGATGTGG-30 and Figure 1. Clinical Features (A andB) Absenceof fingerprints (A) and reducedhandperspiration demonstrated by sweat test (B) in a patient with adermatoglyphia. The American Journal of Human Genetics 89, 302–307, August 12, 2011 303
  • 50. 50 -AAAAGGATCCCTTTGGTTTAGAATGGAAGG-30 , respec- tively. We sequenced the entire insert to verify the authen- ticity of the construct. Next, we introduced the c.378þ1G>T mutation into the minigene by using the Quick Change Site-Directed Mutagenesis kit (Stratagene, Santa Clara, CA). Both the wild-type and the mutant mini- gene constructs were transiently transfected into HeLa cells with Lipofectamine 2000 (Invitrogen). Cells were Figure 2. Genetic Mapping of ADG (A) Multipoint LOD score analysis was performed with the SuperLink software. LOD scores are plotted against all SNP markers distributed across the genome. (B) Haplotype analysis with polymorphic markers on chromosomal region 4q22 reveals a heterozygous 5.1 Mb interval between markers D4S423 and D4S1560 uniquely shared by all patients (boxed in red). 304 The American Journal of Human Genetics 89, 302–307, August 12, 2011
  • 51. harvested 48 hr after transfection; total RNA was extracted and subjected to RT-PCR and direct sequencing. Transfec- tion of the wild-type minigene resulted as expected in the formation of one single and abundant spliced variant containing exons 1 and 2 of the short SMARCAD1 isoform; this was confirmed by sequencing analysis. In contrast, transfection of the mutation-carrying minigene Figure 3. Mutation Analysis (A) Bioinformatics analysis indicated the existence of two SMARCAD1 isoforms differing both in lengths and sites of tran- scription start site. The short SMARCAD1 isoform contains a unique nontranslated exon (red arrow). (B) Sequence analysis revealed a heterozy- gous transversion, c.378þ1G>T, in the short SMARCAD1 isoform (red arrow, left panel). The wild-type sequence is given for comparison (right panel). (C) PCR-RFLP analysis confirmed segrega- tion of the mutation in the family. Muta- tion c.378þ1G>T creates a recognition site for MseI endonuclease; thus, healthy individuals display fragments of 163 bp and 46 bp, whereas affected heterozygous patients show in addition fragments of 73 bp and 90 bp. Figure 4. Tissue Expression of SMARCAD1 Isoforms SMARCAD1 isoform expression was assessed with Clontech tissue blot cDNA array. Quantitative RT-PCR analysis showed that the long SMARCAD1 isoform is expressed ubiquitously at low level. In contrast, the short SMARCAD1 isoform was found to be expressed mainly in skin fibroblasts, keratinocytes, and the esophagus. Expression of SMARCAD1 was normalized to that of ACTB. Results are provided as the fold change of expression of SMARCAD1 long isoform expression in keratinocytes 5 standard deviation. was found to lead to the generation of two aberrant splice variants: the first one was found to contain an extra 51 bp from intron 1, and the second one was found to miss one G at the end of exon 1 because of the utilization of cryptic donor splice sites. Of interest, the abnormal splice products were only marginally detectable as compared with the wild-type RNA, both in HeLa cells (Figure 5B) and in primary human fibroblasts (data not shown). These results are in line with the fact that aberrant splice variants were not detectable in patient fibroblasts (see above). Two main mechanisms, alone or in combination, might explain this observation. First, authentic splicing is typi- cally more efficient than splicing activated at cryptic sites.18 Therefore, it is possible that the significantly reduced level of aberrant splice variants is due to a decrease in splicing efficiency. Another possibility is that the abnormal 50 UTR variants affect RNA stability. Indeed, alter- ation in the secondary structure of an RNA molecule has been shown to inhibit translation initiation directly, by preventing the 40S subunit binding or scanning, or indi- rectly, by preventing the action of regulatory RNA-binding proteins. This in turn has been shown to foster mRNA degradation by increasing decapping and the deadenyla- tion rate.19 To assess this possibility, we initially compared via computational analysis the secondary structure of wild- type and aberrant splice RNA variants by using the Gene- Bee RNA secondary-structure prediction software. As shown in Figure 5C, computational analysis predicts that both aberrant splice variations are likely to significantly affect RNA secondary configuration; this prediction is in agreement with the fact the 50 UTR region of the gene affected by the abnormal splicing is highly conserved across species at the nucleotide level (data not shown). To obtain experimental support for the possibility that aberrantly spliced variants of the SMARCAD1 short isoform The American Journal of Human Genetics 89, 302–307, August 12, 2011 305
  • 52. undergo degradation, we treated cells transfected with both the wild-type and mutation-carrying constructs with cycloheximide at a concentration of 50 mg/ml for 24 hr, which is known to inhibit decapping of mRNA.20 As a result, we observed a significant increase in the aber- rant splice variant levels but not in the wild-type splice variant (Figure 5D). In conclusion, we have identified in a large family with ADG a splice site mutation causing aberrant splicing of a skin-specific isoform of SMARCAD1, implicating this gene in dermatoglyph ontogenesis. The mutation is likely to exert a loss-of-function effect. Little is known about the function of the full-length SMARCAD1, and virtually nothing is known regarding the physiological role of the skin-specific isoform of this gene. Clearly, the tissue-specific pattern of expression of the short isoform is likely to underlie the very limited phenotype displayed by our patients, as attested by the severe phenotype observed in mice knocked out for the ubiquitous SMARCAD1 large isoform of the gene;21 those mice feature retarded growth, perinatal mortality, decreased fertility, and various skeletal defects. The full-length SMARCAD1 seems to control the expres- sion of a large spectrum of target genes encoding transcrip- tional factors and histone modifiers as well as regulators of the cell cycle and development.16 It is tempting to speculate that the skin-specific isoform of SMARCAD1 might target genes involved in dermatoglyph and sweat gland development, two structures jointly affected in the present family and in additional disorders such as Naegeli-Franceschetti-Jadassohn and Rapp-Hodgkin (MIM 129400) syndromes.11,22 Regardless of the exact mecha- nisms mediating the activity of the skin-specific isoform of SMARCAD1 in the skin, the present results once again underscore the fact that rare monogenic traits represent an invaluable tool for the investigation of concealed aspects of our biology. Supplemental Data Supplemental Data include one table and can be found with this article online at http://www.cell.com/AJHG/. Acknowledgments We would like to acknowledge the participation of all family members in this study. We would like to thank Sylvia Kiese for her help. We wish to thank Gil Ast, Hadas Keren, and Mordechai Choder for helpful discussions. Figure 5. Consequences of Mutation c.378þ1G>T To assess the consequences of mutation c.378þ1G>T on SMARCAD1 splicing, we used a minigene system. (A) Schematic representation of the SMARCAD1 short isoform wild-type and mutation-carrying minigenes. (B) Sequence analysis of RT-PCR products generated from HeLa cells transfected with wild-type and mutant minigene constructs. Trans- fection of wild-type minigene resulted in the formation of one spliced variant containing exons 1 and 2 of the SMARCAD1 short isoform. In contrast, transfection of the mutant minigene resulted in two aberrant splice variants, containing an extra 51 bp from intron 1 or missing one G at the end of exon 1. A marked decrease in the level of expression of the spliced variants was also observed. (C) Computational modeling predicts an altered mRNA secondary structure of both aberrant splice variants. (D) Treatment with cycloheximide (at a concentration of 50 mg/ml for 24 hr), known to inhibit mRNA decapping, resulted in signifi- cantly increased levels of aberrant (but not wild-type) splice variants. 306 The American Journal of Human Genetics 89, 302–307, August 12, 2011
  • 53. Received: June 7, 2011 Revised: July 4, 2011 Accepted: July 8, 2011 Published online: August 4, 2011 Web Resources The URLs for data presented herein are as follows dbSNP, http://www.ncbi.nlm.nih.gov/SNP/ Ensembl, http://www.ensembl.org/ GenBank, http://www.ncbi.nlm.nih.gov/Genbank/ GeneBee, http://www.genebee.msu.su/ Online Mendelian Inheritance in Man (OMIM), http://www. omim.org Superlink, http://bioinfo.cs.technion.ac.il/superlink-online-twoloci/ makeped/TwoLociMultiPoint.html UCSC Genome Browser, http://genome.ucsc.edu/ References 1. Verbov, J. (1970). Clinical significance and genetics of epidermal ridges—a review of dermatoglyphics. J. Invest. Der- matol. 54, 261–271. 2. Warman, P.H., and Ennos, A.R. (2009). Fingerprints are unlikely to increase the friction of primate fingerpads. J. Exp. Biol. 212, 2016–2022. 3. Scheibert, J., Leurent, S., Prevost, A., and Debre´geas, G. (2009). The role of fingerprints in the coding of tactile information probed with a biomimetic sensor. Science 323, 1503–1506. 4. Reed, T., Viken, R.J., and Rinehart, S.A. (2006). High herita- bility of fingertip arch patterns in twin-pairs. Am. J. Med. Genet. A. 140, 263–271. 5. Bokhari, A., Coull, B.A., and Holmes, L.B. (2002). Effect of prenatal exposure to anticonvulsant drugs on dermal ridge patterns of fingers. Teratology 66, 19–23. 6. Babler, W.J. (1991). Embryologic development of epidermal ridges and their configurations. Birth Defects Orig. Artic. Ser. 27, 95–112. 7. Baird, H.W. (1968). Absence of fingerprints in four genera- tions. Lancet 2, 1250. 8. Basan, M. (1965). Ectodermal dysplasia. Missing papillary pattern, nail disorders and furrows on 4 fingers. Arch. Klin. Exp. Dermatol. 222, 546–557. 9. Reed, T., and Schreiner, R.L. (1983). Absence of dermal ridge patterns: Genetic heterogeneity. Am. J. Med. Genet. 16, 81–88. 10. Lı´mova´, M., Blacker, K.L., and LeBoit, P.E. (1993). Congenital absenceof dermatoglyphs. J. Am.Acad. Dermatol. 29, 355–358. 11. Lugassy, J., Itin, P., Ishida-Yamamoto, A., Holland, K., Huson, S., Geiger, D., Hennies, H.C., Indelman, M., Bercovich, D., Uitto, J., et al. (2006). Naegeli-Franceschetti-Jadassohn syndrome and dermatopathia pigmentosa reticularis: Two allelic ectodermal dysplasias caused by dominant mutations in KRT14. Am. J. Hum. Genet. 79, 724–730. 12. Sirinavin, C., and Trowbridge, A.A. (1975). Dyskeratosis con- genita: Clinical features and genetic aspects. Report of a family and review of the literature. J. Med. Genet. 12, 339–354. 13. Burger, B., Fuchs, D., Sprecher, E., and Itin, P. (2011). The immigration delay disease: Adermatoglyphia-inherited absence of epidermal ridges. J. Am. Acad. Dermatol. 64, 974–980. 14. Fishelson, M., and Geiger, D. (2002). Exact genetic linkage computations for general pedigrees. Bioinformatics 18 (Suppl 1), S189–S198. 15. Adra, C.N., Donato, J.L., Badovinac, R., Syed, F., Kheraj, R., Cai, H., Moran, C., Kolker, M.T., Turner, H., Weremowicz, S., et al. (2000). SMARCAD1, a novel human helicase family- defining member associated with genetic instability: Cloning, expression, and mapping to 4q22-q23, a band rich in break- points and deletion mutants involved in several human diseases. Genomics 69, 162–173. 16. Okazaki, N., Ikeda, S., Ohara, R., Shimada, K., Yanagawa, T., Nagase, T., Ohara, O., and Koga, H. (2008). The novel protein complex with SMARCAD1/KIAA1122 binds to the vicinity of TSS. J. Mol. Biol. 382, 257–265. 17. Singh, G., and Cooper, T.A. (2006). Minigene reporter for identification and analysis of cis elements and trans factors affecting pre-mRNA splicing. Biotechniques 41, 177–181. 18. Roca, X., Sachidanandam, R., and Krainer, A.R. (2003). Intrinsic differences between authentic and cryptic 50 splice sites. Nucleic Acids Res. 31, 6321–6333. 19. Day, D.A., and Tuite, M.F. (1998). Post-transcriptional gene regulatory mechanisms in eukaryotes: An overview. J. Endo- crinol. 157, 361–371. 20. Schwartz, D.C., and Parker, R. (1999). Mutations in translation initiation factors lead to increased rates of deadenylation and decapping of mRNAs in Saccharomyces cerevisiae. Mol. Cell. Biol. 19, 5247–5256. 21. Schoor, M., Schuster-Gossler, K., Roopenian, D., and Gossler, A. (1999). Skeletal dysplasias, growth retardation, reduced postnatal survival, and impaired fertility in mice lacking the SNF2/SWI2 family member ETL1. Mech. Dev. 85, 73–83. 22. Atasu, M., Akesi, S., Elc¸ioglu, N., Yatmaz, P.I., and Ertas, E.B. (1999). A Rapp-Hodgkin like syndrome in three sibs: Clinical, dental and dermatoglyphic study. Clin. Dysmorphol. 8, 101–110. The American Journal of Human Genetics 89, 302–307, August 12, 2011 307
  • 54. Subscribe to Active ZoneThe Cell Press Neuroscience Newsletter Featuring: Cutting-edge neuroscience from Cell Press and beyond Interviews with leading neuroscientists Special features: Podcasts, Webinars and Review Issues Neural Currents - cultural events, exhibits and new books And much more Read now at bit.ly/activezone
  • 55. REVIEW Five Years of GWAS Discovery Peter M. Visscher,1,2,* Matthew A. Brown,1 Mark I. McCarthy,3,4 and Jian Yang5 The past five years have seen many scientific and biological discov- eries made through the experimental design of genome-wide asso- ciation studies (GWASs). These studies were aimed at detecting variants at genomic loci that are associated with complex traits in the population and, in particular, at detecting associations between common single-nucleotide polymorphisms (SNPs) and common diseases such as heart disease, diabetes, auto-immune diseases, and psychiatric disorders. We start by giving a number of quotes from scientists and journalists about perceived problems with GWASs. We will then briefly give the history of GWASs and focus on the discoveries made through this experimental design, what those discoveries tell us and do not tell us about the genetics and biology of complex traits, and what immediate utility has come out of these studies. Rather than giving an exhaustive review of all reported findings for all diseases and other complex traits, we focus on the results for auto-immune diseases and metabolic diseases. We return to the perceived failure or disappointment about GWASs in the concluding section. Introduction: Have GWASs Been a Failure? In the past five years, genome-wide association studies (GWASs) have led to many scientific discoveries, and yet at the same time, many people have pointed to various problems and perceived failures of this experimental design. Let us begin by considering a number of criticisms that have been made against GWASs. We do not list these quotes to discredit any of the scientists or journalists involved, nor to deliberately cite them out of context. Rather, they serve to confirm that the points we discuss in this review are related to beliefs held by a significant number of scientific commentators and therefore warrant consideration. From an interview with Sir Alec Jeffreys, ESHG Award Lecturer 2010: ‘‘One of the great hopes for GWAS was that, in the same way that huge numbers of Mendelian disorders were pinned down at the DNA level and the gene and mutations involved identified, it would be possible to simply extrapolate from single gene disor- ders to complex multigenic disorders. That really hasn’t happened. Proponents will argue that it has worked and that all sorts of fascinating genes that predispose to or protect against diabetes or breast cancer, for example, have been identified, but the fact remains that the bulk of the heritability in these conditions cannot be ascribed to loci that have emerged from GWAS, which clearly isn’t going to be the answer to everything.’’ From McCLellan and King, Cell 20101 : ‘‘To date, genome-wide association studies (GWAS) have published hundreds of common variants whose allele frequencies are statistically correlated with various illnesses and traits. However, the vast majority of such variants have no established biolog- ical relevance to disease or clinical utility for prog- nosis or treatment.’’ ‘‘An odds ratio of 3.0, or even of 2.0 depending on population allele frequencies, would be robust to such population stratification. However, odds ratios of the magnitude generally detected by GWAS (<1.5) can frequently be explained by cryptic popu- lation stratification, regardless of the p value associ- ated with them.’’ ‘‘More generally, it is now clear that common risk variants fail to explain the vast majority of genetic heritability for any human disease, either individu- ally or collectively (Manolio et al., 2009).’’ ‘‘The general failure to confirm common risk vari- ants is not due to a failure to carry out GWAS properly. The problem is underlying biology, not the operationalization of study design. The common disease–common variant model has been the primary focus of human genomics over the last decade. Numerous international collaborative efforts representing hundreds of important human diseases and traits have been carried out with large well-char- acterized cohorts of cases and controls. If common alleles influenced common diseases, many would have been found by now. The issue is not how to develop still larger studies, or how to parse the data still further, but rather whether the common disease–common variant hypothesis has now been tested and found not to apply to most complex human diseases.’’ From Nicholas Wade in the New York Times, March 20 2011: ‘‘More common diseases, like cancer, are thought to be caused by mutations in several genes, and finding the causes was the principal goal of the $3 billion 1 University of Queensland Diamantina Institute, Princess Alexandra Hospital, Brisbane, Queensland 4102, Australia; 2 The Queensland Brain Institute, The University of Queensland, Brisbane, Queensland 4072, Australia; 3 Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK; 4 Oxford Centre for Diabetes, Endocrinology and Metabolism, Churchill Hospital Old Road, Headington Oxford OX3 7LJ, UK; 5 Queensland Institute of Medical Research, 300 Herston Road, Brisbane, Queensland 4006, Australia *Correspondence: peter.visscher@uq.edu.au DOI 10.1016/j.ajhg.2011.11.029. Ó2012 by The American Society of Human Genetics. All rights reserved. The American Journal of Human Genetics 90, 7–24, January 13, 2012 7
  • 56. human genome project. To that end, medical genet- icists have invested heavily over the last eight years in an alluring shortcut. But the shortcut was based on a premise that is turning out to be incorrect. Scien- tists thought the mutations that caused common diseases would themselves be common. So they first identified the common mutations in the human population in a $100 million project called the HapMap. Then they compared patients’ genomes with those of healthy genomes. The comparisons relied on ingenious devices called SNP chips, which scan just a tiny portion of the genome. (SNP, pronounced ‘‘snip,’’ stands for single nucleotide polymorphism.) These projects, called genome-wide association studies, each cost around $10 million or more. The results of this costly international exercise have been disappointing. About 2,000 sites on the human genome have been statistically linked with various diseases, but in many cases the sites are not inside working genes, suggesting there may be some conceptual flaw in the statistics. And in most diseases the culprit DNA was linked to only a small portion of all the cases of the disease. It seemed that natural selection has weeded out any disease-causing mutation before it becomes common.’’ From Tim Crow, Molecular Psychiatry 20112 : ‘‘There comes a point at which the genetic skeptic can be pardoned the suggestion that if the genes are so small and so multiple, what they are hardly matters, the dividing line between polygenes and no genes is of little practical consequence. Have we reached this point’’? From a commentary article by Jonathan Latham, on guardian.co.uk, 17 April 2011: ‘‘Among all the genetic findings for common illnesses, such as heart disease, cancer and mental illnesses, only a handful are of genuine significance for human health. Faulty genes rarely cause, or even mildly predispose us, to disease, and as a consequence the science of human genetics is in deep crisis. Since the Collins paper [Manolio et al. 20093 ] was published nothing has happened to change that conclusion. It now seems that the original twin- study critics were more right than they imagined. The most likely explanation for why genes for common diseases have not been found is that, with few exceptions, they do not exist.’’ These quotes raise a number of different issues about the methodology, research outcomes, and utility of the research findings. The pertinent points made in these quotes are: (1) GWASs are founded on a flawed assumption that genetics plays an important role in the risk to common diseases; (2) GWASs have been disappointing in not explaining more genetic variation in the population; (3) GWASs have not delivered meaningful, biologically relevant knowledge or results of clinical or any other utility; and (4) GWAS results are spurious. In this review we will briefly give the history of GWASs and then focus on the discoveries made through this experimental design, what those discoveries tell us and do not tell us about the genetics and biology of complex traits, and what immediate utility has come out of these studies. We will focus on the results for auto-immune diseases and metabolic diseases, although there have been important findings for other diseases and complex traits. In the concluding section, we will again consider the perceived failure or disappointment of GWASs. What Are GWASs, and How Did We Get There? Attempts to use linkage analysis to map genomic loci that have an effect on disease or other complex traits have been ubiquitous in the last two decades. Gene mapping by linkage relies on the cosegregation of causal variants with marker alleles within pedigrees. We define and discuss what we mean by ‘‘causal’’ in Box 1. Because the number of recombination events per meiosis is relatively small, tagging a causal variant requires only a few genetic markers per chromosome. The downside of the small number of recombination events is that the mapping resolution, i.e., how close to the causal variant one can get through linked markers, is typically low. Linkage mapping has been extremely successful in mapping genes and gene variants affecting Mendelian traits (e.g., single- gene disorders).4 Mapping loci underlying common diseases and, in particular, identifying causative muta- tions have had much less success. There are many reasons for the failure of linkage analyses to reliably identify complex-trait loci in human pedigrees. One reason is that the effect sizes (‘‘penetrance’’) of individual causal variants are too small to allow detection via cosegregation within pedigrees. GWASs are based upon the principle of linkage disequi- librium (LD) at the population level. LD is the nonrandom association between alleles at different loci. It is created by evolutionary forces such as mutation, drift, and selection and is broken down by recombination.5 Generally, loci that are physically close together exhibit stronger LD than loci that are farther apart on a chromosome. The larger the (effective) population size, the weaker the LD for a given distance.6 (Linkage analysis exploits the large LD within pedigrees.) The genomic distance at which LD decays determines how many genetic markers are needed to ‘‘tag’’ a haplotype, and the number of such tagging markers is much smaller than the total number of segregating variants in the population. For example, a selection of approximately 500,000 common SNPs in the human genome is sufficient to tag common variation 8 The American Journal of Human Genetics 90, 7–24, January 13, 2012
  • 57. in non-African populations, even though the total number of common SNPs exceeds 10 million.7 Geneticists realized some time ago that they could exploit population-based LD to map genes. For example, Bodmer suggested in 1986 that fine-mapping using popu- lation association could lead to closer linkage between a causative mutation and a linked marker.82 However, fine-mapping still relied on having an initial genomic loca- tion that is obtained from linkage analysis in family studies. What if we do not have any prior information on genomic loci or, alternatively, we deliberately want an unbiased scan of the genome? In a landmark paper, Risch and Merikangas83 showed that performing an association scan involving one million variants in the genome and a sample of unrelated individuals could be more powerful than performing a linkage analysis with a few hundred markers. It took only 10 years before this theoretical design became reality. What was needed was the discovery (accel- erated by the sequencing of the human genome) of hundreds of thousands of single-nucleotide variants, the quantification of the correlation (LD) structure of those markers in the human genome, and the ability to accu- rately genotype hundreds of thousands of markers in an automated and affordable manner. The LD structure was investigated in the HapMap project,7 and the outcome was a list of tag SNPs that captured most of the common genomic variation in a number of human populations. Concurrently, commercial companies produced dense SNP arrays that could genotype many markers in a single assay. The technological advances together with biobanks of either population cohorts or case-control samples facili- tated the ability to conduct GWASs. Although GWASs are unbiased with respect to prior bio- logical knowledge (or prior beliefs) and with respect to genome location, they are not unbiased in terms of what is detectable. GWASs rely on LD between genotyped SNPs and ungenotyped causal variants. The strength of statistical association between alleles at two loci in the genome strongly depends on their allele frequencies, such that a rare variant (say, one with a frequency <0.01) will be in low LD (as measured by r2 ) with a nearby common variant, even if they map to the same recombina- tion interval.84 But the SNPs that are on the SNP chips have been selected to be common (most have a minor allele frequency >0.05). Therefore, GWASs are by design powered to detect association with causal variants that are relatively common in the population. Is it realistic to assume common causal variants for disease segregate in the population? This is discussed in Box 2. (Nearly) Five Years of Discovery Although the first results from a GWAS were reported in 20058 and 2006,9 we take the 2007 Wellcome Trust Case Control Consortium (WTCCC) paper in Nature10 as a start- ing point. The reason for this is that the WTCCC study was the first large, well-designed GWAS for complex diseases to employ a SNP chip that had good coverage of the genome. There are many ways to summarize the discoveries based on GWASs in the last five years. We have tried to separate the discoveries quantitatively and to focus on the biology. There are now well over 2000 loci that are significantly and robustly associated with one or more complex traits (see GWAS catalog in Web Resources), as shown in Figure 1. The vast majority of the loci identified are new, i.e., before 2007 their association with disease or other complex traits Box 1. What Is a Causal Variant? New mutations that contribute to an increase or decrease in risk to disease arise in populations all the time. Some of these mutations can reach an appreciable frequency in the population, for example by random drift or by natural selection. As discussed in the main text, these mutations will be associated with other variants in the genome through LD. Such associations will include those with SNPs that are genotyped on ‘‘SNP chips.’’ Because there are many more segregating variants in the population than those genotyped in GWASs, it is unlikely, but not impossible, that a mutation is genotyped itself, and so its effect usually will be de- tected through an association with a genotyped variant. This genotyped variant can be robustly asso- ciated with disease in multiple samples from the same population, or even across populations, but it is not the mutation that causes variation in risk. The results from GWASs have shown that variants at many genetic loci in the genome are associated with disease, and these also reflect many ancestral mutations with an effect on susceptibility to disease. Therefore, the effect size (in terms of increasing or decreasing the absolute probability of disease) is, on average, small, and individual variants are neither necessary nor sufficient to cause disease. Herein lies the problem of defining ‘‘causal’’: How do we prove that a particular mutation causes the observed effect on variation in the population? Engineering the same mutation in a cell or animal model might give a relevant phenotype, but that is not a proof. The mutation can have a direct effect on gene expression in human tissues or be func- tional in another way, but that doesn’t prove it has a causal effect on disease risk. Operationally, in this review what we mean by ‘‘causal variant’’ is an (unknown) variant that has a direct or indirect func- tional effect on disease risk, rather than a variant that is associated with disease risk through LD, even if we don’t have the tools available at present to prove causality beyond reasonable doubt. Hence, it is the variant that causes the observed association signal. The American Journal of Human Genetics 90, 7–24, January 13, 2012 9
  • 58. was not known. Essentially, these are 2000 new biological leads. The number of loci identified per complex trait varies substantially, from a handful for psychiatric diseases to a hundred or more for inflammatory bowel disease (IBD1 [MIM 266600], including Crohn disease [CD]11 and ulcerative colitis [UC]12 ) and stature.13 Importantly, the number of discovered variants is strongly correlated with experimental sample size (Figure 2), which predicts that an ever-increasing discovery sample size will increase the number of discovered variants: very roughly, after a minimum sample-size threshold below which no vari- ants are detected is reached, a doubling in sample size leads Box 2. The CDCV Hypothesis Currently, the allele frequency of variants that contribute to cause common disease is a subject of some debate.85,86 The common disease-common variant (CDCV) hypothesis is sometimes said to be one side of this debate; the other side holds that disease-causing alleles are typically rare. But what is the precise ‘‘hypothesis’’ in the CDCV hypothesis? We tried to find the origin of the CDCV hypothesis. Many researchers cite either Lander87 or Risch and Merikangas.83 We will add Chakravarti88 and Reich and Lander89 as key studies. Lander87 noted from the then-available data that there is a limited diver- sity in coding regions at genes, in that most variants are very rare, and therefore the effective number of alleles is small. In addition, he provided ‘‘tantalizing examples’’ of common alleles with large effects (for example, such alleles include APOE [MIM 107741], MTHFR [MIM 607093], and ACE [MIM 106180]). Reich and Lander89 presented a theoretical popula- tion-genetics model that predicted a relatively simple spectrum of the frequency of disease risk alleles at a particular disease locus. They (re)phrased the CDCV hypothesis as the prediction that the ex- pected allelic identity is high for those disease loci that are responsible for most of the population risk for disease. These studies did not appear to make any prediction about the number of disease loci or, therefore, about the effect size. What the authors stated was that if a disease was common, there was likely to be one disease-causing allele that was much more common than all the other disease- causing alleles at the same locus.87,89 Risch and Merikangas83 quantified two important points regarding the detection of disease loci: first, that detection by association is more powerful than linkage when the genotype-relative risk is modest or small and the risk-allele frequency is large (say, >10%); and second, that the multiple-testing burden of a genome scan by association does not prevent the detection of genome-wide-significant findings. This paper was essentially about experi- mental design and statistical power (and hence feasi- bility), not about the CDCV hypothesis as such. Finally, Chakravarti88 pointed out that if individuals with disease needed to be homozygous for risk vari- ants at multiple loci, then the risk alleles at those loci must be more common than they would be in a model in which homozygosity at any risk locus is sufficient to cause disease. We note that without the assumption of strong epistasis on the scale of liability, there is no need for risk variants to be common. For example, Risch’s multilocus multipli- cative model,90 which implies an additive model Box 2. Continued on the log (risk) scale (it is one of the ‘‘exchangeable’’ models91 ), does not rely on a particular allelic spec- trum of risk-allele frequencies. What all these landmark papers have in common is a remarkable foresight in predicting the GWAS era well before the publication of the full draft of the human genome sequence, the HapMap project, or the availability of commercial genotyping. But what can we conclude about the origin and specifics of the CDCV hypothesis? As implicitly or explicitly stated in these key papers, there is no strong predic- tion about the exact allele-frequency spectrum of risk variants in the genome, nor a prediction about the effect size at any disease loci and hence about the total number of risk alleles in the genome. The current debate is about the frequency spec- trum of disease-causing alleles. Phrasing the debate as an either/or question is not very helpful because examples of both common and rare alleles are already known, but there is still an open question as to whether most genetic variation contributing to complex traits in the population is caused by rare variants or common variants. A more general question regards the spectrum of allele frequencies of disease-causing alleles and the joint distribution between risk-allele frequency and effect size. In the special case of an evolutionarily neutral model and a constant effective population size, most causal variants that are segregating in the population will be rare, but most heritability will be due to common variants.79,92 The reason for this apparent paradox is that the number of segregating variants is propor- tional to 1/[p(1 À p), where p is the allele frequency of a risk-increasing allele (so the smaller p, the more variants of that frequency), whereas the herita- bility contributed at that frequency is proportional to p(1 À p). The net effect is that the heritability is distributed equally over all frequencies, and cumula- tively most heritability is contributed by common variants. 10 The American Journal of Human Genetics 90, 7–24, January 13, 2012
  • 59. to a doubling of the number of associated variants discov- ered. The proportion of genetic variation explained by significantly associated SNPs is usually low (typically less than 10%) for many complex traits, but for diseases such as CD and multiple sclerosis (MS [MIM 126200]), and for quantitative traits such as height and lipid traits, between 10% and 20% of genetic variance has been accounted for (Table 1). In comparison to the pre-GWAS era, the propor- tion of genetic variation accounted for by newly discov- ered variants that are segregating in the population is large. It is clear that for most complex traits that have been investigated by GWAS, multiple identified loci have genome-wide statistical significance, and thus it is likely that there are (many) other loci that have not been identi- fied because of a lack of statistical significance (false nega- tives). Recently, researchers have developed and applied methods to quantify the proportion of phenotypic varia- tion that is tagged when one considers all SNPs simulta- neously.12–14 These methods focus on estimation rather than hypothesis testing and do not suffer from false negatives caused by small effect sizes.15 Whole-genome approaches to estimating genetic variation have shown that approximately one-third to one-half of additive genetic variation in the population is being tagged when all GWAS SNPs are considered simultaneously.12–14 This is a surprisingly large proportion given that evolutionary theory predicts that most variants affecting disease risk ought to be found at a low frequency in the population if they affect fitness,16,17 and such risk variants would not be in sufficient LD with the common SNPs to be detected in GWASs. Autoimmune Diseases We concentrate on seven auto-immune diseases, anky- losing spondylitis (AS [MIM 106300]), rheumatoid arthritis (RA [MIM 180300), systemic lupus erythematosus (SLE [MIM 152700]), and type 1 diabetes (T1D [MIM 222100]), MS, CD, and UC. Table 2 summarizes the number of genes that have been identified for these diseases. Across these diseases, 19 loci (mainly related to human leukocyte antigen) were known prior to 2007, and 277 have been discovered from 2007 onward. The total of 277 includes multiple counts of loci that have been implicated across a number of diseases; such loci include BLK (MIM 191305), TNFAIP3 (MIM 191163) and CD40 (MIM 109535). Inflammatory bowel disease (IBD, not to be confused here with identity by descent) is thought to arise from dysregulation of intestinal homeostasis.18 GWASs of IBD (CD and UC) have been highly successful in terms of the number of loci identified (99 nonoverlapping loci in Figure 1. GWAS Discoveries over Time Data obtained from the Published GWAS Catalog (see Web Resources). Only the top SNPs representing loci with association p values < 5 3 10À8 are included, and so that multiple counting is avoided, SNPs identified for the same traits with LD r2 rr > 0.8 esti- mated from the entire HapMap samples are excluded. Figure 2. Increase in Number of Loci Identified as a Function of Experimental Sample Size (A) Selected quantitative traits. (B) Selected diseases. The coordinates are on the log scale. The complex traits were selected with the criteria that there were at least three GWAS papers published on each in journals with a 2010–2011 journal impact factor >9 (e.g., Nature, Nature Genetics, the American Journal of Human Genetics, and PLoS Genetics) and that at least one paper contained more than ten genome-wide significant loci. These traits are a representative selection among all complex traits that fulfilled these criteria. The American Journal of Human Genetics 90, 7–24, January 13, 2012 11
  • 60. total18 ), and a substantial proportion of familial risk, about 20%, has been accounted for.11,12,18 Twenty-eight risk loci are shared between CD and UC, despite the fact that these diseases display distinct clinical features, and it has been suggested that the two diseases share pathways and are part of a mechanistic continuum.18 There are also strong overlaps between genes involved in CD and UC, AS,19 and psoriasis (MIM 177900), again suggesting shared aetio- pathogenic mechanisms in these conditions. Pleiotropic genetic effects are becoming increasing widely identified, including in classical autoimmune diseases.20 For example, a coding variant in the gene PTPN22 (MIM 600716) confers strong risk for T1D and RA as well as protection against CD.18 Metabolic Diseases In terms of metabolic diseases, we focus here specifically on type 2 diabetes (T2D [MIM 125853]); fasting glucose and insulin levels; body-mass index (BMI) and obesity; and fat distribution. A recent review21 already covered these complex traits, but we have updated that review wherever necessary. Table 3 gives an overview of the number of loci identified. More than 20 major GWASs for T2D have been pub- lished to date21–24 , and there has been a cumulative tally of around 50 genome-wide-significant hits,21,23,24 only three of which were known before the GWAS era. Most of these studies have involved individuals of European descent; the latest published effort is from the DIAGRAM (Diabetes Genetics Replication and Meta-analysis) Consortium and includes more than 47,000 GWAS indi- viduals and 94,000 samples for replication. More recently, equivalent studies have emerged from samples of East Asians,23,25–27 South Asians,22 and Hispanics,28,29 and large studies involving African Americans and other major ethnic groups are underway. Notwithstanding differences in allele frequency and LD patterns, most of the signals found in one ethnic group show some evidence of associ- ation in others, indicating that the common-variant signals identified by GWASs are likely to be the result of widely distributed causal alleles that are of relatively high frequency. This is an important observation because it indicates that most of the GWAS-identified associations for T2D reflect high LD with a causal variant that has a small effect size rather than low LD with a causal variant that has a large effect size. The largest common-variant signal identified for T2D remains TCF7L2 (MIM 602228) (detected just prior to the GWAS era30 ), which has a per-allele odss ratio (OR) of around 1.35. The remaining signals detected by GWAS have allelic ORs in the range between 1.05 and 1.25. Collectively, the most-strongly associated variants at these loci are estimated to explain around 10% of familial aggregation of T2D in European populations. The MAGIC (Meta-Analysis of Glucose- and Insulin- Related Traits Consortium) investigators have been carrying out equivalent analyses focused on the identifica- tion of variants influencing variation in glucose and insulin levels in healthy nondiabetic individuals.31–33 Prior to the GWAS era, the only compelling association signal for fasting glucose levels was known at GCK (MIM 138079) (glucokinase),34 but GWAS in European samples (46,000 GWAS and 76,000 replication samples) have expanded that number to 1632 . These variants explain around 10% of the inherited variation in fasting glucose levels. Only two signals (near GCKR [MIM 600842] and IGF1 [MIM 147440]) were shown to influence fasting insulin levels in the same analysis. Equivalent analyses for 2h glucose33 (15,000 GWAS samples and up to 30,000 replication samples) identified further signals, including variants near the GIP (MIM 137240) receptor (GIPR [MIM 137241]). Before the GWAS era, the only robust association between DNA sequence variation and either BMI or weight involved low-frequency variants in MC4R (MIM 155541).35 Now, there are more than 30. In the most recent study from the GIANT consortium,36 these analyses extended to almost 250,000 samples, half of them in the stage 1 GWAS, the remainder for replication. The largest signal remains that at FTO (MIM 610966),37 where the Table 1. Population Variation Explained by GWAS for a Selected Number of Complex Traits Trait or Disease h2 Pedigree Studies h2 GWAS Hitsa h2 All GWAS SNPsb Type 1 diabetes 0.998 0.699 ,c 0.312 Type 2 diabetes 0.3–0.6100 0.05-0.1034 Obesity (BMI) 0.4–0.6101,102 0.01-0.0236 0.214 Crohn’s disease 0.6–0.8103 0.111 0.412 Ulcerative colitis 0.5103 0.0512 Multiple sclerosis 0.3–0.8104 0.145 Ankylosing spondylitis >0.90105 0.2106 Rheumatoid arthritis 0.6107 Schizophrenia 0.7–0.8108 0.0179 0.3109 Bipolar disorder 0.6–0.7108 0.0279 0.412 Breast cancer 0.3110 0.08111 Von Willebrand factor 0.66–0.75112,113 0.13114 0.2514 Height 0.8115,116 0.113 0.513,14 Bone mineral density 0.6-0.8117 0.05118 QT interval 0.37–0.60119,120 0.07121 0.214 HDL cholesterol 0.5122 0.157 Platelet count 0.8123 0.05–0.158 a Proportion of phenotypic variance or variance in liability explained by genome-wide-significant and validated SNPs. For a number of diseases, other parameters were reported, and these were converted and approximated to the scale of total variation explained. Blank cells indicate that these parameters have not been reported in the literature. b Proportion of phenotypic variance or variance in liability explained when all GWAS SNPs are considered simultaneously. Blank cell indicate that these parameters have not been reported in the literature. c Includes pre-GWAS loci with large effects. 12 The American Journal of Human Genetics 90, 7–24, January 13, 2012
  • 61. average between-homozygotes difference in weight is around 2.5 kg. The effects at other loci are smaller, and in combination, these variants explain no more than 1%–2% of overall variation in adult BMI (although this percentage rises to almost 20% if the analysis is extended to all GWA variants, not just those that reach genome- wide significance14 ). As well as these studies of BMI and obesity in population samples, there have been several studies focused on extreme obesity phenotypes.38,39 The genome-wide-significant loci thrown up by these efforts only partially overlap with those emerging from popula- tion-based studies, raising the possibility that some of Table 2. Summary of GWAS Findings for Seven Autoimmune Diseasesa Prior to 2007 2007 onward Disease Number of Loci Loci Number of Loci Some or All of the Loci Ankylosing spondylitis 1 HLA-B27 13 IL23R, ERAP1, 2p15, 21q22, CARD9 (MIM 607212), IL12B (MIM 161561), PTGER4 (MIM 601586), IL1R2 (MIM 147811), TNFR1, TBKBP1 (MIM 608476), ANTXR2 (MIM 608041), RUNX3 (MIM 600210), KIF21B (MIM 608322) Rheumatoid arthritis 3 HLA-DRB1, PADI4, CTLA4 30 AFF3 (MIM 601464), BLK, CCL21 (MIM 602737), CD2/CD58 (MIM 186990)/153420], CD28, CD40, FCGR2A (MIM 146790), HLA-DRB1, IL2/IL21 (MIM 147680/605384), IL2RA, IL2RB (MIM 146710), KIF5A/PIP4K2C, PRDM1 (MIM 603423), PRKCQ (MIM 600448), PTPRC (MIM 151460), REL (MIM 164910), STAT4 (MIM 600558), TAGAP, TNFAIP3, TNFRSF14, TRAF1/C5 (MIM 120900/601711), TRAF6 (MIM 602355), IL6ST (MIM 600694), SPRED2 (MIM 609292), RBPJ (MIM 147183), CCR6 (MIM 601835), IRF5 (MIM 607218), PXK (MIM 611450) Systemic lupus erythematosus 3 HLA, PTPN22, IRF5 (MIM 607218) 31 BANK1 (MIM 610292), BLK (MIM 191305), C1q, C2 (MIM 613927), C4A/B (MIM 120820/120810), CRP (MIM 123260), ETS1 (MIM 164720), FcGR2A–FcGR3A (MIM 146790/146740), FcGR3B (MIM 610665), HIC2-UBE2L3 (MIM 607712/603721), IKZF1 (MIM 603023), IL10 (MIM 124092), IRAK1 (MIM 300283), ITGAM–ITGAX (MIM 120980)/151510], JAZF1, KIAA1542/PHRF1, LRRC18-WDFY4, LYN (MIM 165120), NMNAT2 (MIM 608701), PRDM1 (MIM 603423), PTTG1 (MIM 604147), PXK (MIM 611450), RASGRP3 (MIM 609531), SLC15A4, STAT1 (MIM 600555), TNFAIP3, TNFSF4 (MIM 603594), TNIP1 (MIM 607714), TREX1 (MIM 606609), UHRF1BP1, XKR6 Type 1 diabetes 4 HLA, INS (MIM 176730), PTPN22, CTLA4 40 RGS1, IL18RAP (MIM 604509), IFIH1 (MIM 606951), CCR5 (MIM 601373), IL2 (MIM 147680), IL7R, MHC, BACH2 (MIM 605394), TNFAIP3, TAGAP, IL2RA, PRKCQ (MIM 600448), INS (MIM 176730), ERBB3 (MIM 190151), 12q13.3, SH2B3 (MIM 605093), CTSH (MIM 116820), CLEC16A (MIM 611303), PTPN2 (MIM 176887), CD226 (MIM 605397), UBASH3A (MIM 605736), C1QTNF6, IL10 (MIM 124092), 4p15.2, C6orf173, 7p15.2, COBL (MIM 610317), GLIS3 (MIM 610192), C10orf59, CD69 (MIM 107273), 14q24.1, 14q32.2, IL27 (MIM 608273), 16q23.1, ORMDL3 (MIM 610075), 17q21.2, 19q13.32, 20p13, 22q12.2, Xq28 Multiple sclerosis 1 HLA 52 BACH2 (MIM 605394), BATF (MIM 612476), CBLB, CD40, CD58, CD6 (MIM 186720), CD86, CLEC16A (MIM 611303), CLECL1, CYP24A1, CYP27B1, DKKL1 (MIM 605418), EOMES (MIM 604615), EVI5 (MIM 602942), GALC (MIM 606890), HHEX (MIM 604420), IL12A, IL12B, IL22RA2, IL2RA, IL7, IL7R, IRF8, KIF21B (MIM 608322), MALT1, MAPK1 (MIM 176948), MERTK (MIM 604705), MMEL1, MPHOSPH9 (MIM 605501), MPV17L2, MYB (MIM 189990), MYC (MIM 190080), OLIG3 (MIM 609323), PLEK (MIM 173570), PTGER4 (MIM 601586), PVT1 (MIM 165140), RGS1, SCO2 (MIM 604272), SP140 (MIM 608602), STAT3, TAGAP, THEMIS (MIM 613607), TMEM39A, TNFRSF1A, TNFSF14 (MIM 604520), TYK2, VCAM1, ZFP36L1 (MIM 601064), ZMIZ1 (MIM 607159), ZNF767 Crohn’s disease 4 NOD2 (MIM 605956), IBD5 (MIM 606348), DRB1*0103, IL23R 67 SMAD3 (MIM 603109), ERAP2 (MIM 609497), IL10 (MIM 124092), IL2RA, TYK2, FUT2 (MIM 182100), DNMT3A (MIM 602769), DENND1B (MIM 613292), BACH2 (MIM 605394), ATG16L1 (MIM 610767) Ulcerative colitis 3 DRB1*1502, DRB1*0103, IL23R 44 IL1R2 (MIM 147811), IL8RA-IL8RB, IL7R, IL12B, DAP (MIM 600954), PRDM1 (MIM 603423), JAK2 (MIM 147796), IRF5 (MIM 607218), GNA12 (MIM 604394), LSP1 (MIM 153432), ATG16L1 (MIM 610767) Total 19 277 a The names of the loci are signposts and do not indicate that these loci are necessarily biologically relevant. A number of associated variants are distant from protein-coding genes. The American Journal of Human Genetics 90, 7–24, January 13, 2012 13
  • 62. the most extreme cases of obesity are driven by highly penetrant, low-frequency variants. Variation at copy- number variants (CNVs) has some impact on BMI. This is true of common CNVs (the NEGR1 association seems likely to be driven by a common CNV40 ) and also rarer CNVs for which evidence is starting to accumulate (e.g., 16p CNV and effect on morbid obesity and developmental delay41 ). The adverse metabolic effects of obesity depend not only on the overall level of adiposity but also on the distribu- tion of fat around the body; visceral (abdominal) fat has particularly adverse consequences for overall health. GWASs of fat-distribution phenotypes (including waist circumfer- ence,waist:hipratio,andbody-fatpercentagestudiedinclose to 200,000 individuals) have revealed almost 20 loci with genome-wide significance40,42–44 and relatively little overlap withthoselociinfluencingoveralladiposity.AswithBMI,the proportion of variance explained by these loci is small (around 1% after adjustment for BMI, age, and sex). New Biology Arising from GWAS Discoveries Autoimmune Diseases Thus far nearly all genes associated with MS have been involved in autoimmune pathways rather than in neurologic degenerative diseases.45 Indeed, of the two MS-associated genes involved in neurodegeneration, one (KIF21B) is also associated with AS and CD, suggesting that it is actually an autoimmunity gene. The genes involved in MS include genes coding for components of the cytokine pathway (CXCR5 [MIM 601613], IL2RA [MIM 147730], IL7R [MIM 146661], IL7 [MIM 146660], IL12RB1 [MIM 601604], IL22RA2 [MIM 606648], IL12A [MIM 161560], IL12B [MIM 161561], IRF8 [MIM 601565], TNFRSF1A [MIM 191190], TNFRSF14 [MIM 602746], and TNFSF14 [MIM 604520]), costimulatory molecules (CD37 [MIM 151523], CD40, CD58 [MIM 153420], CD80 [MIM 112203], CD86 [MIM 601020], and CLECL1 [MIM 607467]), and signal-transduction molecules of immunological relevance (CBLB [MIM 604491], GPR65 [MIM 604620], MALT1 [MIM 604860], RGS1 [MIM 600323], STAT3 [MIM 102582], TAGAP [MIM 609667], and TYK2 [MIM 176941]). Interestingly, these genes mainly implicate T-helper cells in MS pathogenesis. Genetic findings have had a major impact on AS research and therapeutics. The association of the genes IL23R (MIM 607562)46 and IL12B19 have pointed to the involvement of the IL-23R pathway, and hence IL-17-producing Table 3. Summary of GWAS Findings for Metabolic Traitsa Prior to 2007 2007 onward Disease Number of Loci Loci Number of Loci Some or All of the Loci Type 2 diabetes 3 PPARG, KCNJ11 (MIM 600937), TCF7L2 50 NOTCH2 (MIM 600275), PROX1 (MIM 601546), GCKR, THADA (MIM 611800), BCL11A (MIM 606557), RBMS1 (MIM 602310), IRS1, ADAMTS9, ADCY5 (MIM 600293), IGF2BP2 (MIM 608289), WFS1, ZBED3, CDKAL1, DGKB (MIM 604070), JAZF1, GCK, KLF14, TP53INP1 (MIM 606185), SLC30A8 (MIM 611145), PTPRD (MIM 601598), CDKN2A, CHCHD9, CDC123, HHEX (MIM 604420), DUSP8 (MIM 602038), KCNQ1, CENTD2, MTNR1B, HMGA2 (MIM 600698), TSPAN8 (MIM 600769), HNF1A, ZFAND6 (MIM 610183), PRC1 (MIM 603484), FTO, SRR (MIM 606477), HNF1B (MIM 189907), DUSP9 (MIM 300134), CDCD4A, UBE2E2 (MIM 602163), GRB14 (MIM 601524), ST6GAL1 (MIM 109675), VPS26A (MIM 605506), HMG20A (MIM 605534), AP3S2 (MIM 602416), HNF4A (MIM 600281), SPRY2 (MIM 602466) Body-mass index 1 MC4R 30 NEGR1 (MIM 613173), TNNI3K (MIM 613932), PTBP2 (MIM 608449), TMEM18 (MIM 613220), POMC, FANCL (MIM 608111), LRP1B (MIM 608766), CADM2 (MIM 609938), ETV5 (MIM 601600), GNPDA2 (MIM 613222), SLC39A8 (MIM 608732), HMGCR (MIM 142910), PCSK1, ZNF608, NCR3 (MIM 611550), HMGA1 (MIM 600701), LRRN6C, TUB (MIM 601197), BDNF, MTCH2 (MIM 613221), FAIM3 (MIM 606015), MTIF3, PRKD1 (MIM 605435), MAP2K5 (MIM 602520), FTO, SH2B1, GPRC5B (MIM 605948), KCTD15, GIPR, TMEM160 Glucose or insulin 1 GCK 15 GCKR, G6PC2, IGF1, ADCY5 (MIM 600293), MADD (MIM 603584), ADRA2A, CRY2 (MIM 603732), FADS1 (MIM 606148), GLIS3 (MIM 610192), SLC2A2, PROX1 (MIM 601546), C2CD4B (MIM 610344), DGKB (MIM 604070), GIPR, VPS13C (MIM 608879) Fat distribution 0 20 TBX15 (MIM 604127), LYPLAL1, IRS1, SPRY2 (MIM 602466), GRB14 (MIM 601524), STAB1 (MIM 608560), ADAMTS9, CPEB4 (MIM 610607), VEGFA (MIM 192240), TFAP2B (MIM 601601), LY86 (MIM 605241), RSPO3 (MIM 610574), NFE2L3 (MIM 604135), MSRA (MIM 601250), ITPR2 (MIM 600144), HOXC13 (MIM 142976), NRXN3 (MIM 600567), ZNRF3 (MIM 612062), PIGC (MIM 601730) Total 5 107 a The names of the loci are signposts and do not indicate that these loci are necessarily biologically relevant. A number of associated variants are distant from protein-coding genes. 14 The American Journal of Human Genetics 90, 7–24, January 13, 2012
  • 63. proinflammatory cell populations, in the aetiopathogene- sis of AS. The involvement of this pathway in AS was not considered until the genetic discoveries were reported. The recent demonstration that ERAP1 (MIM 606832) poly- morphisms are associated with HLA-B27-positive but not HLA-B27-negative AS has shed important light on research into the mechanism by which HLA-B27 induces AS; this mechanism has remained an enigma since the discovery of the association of HLA-B27 with AS in the early 1970s. ERAP1 is involved in peptide processing before HLA class I molecule presentation; the restriction of the association of ERAP1 variants to HLA-B27-positive disease indicates that HLA-B27 operates to cause AS by a mechanism that involves peptide presentation. Protective variants of ERAP1 have been shown to have lower peptide-processing capacity and thus to reduce the amount of peptide avail- able to HLA-B27.47 Thus HLA-B27 is more likely to cause AS when it is processing more peptides. The finding that PADI4 (MIM 605347) is associated with RA focused research interest on the role of anti-citrulli- nated peptide antibodies (ACPAs) and disease.48 PADI4 is involved in the citrullination of peptides against which ACPAs develop. The association of PADI4 variants with RA therefore indicated that ACPAs are directly involved in RA pathogenesis, not an indirect manifestation of immune dysregulation in the disease. Subsequently, it was discovered that the association of HLA-DRB1 (MIM 142857) with RA was restricted to ACPA-positive disease and that there was a strong gene-environment interaction, such that cigarette smoking increases the risk of ACPA- positive but not ACPA-negative RA.49 Because ACPA- positive disease is more severe than ACPA-negative disease and has a greater propensity toward joint-damaging erosion, this provided further evidence supporting public- health measures against cigarette smoking. The genetic loci identified for IBD through GWASs have highlighted a number of pathways, including antibacterial autophagy and signaling pathways (e.g., IL-10 signaling, T-cell-negative regulators, and pathways involving B cells and innate sensors).18 Some of these pathways were previ- ously not suspected to be important for these diseases. The role of a number of pathways, for example the IL-23R pathway, the autophagy pathway, and innate immunity, haveallcomefromhypothesis-generating geneticsresearch, not from immunology or hypothesis-driven research. Similar advances could be described for many other autoimmune diseases but are beyond the scope of this review. Metabolic Traits Most loci affecting T2D and fasting glucose levels map to regulatory sequences, and in many cases, the ‘‘causal’’ tran- script, i.e., the transcript responsible for mediating the effect of the associated variants, is not yet known. At other loci, a combination of coding variants, strong biological candidates, and/or cis expression QTL data has defined the transcript through which the effect is mediated (HNF1A [MIM 142410], GCK, IRS1 [MIM 147545], WFS1 [MIM 606201], PPARG [MIM 601487], CAMK1D [MIM 607957], JAZF1 [MIM 606246], KLF14 [MIM 609393] and others) as a first step to inferring biology.50 Some of these stories are now starting to be fleshed out into biological mechanisms (e.g., KLF1451 ). There is incomplete overlap with the loci influencing physiological variation in glucose and insulin. Some loci (e.g., MTNR1B [MIM 600804]) have a relatively large effect on both, whereas others (e.g., G6PC2 [MIM 608058]) influence fasting glucose levels but have a minimal effect on T2D risk. Still others (e.g., CDKN2A and CDKN2 B [MIM 600160 and 600431]) impact T2D and have surpris- ingly modest effects on fasting glucose levels in healthy, nondiabetic individuals32,33,50 . Most of these loci appear to have their primary effect on the function of beta cells rather than on insulin resistance, highlighting the impor- tance of the former with respect to normal and abnormal glucose homeostasis.50 Of the subset of loci (including PPARG, KLF14, and ADAMTS9 [MIM 605421]) shown to influence T2D risk through a primary effect on insulin resistance, only FTO seems to act primarily through an effect on obesity.50 Several of the T2D loci overlap genes that are known to harbor rare variants responsible for penetrant, monogenic forms of diabetes (such genes include KCNQ1 [MIM 607542], PPARG, HNF1A, GCK, and WFS1), indicating that multiple causal variants at the same locus segregate in the population at difference frequencies. There is overlap between signals influencing T2D risk and those influencing body weight (CDKAL1 [MIM 611259] and ADCY5 [MIM 600293]) indicating that some of the observed epidemiological associations between these traits are attributable to shared suscepti- bility variants.52 Whereas many of the fasting-glucose and fasting-insulin signals map near strong biological candidates for relevant traits (such candidate genes include IRS1, IGF1, ADRA2A [MIM 104210], SLC2A2 [MIM 138160], GCK and GCKR) and fit within established models of our understanding of islet biology, this is far from the case with the loci iden- tified for T2D. Efforts to demonstrate that the genes mapping close to T2D risk loci are enriched for particular pathways or processes have met with only limited success; the most robust finding yet has been in relation to cell-cycle regulation (and was consistent with a model in which the regulation of islet mass is a key component of risk50 ). Either T2D is especially heterogeneous or else key aspects of its pathophysiology are as yet poorly codified in existing databases. As for T2D and fasting glucose, most of the signals for obesity and fat distribution map to regulatory signals, the causal transcript is known at only a minority of the loci. Signals influencing BMI appear to be enriched for genes implicated in neuronal processes, whereas those influ- encing fat distribution seem to be more closely related to adipose development.36,43 Overlap with signals and genes implicated in more severe forms of disease (morbid obesity, The American Journal of Human Genetics 90, 7–24, January 13, 2012 15
  • 64. lipodystrophy) is seen at some loci (PCSK1 [MIM 162150], POMC [MIM 176830], BDNF [MIM 113505], MC4R, and SH2B1 [MIM 608937]) but is far from complete (some loci implicated in extreme obesity case-control studies show no association with BMI at the population level36 ). The strongest signal for overall adiposityis the one map- ping to FTO37 . FTO is thought to be a DNA methylase,53 but its function is poorly understood. Murine models demonstrate that modulation of Fto expression is associ- ated with changes in body weight,54–56 but no direct evidence linking coding variants in FTO in humans to body-weight variation has been demonstrated. For the time being, FTO remains the strongest candidate, but the role of other genes (e.g., RPGRIP1L [MIM 610937]) in the region cannot be discounted. This example demon- strates the difficulties that remain in relating GWAS signals to downstream biology. Fat distribution is a strongly gender-dimorphic phenotype, and many of the signals associated with fat distribution seem to have a selective effect on this phenotype in women.43 Quantitative Traits In addition to having been performed on the quantitative traits discussed previously (e.g., BMI and fasting-glucose and -insulin levels), GWASs have been done on a number of quantitative risk factors for disease and for traits that are models for the genetic architecture of complex traits. For bone mineral density (BMD), a risk factor for osteopo- rotic fracture, a total of 34 loci, together explaining ~5% of narrow sense heritability, have been identified (Estrada et al., abstract presented at the American Society for Bone and Mineral Research 2010 Annual Meeting, published in J. Bone. Med. Res. 25 [Suppl S1], p. 1243). Among these genes, there is a major over-representation of genes in the Wnt-signaling pathway, which was first implicated in oste- oporosis (MIM 166710) from studies in families with high or low BMD phenotypes. Many other examples exist in osteoporosis and other human diseases in which GWASs have demonstrated that more-prevalent but less-severe genetic variants in genes initially identified from studies of severe familial diseases have proven to be important in the risk of disease in the general population. For human height, a combined discovery and validation cohort of ~180,000 samples identified 180 robustly associated loci, many in meaningful biological pathways and with evi- dence for multiple segregating variants at the same loci.13 Together these loci explain approximately 12%–14% of additive genetic variation (~10% of phenotypic variation). A meta-analysis of more than 100,000 individuals of European ancestry detected a total of 95 loci significantly associated with plasma concentrations of cholesterol and triglycerides, known risk factors for coronary artery disease,57 and it provided evidence that the GWAS loci were of biological and clinical relevance. A meta-analysis from the HaemGen consortium on platelet count and platelet volume, which are endophenotypes for myo- cardial infarction (MIM 608446), discovered 68 loci.58 When the genes of a number of these loci were silenced in Drosophila, 11 showed a clear platelet phenotype. These genes are previously unknown regulators of blood cell formation. The identification of so many loci has uncov- ered new gene functions in megakaryopoiesis and platelet formation. That is, new biology has resulted directly from the identification of SNPs that are associated with variation in platelet phenotypes. Across these quantitative traits, a number of loci discov- ered through GWASs were known to be a mutational target for those traits because Mendelian forms with extreme phenotypes existed. Taken together, the inference from quantitative traits in terms of the (large) number of loci involved, the allelic frequency spectrum of associated vari- ants, and the nature of the candidate genes suggest that models arising from quantitative traits appropriately reflect the genetic architecture of disease and reinforce the emerging evidence that it is the cumulative effect of many loci that underlies susceptibility to disease. From GWAS to Translation: Clinical Relevance Autoimmune Diseases Many of the MS-associated genes discovered by GWASs represent excellent potential therapeutic targets. Of partic- ular note is the identification of two genes involved in vitamin D metabolism (CYP27B1 [MIM 609506] and CYP24A1 [MIM 126065]). This identification might help to explain the latitudinal variation in MS incidence—i.e., higher MS prevalence at more extreme latitudes is most likely due to higher rates of vitamin D deficiency. Two other identified genes are already targets of MS therapies, highlighting the relevance of the findings to the disease pathogenesis (natalizumab targets VCAM1 [MIM 192225], and daclizumab targets IL2RA). The findings for AS have stimulated the trial of therapies against identified pathways. Anti-IL-17 treatment has been shown in a phase 2 trial to have equivalent efficacy as the current gold-stan- dard treatment, TNF-inhibition, in the treatment of AS. The relevance of the RA-related genetic findings to thera- peutic development is highlighted by the fact that some existing therapies already target genes or gene pathways highlighted by the genetic associations with RA; such ther- apies include those involving TNF inhibitors (e.g., inflixi- mab) and co-stimulation inhibitors (e.g., abatacept). Abatacept is a fusion protein of CTLA-4 and immunoglob- ulin. It acts by preventing costimulation of T-helper cells by the binding of the T cell’s CD28 protein to the B7 protein on the antigen-presenting cell. CTLA4 (MIM 123890) and CD28 (MIM 186760) polymorphisms are associated with RA. The RA-associated genes include many involved in the NfKB signaling pathway and place this pathway at the center of RA pathogenesis. As in MS, mouse research prior to the genetic discoveries had implicated the IL-23-dependent Th17-lymphocyte pathway in RA pathogenesis. To date there has been very little genetic support for this with regard to human diseases, in contrast to the situation in seronegative 16 The American Journal of Human Genetics 90, 7–24, January 13, 2012
  • 65. diseases such as AS, psoriasis and IBD, where strong genetic associations exist and treatments targeting the pathway are in clinical use. Metabolic Diseases The main relevance of GWASs lies in the insights into disease biology (see above) and the potential for clinical translation through novel approaches to the diagnosis, prevention, treatment, and monitoring of disease. This will take some time, in particular given that most GWAS discoveries were made in the last few years. The predictive power of disease risk ascertained from genetic data remains poor because for most diseases only a small proportion of additive genetic variation has been accounted for. Although it is possible for T2D to identify individuals who are at the extremes of the genotype risk score distribu- tion and who differ appreciably in T2D risk (they have twice or half the average risk for the upper and lower 1%–2%, respectively), many of these would already be identifiable on the basis of classical risk factors. In fact, when using receiver operating characteristic (ROC) anal- yses, BMI and age do a far better job of discrimination than the genetic variants so far discovered.59 This may change as low frequency and rare causal alleles are found. Although individual prediction is not yet practical with the variants at hand, it should be possible to identify groups of individuals who are at a substantially greater- than-average risk for diabetes, and this might be of value, for example, with respect to clinical-trial enrichment. One obvious route to early translation involves the iden- tification of diagnostic biomarkers on the basis of the processes that have been uncovered. These may have predictive impact well beyond the genetic variants that led to their discovery. This was recently demonstrated by a GWAS of C-reactive protein (CRP) levels; that study found that common variants near the HNF1A gene were associated with variation in CRP.60 The authors asked whether rare HNF1A mutations that are causal for the Mendelian MODY (MIM 606391) subtype of diabetes are also associated with differences in CRP levels and whether it would be possible to use CRP levels as a diagnostic marker to help identify individuals who have early-onset diabetes and who are likely to have HNF1A-MODY (and to direct those individuals to sequence-based diagnostics). They were able to show marked differences in CRP levels between HNF1A -MODY and other types of diabetes and demonstrated that diagnoses based on CRP levels has a discriminative accuracy of more than 80% for this diag- nostic classification.61,62 Otherwise, GWAS findings have as yet had no impact on therapeutic optimization. Recent studies have identified variants that influence therapeutic response to metformin63 and might herald better under- standing of how these drugs work. New Science Facilitated by GWASs Although the GWAS approach was designed for the detec- tion of associations between DNA markers and disease, as a by-product such studies have generated new scientific discoveries. A detailed description and discussion is outside the scope of this review, and we highlight only a few of these advances: the discovery of genes affecting genetic recombination and their correlation with natural selec- tion64–66 and new insight in human population structure and evolution.67–73 Interpretation of GWAS Results GWASs conducted in the last five years were designed and powered to detect associations through LD between geno- typed (or imputed) common SNP markers and unknown causal variants. What do the results imply in terms of vari- ance explained in the population, common versus rare variants underlying complex traits, and the nature of complex-trait variation and evolution? It is too early to be able to quantify the joint distribution of risk-allele frequencies and their effect sizes because there are very few causal variants identified by GWAS and because systematic study of rare variants (through exome or whole-genome sequencing) is in an early stage. To under- stand the allelic spectrum of risk variants and thereby inform optimal design of experiments aiming to detect causal variants, one must differentiate between two expla- nations for observed associations between genotyped common SNPs and disease: the association can be caused by one or more causal variants that have large effect sizes and are in low LD with the genotyped SNPs, or it can be caused by causal variants that have small effects and are in high LD with the genotyped SNPs. Low LD occurs when the allele frequencies of the unknown causal vari- ants and those at the genotyped SNPs are very different from each other, for example when the allele frequency of causal variants is much lower than that of the SNPs. For a single robustly associated SNP in a homogeneous population, we cannot distinguish between the hypoth- eses that the association signal is caused by a rare variant of large effect or a common variant with small effect. However, variants at multiple loci and GWASs in other ethnic populations help to narrow the boundaries of the genetic architecture of diseases. At this point in time, we can conclude that (1) Many loci contribute to complex-trait variation (e.g., Figure 2). (2) At a number of identified risk loci, there are multiple alleles associated with disease at a wide range of frequencies. (3) There is evidence for pleiotropy, i.e., that the same variants are associated with multiple traits.66,74,75 (4) A number of variants associated with disease or complex traits in one ethnic population are also associated the same disease or traits in other popula- tions (see above for T2D examples). (5) The hypothesis76 that causal variant(s) that lead to the association between common SNPs and disease are mostly rare (say, have an allele frequency of 1% The American Journal of Human Genetics 90, 7–24, January 13, 2012 17
  • 66. or lower) is not consistent with theoretical and empir- ical results.77,78 In particular, there is no widespread evidence for the existence of ‘‘synthetic associations’’ (see Box 3). Numerically, we expect that most causal variants that segregate in the population are rare, consistent with evolutionary theory, but the propor- tion of genetic variation that these variants cumula- tively explain depends on their correlation with fitness.79 (6) A surprisingly large proportion of additive genetic variation is tagged when all SNPs are considered simultaneously.12–14 The Cost of GWASs If we assume that the GWAS results from Figure 1 represent a total of 500,000 SNP chips and that on average a chip costs $500, then this is a total investment of $250 million. If there are a total of ~2,000 loci detected across all traits, then this implies an investment of $125,000 per discov- ered locus. Is that a good investment? We think so: The total amount of money spent on candidate-gene studies and linkage analyses in the 1990s and 2000s probably exceeds $250M, and they in total have had little to show for it. Also, it is worthwhile to put these amounts in context. $250M is of the order of the cost of a one-two stealth fighter jets and much less than the cost of a single navy submarine. It is a fraction of the ~$9 billion cost of the Large Hadron Collider. It would also pay for about 100 R01 grants. Would those 100 non-funded R01 grants have made breakthrough discoveries in biology and medi- cine? We simply can’t answer this question, but we can conclude that a tremendous number of genuinely new discoveries have been made in a period of only five years. Concluding Comments In this review we have attempted to summarize the tremendous quality and quantity of discoveries that have been made by GWASs in the last five years. Because of space limitations, we have been able to discuss only a subset of diseases and have not mentioned those made in common cancers, pediatric diseases, and ophthalmolog- ical diseases, to name but a few. We now return to the Box 3. Synthetic Associations Dickson and colleagues suggested that the observed association between a common SNP and a complex trait might result when one or more rare variants at the locus is in LD with that SNP.76,93 Because common SNP alleles and rare causal variants cannot be highly correlated because of the properties of LD,84 the hypothesis of ‘‘synthetic’’ associations implies that the effect sizes of the causal variants are much larger than the effect size observed at the common SNP and suggests that (re)sequencing studies might detect such variants. The hypothesis is not about whether GWASs work as an experi- mental design but what the likely interpretation of GWAS hits is in terms of the allele spectrum of causal risk alleles. Are empirical data consistent with this hypothesis? Several lines of evidence suggest that associations observed with common SNP associa- tions are rarely due to synthetic associations with rare variants. First, because the LD correlation between common and rare variants is so low (typi- cally 0.01–0.02), synthetic associations imply that variation explained by the causal variants at the locus is 50–100 times larger than the variance ex- plained at the genotyped SNP.78 So, if the SNP explains 0.1% of phenotypic variation in the popu- lation, the causal variant would explain 5%–10%. But as shown in this review, for many complex traits and diseases tens to hundred of common variants are identified, and so their combined effects would explain too much variation if synthetic associations were the norm. Second, empirical data from (re)sequencing studies and trans-ethnic mapping suggest that both common and rare variants contribute to disease risk.77 At most loci detected by GWASs, there is no evidence (despite extensive genotyping and/or re-sequencing) that the common-variant signal is driven by low-frequency or rarer variants. Where rare risk alleles are uncov- ered at the same loci, they seem much more likely to be independent signals.94–96 Together these observations point to a highly polygenic model of disease susceptibility with causal variants across the entire range of the allele- frequency spectrum. By ‘‘polygenic,’’ we mean that segregating variants at many genomic loci (tens, hundreds, or even thousands) contribute to genetic variation for susceptibility in the population. The observations imply that, for most common complex diseases, nearly everyone in the population carries some risk alleles and that affected individuals are likely to have a different portfolio of risk alleles.79 They also imply that any single risk allele is neither necessary nor sufficient to cause disease. For the Box 3. Continued etiology of disease, these observations provide empirical evidence to support a threshold or burden model involving multiple variants and environ- mental factors, and they appear to be inconsistent with a single cause (e.g., a single mutation). A rare- variant only model of disease, characterized by locus heterogeneity and rare mutations of large effects and proposed by, for example, McClellan and King,1 is not consistent with empirical observations.77,79,97 18 The American Journal of Human Genetics 90, 7–24, January 13, 2012
  • 67. perceived failure of GWASs as summarized in the introduc- tory section: (1) Is the GWAS approach founded on a flawed assumption that genetics plays an important role in the risk for common diseases? Pedigree studies, including those involving twins, suggest that a substantial propor- tion of variation in susceptibility for common disease is due to genetic factors. The proportion of total variation explained by genome-wide-signifi- cant variants has reached 10%–20% for a number of diseases, and clearly there are additional variants with such small effect sizes that they have not been detected with stringent significance. As reviewed here, many of the detected loci are in biologically meaningful pathways for the diseases investigated. Whole-genome analyses involving GWAS data have estimated that 20%–50% of phenotypic varia- tion is captured when all SNPs are considered simul- taneously for a number of complex diseases and traits. These estimates are based on population- wide studies and provide a lower limit of the total proportion of phenotypic variation due to genetic factors. Inference from GWASs is independent of inference drawn from close relatives (pedigree/ family studies), and therefore these studies have provided independent evidence for the role of genetics in common diseases. (2) Have GWASs been disappointing in not explaining more genetic variation in the population? This criticism implies that the aim of GWASs is to explain all genetic variation. This is a misrepresentation of the objective of GWASs. As was the aim of linkage studies in pedigrees for complex diseases prior to the GWAS era, the aim of GWAS is to detect loci that are associated with complex traits. The detec- tion of such loci has led to the discovery of new bio- logical knowledge about disease—knowledge that was absent only five years ago. But even ignoring the aim of GWASs, for a number of complex traits the proportion of genetic variation uncovered by GWASs is actually substantial. For example, for T2D, MS, and CD, approximately 10%, 20%, and 20%, respectively, of genetic variation in the popu- lation has been accounted for. Apart from diseases with a known major locus (which is usually the major histocompatibility locus), the baseline of variation explained five years ago was essentially zero. (3) Have GWASs delivered meaningful biologically relevant knowledge or results of clinical or any other utility? As we have highlighted in this review, the answer to this question is a definite ‘‘yes.’’ For example, the discovery of the importance of the autophagy pathway in Crohn disease, the IL-23R pathway in rheumatoid arthritis, and factor H in age-related macular degeneration (MIM 610149)9 have given important biological insight with direct clinical relevance. Hunter and Kraft put it this way back in 2007: ‘‘There have been few, if any, similar bursts of discovery in the history of medical research.’’80 (4) Are GWAS results spurious? The combination of large sample sizes and stringent significance testing has led to a large number of robust and replicable asso- ciations between complex traits and genetic vari- ants, many of which are in meaningful biological pathways. A number of variants or different variants at the same loci have been shown to be associated with the same trait in different ethnic populations, and some loci are even replicated across species.81 The combination of multiple variants with small effect sizes has been shown to predict disease status or phenotype in independent samples from the same population. Clearly, these results are not consistent with flawed inferences from GWASs. In conclusion, in a period of less than five years, the GWAS experimental design in human populations has led to new discoveries about genes and pathways involved in common diseases and other complex traits, has provided a wealth of new biological insights, has led to discoveries with direct clinical utility, and has facilitated basic research in human genetics and genomics. For the future, technological advances enabling the sequencing of entire genomes in large samples at affordable prices is likely to generate additional genes, pathways, and biolog- ical insights, as well as to identify causal mutations. Acknowledgments We acknowledge funding from the Australian National Health and Medical Research Council (NHMRC grants 389892, 496667, 613672, 613601, and 1011506) and the Australian Research Council (ARC grant DP1093502). P.M.V. and M.A.B. are funded by NHMRC Senior Principal Research Fellowships. We thank two referees for many helpful comments. Web Resources The URLs for data presented herein are as follows: Online Mendelian Inheritance in Man (OMIM), http://www. omim.org GWAS Catalog, http://www.genome.gov/26525384 References 1. McClellan, J., and King, M.C. (2010). Genetic heterogeneity in human disease. Cell 141, 210–217. 2. Crow, T.J. (2011). ‘The missing genes: what happened to the heritability of psychiatric disorders?’. Mol. Psychiatry 16, 362–364. 3. Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A., Hunter, D.J., McCarthy, M.I., Ramos, E.M., Cardon, L.R., Chakravarti, A., et al. (2009). Finding the miss- ing heritability of complex diseases. Nature 461, 747–753. The American Journal of Human Genetics 90, 7–24, January 13, 2012 19
  • 68. 4. Botstein, D., and Risch, N. (2003). Discovering genotypes underlying human phenotypes: Past successes for mende- lian disease, future approaches for complex disease. Nat. Genet. Suppl. 33, 228–237. 5. Hartl, D.L., and Clark, A.G. (1997). Principles of population genetics (Sunderland: Sinauer Associates). 6. Hill, W.G., and Robertson, A. (1968). The effects of inbreeding at loci with heterozygote advantage. Genetics 60, 615–628. 7. Altshuler, D., Brooks, L.D., Chakravarti, A., Collins, F.S., Daly, M.J., and Donnelly, P.; International HapMap Consor- tium. (2005). A haplotype map of the human genome. Nature 437, 1299–1320. 8. Dewan, A., Liu, M., Hartman, S., Zhang, S.S., Liu, D.T., Zhao, C., Tam, P.O., Chan, W.M., Lam, D.S., Snyder, M., et al. (2006). HTRA1 promoter polymorphism in wet age-related macular degeneration. Science 314, 989–992. 9. Klein, R.J., Zeiss, C., Chew, E.Y., Tsai, J.Y., Sackler, R.S., Haynes, C., Henning, A.K., SanGiovanni, J.P., Mane, S.M., Mayne, S.T., et al. (2005). Complement factor H polymor- phism in age-related macular degeneration. Science 308, 385–389. 10. Wellcome Trust Case Control Consortium. (2007). Genome- wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678. 11. Franke, A., McGovern, D.P., Barrett, J.C., Wang, K., Radford- Smith, G.L., Ahmad, T., Lees, C.W., Balschun, T., Lee, J., Roberts, R., et al. (2010). Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nat. Genet. 42, 1118–1125. 12. Anderson, C.A., Boucher, G., Lees, C.W., Franke, A., D’Amato, M., Taylor, K.D., Lee, J.C., Goyette, P., Imielinski, M., Latiano, A., et al. (2011). Meta-analysis identifies 29 addi- tional ulcerative colitis risk loci, increasing the number of confirmed associations to 47. Nat. Genet. 43, 246–252. 13. Lango Allen, H., Estrada, K., Lettre, G., Berndt, S.I., Weedon, M.N., Rivadeneira, F., Willer, C.J., Jackson, A.U., Vedantam, S., Raychaudhuri, S., et al. (2010). Hundreds of variants clus- tered in genomic loci and biological pathways affect human height. Nature 467, 832–838. 14. Yang, J., Manolio, T.A., Pasquale, L.R., Boerwinkle, E., Capor- aso, N., Cunningham, J.M., de Andrade, M., Feenstra, B., Feingold, E., Hayes, M.G., et al. (2011). Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet. 43, 519–525. 15. Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders, A.K., Nyholt, D.R., Madden, P.A., Heath, A.C., Martin, N.G., Montgomery, G.W., et al. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569. 16. Eyre-Walker, A. (2010). Evolution in health and medicine Sackler colloquium: Genetic architecture of complex traits and its implications for fitness and genome-wide associa- tion studies. Proc. Natl. Acad. Sci. USA 107 (Suppl 1), 1752–1756. 17. Pritchard, J.K. (2001). Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 69, 124–137. 18. Khor, B., Gardet, A., and Xavier, R.J. (2011). Genetics and pathogenesis of inflammatory bowel disease. Nature 474, 307–317. 19. Danoy, P., Pryce, K., Hadler, J., Bradbury, L.A., Farrar, C., Poin- ton, J., Ward, M., Weisman, M., Reveille, J.D., Wordsworth, B.P., et al; Australo-Anglo-American Spondyloarthritis Consortium; Spondyloarthritis Research Consortium of Canada. (2010). Association of variants at 1q32 and STAT3 with ankylosing spondylitis suggests genetic overlap with Crohn’s disease. PLoS Genet. 6, e1001195. 20. Cotsapas, C., Voight, B.F., Rossin, E., Lage, K., Neale, B.M., Wallace, C., Abecasis, G.R., Barrett, J.C., Behrens, T., Cho, J., et al; FOCiS Network of Consortia. (2011). Pervasive sharing of genetic effects in autoimmune disease. PLoS Genet. 7, e1002254. 21. McCarthy, M.I. (2010). Genomics, type 2 diabetes, and obesity. N. Engl. J. Med. 363, 2339–2350. 22. Kooner, J.S., Saleheen, D., Sim, X., Sehmi, J., Zhang, W., Frossard, P., Been, L.F., Chia, K.S., Dimas, A.S., Hassanali, N., et al; DIAGRAM; MuTHER. (2011). Genome-wide associ- ation study in individuals of South Asian ancestry identifies six new type 2 diabetes susceptibility loci. Nat. Genet. 43, 984–989. 23. Yamauchi, T., Hara, K., Maeda, S., Yasuda, K., Takahashi, A., Horikoshi, M., Nakamura, M., Fujita, H., Grarup, N., Cauchi, S., et al. (2010). A genome-wide association study in the Japanese population identifies susceptibility loci for type 2 diabetes at UBE2E2 and C2CD4A-C2CD4B. Nat. Genet. 42, 864–868. 24. Shu, X.O., Long, J., Cai, Q., Qi, L., Xiang, Y.B., Cho, Y.S., Tai, E.S., Li, X., Lin, X., Chow, W.H., et al. (2010). Identification of new genetic risk variants for type 2 diabetes. PLoS Genet. 6, e1001127. 25. Yasuda, K., Miyake, K., Horikawa, Y., Hara, K., Osawa, H., Furuta, H., Hirota, Y., Mori, H., Jonsson, A., Sato, Y., et al. (2008). Variants in KCNQ1 are associated with susceptibility to type 2 diabetes mellitus. Nat. Genet. 40, 1092–1097. 26. Unoki, H., Takahashi, A., Kawaguchi, T., Hara, K., Horikoshi, M., Andersen, G., Ng, D.P., Holmkvist, J., Borch-Johnsen, K., Jørgensen, T., et al. (2008). SNPs in KCNQ1 are associated with susceptibility to type 2 diabetes in East Asian and Euro- pean populations. Nat. Genet. 40, 1098–1102. 27. Tsai, F.J., Yang, C.F., Chen, C.C., Chuang, L.M., Lu, C.H., Chang, C.T., Wang, T.Y., Chen, R.H., Shiu, C.F., Liu, Y.M., et al. (2010). A genome-wide association study identifies susceptibility variants for type 2 diabetes in Han Chinese. PLoS Genet. 6, e1000847. 28. Below, J.E., Gamazon, E.R., Morrison, J.V., Konkashbaev, A., Pluzhnikov, A., McKeigue, P.M., Parra, E.J., Elbein, S.C., Hallman, D.M., Nicolae, D.L., et al. (2011). Genome-wide association and meta-analysis in populations from Starr County, Texas, and Mexico City identify type 2 diabetes susceptibility loci and enrichment for expression quantita- tive trait loci in top signals. Diabetologia 54, 2047–2055. 29. Parra, E.J., Below, J.E., Krithika, S., Valladares, A., Barta, J.L., Cox, N.J., Hanis, C.L., Wacher, N., Garcia-Mena, J., Hu, P., et al; Diabetes Genetics Replication and Meta-analysis (DIAGRAM) Consortium. (2011). Genome-wide association study of type 2 diabetes in a sample from Mexico City and a meta-analysis of a Mexican-American sample from Starr County, Texas. Diabetologia 54, 2038–2046. 30. Grant, S.F., Thorleifsson, G., Reynisdottir, I., Benediktsson, R., Manolescu, A., Sainz, J., Helgason, A., Stefansson, H., Emilsson, V., Helgadottir, A., et al. (2006). Variant of 20 The American Journal of Human Genetics 90, 7–24, January 13, 2012
  • 69. transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat. Genet. 38, 320–323. 31. Prokopenko, I., Langenberg, C., Florez, J.C., Saxena, R., Soranzo, N., Thorleifsson, G., Loos, R.J., Manning, A.K., Jackson, A.U., Aulchenko, Y., et al. (2009). Variants in MTNR1B influence fasting glucose levels. Nat. Genet. 41, 77–81. 32. Dupuis, J., Langenberg, C., Prokopenko, I., Saxena, R., Soranzo, N., Jackson, A.U., Wheeler, E., Glazer, N.L., Boua- tia-Naji, N., Gloyn, A.L., et al; DIAGRAM Consortium; GIANT Consortium; Global BPgen Consortium; Anders Hamsten on behalf of Procardis Consortium; MAGIC investi- gators. (2010). New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nat. Genet. 42, 105–116. 33. Saxena, R., Hivert, M.F., Langenberg, C., Tanaka, T., Pankow, J.S., Vollenweider, P., Lyssenko, V., Bouatia-Naji, N., Dupuis, J., Jackson, A.U., et al; GIANT consortium; MAGIC investiga- tors. (2010). Genetic variation in GIPR influences the glucose and insulin responses to an oral glucose challenge. Nat. Genet. 42, 142–148. 34. Weedon, M.N., Clark, V.J., Qian, Y., Ben-Shlomo, Y., Timp- son, N., Ebrahim, S., Lawlor, D.A., Pembrey, M.E., Ring, S., Wilkin, T.J., et al. (2006). A common haplotype of the gluco- kinase gene alters fasting glucose and birth weight: Associa- tion in six studies and population-genetics analyses. Am. J. Hum. Genet. 79, 991–1001. 35. Larsen, L.H., Echwald, S.M., Sørensen, T.I., Andersen, T., Wulff, B.S., and Pedersen, O. (2005). Prevalence of mutations and functional analyses of melanocortin 4 receptor variants identified among 750 men with juvenile-onset obesity. J. Clin. Endocrinol. Metab. 90, 219–224. 36. Speliotes, E.K., Willer, C.J., Berndt, S.I., Monda, K.L., Thor- leifsson, G., Jackson, A.U., Allen, H.L., Lindgren, C.M., Luan, J., Ma¨gi, R., et al; MAGIC; Procardis Consortium. (2010). Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat. Genet. 42, 937–948. 37. Frayling, T.M., Timpson, N.J., Weedon, M.N., Zeggini, E., Freathy, R.M., Lindgren, C.M., Perry, J.R., Elliott, K.S., Lango, H., Rayner, N.W., et al. (2007). A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316, 889–894. 38. Meyre, D., Delplanque, J., Che`vre, J.C., Lecoeur, C., Lobbens, S., Gallina, S., Durand, E., Vatin, V., Degraeve, F., Proenc¸a, C., et al. (2009). Genome-wide association study for early-onset and morbid adult obesity identifies three new risk loci in European populations. Nat. Genet. 41, 157–159. 39. Scherag, A., Dina, C., Hinney, A., Vatin, V., Scherag, S., Vogel, C.I., Mu¨ller, T.D., Grallert, H., Wichmann, H.E., Balkau, B., et al. (2010). Two new Loci for body-weight regulation iden- tified in a joint analysis of genome-wide association studies for early-onset extreme obesity in French and german study groups. PLoS Genet. 6, e1000916. 40. Willer, C.J., Speliotes, E.K., Loos, R.J., Li, S., Lindgren, C.M., Heid, I.M., Berndt, S.I., Elliott, A.L., Jackson, A.U., Lamina, C., et al; Wellcome Trust Case Control Consortium; Genetic Investigation of ANthropometric Traits Consortium. (2009). Six new loci associated with body mass index high- light a neuronal influence on body weight regulation. Nat. Genet. 41, 25–34. 41. Walters, R.G., Jacquemont, S., Valsesia, A., de Smith, A.J., Martinet, D., Andersson, J., Falchi, M., Chen, F., Andrieux, J., Lobbens, S., et al. (2010). A new highly penetrant form of obesity due to deletions on chromosome 16p11.2. Nature 463, 671–675. 42. Heard-Costa, N.L., Zillikens, M.C., Monda, K.L., Johansson, A., Harris, T.B., Fu, M., Haritunians, T., Feitosa, M.F., Aspe- lund, T., Eiriksdottir, G., et al. (2009). NRXN3 is a novel locus for waist circumference: A genome-wide association study from the CHARGE Consortium. PLoS Genet. 5, e1000539. 43. Heid, I.M., Jackson, A.U., Randall, J.C., Winkler, T.W., Qi, L., Steinthorsdottir, V., Thorleifsson, G., Zillikens, M.C., Speliotes, E.K., Ma¨gi, R., et al; MAGIC. (2010). Meta-analysis identifies 13 new loci associated with waist-hip ratio and reveals sexual dimorphism in the genetic basis of fat distribu- tion. Nat. Genet. 42, 949–960. 44. Kilpelainen, T.O., Zillikens, M.C., Stancakova, A., Finucane, F.M., Ried, J.S., Langenberg, C., Zhang, W., Beckmann, J.S., Luan, J., Vandenput, L., et al. (2011). Genetic variation near IRS1 associates with reduced adiposity and an impaired metabolic profile. Nat. Genet. 43, 753–760. 45. Sawcer, S., Hellenthal, G., Pirinen, M., Spencer, C.C., Patso- poulos, N.A., Moutsianas, L., Dilthey, A., Su, Z., Freeman, C., Hunt, S.E., et al; International Multiple Sclerosis Genetics Consortium; Wellcome Trust Case Control Consortium 2. (2011). Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis. Nature 476, 214–219. 46. Burton, P.R., Clayton, D.G., Cardon, L.R., Craddock, N., Deloukas, P., Duncanson, A., Kwiatkowski, D.P., McCarthy, M.I., Ouwehand, W.H., Samani, N.J., et al; Wellcome Trust Case Control Consortium; Australo-Anglo-American Spon- dylitis Consortium (TASC); Biologics in RA Genetics and Genomics Study Syndicate (BRAGGS) Steering Committee; Breast Cancer Susceptibility Collaboration (UK). (2007). Association scan of 14,500 nonsynonymous SNPs in four diseases identifies autoimmunity variants. Nat. Genet. 39, 1329–1337. 47. Evans, D.M., Spencer, C.C., Pointon, J.J., Su, Z., Harvey, D., Kochan, G., Oppermann, U., Dilthey, A., Pirinen, M., Stone, M.A., et al; Spondyloarthritis Research Consortium of Canada (SPARCC); Australo-Anglo-American Spondyloar- thritis Consortium (TASC); Wellcome Trust Case Control Consortium 2 (WTCCC2). (2011). Interaction between ERAP1 and HLA-B27 in ankylosing spondylitis implicates peptide handling in the mechanism for HLA-B27 in disease susceptibility. Nat. Genet. 43, 761–767. 48. Suzuki, A., Yamada, R., Chang, X., Tokuhiro, S., Sawada, T., Suzuki, M., Nagasaki, M., Nakayama-Hamada, M., Kawaida, R., Ono, M., et al. (2003). Functional haplotypes of PADI4, encoding citrullinating enzyme peptidylarginine deiminase 4, are associated with rheumatoid arthritis. Nat. Genet. 34, 395–402. 49. Padyukov, L., Silva, C., Stolt, P., Alfredsson, L., and Klareskog, L. (2004). A gene-environment interaction between smoking and shared epitope genes in HLA-DR provides a high risk of seropositive rheumatoid arthritis. Arthritis Rheum. 50, 3085–3092. 50. Voight, B.F., Scott, L.J., Steinthorsdottir, V., Morris, A.P., Dina, C., Welch, R.P., Zeggini, E., Huth, C., Aulchenko, Y.S., Thorleifsson, G., et al; MAGIC investigators; GIANT Consortium. (2010). Twelve type 2 diabetes susceptibility The American Journal of Human Genetics 90, 7–24, January 13, 2012 21
  • 70. loci identified through large-scale association analysis. Nat. Genet. 42, 579–589. 51. Small, K.S., Hedman, A.K., Grundberg, E., Nica, A.C., Thor- leifsson, G., Kong, A., Thorsteindottir, U., Shin, S.Y., Richards, H.B., Soranzo, N., et al; GIANT Consortium; MAGIC Investigators; DIAGRAM Consortium; MuTHER Consortium. (2011). Identification of an imprinted master trans regulator at the KLF14 locus related to multiple meta- bolic phenotypes. Nat. Genet. 43, 561–564. 52. Freathy, R.M., Mook-Kanamori, D.O., Sovio, U., Prokopenko, I., Timpson, N.J., Berry, D.J., Warrington, N.M., Widen, E., Hottenga, J.J., Kaakinen, M., et al; Genetic Investigation of ANthropometric Traits (GIANT) Consortium; Meta-Analyses of Glucose and Insulin-related traits Consortium; Wellcome Trust Case Control Consortium; Early Growth Genetics (EGG) Consortium. (2010). Variants in ADCY5 and near CCNL1 are associated with fetal growth and birth weight. Nat. Genet. 42, 430–435. 53. Gerken, T., Girard, C.A., Tung, Y.C., Webby, C.J., Saudek, V., Hewitson, K.S., Yeo, G.S., McDonough, M.A., Cunliffe, S., McNeill, L.A., et al. (2007). The obesity-associated FTO gene encodes a 2-oxoglutarate-dependent nucleic acid deme- thylase. Science 318, 1469–1472. 54. Church, C., Lee, S., Bagg, E.A., McTaggart, J.S., Deacon, R., Gerken, T., Lee, A., Moir, L., Mecinovic, J., Quwailid, M.M., et al. (2009). A mouse model for the metabolic effects of the human fat mass and obesity associated FTO gene. PLoS Genet. 5, e1000599. 55. Church, C., Moir, L., McMurray, F., Girard, C., Banks, G.T., Teboul, L., Wells, S., Bru¨ning, J.C., Nolan, P.M., Ashcroft, F.M., and Cox, R.D. (2010). Overexpression of Fto leads to increased food intake and results in obesity. Nat. Genet. 42, 1086–1092. 56. Freathy, R.M., Timpson, N.J., Lawlor, D.A., Pouta, A., Ben- Shlomo, Y., Ruokonen, A., Ebrahim, S., Shields, B., Zeggini, E., Weedon, M.N., et al. (2008). Common variation in the FTO gene alters diabetes-related metabolic traits to the extent expected given its effect on BMI. Diabetes 57, 1419–1426. 57. Teslovich, T.M., Musunuru, K., Smith, A.V., Edmondson, A.C., Stylianou, I.M., Koseki, M., Pirruccello, J.P., Ripatti, S., Chasman, D.I., Willer, C.J., et al. (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–713. 58. Gieger, C., Radhakrishnan, A., Cvejic, A., Tang, W., Porcu, E., Pistis, G., Serbanovic-Canic, J., Elling, U., Goodall, A.H., Lab- rune, Y., et al. (2011). New gene functions in megakaryopoi- esis and platelet formation. Nature 480, 201–208. 59. Mihaescu, R., Meigs, J., Sijbrands, E., and Janssens, A.C. (2011). Genetic risk profiling for prediction of type 2 dia- betes. PLoS Curr. 3, RRN1208. 60. Elliott, P., Chambers, J.C., Zhang, W., Clarke, R., Hopewell, J.C., Peden, J.F., Erdmann, J., Braund, P., Engert, J.C., Bennett, D., et al. (2009). Genetic Loci associated with C-reactive protein levels and risk of coronary heart disease. JAMA 302, 37–48. 61. Owen, K.R., Thanabalasingham, G., James, T.J., Karpe, F., Farmer, A.J., McCarthy, M.I., and Gloyn, A.L. (2010). Assess- ment of high-sensitivity C-reactive protein levels as diag- nostic discriminator of maturity-onset diabetes of the young due to HNF1A mutations. Diabetes Care 33, 1919–1924. 62. Thanabalasingham, G., Shah, N., Vaxillaire, M., Hansen, T., Tuomi, T., Gasperikova, D., Szopa, M., Tjora, E., James, T.J., Kokko, P., et al. (2011). A large multi-centre European study validates high-sensitivity C-reactive protein (hsCRP) as a clinical biomarker for the diagnosis of diabetes subtypes. Diabetologia 54, 2801–2810. 63. Zhou, K., Bellenguez, C., Spencer, C.C., Bennett, A.J., Coleman, R.L., Tavendale, R., Hawley, S.A., Donnelly, L.A., Schofield, C., Groves, C.J., et al; GoDARTS and UKPDS Diabetes Pharmacogenetics Study Group; Wellcome Trust Case Control Consortium 2; MAGIC investigators. (2011). Common variants near ATM are associated with glycemic response to metformin in type 2 diabetes. Nat. Genet. 43, 117–120. 64. Stefansson, H., Helgason, A., Thorleifsson, G., Steinthorsdot- tir, V., Masson, G., Barnard, J., Baker, A., Jonasdottir, A., Inga- son, A., Gudnadottir, V.G., et al. (2005). A common inversion under selection in Europeans. Nat. Genet. 37, 129–137. 65. Kong, A., Barnard, J., Gudbjartsson, D.F., Thorleifsson, G., Jonsdottir, G., Sigurdardottir, S., Richardsson, B., Jonsdottir, J., Thorgeirsson, T., Frigge, M.L., et al. (2004). Recombination rate and reproductive success in humans. Nat. Genet. 36, 1203–1206. 66. Hinch, A.G., Tandon, A., Patterson, N., Song, Y., Rohland, N., Palmer, C.D., Chen, G.K., Wang, K., Buxbaum, S.G., Akylbe- kova, E.L., et al. (2011). The landscape of recombination in African Americans. Nature 476, 170–175. 67. Seldin, M.F., Tian, C., Shigeta, R., Scherbarth, H.R., Silva, G., Belmont, J.W., Kittles, R., Gamron, S., Allevi, A., Palatnik, S.A., et al. (2007). Argentine population genetic structure: Large variance in Amerindian contribution. Am. J. Phys. Anthropol. 132, 455–462. 68. Seldin, M.F., Shigeta, R., Villoslada, P., Selmi, C., Tuomilehto, J., Silva, G., Belmont, J.W., Klareskog, L., and Gregersen, P.K. (2006). European population substructure: Clustering of northern and southern populations. PLoS Genet. 2, e143. 69. Tian, C., Hinds, D.A., Shigeta, R., Kittles, R., Ballinger, D.G., and Seldin, M.F. (2006). A genomewide single-nucleotide- polymorphism panel with high ancestry information for African American admixture mapping. Am. J. Hum. Genet. 79, 640–649. 70. McEvoy, B.P., Montgomery, G.W., McRae, A.F., Ripatti, S., Perola, M., Spector, T.D., Cherkas, L., Ahmadi, K.R., Boomsma, D., Willemsen, G., et al. (2009). Geographical structure and differential natural selection among North European populations. Genome Res. 19, 804–814. 71. Heath, S.C., Gut, I.G., Brennan, P., McKay, J.D., Bencko, V., Fabianova, E., Foretova, L., Georges, M., Janout, V., Kabesch, M., et al. (2008). Investigation of the fine structure of European populations with applications to disease associa- tion studies. Eur. J. Hum. Genet. 16, 1413–1429. 72. Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A.R., Auton, A., Indap, A., King, K.S., Bergmann, S., Nelson, M.R., et al. (2008). Genes mirror geography within Europe. Nature 456, 98–101. 73. Price, A.L., Butler, J., Patterson, N., Capelli, C., Pascali, V.L., Scarnicci, F., Ruiz-Linares, A., Groop, L., Saetta, A.A., Korkolo- poulou, P., et al. (2008). Discerning the ancestry of European Americans in genetic association studies. PLoS Genet. 4, e236. 74. Manolio, T.A. (2010). Genomewide association studies and assessment of the risk of disease. N. Engl. J. Med. 363, 166–176. 22 The American Journal of Human Genetics 90, 7–24, January 13, 2012
  • 71. 75. Sivakumaran, S., Agakov, F., Theodoratou, E., Prendergast, J.G., Zgaga, L., Manolio, T., Rudan, I., McKeigue, P., Wilson, J.F., and Campbell, H. (2011). Abundant pleiotropy in human complex diseases and traits. Am. J. Hum. Genet. 89, 607–618. 76. Dickson, S.P., Wang, K., Krantz, I., Hakonarson, H., and Goldstein, D.B. (2010). Rare variants create synthetic genome-wide associations. PLoS Biol. 8, e1000294. 77. Anderson, C.A., Soranzo, N., Zeggini, E., and Barrett, J.C. (2011). Synthetic associations are unlikely to account for many common disease genome-wide association signals. PLoS Biol. 9, e1000580. 78. Wray, N.R., Purcell, S.M., and Visscher, P.M. (2011). Synthetic associations created by rare variants do not explain most GWAS results. PLoS Biol. 9, e1000579. 79. Visscher, P.M., Goddard, M.E., Derks, E.M., and Wray, N.R. (2011). Evidence-based psychiatric genetics, AKA the false dichotomy between common and rare variant hypotheses. Molecular Psychiatry, in press. Published online 14 June 2011. 2010.1038/mp.2011.2065. 80. Hunter, D.J., and Kraft, P. (2007). Drinking from the fire hose—Statistical issues in genomewide association studies. N. Engl. J. Med. 357, 436–439. 81. Pryce, J.E., Hayes, B.J., Bolormaa, S., and Goddard, M.E. (2011). Polymorphic regions affecting human height also control stature in cattle. Genetics 187, 981–984. 82. Bodmer, W.F. (1986). Human genetics: The molecular chal- lenge. Cold Spring Harb. Symp. Quant. Biol. 51, 1–13. 83. Risch, N., and Merikangas, K. (1996). The future of genetic studies of complex human diseases. Science 273, 1516– 1517. 84. Wray, N.R. (2005). Allele frequencies and the r2 measure of linkage disequilibrium: impact on design and interpretation of association studies. Twin Res. Hum. Genet. 8, 87–94. 85. McClellan, J.M., Susser, E., and King, M.C. (2007). Schizo- phrenia: A common disease caused by multiple rare alleles. Br. J. Psychiatry 190, 194–199. 86. Craddock, N., O’Donovan, M.C., and Owen, M.J. (2007). Phenotypic and genetic complexity of psychosis. Invited commentary on . Schizophrenia: a common disease caused by multiple rare alleles. Br. J. Psychiatry 190, 200–203. 87. Lander, E.S. (1996). The new genomics: Global views of biology. Science 274, 536–539. 88. Chakravarti, A. (1999). Population genetics—Making sense out of sequence. Nat. Genet. 21 (1, Suppl), 56–60. 89. Reich, D.E., and Lander, E.S. (2001). On the allelic spectrum of human disease. Trends Genet. 17, 502–510. 90. Risch, N. (1990). Linkage strategies for genetically complex traits. I. Multilocus models. Am. J. Hum. Genet. 46, 222–228. 91. Slatkin, M. (2008). Exchangeable models of complex in- herited diseases. Genetics 179, 2253–2261. 92. Hill, W.G., Goddard, M.E., and Visscher, P.M. (2008). Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet. 4, e1000008. 93. Wang, K., Dickson, S.P., Stolle, C.A., Krantz, I.D., Goldstein, D.B., and Hakonarson, H. (2010). Interpretation of associa- tion signals and identification of causal variants from genome-wide association studies. Am. J. Hum. Genet. 86, 730–742. 94. Nejentsev, S., Walker, N., Riches, D., Egholm, M., and Todd, J.A. (2009). Rare variants of IFIH1, a gene implicated in anti- viral responses, protect against type 1 diabetes. Science 324, 387–389. 95. Momozawa, Y., Mni, M., Nakamura, K., Coppieters, W., Almer, S., Amininejad, L., Cleynen, I., Colombel, J.F., de Rijk, P., Dewit, O., et al. (2011). Resequencing of positional candidates identifies low frequency IL23R coding variants protecting against inflammatory bowel disease. Nat. Genet. 43, 43–47. 96. Rivas, M.A., Beaudoin, M., Gardet, A., Stevens, C., Sharma, Y., Zhang, C.K., Boucher, G., Ripke, S., Ellinghaus, D., Burtt, N., et al; National Institute of Diabetes and Digestive Kidney Diseases Inflammatory Bowel Disease Genetics Consortium (NIDDK IBDGC); United Kingdom Inflammatory Bowel Disease Genetics Consortium; International Inflammatory Bowel Disease Genetics Consortium. (2011). Deep rese- quencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat. Genet. 43, 1066–1073. 97. Wang, K., Bucan, M., Grant, S.F., Schellenberg, G., and Hako- narson, H. (2010). Strategies for genetic studies of complex diseases. Cell 142, 351–353, author reply 353–355. 98. Hyttinen, V., Kaprio, J., Kinnunen, L., Koskenvuo, M., and Tuomilehto, J. (2003). Genetic liability of type 1 diabetes and the onset age among 22,650 young Finnish twin pairs: A nationwide follow-up study. Diabetes 52, 1052–1055. 99. Polychronakos, C., and Li, Q. (2011). Understanding type 1 diabetes through genetics: Advances and prospects. Nat. Rev. Genet. 12, 781–792. 100. Poulsen, P., Kyvik, K.O., Vaag, A., and Beck-Nielsen, H. (1999). Heritability of type II (non-insulin-dependent) diabetes mellitus and abnormal glucose tolerance—A popu- lation-based twin study. Diabetologia 42, 139–145. 101. Magnusson, P.K., and Rasmussen, F. (2002). Familial resem- blance of body mass index and familial risk of high and low body mass index. A study of young men in Sweden. Int. J. Obes. Relat. Metab. Disord. 26, 1225–1231. 102. Schousboe, K., Willemsen, G., Kyvik, K.O., Mortensen, J., Boomsma, D.I., Cornes, B.K., Davis, C.J., Fagnani, C., Hjelm- borg, J., Kaprio, J., et al. (2003). Sex differences in heritability of BMI: A comparative study of results from twin studies in eight countries. Twin Res. 6, 409–421. 103. Tysk, C., Lindberg, E., Ja¨rnerot, G., and Flode´rus-Myrhed, B. (1988). Ulcerative colitis and Crohn’s disease in an unse- lected population of monozygotic and dizygotic twins. A study of heritability and the influence of smoking. Gut 29, 990–996. 104. Hawkes, C.H., and Macgregor, A.J. (2009). Twin studies and the heritability of MS: A conclusion. Mult. Scler. 15, 661–667. 105. Brown, M.A., Kennedy, L.G., MacGregor, A.J., Darke, C., Duncan, E., Shatford, J.L., Taylor, A., Calin, A., and Words- worth, P. (1997). Susceptibility to ankylosing spondylitis in twins: The role of genes, HLA, and the environment. Arthritis Rheum. 40, 1823–1828. 106. Brown, M.A. (2011). Progress in the genetics of ankylosing spondylitis. Brief. Funct. Genomics 10, 249–257. 107. MacGregor, A.J., Snieder, H., Rigby, A.S., Koskenvuo, M., Kaprio, J., Aho, K., and Silman, A.J. (2000). Characterizing the quantitative genetic contribution to rheumatoid arthritis using data from twins. Arthritis Rheum. 43, 30–37. 108. Lichtenstein, P., Yip, B.H., Bjo¨rk, C., Pawitan, Y., Cannon, T.D., Sullivan, P.F., and Hultman, C.M. (2009). Common The American Journal of Human Genetics 90, 7–24, January 13, 2012 23
  • 72. genetic determinants of schizophrenia and bipolar disorder in Swedish families: A population-based study. Lancet 373, 234–239. 109. Purcell, S.M., Wray, N.R., Stone, J.L., Visscher, P.M., O’Dono- van, M.C., Sullivan, P.F., and Sklar, P.; International Schizo- phrenia Consortium. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752. 110. Lichtenstein, P., Holm, N.V., Verkasalo, P.K., Iliadou, A., Kaprio, J., Koskenvuo, M., Pukkala, E., Skytthe, A., and Hem- minki, K. (2000). Environmental and heritable factors in the causation of cancer—Analyses of cohorts of twins from Sweden, Denmark, and Finland. N. Engl. J. Med. 343, 78–85. 111. Turnbull, C., Ahmed, S., Morrison, J., Pernet, D., Renwick, A., Maranian, M., Seal, S., Ghoussaini, M., Hines, S., Healey, C.S., et al; Breast Cancer Susceptibility Collaboration (UK). (2010). Genome-wide association study identifies five new breast cancer susceptibility loci. Nat. Genet. 42, 504–507. 112. Orstavik, K.H., Magnus, P., Reisner, H., Berg, K., Graham, J.B., and Nance, W. (1985). Factor VIII and factor IX in a twin population. Evidence for a major effect of ABO locus on factor VIII level. Am. J. Hum. Genet. 37, 89–101. 113. de Lange, M., Snieder, H., Arie¨ns, R.A., Spector, T.D., and Grant, P.J. (2001). The genetics of haemostasis: A twin study. Lancet 357, 101–105. 114. Smith, N.L., Chen, M.H., Dehghan, A., Strachan, D.P., Basu, S., Soranzo, N., Hayward, C., Rudan, I., Sabater-Lleal, M., Bis, J.C., et al; Wellcome Trust Case Control Consortium. (2010). Novel associations of multiple genetic loci with plasma levels of factor VII, factor VIII, and von Willebrand factor: The CHARGE (Cohorts for Heart and Aging Research in Genome Epidemiology) Consortium. Circulation 121, 1382–1392. 115. Visscher, P.M., Medland, S.E., Ferreira, M.A., Morley, K.I., Zhu, G., Cornes, B.K., Montgomery, G.W., and Martin, N.G. (2006). Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2, e41. 116. Silventoinen, K., Sammalisto, S., Perola, M., Boomsma, D.I., Cornes, B.K., Davis, C., Dunkel, L., De Lange, M., Harris, J.R., Hjelmborg, J.V., et al. (2003). Heritability of adult body height: A comparative study of twin cohorts in eight coun- tries. Twin Res. 6, 399–408. 117. Peacock, M., Turner, C.H., Econs, M.J., and Foroud, T. (2002). Genetics of osteoporosis. Endocr. Rev. 23, 303–326. 118. Duncan, E.L., Danoy, P., Kemp, J.P., Leo, P.J., McCloskey, E., Nicholson, G.C., Eastell, R., Prince, R.L., Eisman, J.A., Jones, G., et al. (2011). Genome-wide association study using extreme truncate selection identifies novel genes affecting bone mineral density and fracture risk. PLoS Genet. 7, e1001372. 119. Dalageorgou, C., Ge, D., Jamshidi, Y., Nolte, I.M., Riese, H., Savelieva, I., Carter, N.D., Spector, T.D., and Snieder, H. (2008). Heritability of QT interval: how much is explained by genes for resting heart rate? J. Cardiovasc. Electrophysiol. 19, 386–391. 120. Russell, M.W., Law, I., Sholinsky, P., and Fabsitz, R.R. (1998). Heritability of ECG measurements in adult male twins. J. Electrocardiol. Suppl. 30, 64–68. 121. Shah, S.H., and Pitt, G.S. (2009). Genetics of cardiac repolar- ization. Nat. Genet. 41, 388–389. 122. Hunt, S.C., Hasstedt, S.J., Kuida, H., Stults, B.M., Hopkins, P.N., and Williams, R.R. (1989). Genetic heritability and common environmental components of resting and stressed blood pressures, lipids, and body mass index in Utah pedi- grees and twins. Am. J. Epidemiol. 129, 625–638. 123. Evans, D.M., Frazer, I.H., and Martin, N.G. (1999). Genetic and environmental causes of variation in basal levels of blood cells. Twin Research: The Official Journal of the Inter- national Society for Twin Studies 2, 250–257. 24 The American Journal of Human Genetics 90, 7–24, January 13, 2012
  • 73. ARTICLE Mitochondrial DNA and Y Chromosome Variation Provides Evidence for a Recent Common Ancestry between Native Americans and Indigenous Altaians Matthew C. Dulik,1 Sergey I. Zhadanov,1,2 Ludmila P. Osipova,2 Ayken Askapuli,1,3 Lydia Gau,1 Omer Gokcumen,1,4 Samara Rubinstein,1,5 and Theodore G. Schurr1,* The Altai region of southern Siberia has played a critical role in the peopling of northern Asia as an entry point into Siberia and a possible homeland for ancestral Native Americans. It has an old and rich history because humans have inhabited this area since the Paleolithic. Today, the Altai region is home to numerous Turkic-speaking ethnic groups, which have been divided into northern and southern clus- ters based on linguistic, cultural, and anthropological traits. To untangle Altaian genetic histories, we analyzed mtDNA and Y chromo- some variation in northern and southern Altaian populations. All mtDNAs were assayed by PCR-RFLP analysis and control region sequencing, and the nonrecombining portion of the Y chromosome was scored for more than 100 biallelic markers and 17 Y-STRs. Based on these data, we noted differences in the origin and population history of Altaian ethnic groups, with northern Altaians appearing more like Yeniseian, Ugric, and Samoyedic speakers to the north, and southern Altaians having greater affinities to other Turkic speaking pop- ulations of southern Siberia and Central Asia. Moreover, high-resolution analysis of Y chromosome haplogroup Q has allowed us to reshape the phylogeny of this branch, making connections between populations of the New World and Old World more apparent and demonstrating that southern Altaians and Native Americans share a recent common ancestor. These results greatly enhance our understanding of the peopling of Siberia and the Americas. Introduction The Altai Republic is located in south-central Russia, situ- ated at the borders of Mongolia, China, and Kazakhstan. It sits at a crossroads where the Eurasian steppe meets the Siberian taiga and serves as an entry point into northern Asia. Having been habitable throughout the last glacial maximum (LGM), the Altai region has had a human pres- ence for some 45,000 years.1 The archaeology of the region shows that, during this time, a number of different cultures and peoples lived in and migrated through the area.2–4 The confirmation of Neanderthals and the recent discovery of a new hominin at the Denisova cave in the Altai region indicates that this area has long hosted extremely diverse populations.5–7 It is also the area from which the ancestors of Native American populations are thought to have arisen prior to their expansion into the New World.8–11 In addi- tion, archaeological evidence suggests that a few of the later cultural horizons (Afanasievo and Andronovo) arose in western Eurasia and spread eastward to the Altai region during the Eneolithic and Bronze Ages, respectively.12,13 Such interactions increased during the Iron Age, as evi- denced by the frozen Pazyryk kurgans in the southern Altai Mountains,14 which contained examples of the typical ‘‘Scytho-Siberian animal style’’ observed throughout the entire Eurasian steppe.3,15 These populations further intermingled with expanding Altaic speaking groups, and specifically the movements involving the Xiongnu, Xianbei, and Yuezhi, as recorded by ancient Chinese histo- rians in the second century BCE.16,17 Ethnographic studies of Turkic-speaking tribes indige- nous to the Altai region of southern Siberia noted cultural differences among ethnic groups such that they could be classified into northern or southern Altaians.18,19 Northern Altaian ethnic groups include the Chelkan, Kumandin, and Tubalar. The Altai-kizhi, Teleut, and Telengit were grouped together as southern Altaians, along with a few other smaller populations. Similarly, linguistic studies have shown that languages from northern and southern populations are mutually unintelligible, despite their having similar Turkic roots. The northern Altai languages also showed greater influences from Samoyedic, Yeniseian, and Ugric languages, possibly reflecting their origin among the ancestors of these present-day peoples. By contrast, southern Altaian languages belong to the Kipchak branch of Turkic language family and have been greatly influenced by Mongolian, especially after the expansion of the Mongol Empire.16,20 These linguistic differences are further mirrored by differences in anthropometric traits, traditional subsistence strategies, religious traditions, and clan names for northern and southern Altaians.18,19,21 Genetic analysis of Altaian populations initially focused on protein polymorphisms to assess levels of diversity and the relationships between them and other Siberian popula- tions by comparing relative proportions of West and East Eurasian genotypes.22–24 The role that the Altai region 1 Department of Anthropology, University of Pennsylvania, Philadelphia, PA 19104-6398, USA; 2 Institute of Cytology and Genetics, SB RAS, Novosibirsk 630090, Russia; 3 Institute of General Genetics and Cytology, Almaty 050060, Kazakhstan 4 Present address: Harvard University Medical School, Brigham and Women’s Hospital, Boston, MA 02115, USA 5 Present address: Sackler Educational Laboratory for Comparative Genomics and Human Origins, American Museum of Natural History, New York, NY 10024-5192, USA *Correspondence: tgschurr@sas.upenn.edu DOI 10.1016/j.ajhg.2011.12.014. Ó2012 by The American Society of Human Genetics. All rights reserved. The American Journal of Human Genetics 90, 229–246, February 10, 2012 229
  • 74. played in the dispersal of humans into northern Eurasia and subsequently into the Americas gained increasing importance with the search for the founding mitochon- drial DNAs (mtDNAs) and Y chromosomes for the New World.8,25,26 As a result, the issue of where Native American progenitors originated became a hotly debated topic, with suggested source areas being Central Asia, Mongolia, and different parts of Siberia.8–10,27–46 However, much of the previous genetic research into this issue focused mainly on southern Altaian populations, leaving our understanding of the genetic diversity of northern Altaian groups incomplete. Given the ethnographic and historical background of Altaian peoples, we characterized the mtDNA and Y chro- mosome variation in these populations to elucidate their genetic history. Our first objective was to determine whether the ethnographic classifications of northern and southern Altaians reflected their patterns of genetic varia- tion, and specifically whether they shared a common ancestry. If differences were observed, we then wanted to know whether they were attributable to demographic factors, social organization, or some combination of the two. The second goal was to examine whether northern Altaians’ genetic variation is structured by tribe and clan identity. The third goal was to use these data to investigate larger questions concerning the peopling of Siberia (and the Americas). In particular, we were interested in learning whether these genetic data would reveal the effects of ancient and/or recent migrations into or out of the Altai region, including that giving rise to the ancestors of indigenous populations from America. Overall, this paper attempts to understand the population history of Altaians by placing them into a Siberian genetic context and uses a phylogeographic approach to dissect the layers of history, uncovering the formation of these ethnic groups and their importance for understanding the peopling of Northern Asia and the Americas. Subjects and Methods Sample Collection Between 1991 and 2002, we conducted ethnographic fieldwork and sample collection in a number of settlements within the southern part of the Altai Republic (Figure 1). During this period, a total of 267 self-identified Altai-kizhi individuals living in the villages of Mendur-Sokkon, Cherny Anuy, Turata, and Kosh-Agach participated in the study. In addition, another nine Altai-kizhi individuals from villages in the northern Altai Republic partici- pated in the study (see below), bringing the total number of Altai-kizhi participants to 276, of whom 120 were men. Figure 1. Map of the Altai Republic and Locations of Sample Collection 230 The American Journal of Human Genetics 90, 229–246, February 10, 2012
  • 75. In 2003, we worked with 214 Northern Altaians living in the Turochak District of the Altai Republic. These persons included 91 Chelkans, 52 Kumandins, and 71 Tubalars living in nine different villages in the Biya and Lebed’ River basins and along Teletskoe Lake (Figure 1). The villages included Artybash, Biika, Dmitrievka, Kebezen, Kurmach-Baigol, Sank-Ino, Shunarak, Tandoshka, and Yugach. Of the northern Altaian participants, 69 were men. Blood samples were drawn from all participants with informed consent written in Russian and approved by the University of Pennsylvania IRB and the Institute of Cytology and Genetics in Novosibirsk, Russia. Genealogical data were also obtained from each person at the time of sample collection to ensure that the individuals were unrelated through at least three generations and to assess the level of admixture in these communities. Individ- uals were categorized by self-identified ethnicity for this study. Molecular Genetic Analysis Sample Preparation Bloods were fractionated through low-speed centrifugation to obtain plasma and red cell fractions. Total genomic DNAs were isolated from buffy coats with a lysis buffer and standard phenol- chloroform extraction protocol modified from earlier studies.27,47 mtDNA Analysis The mtDNA of each sample was characterized by high-resolution SNP analysis and control region sequencing. PCR-RFLP analysis was employed to assign individuals to West48–52 and East30,53–56 Eurasian mtDNA haplogroups by screening them for known diag- nostic markers, as per previous studies57,58 (Table S1 available online), with the nomenclature used to classify the mitochondrial haplotype according to PhyloTree.org.59 The hypervariable segment 1 (HVS1) of the control region was directly sequenced for each sample by published methods,58 and hypervariable segment 2 (HVS2) was sequenced with the primers indicated in Table S2. Sequences were read on ABI 3130xl Gene Analyzers located in the Laboratory of Molecular Anthropology and the Department of Genetics Sequencing Core Facility at the University of Pennsylvania and aligned and edited with the Sequencher 4.8 (Gene Codes Corporation). All polymorphic nucleotides were reckoned relative to the revised Cambridge refer- ence sequence (rCRS).60,61 The combination of SNP data and control region sequences defined maternal haplotypes in these individuals. Y Chromosome Analysis The nonrecombining portion of the Y chromosome (NRY) from each male participant was characterized by assaying phylogeneti- cally informative biallelic markers in a hierarchical fashion accord- ing to published information62,63 and previously published methods.64 A total of 116 biallelic markers were tested to define sample membership in respective NRY haplogroups. Most of the SNPs and fragment length polymorphisms were characterized by custom TaqMan assays read on an ABI Prism 7900 HT Real-Time PCR System (Applied Biosystems). These polymorphisms included L53, L54, L55, L56, L57, L213, L329, L330, L331, L332, L333, L365, L400, L456, L472, L474, L475, L476, L528, LLY22g, M3, M9, M12, M15, M18, M20, M25, M35, M45, M55, M56, M69, M70, M73, M81, M86, M89, M93, M96, M102, M117, M119, M120, M122, M123, M124, M128, M130, M134, M143, M147, M157, M162, M170, M172, M173, M174, M178, M186, M201, M204, M207, M214, M217, M223, M230, M242, M253, M265, M267, M269, M285, M304, M323, M335, M346, M410, M417, M434, M458, P15, P25, P31, P36.2, P37.2, P47, P60, P63, P105, P215, P256, P261, P297, and PK2. Additional markers were detected through direct sequencing (L191, L334, L401, L527, L529, M17, M46 [Tat], M343, M407, MEH2, P39, P43, P48, P53.1, P62, P89, P98, P101, PageS000104, and PK5) and by PCR- RFLP analysis (M175).65 Seventeen short tandem repeats (STRs) were characterized with the AmpFlSTR Yfiler PCR Amplification Kit (ABI) and read on an ABI 3130xl Genetic Analyzer with Gene- Mapper ID v3.2 software. Each paternal haplotype was designated by its 17-STR profile. Y chromosome lineages were defined as the unique combinations of SNP and STR data present in the samples. DYS389b was calculated by subtracting DYS389I from DYS389II, which was used for all statistical and network analyses.64 Comparative Data To place their genetic histories in a broader contextual framework, we compared Altaian mtDNA and NRY data with those from populations in southern Siberia, Central Asia, Mongolia, and East Asia. For the mtDNA analysis, the populations included Telengits, Teleuts, Shors, Khakass, Tuvinians, Todzhans, Tofalars, Soyots, Buryats, Khanty, Mansi, Ket, Nganasan, Western Evenks, Uyghurs, Kazakhs, Kyrgyz, Uzbeks, and Mongolians.41,43,44,66–71 For the NRY analysis, only populations that were represented by full Y-STR data sets (not just Y-STRs for specific haplogroups) were used for comparative purposes. These populations included Teleuts, Khakass, Mansi, Khanty, Kalmyks, Mongolians, and Uyghurs.68,72–75 The STR haplotypes were reduced to ten loci (DYS19, DYS389I, DYS398b, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, and DYS439) to allow for as broad a comparison as possible. In the coalescence analysis, we used the 15 Y-STR loci Q-M3 haplotypes from Geppert et al.76 Data Analysis Summary statistics, including gene diversity and pairwise differ- ences, were calculated with Arlequin v3.1177 for mtDNA HVS1 (np 16024-16400) and NRY Y-STRs. FST and RST values between populations were also calculated with Arlequin v3.11 for the HVS1 sequences and Y-STRs, respectively. FST values were esti- mated with the Tamura and Nei model of sequence evolution.78 Pairwise genetic distances were visualized by multidimensional scaling (MDS) with SPSS 11.0.0.79 In addition, nucleotide diversity, Tajima’s D, and Fu’s FS were calculated with mtDNA HVS1 sequences. We analyzed the phylogenetic relationships among Y-STR haplotypes and complete mtDNA genomes by using Network 4.6.0.0 (Fluxus Technology Ltd). These networks employed a reduced median-median joining approach and MP post-process- ing.80–82 The NRY haplotypes used to generate the networks consisted of 15 Y-STRs. DYS385 was excluded from the network analysis because differentiation between DYS385a and DYS385b is not possible with the Y-Filer kit.83 The Y-STR loci were weighted based on the inverse of their variances. Mitogenomes used in this analysis came from the published literature and GenBank. The time to the most recent common ancestor (TMRCA) for mi- togenomes was estimated with the methods of Soares et al.84 The Y-STR diversity within each haplogroup was assessed by two methods.64 The first involved calculation of rho statistics with Network 4.6.0.0, where the founder haplotype was inferred as in Sengupta et al.85 The second used Batwing,86 a Bayesian analysis where the TMRCA and expansion time of each popula- tion (or haplogroup) were calculated by previously published methods.64,72,87 Both the evolutionary and the pedigree-based mutation rates were used to estimate coalescence dates with The American Journal of Human Genetics 90, 229–246, February 10, 2012 231
  • 76. generation times of 25 and 30 years, respectively.88–90 Because a definitive consensus does not yet exist as to which rate should be used, the validity of the resulting estimates are discussed. In addition, Batwing was used to estimate the split or divergence times of several haplogroups. This method assumes that, after pop- ulations split, no further migration occurs between them. In this case, the haplogroups investigated were not shared between pop- ulations but derive from a common source, thereby justifying this approach. Duplicated loci and new STR variants detected in this study were excluded from statistical analysis. Results Mitochondrial DNA and Y Chromosome Diversity The maternal genetic ancestry of northern and southern Altaian populations was explored by characterizing coding region SNPs and control region sequences from 490 inhab- itants of the Altai Republic, which yielded 99 distinct mtDNA haplotypes defined by SNP and HVS1 mutations (Table S3). The majority of mtDNAs were of East Eurasian origin, although the relative proportion of these haplo- types was greater in Chelkans (91.5%) compared to other Altaian populations (75.2% in Tubalars, 75.6% in Kuman- dins, and 76.4% in Altai-kizhi) (Table 1). Despite exhibit- ing a lower overall frequency of West Eurasian haplo- groups, Altaians (specifically, the Altai-kizhi, Tubalar, and Kumandins) had a higher proportion of them as compared to other southern Siberians.41,43 Differences in mtDNA haplogroup profiles were observed among northern Altaian ethnic groups and between northern Altaians and Altai-kizhi, with the Chelkans being extraordinarily distinct. Nevertheless, comparisons among other Altaian ethnic groups revealed some consistent patterns. mtDNA haplogroups B, C, D, and U4 were found in all Altaian pop- ulations, but at varying frequencies, whereas southern Altaians (Altai-kizhi, Telengits, and Teleuts) tended to have a greater variety of West Eurasian haplogroups at low frequencies. Shors, who have sometimes been catego- rized as northern Altaians,18 exhibited a similar haplo- group profile to other northern Altaian ethnic groups, including moderate frequencies of C, D, and F1, although they lacked others (N9a and U).41 Haplogroups C and D were the most frequent mtDNA lineages in the Altaians, consistent with the overall picture of the Siberian mtDNA gene pool. However, phylogeo- graphic analysis of these lineages showed a greater diver- sity of haplotypes in the southern Altaians compared to northern Altaians. Although haplotypes were shared between regions, northern Altaians largely had C4 with the root HVS1 motif (16223-16298-16327) and C5c, whereas the southern Altaians had C4a1 and C4a2. Although C5c is largely confined to Altaians, it has been suggested that an early migration from Siberia to Europe brought haplogroup C west, where the branch differenti- ated during the Neolithic and then was taken back into southern Siberia.83 Also noteworthy, D4j7 appears to be specific to Altaians and Shors.41,91 In addition, a D5a haplotype was shared by Tubalars and Altai-kizhi, and a rare D5c2 haplotype was shared by the Chelkans and Kumandins. Interestingly, complete mtDNA genome sequencing of a subset of our D5c2 samples showed few differences from those present in Japan,55 suggesting a possible connection resulting from the dispersal of Altaic speaking populations.92 The remainder of the D haplo- types were found in other southern Siberian and Central Asian populations. To explore the NRY variation in Altaian populations, 116 biallelic polymorphisms were characterized in 189 male individuals, resulting in 106 Y chromosome lineages (Table 2). Northern Altaian populations were composed largely of haplogroups Q and N-P43, whereas southern Altaians had a higher proportion of R-M417, C-M217/ PK2, C-M86, and D-P47. Haplogroups typical of south Asia, western Europe, and East Asia were not found in appreciable frequencies.72,93–99 The haplogroup frequency differences between northern and southern Altaians were statistically significant (c2 ¼ 66.03, df ¼ 9, p ¼ 9.09 eÀ11 ). As with the mtDNA data set, we also observed differ- ences in NRY haplogroup composition among northern Altaian populations, where each ethnic group shared haplogroups with the other two, yet had distinct haplo- group profiles. Overall, Kumandins had the most disparate haplogroup frequencies of the northern Altaians, exhibit- ing similar number of N-P43 chromosomes as the Chelkans, which were quite similar to those found in Khanty and Mansi populations in northwestern Sibe- ria.68,100 In addition, a large proportion of Kumandin Y chromosomes belonged to R-M73. This haplogroup is largely restricted to Central Asia101 but has also been found in Altaian Kazakhs and other southern Siberians.64,102 In fact, Myres et al.101 noted two distinct clusters of R-M73 STR haplotypes, with one of them containing Y chromo- somes bearing a 19 repeat allele for DYS390, which appears to be unique to R-M73. Interestingly, the majority of Kumandin R-M73 haplotypes fell into this category, although haplotypes from both clusters are found in southern Siberia.102 In all cases, the haplotypes present in Altaians fit into known modern human phylogenies. None of the Altaians had a mitochondrial lineage similar to those of Neander- thals or the Denisovan hominin. Although there are no ancient Denisovan or Neanderthal Y chromosome data to compare with the Altaian data set, the Altaian Y chro- mosomes clearly derived from more recent expansions of modern humans out of Africa. Altaian Genetic Relationships Summary statistics were calculated to assess the relative amounts of genetic diversity in Altaian populations (Table 3). Gene diversities based on HVS1 of the mtDNA showed that, overall, the Altai-kizhi were more diverse than the northern Altaians. The average pairwise differ- ences for the Altai-kizhi were also smaller. In fact, the esti- mates for the Altai-kizhi and Tubalars were comparable to other southern Siberians.43 By contrast, those for the 232 The American Journal of Human Genetics 90, 229–246, February 10, 2012
  • 77. Chelkans and Kumandins were lower and more similar to Soyots, but not as low as that of Tofalars. Mismatch distri- butions were smooth and bell-shaped for all populations except the Chelkans, which had a significant raggedness index. This statistic indicated that Tubalars, Kumandins, and Altai-kizhi had experienced sudden expansions or expansions from population bottlenecks.103 Tests of neutrality confirmed these findings in yielding signifi- cantly negative Tajima’s D and Fu’s FS estimates for all populations, except the Chelkans, indicating that this Table 1. mtDNA Haplogroup Frequencies of Altaian Populations Hg Chelkan Kumandin Tubalar1 Tubalar2 Shor Altai-kizhi1 Altai-kizhi2 Telengit Teleut # 91 52 71 72 28 276 48 55 33 C 15.1 41.5 35.6 20.8 17.9 31.4 25.0 14.6 24.2 Z 2.7 3.6 4.3 4.2 3.0 M8 3.6 4.2 D4 13.9 15.1 24.7 15.3 25.0 13.0 6.3 18.2 24.2 D5 8.6 3.8 4.1 5.6 3.6 0.7 3.0 G 3.2 4.0 4.2 3.6 M7 1.8 M9 1.4 M10 1.1 3.6 0.4 2.1 M11 2.1 1.8 3.0 M* 1.8 A 1.9 11.1 3.6 2.9 4.7 7.3 I 3.6 1.4 2.1 1.8 N1a 1.8 N1b 0.4 W 1.1 X 3.8 1.4 2.2 2.1 3.0 N9a 19.4 1.9 2.7 6.9 1.8 B 3.2 3.8 2.7 4.2 3.6 1.4 6.3 14.6 6.1 F1 10.8 3.8 1.4 14.3 8.3 4.2 1.8 3.0 F2 15.1 2.7 3.6 2.5 2.1 H 1.1 2.7 1.4 3.6 2.5 8.3 9.1 9.1 H2 3.3 2.1 H8 5.7 2.7 4.2 3.6 1.4 HV 1.8 V 6.1 J 3.6 4.0 6.3 1.8 T 1.9 0.4 3.6 6.1 U2 2.8 0.7 1.8 3.0 U3 2.1 U4 4.3 3.8 15.1 18.1 3.6 0.7 2.1 1.8 3.0 U5 2.2 9.4 4.1 5.6 3.3 2.1 1.8 U8 1.8 K 3.6 3.3 6.3 3.0 R9 1.1 3.8 1.4 2.2 5.5 R11 2.1 The American Journal of Human Genetics 90, 229–246, February 10, 2012 233
  • 78. particular population probably experienced a reduction in population size or was subdivided. To understand Altaian maternal genetic background, we compared our data with those from other North Asian and Central Asian populations. FST values between populations were calculated with HVS1 sequences and viewed through multidimensional scaling (Figure 2). In this analysis, southern Siberians formed a rather diffuse cluster, with most Central Asian and Mongolian populations being separated from them. Altaian populations also did not constitute a distinct cluster unto themselves. Based on the FST values, the Chelkans were distinctive from all other ethnic groups. Although falling closest to the Khakassians in the MDS plot, they shared a smaller genetic distance with the Tubalars2, which was expected because of the inclusion of some Chelkans in that sample set.44 Kuman- dins and Tubalars1 were not significantly different, and appeared close to Tuvinians and southern Altaians. In fact, both populations had smaller FST values with southern Altaians than they did with the Chelkans, although the genetic distances between Tubalars1 and Tubalars2, Altai-kizhi, and Teleuts were also nonsignifi- cant. Unlike northern Altaians, most of the southern Altaian populations clustered together. The Altai-kizhi, Teleuts, and Tubalars1 formed one small cluster with Kyrgyz, whereas the Telengits showed greater affinities with Central Asian populations. The southern Altaian cluster sat near a cluster of Tuvinian populations, suggest- ing a similar population history and likely gene flow between these groups. Summary statistics were calculated to assess the genetic diversity of paternal lineages in Altaian populations (Table 4). Gene diversities based on Y-STR haplotypes (15-loci Y-STR haplotypes; Table S4) showed that the Altai- kizhi were more diverse than the northern Altaians. Unlike the mtDNA data, within group pairwise differences were greater in the southern Altaian and Tubalar Y-STR haplo- types than in the Chelkans and Kumandins. Y-chromosomal variation in the four populations in our data set provided a slightly different picture than the mito- chondrial data. In this analysis, RST values were calculated with 15-loci Y-STR haplotypes (Table S6). These estimates indicated that only the Chelkans and Tubalars were not Table 2. High-Resolution NRY Haplogroup Frequencies in Altaian Populations Haplogroup Chelkan Kumandin Tubalar Altai-kizhi C3* 19 (0.158) C3c1 5 (0.042) D3a 6 (0.050) E1b1b1c 1 (0.037) I2a 1 (0.037) J2a 3 (0.025) L 1 (0.040) N1* 1 (0.059) 3 (0.111) N1b* 5 (0.200) 8 (0.471) 2 (0.017) N1c* 1 (0.008) N1c1 2 (0.017) O3a3c* 1 (0.008) O3a3c1 1 (0.037) 1 (0.008) Q1a2 1 (0.037) Q1a3a* 15 (0.600) 10 (0.370) Q1a3a1c* 20 (0.167) R1a1a1* 4 (0.160) 2 (0.118) 10 (0.370) 60 (0.500) R1b1a1 6 (0.353) T Total 25 17 27 120 Table 3. HVS1 Summary Statistics for Altaian Populations Population Northern Altaian Southern Altaian Chelkan Kumandin Tubalar1 Altai-kizhi1 # of samples 91 52 71 276 # of haplotypes 22 18 26 75 Haplotype diversity 0.923 5 0.013 0.914 5 0.021 0.953 5 0.010 0.976 5 0.003 Nucleotide diversity 0.020 5 0.011 0.022 5 0.011 0.019 5 0.010 0.018 5 0.009 Pairwise differences 7.68 5 3.61 8.22 5 3.87 7.03 5 3.34 6.84 5 3.23 Raggedness index 0.032 0.022 0.010 0.011 Raggedness p value 0.000 0.149 0.635 0.388 Tajima D 1.201 À0.644 À0.701 À1.180 Tajima D p value 0.000 0.000 0.000 0.000 Fu’s FS 3.417 À0.497 À3.877 À24.416 Fu’s FS p value 0.002 0.000 0.000 0.000 234 The American Journal of Human Genetics 90, 229–246, February 10, 2012
  • 79. significantly different from each other. The Kumandins were quite distant from all populations, although these distances were slightly smaller among northern Altaians than with the Altai-kizhi. The Altai-kizhi were again closest to the Tubalars. These relationships were affirmed by the haplotype sharing between the four populations. The Chelkans and Tubalars shared a large proportion of their haplotypes, mostly those from haplogroups Q and R-M417, whereas the Kumandins shared only one haplotype with Tubalars (a rare N-LLY22g haplotype). In addition, the northern and southern Altaians shared only a single haplotype, belonging to haplogroup O-M117, which is more commonly found in southern China.104 In fact, these two Y chromosomes were the only occurrences of hap- logroup O in our data set. The Y-STR profiles were reduced to 10-loci STR haplo- types in order to compare Y chromosome diversity in several Siberian and Central Asian populations (Table 5; Figure 3). The genetic distances in our sample set remained high despite the greater haplotype sharing that resulted from this reduction. Overall, the genetic distances were much greater with the Y-STR haplotypes compared to mtDNA haplotypes, indicating greater genetic differentia- tion in paternal lineages compared to maternal lineages. In addition to the Chelkans and Tubalars, two other groups of populations exhibited nonsignificant RST values. One group included Uyghur (from Urumqi and Yili) and Mongolian (Kalmyks and Mongolians) populations, and the other included the Mansi and a Sagai population iden- tified as part of the Khakass ethnic group. In contrast with their position in the mtDNA MDS plot, northern Altaians were separated from all other populations, including other southern Siberians. The three groups of Khakass (Sagai, Sagai/Shor, and Kachin) fell much closer to the Khanty and Mansi, which probably indicates a common ancestry Figure 2. MDS Plot of FST Genetic Distances Generated from mtDNA HVS1 Sequences in Siberian and Central Asian Populations Circle, southern Siberian; diamond, northwestern Siberian; square, Central Asian. Table 4. Y-STR Summary Statistics for Altaian Populations Population Northern Altaian Southern Altaian Chelkan Kumandin Tubalar Altai-kizhi # of samples 25 17 27 120 # of haplotypes 14 9 18 62 Haplotype diversity 0.910 5 0.043 0.912 5 0.042 0.954 5 0.025 0.978 5 0.005 Pairwise differences 6.59 5 3.22 6.39 5 3.19 7.40 5 3.57 7.58 5 3.56 The American Journal of Human Genetics 90, 229–246, February 10, 2012 235
  • 80. for these populations. Unfortunately, more complete Y-STR data sets were not available for other southern Sibe- rian populations. Nonetheless, these results indicated a different history for northern Altaians compared to Central Asians and even other southern Siberians. A specific reason for this difference is that Mongolians had a much greater genetic impact on southern Altaians, which is expected given the historical and linguistic evidence.18,19,105 Altaian and Native American Connections To test the hypothesis that Native Americans share a more recent common ancestor with Altaians relative to other Siberian and East Asian populations, we specifi- cally examined the mtDNA and NRY haplogroups that appeared in both locations. For the mtDNA, it is well known that haplogroups A–D and X largely make up the maternal genetic heritage of indigenous peoples in the Americas.27,29,39,47,106 Complete mtDNA genome sequenc- ing has led to a greater comprehension of the phylogeny of Native American mtDNAs and, consequently, a better understanding of their origins.107–110 Although Altaians possess the five primary mtDNA haplogroups found in the Americas, these lineages are not exactly the same as those appearing in Native Americans at the subhaplogroup level. This is also true for other Siberian populations except in those few instances where gene flow across the Bering Strait brought some low frequency types back to north- eastern Siberians. An example of this pattern is haplogroup C1a. Southern Altaians possessed C1a, which is an exclusively Asian branch of the predominately American C1 haplo- group.107,108 To date, only four complete C1a genomes have been published. These sequences produced a more recent TMRCA than other genetic evidence had previously suggested for the peopling of the Americas. Although Tamm et al.107 viewed this haplogroup as representing a back migration into Siberia, it does not occur in Siberian populations that are geographically closest to the Americas, but rather those living in southern and southeastern Siberia.41,89 However, given the small effective population sizes from the northeastern Siberian groups that have been studied thus far, this haplogroup could have been lost because of drift. The other mtDNA haplogroup found in northern and southern Altaians that is a close relative of a Native American lineage is D4b1a2a1a. This haplogroup has been found in Altaians, Shors, and Uzbeks from north- western China.41,44,70 Analysis of complete mtDNA genomes identified a sister branch (D4b1a2a1a1), which is found only in northeastern Siberian populations and Inuit from Canada and Greenland.42,45,54,91,111 TMRCAs were calculated from the complete mtDNA genomes of this branch and those from Native American D4b1a2a1a1. By analyzing only synonymous mutations from these sequences with the method of Soares et al.,84 Table 5. Low-Resolution NRY Haplogroup Frequency Comparison of Altaians Hg Chelkan Kumandin Tubalar Altai-kizhi1 Altai-kizhi2 Teleut1 Teleut2 Shor C 20.0 13.0 8.5 5.7 2.0 D 5.0 3.3 E 3.7 F (xJ,K) 3.7 3.3 10.7 2.0 J 2.5 2.2 2.1 K (xN1c,O,P) 24.0 52.9 11.1 1.7 2.2 13.7 N1c 2.5 5.4 10.6 28.6 2.0 O 3.7 1.7 P (xR1a1a) 60.0 35.3 40.7 16.7 28.3 34.3 2.0 R1a1a 16.0 11.8 37.0 50.0 42.4 68.1 31.4 78.4 Total 25 17 27 120 92 47 35 51 Figure 3. MDS Plot of RST Genetic Distances Generated from Y Chromosome STR Haplotypes in Siberian and Central Asian Pop- ulations Circle, southern Siberian; diamond, northwestern Siberian; square, Central Asian. 236 The American Journal of Human Genetics 90, 229–246, February 10, 2012
  • 81. we estimated the TMRCAs of these two branches at 11.8 kya and 15.8 kya, respectively. For the Y chromosome, indigenous American lineages are derived mostly from haplogroups C and Q, and, as such, are crucial for understanding of the genetic histories of peoples from the Americas and how they relate to populations of Central Asia and Siberia.9,39,93,98,112,113 Just as Seielstad et al.114 and Bortolini et al.38 used M242 to clarify the genetic relationship between Asian and American Y chromosomes, the characterization of this haplogroup at an even higher level of resolution has led to a much greater understanding of the origins of Native American Y chromosomes and their connections to Asian types. In this regard, it was recently shown that the American Q-M3 SNP is located on an M346-positive background.63 The presence of M346 in Central Asia and Siberia has strengthened the argument for a southern Siberian or Central Asian origin for many American Y chro- mosomes.85,99,102,115 Given the importance of haplogroup Q for Native American origins, we subjected samples from this lineage to high-resolution SNP analysis involving 37 biallelic markers to better understand the relationship between Old and New World populations and the migration(s) that connect them. All Y chromosomes in this study that belonged to haplogroup Q (as indicated by the presence of M242) were also found to have the P36.2, MEH2, L472, and L528 markers (Figure S1). Thus, these haplo- types fell into the Q1a branch of the Y chromosome phylogeny. Because Q1b Y chromosomes were not found in Altaian samples, we were not able to definitively place the L472 and L528 SNPs at the same phylogenetic position as MEH2. For this reason, their placement is tentative until L275/L314/M378 Y chromosomes are screened for these markers. Furthermore, M120/M265-positive, P48-positive, and P89-positive samples were not found in the Altai region. Therefore, the placement of these branches at the same phylogenetic level as M25/M143 and M346/L56/ L57 should also be considered as provisional (although see Karafet et al.63 ). The M346, L56, and L57 SNPs were positioned as ances- tral to three derived branches in the Family Tree DNA phylogeny. We found that the L474, L475, and L476 SNPs were present in all of our M346-positive samples. However, because M323- and L527/L529-positive samples were not found in the Altaians, we could not confirm the exact position of these markers at either the Q1a3 or Q1a3a level. On the other hand, all Altaians that possessed the M346, L56, L57, L474, L475, and L476 SNPs also had L53, L55, L213, and L331. Interestingly, northern and southern Altaian Q Y chro- mosomes differed by three markers. L54, L330, and L333 were found in Q haplotypes in the southern Altaians and one Altaian Kazakh, whereas the northern Altaians Q haplotypes lacked these derived SNPs. Thus, according to the standard nomenclature set by the Y Chromosome Consortium62 and followed by others, the northern Altaian Q haplotypes belonged to Q1a3a* and the southern Altaians belonged to Q1a3a1c*. We have further confirmed that M3 haplotypes belong to L54-derived Y chromosomes (unpublished data). These alterations in the phylogeny change the haplogroup name of the Native American Q-M3 Y chromosomes from Q1a3a to Q1a3a1a. Moreover, the position of M3 and L330/L333 in the phylogeny indis- putably showed that the MRCA of most Native American Y chromosomes was shared with southern Altaians. The differences between the northern and southern Altaian Q Y chromosomes were also reflected in the anal- ysis of high-resolution Y-STR haplotypes (Figure S2).116 Comparisons of Altaian Q-M346 Y chromosomes with those from southern Siberian, Central Asian, and East Asian populations revealed affinities between southern Altaian and these other groups. However, the northern Altaians remained distinctive, even in networks con- structed from fewer Y-STR loci (Figure S3). The time required to evolve the extent of haplotypic diversity observed in each of the subhaplogroups can aid in determining when particular mutations arose and possibly when these mutations were carried to other loca- tions. The TMRCA for the northern Altaian Q1a3a* Y chro- mosomes indicated a relatively recent origin for them, one dating to either the Bronze Age or recent historical period, depending on the Y-STR mutation rate being used (Table 6). The southern Altaian/Altaian Kazakh Q1a3a1c* Y chromo- somes had a slightly older TMRCA that dated them to either the late Neolithic or early Bronze Age. By using Bayesian analysis, we further estimated the divergence time of the two Q haplogroups at about 1,000 years after the TMRCA of all Altaian Q lineages (~20 kya), indicating an ancient separation of northern and southern Altaian Q Y chromosomes (Table 7). A similar analysis was conducted to determine when the L54 haplogroup arose and gave rise to M3 and L330/L334 subbranches. The indigenous American Y chromosomes used in this analysis were more diverse than those of southern Altaians. The resulting TMRCA for the South American Q1a3a1a* samples was 22.2 kya or 7.6 kya, depending on the mutation rate used. The divergence between the M3 and L330/L334 Y chromosomes was ~13.4 kya, with a TMRCA of 22.0 kya, via the evolutionary rate. By contrast, the TMRCA and divergence time via a pedigree-based mutation rate were 7.7 kya and 4.9 kya, respectively. The time required to generate the haplotypic diversity in the L54-positive Y chromosomes clearly showed that the evolutionary rate provided a more reasonable estimate. The Americas were inhabited well before 5–8 kya, based on various lines of evidence, making the use of the pedi- gree-based mutation rate questionable. The estimates generated with the evolutionary-based mutation rate provided times that are more congruent with the known prehistory of the Americas.117 They are also similar to the TMRCAs calculated for Native American mtDNA haplo- groups.107,108 The American Journal of Human Genetics 90, 229–246, February 10, 2012 237
  • 82. Discussion Origins of Northern and Southern Altaians In this paper, we characterized mtDNA and NRY variation in northern and southern Altaians to better understand their population histories and elucidate the genetic relationship between Altaians and Native American popu- lations. The evidence from the mtDNA and NRY data supports the hypothesis that northern and southern Altaians generally formed out of separate gene pools. This complex genetic history involves repeated migrations into (and probably out of) the Altai-Sayan region. In addi- tion, the histories as revealed by these data added nuances that could not be attained through low-resolution charac- terization alone. The NRY data provided the clearest evidence for a signif- icant genetic difference between the two sets of Altaian ethnic groups. Although sharing certain NRY haplogroups, the two population groups differed in the frequencies of these lineages, and, more importantly, shared few haplo- types with them. By contrast, northern and southern pop- ulations shared considerably more mtDNA haplotypes, indicating that some degree of gene flow had occurred between them, albeit in a sex-specific manner. As seen in other populations from Siberia and Central Asia, the patri- lineality of these groups probably helped to shape this difference in patterns of mtDNA and Y-chromosomal vari- ation.64,118 In addition, each northern Altaian ethnic group showed different genetic relationships with the Altai-kizhi. The Tubalars consistently grouped closer to the Altai-kizhi than the other two northern Altaians based on both mtDNA and NRY data. Thus, the higher genetic diversity of mtDNA and NRY haplotypes in the Tubalars is probably the result of admixture with other groups, such as southern Altaians. The Chelkans, on the other hand, have the most divergent set of mtDNAs of the three popu- lations. Mismatch analysis and tests of neutrality indicated that the Chelkans show signs of decreasing population size or population structure. Long-term endogamy has prob- ably also played a role in shifting the pattern of mtDNA diversity in Chelkans from that seen in other northern Altaians. Because of this endogamy (and genetic drift), only a few lineages attained high frequencies, resulting Table 7. Divergence Times between Haplogroups/Populations TMRCA Split Time Median 95% Confidence Interval Median 95% Confidence Interval Pedigree-Based Mutation Rate Northern and Southern Altaians 5,490 [3,000–11,100] 4,490 [1,730–10,070] Southern Altaians and Native Americans 7,740 [5,170–12,760] 4,950 [2,360–9,490] Evolutionary-Based Mutation Rate Northern and Southern Altaians 21,890 [9,900–57,440] 19,260 [7,060–54,600] Southern Altaians and Native Americans 21,960 [12,260–42,690] 13,420 [5,220–30,430] Table 6. TMRCAs and Expansion Times for Altaian and Native American NRA Haplogroup Q Lineages Hg N Network Batwing - TMRCA Batwing - Expansion r 5 s Median 95% C.I. Median 95% C.I. Pedigree-Based Mutation Rate All Q1a3a 97 5,390 5 1,000 8,420 [5,620–14,290] 7,230 [1,220–20,510] Q1a3a* 25 1,410 5 580 1,480 [680–3,060] 2,100 [380–6,830] Q1a3a1a* 52 5,820 5 1,280 7,630 [4,870–12,920] 4,680 [480–14,940] Q1a3a1c* 20 2,420 5 700 2,970 [1,500–5,960] 2,680 [450–8,610] Evolutionary-Based Mutation Rate All Q1a3a 97 14,970 5 2,760 25,580 [14,230–51,140] 17,220 [1,380–54,950] Q1a3a* 25 3,910 5 1,610 5,320 [2,300–12,160] 4,340 [1,000–13,080] Q1a3a1a* 52 16,170 5 3,550 22,160 [11,960–44,340] 9,800 [620–39,543] Q1a3a1c* 20 6,750 5 1,950 8,720 [3,960–20,010] 5,600 [1,030–17,910] Note: r, rho statistic; s, standard error; Q1a3a*, Northern Altaians (this study); Q1a3a1a, Native Americans (Geppert et al.76 ); Q1a3a1c, Southern Altaians (this study). 238 The American Journal of Human Genetics 90, 229–246, February 10, 2012
  • 83. in reduced mtDNA diversity. Based on the NRY data, the Kumandins were distinct from both the Chelkans and Tubalars, who were composed of mostly the same set of lineages. Thus, the genetic diversity in northern Altaians is structured by ethnic group membership, and, therefore, can be viewed as reflecting distinctive histories for each population. Not much is known about the ethnogenesis of northern Altaians. However, it has been suggested that they descended from groups that historically lived around the Yenisei River and spoke either southern Samoyedic, Ugric, or Yeniseian languages.18,19 These populations are the same ones that later contributed to the formation of the Kets, Selk’ups, Shors, and Khakass in northwestern Siberia and the western Sayans of southern Siberia.4,105 Further- more, the Chelkans and Tubalars possess a large number of Q1a3a* Y chromosomes with dramatically different STR profiles compared to other southern Siberians (Altai- kizhi and Tuvinians) and Mongolians. Thus, it is possible that similar lineages will be found in the Kets and/or Sel’kups, where high frequencies of Q1-P36 have already been noted.119 Should this be the case, it would provide additional evidence for northern Altaians having common ancestry with Samoyedic, Yeniseian, and Ugric speakers. In fact, Chelkans and Kumandins also have N-P43 Y chromo- somes very similar to ones found in the Ugric-speaking Khanty. Regardless, there is notable genetic discontinuity between northern Altaians and other Turkic-speaking people of southern Siberia. Southern Altaians share greater affinities with Mongo- lians and Central Asians than they do with northern Altaians. This is partly because of the high frequencies of Y chromosome haplogroup C in these groups. In fact, present-day Kyrgyz are nearly indistinguishable from the Altai-kizhi based on their NRY haplogroup profile.120,121 They share similar C-M217 and R-M417 lineages with the Altai-kizhi, suggesting a recent common ancestry for the two groups, which further supports the theory of a recent common ancestry among southern Siberians and Kyrgyz.122 As evident in the disparities in genetic history between northern and southern Altaians, the Altai has served as a long-term genetic boundary zone. These disparities reflect the different sources of genetic lineages and spheres of interaction for both groups. The northern Altaians share clan names, similar languages, subsistence strategies, and other cultural elements with populations that today live farther to the north.4 By contrast, southern Altaians share these same features with populations in Central Asia, mostly with Turkic- (Kipchak) but also Mongolic-speaking peoples. Thus, the geography of the Altai (taiga versus steppe) has helped to maintain these cultural and biolog- ical (mtDNA, Y chromosome, and cranial-morphological) differences. Furthermore, no evidence of Denisovan or Neanderthal ancestry was found in the Altaian mtDNA and Y chromo- some data. However, this does not preclude such admix- ture in the autosomes of Altaian populations. Greater numbers of derived Denisovan SNPs were found in some southeastern Asian and Oceanian populations, although native Siberians were not included in that study.123 There- fore, this issue requires further investigation. Native American Origins Many earlier genetic studies looked for the origins of Native Americans among the indigenous peoples of Sibe- ria, Mongolia, and East Asia. Often, the identification of source populations conflicted between studies, depending largely on the loci or samples being studied. Cranial morphology has been used to demonstrate a connection between the Native Americans and Siberian popula- tions.124,125 Various researchers have suggested sources such as the Baikal region of southern Siberia, the Amur region of southeastern Siberia, and more generally Eurasia and East Asia.126–128 A study of autosomal loci also showed an affinity between populations in the New World and Siberian regions but did not attempt to pinpoint a partic- ular area of Siberia as the source area.129 In addition, mtDNA studies have suggested New World origins from a number of different locations including different parts of Siberia, Mongolia, and northern China.34,41–45,47,71,130 Our own analysis of Altaian mtDNAs showed that the five primary haplogroups (A–D, X) were present among these populations. However, Altaian populations (and generally all Siberian populations outside of Chukotka) lack mtDNA haplotypes that are identical to those appear- ing in the Americas. The only exceptions are the Selk’ups and Evenks who bear A2 haplotypes, with their presence in those groups being explained as a result of a back migra- tion to northeast Asia.107 Despite the general absence of Native American haplo- types in southern Siberia, there are sister branches whose MRCAs are shared with those in Native Americans. One such lineage is C1a, which was found in two Altai-kizhi individuals and has also been observed at low frequencies in Mongolia, southeastern Siberia, and Japan.44,46,55,71 Tamm et al.107 attribute its presence in northeast Asia to a back migration from the New World, where haplogroups C1b–d are prevalent, whereas Starikovskaya et al.44 argue that C1a and C1b arose in the Amur region, with C1b migrating to the Americas later. A similar lineage is D4b1a2a1a, a sister branch to D4b1a2a1a1, which is found in northern North America. Although both of these line- ages date to around 15,000 years ago, additional mitoge- nome sequences from these haplogroups are needed to estimate more precise TMRCAs for them and thereby delineate their putative Asian and American origins. Results obtained from the Y chromosome analysis support the view that southern Siberians and Native Americans share a common source.8,9,11,38,131 This con- nection was initially suggested by a low-level Y-SNP resolution and an alphoid heteroduplex system by Santos et al.8 Subsequently, Zegura et al.11 showed a similarity in NRY Q and C types among southern Altaians and Native The American Journal of Human Genetics 90, 229–246, February 10, 2012 239
  • 84. Americans by using only fast evolving Y-STR loci and, again, low-level Y-SNP resolution. We focused on haplo- group Q in this study because of the greater number of new mutations published for this branch and correspond- ing levels of Y-STR resolution (15–17 loci), which are currently lacking for published Native American haplo- group C Y chromosomes. This high-resolution character- ization is critical because it allows for a more accurate dating of TMRCAs and estimates of divergence between the ancestors of Native Americans and indigenous Sibe- rians. For example, with this approach, Seielstad et al.114 dated the origin of the M242, which defines the NRY haplogroup Q, and, in turn, provided a more accurate upper bound to the timing of the initial peopling of the Western Hemisphere. Several studies have shown that the American-specific Q-M3 arose on an M346-positive Y chromosome.63,115,132 The M346 marker was also discovered in Altaians and other Siberian populations.102,116 However, it has a broad geographic distribution, being found in Siberia, Central Asia, East Asia, India, and Pakistan, albeit at lower frequen- cies.85,99 We have shown that southern Altaians M346 Y chromosomes also possess L54, a SNP marker that also is shared by Native Americans who have the M3 marker and which is more derived than M346. Because L54 is found in both Siberia and the Americas, it most probably defines the initial founder haplogroup from which M3 later developed. Our coalescence analysis suggests that the two derived branches of L54 (M3 and L330/L334) diverged soon after this mutation arose. Estimates using the evolutionary Y-STR mutation rate place the origin of this marker at around 22,000 years ago, with the two branches diverging at roughly 13,400 years ago. Although the 95% confidence intervals for the Bayesian analyses are broad, the median values of the TMCRAs estimated with this method closely match those obtained through the analysis with rho statis- tics. In addition, the coalescence estimates of northern and southern Altaian Q Y chromosomes show that they, too, are similar to the overall TMRCA estimates. This concor- dance suggests that a rapid expansion probably occurred for this particular Y chromosome branch around 15,000– 20,000 years ago. Given previous estimates for the timing of the initial peopling of the Americas, this scenario seems plausible, because these estimates fall in line with recent estimates of indigenous American mitogenomes.107,133 As in any study, there are limitations to this analysis. The primary issues are the accuracy and precision of using microsatellites for dating origins and dispersals of haplo- types. The stochastic nature of mutational accumulation will continue to be a source of some uncertainty in any attempt at dating TMRCAs. For this reason, the question of which Y-STR mutation rate to use for coalescence esti- mates has been debated.88,134,135 In this study, the evolu- tionary rate seems the most realistic, because estimates generated with the pedigree rate provided times that are much too recent, given what is known about the peopling of the New World from nongenetic studies.117 There is no evidence that the majority of Native Americans (men with Q-M3 Y chromosomes) derived from a migration less than 8 kya, as would be suggested from the TMRCAs calculated with the pedigree rate. However, other studies have used the pedigree mutation rate to explore historical events with great effect—the most-well-known case being the Genghis Khan star cluster.136 It is possible that such rates are, like that of the mtDNA, time dependent or that the Y chromosomes to which the Y-STRs are linked have been affected by purifying selection.84,133,137,138 In this regard, the pedigree-based mutation rate would be more appropriately used with lower diversity estimates, reflect- ing recent historical events, while the evolutionary rate would be used in scenarios with higher diversity estimates, reflecting more ancient phenomena. Although beyond the scope of this paper, it is likely that the Y-STR mutation rate follows a similarly shaped curve as that of the mitochon- drial genome. Furthermore, haplogroup divergence dates need not (and mostly do not) equate with population divergence dates. In this case, however, the mutations defining the southern Altaian and Native American branches of the Q-L54 lineage most probably arose after their ancestral populations split, given the geographic exclusivity of each derived marker. Yet, sample sets that are not entirely representative of a derived branch could potentially skew the coalescent results. In all likelihood, the L54 marker will be found in other southern Siberian populations, because southern Altaians show some genetic affinities with Tuvinians and other populations from the eastern Sayan region. Even so, the consistency of TMRCA esti- mates and the divergence dates for the different Q branches examined here suggest that our data sets are suffi- ciently representative. Moreover, even though the M3 haplotypes used in this analysis came exclusively from indigenous Ecuadorian populations, the diversity found within this data set is similar to previous estimates of the age of the Q-M3 haplogroup.11 Although different lines of evidence point to different source populations for Native Americans, the alternatives need not be exclusive. The effects of historical and demo- graphic events and evolutionary processes, particularly recent gene flow, have shaped modern-day populations such that we should not expect that any one population in the Old World would show the same genetic composi- tion as populations in the New World. That (an) ancestral population(s) probably differentiated into the numerous populations of Siberia and Central Asia, which have inter- acted over the past 15,000 years, is not lost on us. Historical expansions of people and the effects of animal and plant domestication have played critical roles in shaping the genetics of both Old and New World populations, particu- larly in the past several thousand years. Modern popula- tions have complex, local histories that need to be under- stood if these are to be used in larger interregional (or biomedical) analyses. Through the use of phylogeographic 240 The American Journal of Human Genetics 90, 229–246, February 10, 2012
  • 85. methods, we can attain a better understanding of these populations for such purposes. It is through this type of approach that it becomes quite clear that southern Altaians and Native Americans share a recent common paternal ancestor. Supplemental Data Supplemental Data include three figures and six tables and can be found with this article online at http://www.cell.com/AJHG/. Acknowledgments The authors would like to thank all of the indigenous Altaian participants for their involvement in this study. We also thank Fabricio Santos for his careful review of and helpful suggestions for the manuscript, and two anonymous reviewers for their constructive comments. In addition, we would like to acknowl- edge the people who facilitated and provided assistance with our field research in the Altai Republic. They include Vasiliy Seme¨no- vich Palchikov, the staff of the Biochemistry Lab at the Turochak Hospital, Dr. Maria Nikolaevna Trishina, Vitaliy Trishin, Alexander A. Guryanov, the staff of the Native Affairs office in Gorniy Altaiask, Galina Nikolaevna Makhalina, and Tatiana Kunduchi- novna Babrasheva. In addition, we received help from a number of people living in local villages around the Turochakskiy Raion, particularly Alexander Adonyov. This project was supported by funds from the University of Pennsylvania (T.G.S.), the National Science Foundation (BCS-0726623) (T.G.S., M.C.D.), the Social Sciences and Humanities Research Council of Canada (MCRI 412-2005-1004) (T.G.S.), and the Russian Basic Fund for Research (L.P.O.). T.G.S. would also like to acknowledge the infrastructural support provided by the National Geographic Society. Received: September 15, 2011 Revised: December 6, 2011 Accepted: December 19, 2011 Published online: January 26, 2012 Web Resources The URLs for data presented herein are as follows: Arlequin, version 3.11, http://cmpg.unibe.ch/software/arlequin3/ Batwing, http://www.mas.ncl.ac.uk/~nijw/ Network, version 4.6.0.0, http://www.fluxus-engineering.com/ sharenet.htm Network Publisher, version 1.3.0.0, http://www.fluxus-engineering. com/nwpub.htm Y-DNA Haplogroup Tree 2011, version 6.46, http://www.isogg.org/ tree References 1. Goebel, T. (1999). Pleistocene human colonization of Siberia and peopling of the Americas: An ecological approach. Evol. Anthropol. 8, 208–227. 2. Gryaznov, M.P. (1969). The Ancient Civilization of Southern Siberia (New York: Cowles Book Company, Inc.). 3. Okladnikov, A.P. (1964). Ancient population of Siberia and its culture. In The Peoples of Siberia, M.G. Levin and L.P. Potapov, eds. (Chicago: The University of Chicago Press), pp. 13–98. 4. Levin, M.G., and Potapov, L.P. (1964). The Peoples of Siberia (Chicago: University of Chicago Press). 5. Reich, D., Green, R.E., Kircher, M., Krause, J., Patterson, N., Durand, E.Y., Viola, B., Briggs, A.W., Stenzel, U., Johnson, P.L.F., et al. (2010). Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468, 1053– 1060. 6. Krause, J., Fu, Q., Good, J.M., Viola, B., Shunkov, M.V., Derevianko, A.P., and Pa¨a¨bo, S. (2010). The complete mito- chondrial DNA genome of an unknown hominin from southern Siberia. Nature 464, 894–897. 7. Krause, J., Orlando, L., Serre, D., Viola, B., Pru¨fer, K., Richards, M.P., Hublin, J.J., Ha¨nni, C., Derevianko, A.P., and Pa¨a¨bo, S. (2007). Neanderthals in central Asia and Siberia. Nature 449, 902–904. 8. Santos, F.R., Pandya, A., Tyler-Smith, C., Pena, S.D., Schan- field, M., Leonard, W.R., Osipova, L., Crawford, M.H., and Mitchell, R.J. (1999). The central Siberian origin for native American Y chromosomes. Am. J. Hum. Genet. 64, 619–628. 9. Karafet, T.M., Zegura, S.L., Posukh, O., Osipova, L., Bergen, A., Long, J., Goldman, D., Klitz, W., Harihara, S., de Knijff, P., et al. (1999). Ancestral Asian source(s) of new world Y-chromosome founder haplotypes. Am. J. Hum. Genet. 64, 817–831. 10. Lell, J.T., Sukernik, R.I., Starikovskaya, Y.B., Su, B., Jin, L., Schurr, T.G., Underhill, P.A., and Wallace, D.C. (2002). The dual origin and Siberian affinities of Native American Y chro- mosomes. Am. J. Hum. Genet. 70, 192–206. 11. Zegura, S.L., Karafet, T.M., Zhivotovsky, L.A., and Hammer, M.F. (2004). High-resolution SNPs and microsatellite haplo- types point to a single, recent entry of Native American Y chromosomes into the Americas. Mol. Biol. Evol. 21, 164–175. 12. Anthony, D.W. (2007). The Horse, the Wheel, and Language: How Bronze-Age Riders from the Eurasian Steppes Shaped the Modern World (Princeton, N.J.: Princeton University Press). 13. Kuzmina, E.E., and Mair, V.H. (2008). The Prehistory of the Silk Road (Philadelphia: University of Pennsylvania Press). 14. Rudenko, S.I. (1970). Frozen Tombs of Siberia, the Pazyryk Burials of Iron Age Horsemen (Berkeley: University of California Press). 15. David-Kimball J., Bashilov V.A., and Yablonsky L.T., eds. (1995). Nomads of the Eurasian Steppes in the Early Iron Age (Berkeley, CA: Zinat Press). 16. Golden, P.B. (1992). An Introduction to the History of the Turkic Peoples: Ethnogenesis and State-Formation in Medieval and Early Modern Eurasia and the Middle East (Wiesbaden: Otto Harrassowitz). 17. Grousset, R. (1970). The Empire of the Steppes: A History of Central Asia (New Brunswick, N.J.: Rutgers University Press). 18. Potapov, L.P. (1962). The origins of the Altayans. In Studies in Siberian Ethnogenesis, H.N. Michael, ed. (Toronto: University of Toronto Press), pp. 169–196. 19. Potapov, L.P. (1964). The Altays. In The Peoples of Siberia, M.G. Levin and L.P. Potapov, eds. (Chicago: University of Chicago Press), pp. 305–341. 20. Menges, K.H. (1968). The Turkic Languages and Peoples: An Introduction to Turkic Studies (Wiesbaden: Otto Harras- sowitz). The American Journal of Human Genetics 90, 229–246, February 10, 2012 241
  • 86. 21. Levin, M.G. (1964). The anthropological types of Siberia. In The Peoples of Siberia, M.G. Levin and L.P. Potapov, eds. (Chicago: The University of Chicago Press), pp. 99–104. 22. Osipova, L.P., and Sukernik, R.I. (1978). [Polymorphism of immunoglobulin Gm- and Km-allotypes in northern Altaians (western Sibiria)]. Genetika 14, 1272–1275. 23. Posukh, O.L., Osipova, L.P., Kashinskaia, IuO., Ivakin, E.A., Kriukov, IuA., Karafet, T.M., Kazakovtseva, M.A., Skobel’tsina, L.M., Crawford, M.G., Lefranc, M.P., and Lefranc, G. (1998). [Genetic analysis of the South Altaian population of the Mendur-Sokkon village, Altai Republic]. Genetika 34, 106–113. 24. Sukernik, R.I., Karafet, T.M., Abanina, T.A., Korostyshevskiĭ, M.A., and Bashlaĭ, A.G. (1977). [Genetic structure of 2 iso- lated populations of native inhabitants of Sibiria (Northern Altaics) according to the results of a study of blood groups and isoenzymes]. Genetika 13, 911–918. 25. Sukernik, R.I., Shur, T.G., Starikovskaia, E.B., and Uolles, D.K. (1996). [Mitochondrial DNA variation in native inhabitants of Siberia with reconstructions of the evolutional history of the American Indians. Restriction polymorphism]. Genetika 32, 432–439. 26. Shields, G.F., Schmiechen, A.M., Frazier, B.L., Redd, A., Voevoda, M.I., Reed, J.K., and Ward, R.H. (1993). mtDNA sequences suggest a recent evolutionary divergence for Beringian and northern North American populations. Am. J. Hum. Genet. 53, 549–562. 27. Torroni, A., Schurr, T.G., Yang, C.C., Szathmary, E.J., Williams, R.C., Schanfield, M.S., Troup, G.A., Knowler, W.C., Lawrence, D.N., Weiss, K.M., et al. (1992). Native American mitochondrial DNA analysis indicates that the Amerind and the Nadene populations were founded by two independent migrations. Genetics 130, 153–162. 28. Wallace, D.C., and Torroni, A. (1992). American Indian prehistory as written in the mitochondrial DNA: a review. Hum. Biol. 64, 403–416. 29. Torroni, A., Schurr, T.G., Cabell, M.F., Brown, M.D., Neel, J.V., Larsen, M., Smith, D.G., Vullo, C.M., and Wallace, D.C. (1993). Asian affinities and continental radiation of the four founding Native American mtDNAs. Am. J. Hum. Genet. 53, 563–590. 30. Torroni, A., Sukernik, R.I., Schurr, T.G., Starikorskaya, Y.B., Cabell, M.F., Crawford, M.H., Comuzzie, A.G., and Wallace, D.C. (1993). mtDNA variation of aboriginal Siberians reveals distinct genetic affinities with Native Americans. Am. J. Hum. Genet. 53, 591–608. 31. Forster, P., Harding, R., Torroni, A., and Bandelt, H.J. (1996). Origin and evolution of Native American mtDNA variation: a reappraisal. Am. J. Hum. Genet. 59, 935–945. 32. Merriwether, D.A., and Ferrell, R.E. (1996). The four founding lineage hypothesis for the New World: a critical reevaluation. Mol. Phylogenet. Evol. 5, 241–246. 33. Bonatto, S.L., and Salzano, F.M. (1997). Diversity and age of the four major mtDNA haplogroups, and their implications for the peopling of the New World. Am. J. Hum. Genet. 61, 1413–1423. 34. Merriwether, D.A., Hall, W.W., Vahlne, A., and Ferrell, R.E. (1996). mtDNA variation indicates Mongolia may have been the source for the founding population for the New World. Am. J. Hum. Genet. 59, 204–212. 35. Neel, J.V., Biggar, R.J., and Sukernik, R.I. (1994). Virologic and genetic studies relate Amerind origins to the indigenous people of the Mongolia/Manchuria/southeastern Siberia region. Proc. Natl. Acad. Sci. USA 91, 10737–10741. 36. Karafet, T.M., Zegura, S.L., Vuturo-Brady, J., Posukh, O., Osipova, L., Wiebe, V., Romero, F., Long, J.C., Harihara, S., Jin, F., et al. (1997). Y chromosome markers and Trans-Bering Strait dispersals. Am. J. Phys. Anthropol. 102, 301–314. 37. Lell, J.T., Brown, M.D., Schurr, T.G., Sukernik, R.I., Starikov- skaya, Y.B., Torroni, A., Moore, L.G., Troup, G.M., and Wallace, D.C. (1997). Y chromosome polymorphisms in native American and Siberian populations: identification of native American Y chromosome haplotypes. Hum. Genet. 100, 536–543. 38. Bortolini, M.C., Salzano, F.M., Thomas, M.G., Stuart, S., Nasanen, S.P., Bau, C.H., Hutz, M.H., Layrisse, Z., Petzl-Erler, M.L., Tsuneto, L.T., et al. (2003). Y-chromosome evidence for differing ancient demographic histories in the Americas. Am. J. Hum. Genet. 73, 524–539. 39. Schurr, T.G., and Sherry, S.T. (2004). Mitochondrial DNA and Y chromosome diversity and the peopling of the Americas: evolutionary and demographic evidence. Am. J. Hum. Biol. 16, 420–439. 40. Derenko, M.V., Malyarchuk, B., Denisova, G.A., Wozniak, M., Dambueva, I., Dorzhu, C., Luzina, F., Miscicka-Sliwka, D., and Zakharov, I. (2006). Contrasting patterns of Y-chromo- some variation in South Siberian populations from Baikal and Altai-Sayan regions. Hum. Genet. 118, 591–604. 41. Derenko, M.V., Malyarchuk, B., Grzybowski, T., Denisova, G., Dambueva, I., Perkova, M., Dorzhu, C., Luzina, F., Lee, H.K., Vanecek, T., et al. (2007). Phylogeographic analysis of mito- chondrial DNA in northern Asian populations. Am. J. Hum. Genet. 81, 1025–1041. 42. Volodko, N.V., Starikovskaya, E.B., Mazunin, I.O., Eltsov, N.P., Naidenko, P.V., Wallace, D.C., and Sukernik, R.I. (2008). Mitochondrial genome diversity in arctic Siberians, with particular reference to the evolutionary history of Beringia and Pleistocenic peopling of the Americas. Am. J. Hum. Genet. 82, 1084–1100. 43. Derenko, M.V., Grzybowski, T., Malyarchuk, B.A., Dam- bueva, I.K., Denisova, G.A., Czarny, J., Dorzhu, C.M., Kakpa- kov, V.T., Miscicka-Sliwka, D., Wozniak, M., and Zakharov, I.A. (2003). Diversity of mitochondrial DNA lineages in South Siberia. Ann. Hum. Genet. 67, 391–411. 44. Starikovskaya, E.B., Sukernik, R.I., Derbeneva, O.A., Volodko, N.V., Ruiz-Pesini, E., Torroni, A., Brown, M.D., Lott, M.T., Hosseini, S.H., Huoponen, K., and Wallace, D.C. (2005). Mitochondrial DNA diversity in indigenous populations of the southern extent of Siberia, and the origins of Native American haplogroups. Ann. Hum. Genet. 69, 67–89. 45. Starikovskaya, Y.B., Sukernik, R.I., Schurr, T.G., Kogelnik, A.M., and Wallace, D.C. (1998). mtDNA diversity in Chukchi and Siberian Eskimos: implications for the genetic history of Ancient Beringia and the peopling of the New World. Am. J. Hum. Genet. 63, 1473–1491. 46. Schurr, T.G., and Wallace, D.C. (2003). Genetic prehistory of Paleoasiatic-speaking populations of northeastern Siberia and their relationships to Native Americans. In Constructing cultures then and now: celebrating Franz Boas and the Jesup North Pacific Expedition, L. Kendall and I. Krupnik, eds. (Washington, D.C.: Arctic Studies Center, National Museum of Natural History, Smithsonian Institution), pp. 239–258. 47. Schurr, T.G., Ballinger, S.W., Gan, Y.Y., Hodge, J.A., Merri- wether, D.A., Lawrence, D.N., Knowler, W.C., Weiss, K.M., 242 The American Journal of Human Genetics 90, 229–246, February 10, 2012
  • 87. and Wallace, D.C. (1990). Amerindian mitochondrial DNAs have rare Asian mutations at high frequencies, suggesting they derived from four primary maternal lineages. Am. J. Hum. Genet. 46, 613–623. 48. Macaulay, V., Richards, M., Hickey, E., Vega, E., Cruciani, F., Guida, V., Scozzari, R., Bonne´-Tamir, B., Sykes, B., and Torroni, A. (1999). The emerging tree of West Eurasian mtDNAs: a synthesis of control-region sequences and RFLPs. Am. J. Hum. Genet. 64, 232–249. 49. Richards, M., Macaulay, V., Hickey, E., Vega, E., Sykes, B., Guida, V., Rengo, C., Sellitto, D., Cruciani, F., Kivisild, T., et al. (2000). Tracing European founder lineages in the Near Eastern mtDNA pool. Am. J. Hum. Genet. 67, 1251–1276. 50. Torroni, A., Bandelt, H.J., D’Urbano, L., Lahermo, P., Moral, P., Sellitto, D., Rengo, C., Forster, P., Savontaus, M.L., Bonne´-Tamir, B., and Scozzari, R. (1998). mtDNA analysis reveals a major late Paleolithic population expansion from southwestern to northeastern Europe. Am. J. Hum. Genet. 62, 1137–1152. 51. Torroni, A., Huoponen, K., Francalacci, P., Petrozzi, M., Morelli, L., Scozzari, R., Obinu, D., Savontaus, M.L., and Wallace, D.C. (1996). Classification of European mtDNAs from an analysis of three European populations. Genetics 144, 1835–1850. 52. Torroni, A., Lott, M.T., Cabell, M.F., Chen, Y.S., Lavergne, L., and Wallace, D.C. (1994). mtDNA and the origin of Cauca- sians: identification of ancient Caucasian-specific haplo- groups, one of which is prone to a recurrent somatic duplica- tion in the D-loop region. Am. J. Hum. Genet. 55, 760–776. 53. Kivisild, T., Tolk, H.V., Parik, J., Wang, Y., Papiha, S.S., Bandelt, H.J., and Villems, R. (2002). The emerging limbs and twigs of the East Asian mtDNA tree. Mol. Biol. Evol. 19, 1737–1751. 54. Schurr, T.G., Sukernik, R.I., Starikovskaya, Y.B., and Wallace, D.C. (1999). Mitochondrial DNA variation in Koryaks and Itel’men: population replacement in the Okhotsk Sea-Bering Sea region during the Neolithic. Am. J. Phys. Anthropol. 108, 1–39. 55. Tanaka, M., Cabrera, V.M., Gonza´lez, A.M., Larruga, J.M., Takeyasu, T., Fuku, N., Guo, L.J., Hirose, R., Fujita, Y., Kurata, M., et al. (2004). Mitochondrial genome variation in eastern Asia and the peopling of Japan. Genome Res. 14 (10A), 1832– 1850. 56. Yao, Y.G., Kong, Q.P., Bandelt, H.J., Kivisild, T., and Zhang, Y.P. (2002). Phylogeographic differentiation of mitochon- drial DNA in Han Chinese. Am. J. Hum. Genet. 70, 635–651. 57. Gokcumen, O., Dulik, M.C., Pai, A.A., Zhadanov, S.I., Rubin- stein, S., Osipova, L.P., Andreenkov, O.V., Tabikhanova, L.E., Gubina, M.A., Labuda, D., and Schurr, T.G. (2008). Genetic variation in the enigmatic Altaian Kazakhs of South-Central Russia: insights into Turkic population history. Am. J. Phys. Anthropol. 136, 278–293. 58. Rubinstein, S., Dulik, M.C., Gokcumen, O., Zhadanov, S., Osipova, L., Cocca, M., Mehta, N., Gubina, M., Posukh, O., and Schurr, T.G. (2008). Russian Old Believers: genetic conse- quences of their persecution and exile, as shown by mito- chondrial DNA evidence. Hum. Biol. 80, 203–237. 59. van Oven, M., and Kayser, M. (2009). Updated comprehen- sive phylogenetic tree of global human mitochondrial DNA variation. Hum. Mutat. 30, E386–E394. 60. Anderson, S., Bankier, A.T., Barrell, B.G., de Bruijn, M.H., Coulson, A.R., Drouin, J., Eperon, I.C., Nierlich, D.P., Roe, B.A., Sanger, F., et al. (1981). Sequence and organization of the human mitochondrial genome. Nature 290, 457–465. 61. Andrews, R.M., Kubacka, I., Chinnery, P.F., Lightowlers, R.N., Turnbull, D.M., and Howell, N. (1999). Reanalysis and revi- sion of the Cambridge reference sequence for human mito- chondrial DNA. Nat. Genet. 23, 147. 62. Y Chromosome Consortium. (2002). A nomenclature system for the tree of human Y-chromosomal binary haplogroups. Genome Res. 12, 339–348. 63. Karafet, T.M., Mendez, F.L., Meilerman, M.B., Underhill, P.A., Zegura, S.L., and Hammer, M.F. (2008). New binary polymor- phisms reshape and increase resolution of the human Y chro- mosomal haplogroup tree. Genome Res. 18, 830–838. 64. Dulik, M.C., Osipova, L.P., and Schurr, T.G. (2011). Y-chro- mosome variation in Altaian Kazakhs reveals a common paternal gene pool for Kazakhs and the influence of Mongo- lian expansions. PLoS ONE 6, e17548. 65. Cox, M.P. (2006). Minimal hierarchical analysis of global human Y-chromosome SNP diversity by PCR-RFLP. Anthro- pol. Sci. 114, 69–74. 66. Derbeneva, O.A., Starikovskaia, E.B., Volod’ko, N.V., Wallace, D.C., and Sukernik, R.I. (2002). [Mitochondrial DNA varia- tion in Kets and Nganasans and the early peoples of Northern Eurasia]. Genetika 38, 1554–1560. 67. Derbeneva, O.A., Starikovskaya, E.B., Wallace, D.C., and Sukernik, R.I. (2002). Traces of early Eurasians in the Mansi of northwest Siberia revealed by mitochondrial DNA analysis. Am. J. Hum. Genet. 70, 1009–1014. 68. Pimenoff, V.N., Comas, D., Palo, J.U., Vershubsky, G., Kozlov, A., and Sajantila, A. (2008). Northwest Siberian Khanty and Mansi in the junction of West and East Eurasian gene pools as revealed by uniparental markers. Eur. J. Hum. Genet. 16, 1254–1264. 69. Comas, D., Calafell, F., Mateu, E., Pe´rez-Lezaun, A., Bosch, E., Martı´nez-Arias, R., Clarimon, J., Facchini, F., Fiori, G., Luiselli, D., et al. (1998). Trading genes along the silk road: mtDNA sequences and the origin of central Asian popula- tions. Am. J. Hum. Genet. 63, 1824–1838. 70. Yao, Y.G., Kong, Q.P., Wang, C.Y., Zhu, C.L., and Zhang, Y.P. (2004). Different matrilineal contributions to genetic struc- ture of ethnic groups in the silk road region in china. Mol. Biol. Evol. 21, 2265–2280. 71. Kolman, C.J., Sambuughin, N., and Bermingham, E. (1996). Mitochondrial DNA analysis of Mongolian populations and implications for the origin of New World founders. Genetics 142, 1321–1334. 72. Xue, Y., Zerjal, T., Bao, W., Zhu, S., Shu, Q., Xu, J., Du, R., Fu, S., Li, P., Hurles, M.E., et al. (2006). Male demography in East Asia: a north-south contrast in human population expansion times. Genetics 172, 2431–2439. 73. Khar’kov, V.N., Medvedeva, O.F., Luzina, F.A., Kolbasko, A.V., Gafarov, N.I., Puzyrev, V.P., and Stepanov, V.A. (2009). [Comparative characteristics of the gene pool of Teleuts inferred from Y-chromosomal marker data]. Genetika 45, 1132–1142. 74. Khar’kov, V., Khamina, K., Medvedeva, O., Shtygasheva, O., and Stepanov, V. (2011). Genetic diversity of the Khakass gene pool: Subethnic differentiation and the structure of Y-chromosome haplogroups. Mol. Biol. (Mosk.) 45, 446–458. 75. Roewer, L., Kru¨ger, C., Willuweit, S., Nagy, M., Rodig, H., Kokshunova, L., Rotha¨mel, T., Kravchenko, S., Jobling, M.A., Stoneking, M., and Nasidze, I. (2007). Y-chromosomal STR The American Journal of Human Genetics 90, 229–246, February 10, 2012 243
  • 88. haplotypes in Kalmyk population samples. Forensic Sci. Int. 173, 204–209. 76. Geppert, M., Baeta, M., Nu´n˜ez, C., Martı´nez-Jarreta, B., Zwey- nert, S., Cruz, O.W., Gonza´lez-Andrade, F., Gonza´lez-Solo- rzano, J., Nagy, M., and Roewer, L. (2011). Hierarchical Y-SNP assay to study the hidden diversity and phylogenetic relationship of native populations in South America. Forensic Sci. Int. Genet. 5, 100–104. 77. Excoffier, L., Laval, G., and Schneider, S. (2005). Arlequin (version 3.0): an integrated software package for population genetics data analysis. Evol. Bioinform. Online 1, 47–50. 78. Tamura, K., and Nei, M. (1993). Estimation of the number of nucleotide substitutions in the control region of mitochon- drial DNA in humans and chimpanzees. Mol. Biol. Evol. 10, 512–526. 79. SPSS Inc. (2001). SPSS for Windows Release 11.0.0 (Chicago, IL: SPSS Inc.). 80. Polzin, T., and Daneschmand, S.V. (2003). On Steiner trees and minimum spanning trees in hypergraphs. Oper. Res. Lett. 31, 12–20. 81. Bandelt, H.J., Forster, P., and Ro¨hl, A. (1999). Median-joining networks for inferring intraspecific phylogenies. Mol. Biol. Evol. 16, 37–48. 82. Bandelt, H.J., Forster, P., Sykes, B.C., and Richards, M.B. (1995). Mitochondrial portraits of human populations using median networks. Genetics 141, 743–753. 83. Gusma˜o, L., Butler, J.M., Carracedo, A., Gill, P., Kayser, M., Mayr, W.R., Morling, N., Prinz, M., Roewer, L., Tyler-Smith, C., and Schneider, P.M.; DNA Commission of the Interna- tional Society of Forensic Genetics. (2006). DNA Commis- sion of the International Society of Forensic Genetics (ISFG): an update of the recommendations on the use of Y-STRs in forensic analysis. Forensic Sci. Int. 157, 187–197. 84. Soares, P., Ermini, L., Thomson, N., Mormina, M., Rito, T., Ro¨hl, A., Salas, A., Oppenheimer, S., Macaulay, V., and Richards, M.B. (2009). Correcting for purifying selection: an improved human mitochondrial molecular clock. Am. J. Hum. Genet. 84, 740–759. 85. Sengupta, S., Zhivotovsky, L.A., King, R., Mehdi, S.Q., Edmonds, C.A., Chow, C.E., Lin, A.A., Mitra, M., Sil, S.K., Ramesh, A., et al. (2006). Polarity and temporality of high- resolution y-chromosome distributions in India identify both indigenous and exogenous expansions and reveal minor genetic influence of Central Asian pastoralists. Am. J. Hum. Genet. 78, 202–221. 86. Wilson, I., Balding, D., and Weale, M. (2003). Inferences from DNA data: population histories, evolutionary processes and forensic match probabilities. J. R. Stat. Soc. [Ser A] 166, 155–188. 87. Xue, Y., Zerjal, T., Bao, W., Zhu, S., Shu, Q., Xu, J., Du, R., Fu, S., Li, P., Hurles, M.E., et al. (2008). Modelling male prehis- tory in east Asia using BATWING. In Simulations, Genetics and Human Prehistory, S. Matsumura, P. Forster, and C. Ren- frew, eds. (Cambridge: McDonald Institute for Archaeolog- ical Research), pp. 79–88. 88. Zhivotovsky, L.A., Underhill, P.A., Cinnioglu, C., Kayser, M., Morar, B., Kivisild, T., Scozzari, R., Cruciani, F., Destro-Bisol, G., Spedini, G., et al. (2004). The effective mutation rate at Y chromosome short tandem repeats, with application to human population-divergence time. Am. J. Hum. Genet. 74, 50–61. 89. Dupuy, B.M., Stenersen, M., Egeland, T., and Olaisen, B. (2004). Y-chromosomal microsatellite mutation rates: differ- ences in mutation rate between and within loci. Hum. Mutat. 23, 117–124. 90. Fenner, J.N. (2005). Cross-cultural estimation of the human generation interval for use in genetics-based population divergence studies. Am. J. Phys. Anthropol. 128, 415–423. 91. Derenko, M., Malyarchuk, B., Grzybowski, T., Denisova, G., Rogalla, U., Perkova, M., Dambueva, I., and Zakharov, I. (2010). Origin and post-glacial dispersal of mitochondrial DNA haplogroups C and D in northern Asia. PLoS ONE 5, e15214. 92. Zhadanov, S.I., Dulik, M.C., Markley, M., Jennings, G.W., Gaieski, J.B., Elias, G., and Schurr, T.G.; Genographic Project Consortium. (2010). Genetic heritage and native identity of the Seaconke Wampanoag tribe of Massachusetts. Am. J. Phys. Anthropol. 142, 579–589. 93. Hammer, M.F., Karafet, T.M., Redd, A.J., Jarjanazi, H., Santa- chiara-Benerecetti, S., Soodyall, H., and Zegura, S.L. (2001). Hierarchical patterns of global human Y-chromosome diver- sity. Mol. Biol. Evol. 18, 1189–1203. 94. Kivisild, T., Rootsi, S., Metspalu, M., Mastana, S., Kaldma, K., Parik, J., Metspalu, E., Adojaan, M., Tolk, H.V., Stepanov, V., et al. (2003). The genetic heritage of the earliest settlers persists both in Indian tribal and caste populations. Am. J. Hum. Genet. 72, 313–332. 95. Wells, R.S., Yuldasheva, N., Ruzibakiev, R., Underhill, P.A., Evseeva, I., Blue-Smith, J., Jin, L., Su, B., Pitchappan, R., Shanmugalakshmi, S., et al. (2001). The Eurasian heartland: a continental perspective on Y-chromosome diversity. Proc. Natl. Acad. Sci. USA 98, 10244–10249. 96. Rosser, Z.H., Zerjal, T., Hurles, M.E., Adojaan, M., Alavantic, D., Amorim, A., Amos, W., Armenteros, M., Arroyo, E., Barbu- jani, G., et al. (2000). Y-chromosomal diversity in Europe is clinal and influenced primarily by geography, rather than by language. Am. J. Hum. Genet. 67, 1526–1543. 97. Quintana-Murci, L., Krausz, C., Zerjal, T., Sayar, S.H., Hammer, M.F., Mehdi, S.Q., Ayub, Q., Qamar, R., Mohyud- din, A., Radhakrishna, U., et al. (2001). Y-chromosome line- ages trace diffusion of people and languages in southwestern Asia. Am. J. Hum. Genet. 68, 537–542. 98. Underhill, P.A., Passarino, G., Lin, A.A., Shen, P., Mirazo´n Lahr, M., Foley, R.A., Oefner, P.J., and Cavalli-Sforza, L.L. (2001). The phylogeography of Y chromosome binary haplo- types and the origins of modern human populations. Ann. Hum. Genet. 65, 43–62. 99. Zhong, H., Shi, H., Qi, X.-B., Duan, Z.-Y., Tan, P.-P., Jin, L., Su, B., and Ma, R.Z. (2011). Extended Y chromosome investiga- tion suggests postglacial migrations of modern humans into East Asia via the northern route. Mol. Biol. Evol. 28, 717–727. 100. Mirabal, S., Regueiro, M., Cadenas, A.M., Cavalli-Sforza, L.L., Underhill, P.A., Verbenko, D.A., Limborska, S.A., and Her- rera, R.J. (2009). Y-chromosome distribution within the geo-linguistic landscape of northwestern Russia. Eur. J. Hum. Genet. 17, 1260–1273. 101. Myres, N.M., Rootsi, S., Lin, A.A., Ja¨rve, M., King, R.J., Kutuev, I., Cabrera, V.M., Khusnutdinova, E.K., Pshenichnov, A., Yunusbayev, B., et al. (2011). A major Y-chromosome haplogroup R1b Holocene era founder effect in Central and Western Europe. Eur. J. Hum. Genet. 19, 95–101. 244 The American Journal of Human Genetics 90, 229–246, February 10, 2012
  • 89. 102. Malyarchuk, B., Derenko, M., Denisova, G., Maksimov, A., Wozniak, M., Grzybowski, T., Dambueva, I., and Zakharov, I. (2011). Ancient links between Siberians and Native Amer- icans revealed by subtyping the Y chromosome haplogroup Q1a. J. Hum. Genet. 56, 583–588. 103. Rogers, A.R., and Harpending, H. (1992). Population growth makes waves in the distribution of pairwise genetic differ- ences. Mol. Biol. Evol. 9, 552–569. 104. Shi, H., Dong, Y.L., Wen, B., Xiao, C.J., Underhill, P.A., Shen, P.D., Chakraborty, R., Jin, L., and Su, B. (2005). Y-chromo- some evidence of southern origin of the East Asian-specific haplogroup O3-M122. Am. J. Hum. Genet. 77, 408–419. 105. Forsyth, J. (1992). A History of the Peoples of Siberia: Russia’s North Asian Colony, 1581–1990 (Cambridge, England: Cambridge University Press). 106. Brown, M.D., Hosseini, S.H., Torroni, A., Bandelt, H.J., Allen, J.C., Schurr, T.G., Scozzari, R., Cruciani, F., and Wallace, D.C. (1998). mtDNA haplogroup X: An ancient link between Europe/Western Asia and North America? Am. J. Hum. Genet. 63, 1852–1861. 107. Tamm, E., Kivisild, T., Reidla, M., Metspalu, M., Smith, D.G., Mulligan, C.J., Bravi, C.M., Rickards, O., Martinez-Labarga, C., Khusnutdinova, E.K., et al. (2007). Beringian standstill and spread of Native American founders. PLoS ONE 2, e829. 108. Achilli, A., Perego, U.A., Bravi, C.M., Coble, M.D., Kong, Q.P., Woodward, S.R., Salas, A., Torroni, A., and Bandelt, H.J. (2008). The phylogeny of the four pan-American MtDNA haplogroups: implications for evolutionary and disease studies. PLoS ONE 3, e1764. 109. Perego, U.A., Achilli, A., Angerhofer, N., Accetturo, M., Pala, M., Olivieri, A., Kashani, B.H., Ritchie, K.H., Scozzari, R., Kong, Q.P., et al. (2009). Distinctive Paleo-Indian migration routes from Beringia marked by two rare mtDNA haplo- groups. Curr. Biol. 19, 1–8. 110. Perego, U.A., Angerhofer, N., Pala, M., Olivieri, A., Lancioni, H., Kashani, B.H., Carossa, V., Ekins, J.E., Go´mez-Carballa, A., Huber, G., et al. (2010). The initial peopling of the Americas: a growing number of founding mitochondrial genomes from Beringia. Genome Res. 20, 1174–1179. 111. Helgason, A., Pa´lsson, G., Pedersen, H.S., Angulalik, E., Gun- narsdo´ttir, E.D., Yngvado´ttir, B., and Stefa´nsson, K. (2006). mtDNA variation in Inuit populations of Greenland and Canada: migration history and population structure. Am. J. Phys. Anthropol. 130, 123–134. 112. Bortolini, M.C., Salzano, F.M., Bau, C.H., Layrisse, Z., Petzl- Erler, M.L., Tsuneto, L.T., Hill, K., Hurtado, A.M., Castro- De-Guerra, D., Bedoya, G., and Ruiz-Linares, A. (2002). Y-chromosome biallelic polymorphisms and Native Amer- ican population structure. Ann. Hum. Genet. 66, 255–259. 113. Underhill, P.A., Shen, P., Lin, A.A., Jin, L., Passarino, G., Yang, W.H., Kauffman, E., Bonne´-Tamir, B., Bertranpetit, J., Franca- lacci, P., et al. (2000). Y chromosome sequence variation and the history of human populations. Nat. Genet. 26, 358–361. 114. Seielstad, M., Yuldasheva, N., Singh, N., Underhill, P., Oef- ner, P., Shen, P., and Wells, R.S. (2003). A novel Y-chromo- some variant puts an upper limit on the timing of first entry into the Americas. Am. J. Hum. Genet. 73, 700–705. 115. Schurr, T.G., Osipova, L.P., Zhadanov, S.I., and Dulik, M.C. (2010). Genetic diversity in Native Siberians: Implications for the prehistoric settlement of te Cis-Baikal region. In Prehistoric Hunter-Gatherers of the Baikal Region, Siberia, A.W. Weber, M.A. Katzenberg, and T.G. Schurr, eds. (Philadel- phia: University of Pennsylvania Press), pp. 121–134. 116. Dulik, M.C. (2011). A molecular anthropological study of Altaian histories utilizing population genetics and phylogeography. PhD thesis, University of Pennsylvania, Philadelphia, PA. 117. Fiedel, S.J. (2000). The peopling of the New World: present evidence, new theories, and future directions. J. Archaeol. Res. 8, 39–103. 118. Martı´nez-Cruz, B., Vitalis, R., Se´gurel, L., Austerlitz, F., Georges, M., The´ry, S., Quintana-Murci, L., Hegay, T., Alda- shev, A., Nasyrova, F., and Heyer, E. (2011). In the heartland of Eurasia: the multilocus genetic landscape of Central Asian populations. Eur. J. Hum. Genet. 19, 216–223. 119. Karafet, T.M., Osipova, L.P., Gubina, M.A., Posukh, O.L., Zegura, S.L., and Hammer, M.F. (2002). High levels of Y-chro- mosome differentiation among native Siberian populations and the genetic signature of a boreal hunter-gatherer way of life. Hum. Biol. 74, 761–789. 120. Balaresque, P., Parkin, E.J., Roewer, L., Carvalho-Silva, D.R., Mitchell, R.J., van Oorschot, R.A., Henke, J., Stoneking, M., Nasidze, I., Wetton, J., et al. (2009). Genomic complexity of the Y-STR DYS19: inversions, deletions and founder line- ages carrying duplications. Int. J. Legal Med. 123, 15–23. 121. Underhill, P.A., Myres, N.M., Rootsi, S., Metspalu, M., Zhivo- tovsky, L.A., King, R.J., Lin, A.A., Chow, C.E., Semino, O., Battaglia, V., et al. (2010). Separating the post-Glacial coan- cestry of European and Asian Y chromosomes within haplo- group R1a. Eur. J. Hum. Genet. 18, 479–484. 122. Soucek, S. (2000). A History of Inner Asia (Cambridge, New York: Cambridge University Press). 123. Reich, D., Patterson, N., Kircher, M., Delfin, F., Nandineni, M.R., Pugach, I., Ko, A.M., Ko, Y.C., Jinam, T.A., Phipps, M.E., et al. (2011). Denisova admixture and the first modern human dispersals into Southeast Asia and Oceania. Am. J. Hum. Genet. 89, 516–528. 124. Hrdlicka, A. (1942). Crania of Siberia. Am. J. Phys. Anthro- pol. 29, 435–481. 125. Gonza´lez-Jose´, R., Bortolini, M.C., Santos, F.R., and Bonatto, S.L. (2008). The peopling of America: craniofacial shape vari- ation on a continental scale and its interpretation from an interdisciplinary view. Am. J. Phys. Anthropol. 137, 175–187. 126. Kozintsev, A.G., Gromov, A.V., and Moiseyev, V.G. (1999). Collateral relatives of American Indians among the Bronze Age populations of Siberia? Am. J. Phys. Anthropol. 108, 193–204. 127. Crawford, M.H. (1998). The Origins of Native Americans: Evidence from Anthropological Genetics (Cambridge: Cam- bridge University Press). 128. Brace, C.L., Nelson, A.R., Seguchi, N., Oe, H., Sering, L., Qifeng, P., Yongyi, L., and Tumen, D. (2001). Old World sour- ces of the first New World human inhabitants: a comparative craniofacial view. Proc. Natl. Acad. Sci. USA 98, 10017– 10022. 129. Wang, S., Lewis, C.M., Jakobsson, M., Ramachandran, S., Ray, N., Bedoya, G., Rojas, W., Parra, M.V., Molina, J.A., Gallo, C., et al. (2007). Genetic variation and population structure in native Americans. PLoS Genet. 3, e185. 130. Horai, S., Kondo, R., Nakagawa-Hattori, Y., Hayashi, S., Sonoda, S., and Tajima, K. (1993). Peopling of the Americas, founded by four major lineages of mitochondrial DNA. Mol. Biol. Evol. 10, 23–47. The American Journal of Human Genetics 90, 229–246, February 10, 2012 245
  • 90. 131. Kaessmann, H., Zo¨llner, S., Gustafsson, A.C., Wiebe, V., Laan, M., Lundeberg, J., Uhle´n, M., and Pa¨a¨bo, S. (2002). Extensive linkage disequilibrium in small human populations in Eurasia. Am. J. Hum. Genet. 70, 673–685. 132. Bailliet, G., Ramallo, V., Muzzio, M., Garcı´a, A., Santos, M.R., Alfaro, E.L., Dipierri, J.E., Salceda, S., Carnese, F.R., Bravi, C.M., et al. (2009). Brief communication: Restricted geo- graphic distribution for Y-Q* paragroup in South America. Am. J. Phys. Anthropol. 140, 578–582. 133. Ho, S.Y., and Endicott, P. (2008). The crucial role of calibra- tion in molecular date estimates for the peopling of the Americas. Am. J. Hum. Genet. 83, 142–146, author reply 146–147. 134. Zhivotovsky, L.A., and Underhill, P.A. (2005). On the evolu- tionary mutation rate at Y-chromosome STRs: comments on paper by Di Giacomo et al. (2004). Hum. Genet. 116, 529–532. 135. Di Giacomo, F., Luca, F., Popa, L.O., Akar, N., Anagnou, N., Banyko, J., Brdicka, R., Barbujani, G., Papola, F., Ciavarella, G., et al. (2004). Y chromosomal haplogroup J as a signature of the post-neolithic colonization of Europe. Hum. Genet. 115, 357–371. 136. Zerjal, T., Xue, Y., Bertorelle, G., Wells, R.S., Bao, W., Zhu, S., Qamar, R., Ayub, Q., Mohyuddin, A., Fu, S., et al. (2003). The genetic legacy of the Mongols. Am. J. Hum. Genet. 72, 717–721. 137. Zhivotovsky, L.A., Underhill, P.A., and Feldman, M.W. (2006). Difference between evolutionarily effective and germ line mutation rate due to stochastically varying haplo- group size. Mol. Biol. Evol. 23, 2268–2270. 138. Ho, S.Y., Phillips, M.J., Cooper, A., and Drummond, A.J. (2005). Time dependency of molecular rate estimates and systematic overestimation of recent divergence times. Mol. Biol. Evol. 22, 1561–1568. 246 The American Journal of Human Genetics 90, 229–246, February 10, 2012
  • 91. ARTICLE A ‘‘Copernican’’ Reassessment of the Human Mitochondrial DNA Tree from its Root Doron M. Behar,1,2,* Mannis van Oven,3,* Saharon Rosset,4 Mait Metspalu,1 Eva-Liis Loogvali,1 Nuno M. Silva,5 Toomas Kivisild,1,6 Antonio Torroni,7 and Richard Villems1,8 Mutational events along the human mtDNA phylogeny are traditionally identified relative to the revised Cambridge Reference Sequence, a contemporary European sequence published in 1981. This historical choice is a continuous source of inconsistencies, misinterpretations, and errors in medical, forensic, and population genetic studies. Here, after having refined the human mtDNA phylogeny to an unprecedented level by adding information from 8,216 modern mitogenomes, we propose switching the reference to a Reconstructed Sapiens Reference Sequence, which was identified by considering all available mitogenomes from Homo neandertha- lensis. This ‘‘Copernican’’ reassessment of the human mtDNA tree from its deepest root should resolve previous problems and will have a substantial practical and educational influence on the scientific and public perception of human evolution by clarifying the core principles of common ancestry for extant descendants. Introduction Nested hierarchy of species, resulting from the descent with modification process,1 is fundamental to our under- standing of the evolution of biological diversity and life in general. In molecular genealogy, the sequential accumulation of mutations since the time of the most recent common ancestor (MRCA) is reflected within the ever-evolving phylogeny of any genetic locus. Accordingly, the reconstructed ancestral sequence of a locus should optimally serve as the reference point for its derived alleles.2 The human mtDNA phylogeny3–7 is an almost perfect molecular prototype for a nonrecombining locus, and knowledge on its variation has been and is extensively used in medical, genealogical, forensic, and popula- tion genetic studies.8–11 Boosted by rapid advances in sequencing and genotyping technology, its mode of inher- itance, high mutation rate, lack of recombination, and high cellular copy number have proved critical in making this locus the primary choice in the field of archaeoge- netics and ancient DNA.12–14 Although its early synthesis was based on restriction-fragment-length polymor- phisms,15–18 control-region variation,19,20 or a combina- tion of both,21 the human mtDNA phylogeny is now reconstructed from complete mtDNA sequences,4,6,7,22 thus stretching the phylogenetic resolution to its maxi- mum. mtDNA also became the main target of ancient- DNA studies because it is much more abundant than nuclear DNA.13 The recently published Homo neandertha- lensis mitogenomes23,24 represent the best available out- group source for rooting the human mtDNA phylogeny known to lay inside the contemporary African varia- tion.22,25,26 Despite these major advances, the extinct human mtDNA complete root sequence was never precisely determined, and mtDNA nomenclature remains cumbersome because it refers to the first completely sequenced mtDNA,27,28 labeled rCRS, which is now known to belong to the recently coalescing European haplogroup H2a2a1.7 The use of the rCRS as a reference resulted in a number of practical problems such as (1) the misidentification of derived versus ancestral states of alleles and (2) the count of nonsynonymous muta- tions that map to the path between the rCRS and the case sequences.29 For instance, clinical and func- tional studies frequently include among the putative nonsynonymous candidate mutations the haplogroup- HV-defining transition at position 14766 (CYTB) simply because the revised Cambridge Reference Sequence (rCRS) belongs to its derived haplogroup H.30 In this study, to definitively address these issues, we propose a ‘‘Copernican’’ reassessment of the human mtDNA phylogeny by switching to a Reconstructed Sapiens Reference Sequence (RSRS) as the phylogenetically valid reference point. To this end, the previously suggested root7,22,25 was updated to most parsimoniously incorporate the available mitogenomes from H. neanderthalensis.23,24 Moreover, we further refined the human mtDNA phylogeny to an unprecedented level by adding informa- tion from 8,216 mitogenomes and evaluated the ranges of nucleotide substitutions from the root RSRS rather than the rCRS28 as a reference point (Figure 1 and Figure S1, available online). 1 Estonian Biocentre and Department of Evolutionary Biology, University of Tartu, Tartu 51010, Estonia; 2 Molecular Medicine Laboratory, Rambam Health Care Campus, Haifa 31096, Israel; 3 Department of Forensic Molecular Biology, Erasmus MC, University Medical Center Rotterdam, 3000 CA Rotterdam, The Netherlands; 4 Department of Statistics and Operations Research, School of Mathematical Sciences, Tel Aviv University, Tel Aviv 69978, Israel; 5 Instituto de Patologia e Imunologia Molecular da Universidade do Porto, Porto 4200-465, Portugal; 6 Department of Biological Anthropology, University of Cambridge, Cambridge CB2 1QH, UK; 7 Dipartimento di Biologia e Biotecnologie ‘‘L. Spallanzani,’’ Universita` di Pavia, Pavia 27100, Italy; 8 Estonian Academy of Sciences, 6 Kohtu Street, Tallinn 10130, Estonia *Correspondence: behardm@usernet.com (D.M.B.), m.vanoven@erasmusmc.nl (M.v.O.) DOI 10.1016/j.ajhg.2012.03.002. Ó2012 by The American Society of Human Genetics. All rights reserved. The American Journal of Human Genetics 90, 675–684, April 6, 2012 675
  • 92. 6 1.3 2.2 0.5 0.15 0.03 0 L0d1c1b (EU092832) H2a2a1 rCRS(NC_012920) H4a1a (HQ860291) 53MUTATIONS 54MUTATIONS 46MUTATIONS 99MUTATIONS 13 MUTATIONS 2 5 9 9 1 3 6 L0 L1’2’3’4’5’6 Pan paniscus Pan troglodytes Homo neander- thalensisthalensis Homo sapiens SRSRNR Mya Hominini a2a1a1111a2a1a11111aa aaa 11122 C8209T A8348G T12011C A11560G G5262A T4928C C6518T A6131G G6962A G7146A A3564G A3334G T4101C T3504C G3438A T6185C T245C G263A C152T G185A C262T A2294G A1779G C146T A200G C146T T13488C G15077A G1048TC182T T8167C C7650T C10915T C9042TA11914G A15775G A16078G C3516aT4312C T16086C T16154C T5442CT10664C A12810G T14063C A2758G C3556T T3308C A12720G A574G G3483A T990C T12864C C16344T A9347GG13276A G10589AG16230A G10586A A16258G G12007A G16156A A14926G A5189t T16093C 291d 361.1A A16129G T5964C G200A! A10520G T391C A13917G T4688C L0L1’2’3’4'5’6FM865411 FM865408 FM865409 AM948965 FM865410 FM865407 H2a2a1 H2 H2a2a H2a H2a2 C152T A2758G C2885T G7146A A825t T8655C A10688G C10810T G13105A T13506C T8468C L2'3’4’5’6 C195T A247G 522.1AC A7521G L3’4'6 T182C! T3594C T7256C T13650C G15301A A16129G T16187C C16189T L2'3’4’6 G4104A G8701A C9540T G10398A C10873T A15301G! N T16278C L3'4 A769G A1018G C16311T L3 T14766C HV G2706A T7028C H G1438A T12705C T16223C R G73A A11719G R0 G8860A G15326A rCRS G4769A G750A G263A 9755 9456 9345 9329 9325 9053 9027 8986 8943 8764 8718 8503 8461 8455 8406 8386 8365 8065 8021 7891 7868 7861 7746 7424 7127 7106 6641 6620 6452 6410 6266 6260 6200 6156 6023 5840 5821 5673 5580 5505 5471 5460 5387 4940 4904 4856 4562 4532 4204 4048 3939 3918 3909 3808 3414 3399 3010 2863 2831 2706 2523 2056 1709 1406 827 709 547 521-522 438 417 243 195 189 150 9869 10101 10256 10281 10307 10310 10324 10373 10532 10750 11383 11458 11527 11590 11623 11770 11950 12070 12189 12351 12366 12406 12474 13095 13194 13269 13359 13506 13650 13656 13680 13707 13801 13879 13889 14053 14144 14178 14296 14560 15043 15148 15191 15226 15232 15295 15301 15355 15443 15479 15629 15649 15667 15671 15789 15850 16037 16139 16148 16169 16183 16187 16209 16234 16244 16256 16262 16263.1 16299 16320 16362 16400 Homo neanderthalensis mtDNA genomes Homo sapiens rCRS genome Figure 1. Schematic Representation of the Human mtDNA Phylogeny within Hominini (Left) Hominini phylogeny illustrating approximate divergence times of the studied species. The positions of the RSRS and the putative Reconstructed Neanderthal Reference Sequence (RNRS) are shown. (Right) Magnification of the human mtDNA phylogeny. Mutated nucleotide positions separating the nodes of the two basal human hap- logroups L0 and L1’20 30 4’50 6 and their derived states as compared to the RSRS are shown. The positions of the rCRS and the RSRS are indicated by golden and a green five-pointed stars, respectively. Accordingly, the number of mutations counted from the rCRS (NC_012920) or the RSRS (Sequence S1) to the L0d1c1b (EU092832) and H4a1a (HQ860291) haplotypes retrieved from a San and a German, respectively, are marked on the golden and green branches. The principle of equidistant star-like radiation from the common ancestor of all contemporary haplotypes is highlighted when the RSRS is preferred over the rCRS as the reference sequence. 676 The American Journal of Human Genetics 90, 675–684, April 6, 2012
  • 93. Subjects and Methods Updating the Human mtDNA Phylogeny and Inference of the Ancestral Root Haplotype MtDNA Genomes Comprising the Phylogeny A total of 18,843 complete mtDNA sequences were used to refine the human mtDNA phylogeny of which 10,627 were previously reported and used for the mtDNA tree Build 13 (28 Dec 2011) as posted by PhyloTree.7 The remaining 8,216 sequences are mainly from the large complete mtDNA database available at FamilyTreeDNA and in part from data sets maintained by the authors. The large database available at FamilyTreeDNA was privately obtained by the sample donors, usually for genealogical purposes. Most donors were of western Eurasian ancestry, but donors with matrilineal ancestry from other geographical regions have also contributed. Once the mtDNA sequences were obtained, donors had several options: keep them confidential, share them with peer genealogists, submit them to the National Center for Biotechnology Information (NCBI) GenBank, and/or consent to contribute them anonymously to a research database maintained by FamilyTreeDNA to improve the mtDNA phylogeny. In turn, this contribution rewards and enriches the genealogical experi- ence as well as benefits the scientific community. All the proce- dures followed in this study were in accordance with the ethical standards of the responsible committee on human experimenta- tion of the participating research centers. Likewise, it is important to clarify that because the complete sequences were obtained privately, some donors have indepen- dently uploaded their sequence to NCBI. Currently (as of February 28, 2012), a total of 1,220 complete mtDNA sequences that were generated at FamilyTreeDNA were privately deposited in NCBI GenBank. Most of these sequences were already considered in the previous PhyloTree Builds.7 Because we have no way to know which of the sequences were autonomously uploaded to NCBI, all duplicate sequences that matched precisely between NCBI and our database were excluded from our analysis. There- fore, even if multiple samples were excluded, no topological infor- mation was lost. Accordingly, out of the 8,216 sequences used to verify the phylogeny, a total of 4,265 sequences are released and deposited in NCBI GenBank under accession numbers JQ701803–JQ706067. The complete mtDNA sequences of the Neanderthals were retrieved from the literature.23,24 Complete mtDNA Sequencing DNA was extracted from buccal swabs. MtDNA was amplified with 18 primers to yield nine overlapping fragments as previously reported.22 PCR products were cleaned with magnetic-particle technology (BioSprint 96; QIAGEN). After purification, the nine fragments were sequenced by means of 92 internal primers to obtain the complete mtDNA genome. Sequencing was performed on a 3730xl DNA Analyzer (Applied Biosystems), and the resulting sequences were analyzed with the Sequencher software (Gene Codes Corporation). Mutations were scored relative to the rCRS and the suggested RSRS. Sample quality control was assured as follows: (1) After the PCR amplification of the nine fragments, DNA handling and distribution to the 96 sequencing reactions was aided by the Beckman Coulter Biomek FX liquid handler to minimize the chance for human pipetting errors. (2) All 96 sequencing reactions of each sample were performed simultaneously in the same sequencing run. Most observed mutations were determined by at least two sequence reads. However, in a minority of the cases only one sequence read was available because of various technical reasons, usually related to the amount and quality of the DNA available. (3) Any fragment that failed the first sequencing attempt or any ambiguous base call was tested by additional and independent PCR and sequencing reactions. In these cases, the first hypervariable segment (HVS-I) of the control region was resequenced too to assure that the correct sample was retrieved. (4) Genotyping history for each sample was recorded to help in the search for DNA handling errors and artificial recom- bination events. (5) All sequences were aligned with the software Sequencher (Gene Codes Corporation), and all positions with a Phred score less than 30 were manually evaluated by an operator. Two independent operators read each sequence. All posi- tions that differed from the reference sequences were recorded electronically to minimize typographic errors. (6) Any sequence that did not comfortably fit within the estab- lished human mtDNA phylogeny was highlighted and resequenced to exclude potential lab errors. (7) Any comments and remarks raised by external investiga- tors after release of the data will be addressed by reassessing the original sequences for accuracy. After that, any unre- solved result will be further examined by resequencing and, if necessary, immediately corrected. Tree Reconstruction and Notation of Mutations The phylogeny was reconstructed by evaluating both all previ- ously available published and the herein released complete mtDNA sequences aiming at the most parsimonious solution and aided by the software mtPhyl. Polymorphic positions are shown on the branches and reticulations were resolved by consid- ering the degree of mutability of individual positions as counted by their number of occurrences in the overall phylogeny. Both the ancestral and derived base status for each mutation appearing in the phylogeny according to the International Union Of Pure And Applied Chemistry (IUPAC) nucleotide code are reported. We use capital letters for transitions (e.g., G73A) and lowercase letters for transversions (e.g., A73t). Although heteroplasmies are not noted in the phylogeny, we recommend labeling them by using IUPAC code and capital letters (e.g., G73R). Throughout the phylogeny indels are given with respect to the RSRS and main- tain the traditional nucleotide position numbering as in the rCRS. Sequencing alignment prefers 30 placement for indels, except in cases where the phylogeny suggests otherwise.31 Deletions are indicated by a ‘‘d’’ after the deleted nucleotide position (e.g., T15944d). Insertions are indicated by a dot followed by the posi- tion number and type of inserted nucleotide(s) (e.g., 5899.1C for a C insertion at the first inserted nucleotide position after position 5899 and 5899.2C for a subsequent C insertion, and these are abbreviated as 5899.1CC when occurring on the same branch). We label polynucleotide stretches of unknown length as follows: 573.XC. In cases where an insertion occurred at an ancestral branch but a reversion of this insertion (¼ deletion) took place at a descendant branch, we noted the latter as follows: 5899.1Cd. An exclamation mark (!) at the end of a labeled position denotes a reversion to the ancestral state. The number of exclama- tion marks stands for the number of sequential reversions in the given position from the RSRS (e.g., C152T, T152C!, and The American Journal of Human Genetics 90, 675–684, April 6, 2012 677
  • 94. C152T!!). Some indel positions have been a source of confusion because multiple alignment solutions enable alternative scoring. Notably, the dinucleotide repeat in hypervariable segment II (HVS-II) of the control region can be viewed either as a CA repeat starting at position 514 or as an AC repeat starting at position 515, leading to two different notations being in use for a repeat loss: 522–523d versus 523–524d. We adhered to the guidelines for consistent treatment of mtDNA-length variants that were estab- lished by the forensic genetic community31 and favor the AC interpretation. As the RSRS has one AC unit less compared to the rCRS, we filled positions 523 and 524 of the RSRS with "NN," thereby preserving the historical genome annotation numbering. Consequently, an AC insertion compared to the RSRS is scored as 522.1AC, whereas an AC deletion is scored as 521–522d. Table S2 presents all common indel positions throughout the complete mtDNA sequence and the way we labeled them. Transitions at the hypervariable position 16519, insertions of one or two Cs at positions 309, 315, and 16193, A to C transversions at 16182 and 16183, as well as length variation of the AC dinucleotide repeat spanning 515–522, were excluded from the phylogeny. Haplogroup labels were re-evaluated and the following sugges- tions were made: (1) Monophyletic clades that are composed of two or more previously named haplogroups are labeled by concate- nating their names and separating them by apostrophe (e.g., L0a’b). This is not applied in the case of capital- letter-only labeled haplogroups (e.g., JT); (2) We suggest labeling an extant sample that matches a haplogroup root with the superscript case letter n for ‘‘nodal’’ (e.g., Hn ); (3) We note that when complete mtDNA sequences are consid- ered, the inability to differentiate a nodal haplotype from an unresolved paraphyletic clade is eliminated. Accord- ingly, the haplogroup label of each observed complete mtDNA sequences can: (1) mark it in a nodal position; (2) affiliate it with a previously labeled haplogroup; (3) suggest a, so far, unlabeled haplogroup; or (4) in the absence of two additional samples to justify the labeling of a, so far, unidentified haplogroup, affiliate it with the ancestral haplogroup. So, the label of a given sample as ‘‘H’’ means that it is an unlabeled descendent of haplogroup H that cannot be affiliated to any known H haplogroup clade at the time of report and based on complete mtDNA sequence. We suggest restricting the use of label ‘‘H*’’ to cases where the haplogroup labeling is based on partial mtDNA sequence; (4) To aid the nonexpert in understanding the mtDNA hap- logroup nomenclature system, we summarize in Table S3 the cases where haplogroup labels do not logically follow from the hierarchy and hence could lead to confusion. Changing these haplogroup labels to make them more logical is undesirable at this stage because they are already used extensively in the literature and therefore changing them would probably cause even more confusion. In addi- tion, we note that for the most basal nodes of the phylogeny, historically the following shorthand names have been in use: L1’5 ¼ L1’20 30 4’50 6; L20 5 ¼ L20 30 4’50 6; L20 6 ¼ L20 30 4’6; and L4’6 ¼ L30 4’6, which we will herein refer to by their full name. One shorthand haplogroup name, M4’’67, is maintained because writing it in full (M4’18’30’37’38’43’45’63’64’65’66’67) seems impractical. It is important to note that the aim of this study is to publish the most up-to-date human mtDNA phylogeny, and it cannot be regarded by any means as a population-level survey exploring the frequencies and distributions of the various haplogroups. Therefore, although all sequences were used to establish the tree topology, the subset of sequences actually presented in the phylogeny is lower because for each branch up to two representa- tive example sequences are provided. In most cases, we labeled haplogroups only when supported by at least three distinct haplo- types to maximize the accuracy of the haplogroup defining array of mutations and to avoid the establishment of haplogroups resulting from sequencing errors. Exceptions included previously established haplogroups or haplogroups supported by a particu- larly long array of mutations. Accordingly, the tips of the herein released phylogeny are in fact internal haplogroup nodes, thus private mutations (if any) of individual haplotypes were not included. Evaluation of the mtDNA Clock and Age Estimates Substitution Counts and Molecular Clock To calculate the substitution counts from the RSRS to every extant mitogenome (which is a tip in the mtDNA phylogeny), we summed up the number of mutations on the path leading to each noted haplogroup in the phylogeny and added to this the number of positions that differed between the tip and the root of the haplogroup. Thus, we are guaranteed to correctly count all parallel and back mutations, except for the case where two mutations affecting the same position occurred on a branch in the tree (in which case we either count zero instead of two, if the second is a back mutation, or one instead of two, if the second mutation is not back to the initial state). As has been argued in the past, such repeated mutations within a single branch in the highly resolved human mtDNA tree are highly unlikely,32 and are even more so if the fastest mutating sites (16519 and the A to C trans- versions and poly-C insertions around the HVS-I position 16189) are eliminated, as was done in our analysis. To test the validity of molecular clock assumption on human mtDNA substitutions, we used PAML 4.4 with the HKY85 substitu- tion model to generate maximum likelihood estimates of branch lengths with and without the molecular clock assumption. We chose to sample around 200–300 sequences and analyze their coalescent tree (a subtree of the complete tree) in each PAML run, to accommodate PAML’s computational limitations, and also to sample mostly deep branches (such as M44), rather than the recent and very short branches (such as D4a1b1) of the over- sampled haplogroups such as H and D. Thus, we preferentially sampled haplogroups whose coalescence with other samples in the tree was more ancient. This ensured that even in such a sample, the deeper clades such as the basal M clades would be represented with high probability, whereas more recently coalescing haplogroups such as the ones of haplogroup D would be rarely sampled. The generalized likelihood ratio (GLR) test for validity of the clock assumption then uses the test statistic 2 3 (log-likelihood of non-clock model À log-likelihood of clock model), which, under the null hypothesis of molecular clock, has a c2 distribution with degrees of freedom equal to the number of parameters under no clock (¼ number of branches in the tree) minus number of parameters under clock (¼ number of internal nodes in the tree). We performed the analyses on two sets of the mtDNA sequences: once by using the coding region alone and once on the entire molecule. This was done as another sanity check for 678 The American Journal of Human Genetics 90, 675–684, April 6, 2012
  • 95. the validity and generality of our results. All obtained p values are presented in Table S4. Age Calculations Assuming a Molecular Clock In spite of the discovered clock violations, we were still interested in applying the best available tools for estimating the ages of ancestral nodes in the tree assuming a molecular clock. We adopted the calculation approach and mutation rate estimate of,32 who suggest to estimate ages in substitutions and then transform them to years in a nonlinear manner accounting for the selection effect on non- synonymous mutations. We used PAML 4.433 with the HKY85 substitution model to generate maximum likelihood estimates of internal node ages under a molecular clock assumption. Because PAML is computationally limited in the size of trees it can analyze, we performed estimation for the whole tree in several separate runs. We divided the tree into seven collections of haplogroups: d All L haplogroups (i.e., the entire phylogeny excluding M and N) d All of M excluding D d D and JT d H excluding H1 and H5 d B4’5 and HV excluding H but including H1 and H5 d U d N excluding HV, U, JT and B4’5 For each PAML run, we selected all sequences belonging to one of these sets, and added a small random sample of other samples from the rest of the phylogeny to maintain ‘‘calibration.’’ Putting together the estimates from all seven runs provided us with age estimates for all nodes in our tree. Estimates are given in Table S5. Data Transition We are aware that the suggested change can raise difficulties and even antagonism from the scientific community. On the other hand, a scenario in which a reference sequence of a genetic locus does not represent its ancestral sequence should, indisputably, be corrected. The realization of the superiority of complete mtDNA sequence analysis compared to other approaches, combined with the emergence of deep sequencing technologies, will possibly shift the entire field into the use of only complete mtDNA sequences in the near future.34–36 Therefore, the sooner the change is made the less ‘‘painful’’ it will be. As the common practice for reporting complete mtDNA sequences is by posting the sequences as FASTA files to NCBI, rather than reporting the substitutions with respect to a reference sequence (as in the case of many data sets restricted to control-region variation), no major change is needed. When a FASTA file is available or created, the only change needed is to switch the reference sequence to the RSRS. For control-region-based data sets, the conversion might be more problematic as the common practice to report the sequences in literature did not involve FASTA files but recorded mutations as compared to the rCRS. Table S6 compares the classic diagnostic mutations for the major haplogroups relative to the rCRS or the RSRS. To facilitate data transition we release the tools ‘‘FASTmtDNA,’’ which allows transformation of Excel list-type reports of mtDNA haplotypes into FASTA files, and ‘‘mtDNAble,’’ which labels haplogroups, performs a phylogeny-based quality check and identifies private substitutions. These noted features are fully supported in a web interface or as standalone versions, which can be freely downloaded from the website including their manual and example files. In addition, the web interface allows the benefit of comparing private substitutions between submitted and previously stored mitogenomes to suggest the labeling of additional haplogroups. Following quality check and consent, the web interface enables the storing of complete mtDNA sequences by members of the mtDNA community to enrich a growing database. This in turn is expected to strengthen the data set used by the website to label haplogroups, perform quality control and refine the phylogeny. Additional tools will be periodically added and updated. Results The RSRS Since the sub-Saharan haplogroup L0 was defined,37 it became clear that the root of the extant variation of human mitochondrial genomes is allocated between haplogroups L0 and L1’20 30 4’50 6, which are separated from each other by 14 coding and four control-region mutations22 (Figure 1). Until now, our understanding of the root of the human mtDNA tree was incomplete because of the absence of reliable closely related outgroup mitogenomes, and the exact placement of the 18 muta- tions separating the L0 and L1’20 30 4’50 6 nodes remained vague. In principle, ancient mtDNA from early human fossils might be informative but unreachable because of considerable technical problems inherent to the analysis process.13 However, as the split between H. sapiens and H. neanderthalensis certainly predates the appearance of the RSRS,38 a resolution of the deepest node might be achieved by rooting the human phylogeny with H. neanderthalensis complete mtDNA sequences23,24 (Figure 1). Table S1 shows all substitutions separating hap- logroup L0 from L1’20 30 4’50 6, their status in the six H. neanderthalensis mitogenomes and their most parsimo- nious allocation around the human root. Accordingly, the ancestral mtDNA sequence of extant humans should correspond to the bifurcation of L0 and L1’20 30 4’50 6. Although it cannot be excluded that further sampling of the African mtDNA variation might reveal yet another more basal clade of the human mtDNA tree, it is at least equally valid to indicate that, in spite of the many thousands of reported complete mtDNA sequences,7 such a clade has not been found so far. Operating under this assumption we established the reference point, RSRS, which is made available as Sequence S1. We present the most resolved human mtDNA phylogeny by compiling the information from 18,843 mitochondrial genomes of which 10,627 were previously summarized in PhyloTree Build 13 (28 Dec 2011).7 We fol- lowed the established cladistic notation for haplogroup labeling adjusted for complete mtDNA genomes.7,39 Yet, in contrast with the previously reported phylogeny, all mutational changes noted on the branches of the tree indi- cate the actual descendant nucleotide state relative to the state in the RSRS. Although this has no effect on the tree topology per se, it is critical to emphasize its major conse- quences in the way of reporting the list of mutations The American Journal of Human Genetics 90, 675–684, April 6, 2012 679
  • 96. denoting an mtDNA haplotype. Accordingly, although the HVS-I haplotype of a nodal haplogroup H2a2a1 mitoge- nome will show no differences when compared to the rCRS, its differentiation relative to the RSRS is now docu- mented by the transitions A16129G, T16187C, C16189T, T16223C, G16230A, T16278C and C16311T. This common practice of expressing haplotypes as a string of differences from the rCRS (Figure 1) led, for instance, many inexperienced readers to incorrectly hold the ‘‘fact’’ that African haplogroup L mitogenomes have more substi- tutions separating them from the rCRS as compared to western Eurasian haplogroup H mitogenomes as a ‘‘proof’’ of an African origin for all contemporary humans. Indications for Violation of the Molecular Clock The accepted notion of a molecular clock means that contemporary mtDNA haplotypes should show statisti- cally insignificant differences in the number of accu- mulated mutations from the RSRS.40 Triggered by the suggested change in the reference sequence that facili- tates substitution counts from the ancestral root, we further evaluated this hypothesis. The range of sub- stitution counts separating contemporary mitogenomes belonging to major haplogroups from the RSRS is shown in Figure S2. The mean distance is 57.1 substitutions, the median is 56 and the empirical standard deviation is 5.9. Widely different distances ranging from 41 substitutions in some L0d1a1 mitogenomes to 77 in some L2b1a mitoge- nomes are observed. Interestingly, the ranges of sub- stitution counts within haplogroups M and N, which are hallmarks of the relatively recent out-of-Africa exodus of humans, are also very large. For example, within M there are two mitogenomes with 43 substitutions (in M30a and M44) and two mitogenomes with as many as 71 substitu- tions (in M2b1b and M7b3a). This is especially striking because the path from the RSRS to the root of M already contains 39 substitutions. Hence, the difference between the M root and its M44 descendant is only four substitu- tions (two in the coding region and two in the control region) as compared to 32 substitutions in the M2b1b and M7b3a mitogenomes. These observations raise the possibility that the tree in general, and haplogroup M in particular, might not adhere uniformly to the assumed molecular clock, under which substitutions occur at a fixed rate on all branches of the tree over time. We evaluated this scenario by performing generalized likelihood ratio tests of the molecular clock by using PAML33 on subsets of samples from the entire tree, on haplogroup L2 (following past evidence of clock violations in this haplogroup40 ) and on the sister haplogroups M and N. Our results demonstrate violations of the molecular clock in M (0.00015 % p value % 0.0003 for c2 GLR test in three different anal- yses) and give mixed results for the entire tree (p ¼ 0.005 and p ¼ 0.018 for two analyses, which might be sensitive to the parts of the tree randomly sampled) and L2 (GLR c2 p value ¼ 5 3 10À5 and p value ¼ 0.033 for two analyses) and borderline results in N (GLR c2 p value ¼ 0.049 and p value ¼ 0.054 in two analyses). We are currently unable to offer well-founded explanations for these findings, which remain the scope of future studies. As the clock violation was observed only in a restricted number of specified cases, we applied the best available tools for estimating the ages of ancestral nodes. We adop- ted a conventional calculation approach and mutation rate32 and used PAML 4.4 to generate maximum likelihood estimates for internal node ages under a molecular clock assumption.33 Figure 2 displays the phylogeny and density of extant haplogroups as a function of both the number of substitutions occurring since the RSRS and the estimated coalescence times. Approaching a Perfect Phylogeny The mitochondrial genomes released herein almost double the number of sequences that were previously available. Despite the fact that the sequences released in this study are not equally representative of all human populations but are mainly from donors of western Eurasian matrilineal ancestry, a few additional advantages arise from this com- bined data. First, an almost final level of resolution for a number of western Eurasian clades was achieved, and the nodes of ancestral and derived haplogroups are often differentiated by a single mutation. For example, Figure 3 −170 −150 −130 −110 −90 −70 −50 −30 −10 050100200300400500600 KYBP MtDNAhaplogroups 1 7 12 18 24 30 36 42 49 Substitutions since RSRS L0 L1 L5 L2 L6 L4 L3 M N R rCRS RSRS Figure 2. Human mtDNA Phylogeny A schematic representation of the most parsimonious human mtDNA phylogeny inferred from 18,843 complete mtDNA sequences with the structure shown explicitly for bifurcations that occurred 40,000 years before present (YBP) or earlier, and a graph showing the explosion of haplogroups since then. The y axis indicates the approximate number of haplogroups from each time layer that have survived to nowadays. The upper and lower x axes of the rooted tree are scaled according to the number of accumulated mutations since the RSRS and the corresponding coalescence ages, respectively. 680 The American Journal of Human Genetics 90, 675–684, April 6, 2012
  • 97. compares the resolution of haplogroup H4 as first41 and as currently resolved. This comprehensive level of resolution minimizes the chance of additional nomenclature issues arising in future studies. Second, the highly resolved phy- logeny is a powerful tool for quality assessment.29,42–44 Mapping any additional complete mtDNA haplotype to such highly resolved phylogeny will highlight potential sequencing errors and problems such as sample mix- up, contamination, and typographical errors. Third, the phylogeny itself is a useful resource for future evolutionary, clinical, and forensic studies.45–51 Discussion Thirty-one years ago, Anderson and colleagues27 published the first complete sequence of human mtDNA. This became the reference sequence in multidisciplinary studies that revolutionized human genetics, leading, for instance, to the concept of ‘‘late-out-of-Africa’’ (‘‘African Eve’’) peopling of the world by modern humans,17,18 the identi- fication of a wide range of pathological mtDNA muta- tions,52,53 and the possibility of reconstructing the origins and the relationships of modern as well as ancient popula- tions.12,14,54 The publication of globally selected complete mtDNA genomes about 10 years ago marked the beginning of the genomic era in this field.4 Since then, progress has been impressive. Most admirable is the penetration of the principles applied in the field of archaeogenetics to hundreds of thousands of people around the world who became interested in their matrilineal descent. In fact, in this paper we add information from more than 8,000 complete mtDNA sequences resulting largely from the curiosity and enthusiasm of lay people to the ~10,000 publicly available complete mtDNA sequences. However, as discussed above, the entire field faces a problem: the traditional manner of reporting variation observed in human mitochondrial genome sequences is, to be blunt, conceptually incorrect. Supported by a consensus of many colleagues and after a few years of hesitation, we have reached the conclusion that on the verge of the deep-sequencing revolution,47,55 when perhaps tens of thousands of additional complete mtDNA sequences are expected to be generated over the next few years, the principal change we suggest cannot be postponed any longer: an ancestral rather than a ‘‘phylo- genetically peripheral’’ and modern mitogenome from Europe should serve as the epicenter of the human mtDNA reference system. Inevitably, the proposed change could raise some temporary inconveniences. For this reason, we provide tables and software to aid data transition. What we propose is much more than a mere clerical change. We use the Ptolemaian geocentric versus Coper- nican heliocentric systems as a metaphor. And the meta- phor extends further: as the acceptance of the heliocentric system circumvented epicycles in the orbits of planets, 7373 1171911719 R 1476614766 d522d522-523523 12 7645 10217 11377 1287912879 1476614766 16256 16352 39923992 40244024 50045004 75817581 91239123 14365436 145824582 154975497 159305930 161646164 11 H4 d522d522-523523 9033 10775 13513h 1620916209 16215T 59 H14 456 16304 200 4336 5839 15521 16093 54715471 12864 13 aH5a 5H5 15 709709 1608 1618916189 14 239 1636216362 16482 44+C 152152 214 62636263 8668 14040 16300 3915 4727 9380 10589 16129 16249 16 aH6a bH6b 6H6 17 55 57 1117 3847 6253 10993 21 H15 1651916519 152152 7272 183183 15981598 16066 16239 60 3460 3786 11536 61 1636216362 62 7373 85578557 9368 12358 16145 28 6908 7711 15519 1629116291 29 3591 4310 9148 13020 1616816168 30 H9 30106776 7373 6320 8468 9921 14978 16051 16162 16259 aH1a 33 1808 5460 13782 15817 16318 32 d522d522-523523 2483 3796 5899+2C 7870 8348 9022 12561 1618916189 16356 1636216362 36 236 709709 1900 5899+C 6040 1629416294 35 228 523+CA523CA 11299 16233 34 368 10003 1629116291 38 723 7271 8952 11549 1631116311 39 14287 3666 1171911719 4062 1629416294 4041 1623416234 42 573+3C 13943 43 15047 1618916189 37 4769 152152 10810 16274 1842 11233 13708 14323 1629116291 23 2H2 24 H2c 1438 152152 319 8598 13281 1392813928 16266 1631116311 1636216362 1651916519 22 9393 95C 15551555 8258 15902 45 54715471 14798 46 152152 4679 1287912879 13404 14152 16239G 1631116311 47 aH3a 7373 761 14325 44 183183 709709 2581 3387G 5911 49 12957 7272 150150 1536 10667 14467 195195 15551555 14200 16176 16519 5251 15551555 1623416234 50 16290 53 4793 185 1719 8573 13105 14560 16213 15981598 6296A 16265 26 7H7 25 48 195 961G 8448 8898 13759 1627816278 1631116311 2392 6719 9530 12633 1620916209 16399 252 2308 10361 19 54 H11 146 709709 13101C 16111 16167 16288 1636216362 3936 14552 16287 18 55 H8 H12 20 195195 4216 5378 14470A 14548 16114 H10 31 2259 4745 13680 14872 9393 7337 13042 13326 573+C 1651916519 7471+C 9449 11563 13542 15712 1627816278 1631116311 3H13 56 57 58 H13a 2706 7028 * 27 5348 12351 13266C 60+T 64 152152 153 2355 2442 3438 3847 10728 13188 15674 16126 1636216362 150150 3290 5134 62636263 9585 12696 2758 3834 6317 7094 10356 11252 1616816168 437 11674 14800 16320 (pre-HV--)1 HV1 HV*VV V 2 3 H 1 7 195195 523+CA523CA 5093 6059 7762 1171911719 13933 5 7272 16298 pre*V1** 15904 5581 85578557 15221 16222 6 pre*2V2** pre-V 8014T 15218 16067 750 7569 8376 9755 13535 1651916519 4 4919 6285 12732 14299 16241 16311 237 1555 3531 4715 5201 8838 10454 12362 12730 13928 16335 10 9 4639 8869 10379 8 4580 7373 1171911719 R 1476614766 d522d5222-523523 12 7645 10217 11377 1287912879 1476614766 16256 16352 d522d5222-523523 9033 10775 13513h 1620916209 16215T 59 H14 456 16304 200 4336 5839 15521 16093 54715471 12864 13 aH5a 5H5 15 709709 1608 1618916189 14 239 1636216362 16482 44+C 152152 214 62636263 8668 14040 16300 3915 4727 9380 10589 16129 16249 16 aH6a bH6b 6H6 17 55 57 1117 3847 6253 10993 21 H15 1651916519 152152 7272 183183 15981598 16066 16239 60 3460 3786 11536 61 1636216362 62 7373 85578557 9368 12358 16145 28 6908 7711 15519 1629116291 29 3591 4310 9148 13020 1616816168 30 H9 30106776 7373 6320 8468 9921 14978 16051 161 1808 5460 13782 15817 16318 32 d522d5222-523523 2483 3796 5 236 709709 1900 5899+C 604 228 523+CA523CA 11299 16233 34 368 10 37 4769 152152 10810 16274 1842 11233 13708 14323 1629116291 23 2H2 24 H2c 1438 152152 319 8598 13281 1392813928 16266 1631116311 1636216362 1651916519 22 93 2 183183 709 12957 7272 50 195195 1555555 15551555 1623416234 16290 53 4793 185 1719 8573 13105 14560 16213 15981598 6296A 16265 26 7H7 25 48 195 961G 8448 8898 13759 1627816278 1631116311 2392 6719 9530 12633 1620916209 16399 252 2308 10361 19 54 H11 146 709709 13101C 16111 16167 16288 1636216362 3936 14552 16287 18 55 H8 20 195195 4216 5378 14470A 14548 16114 H10 31 2259 4745 13680 14872 9393 7337 573+C 1651916519 7471+C 9449 11563 2 2706 7028 * 27 5348 12351 13266C 60+T 64 152152 153 2355 2442 3438 3847 10728 13188 15674 16126 1636216362 150150 3290 5134 62636263 9585 12696 2758 3834 6317 7094 10356 11252 1616816168 437 11674 14800 16320 (pre-HV--)1 HV1 HV*VV V 2 3 H 1 7 195195 523+CA523CA 5093 6059 7762 1171911719 13933 5 7272 16298 pre*V1** 15904 5581 85578557 15221 16222 6 pre*2V2** pre-V 8014T 15218 16067 750 7569 8376 9755 13535 1651916519 4 4919 6285 12732 14299 16241 16311 237 1555 3531 4715 5201 8838 10454 12362 12730 13928 16335 10 9 4639 8869 10379 8 4580 aH1a 331636216362 H1b 36 aH3a 1631116311 3H133H3H158 H13a 16356 616216162 162591625932 3796 5899+5899+2C2C 787078707870 83488348 90229022 1256112561 161891618916189 163561635616356 C 6040 162941629416294 35 8 10003 1629116291 3838 723723723 727172717271 8952 1154911549 1631116311 3939 14287 36663666 1171911719 40624062 1629416294 40404141 1623416234 4242 573+573+3C3C 13943 4343 1504715047 1618916189 9393 95C95C95C 15551555 82588258 1590215902 4545 547154715471 14798 46 152152152 46794679 1287912879 1340413404 1415214152 16239G16239G 163111631116311 474747 737373 761 14325 4444 70709709709 2581 3387G3387G 5911 4949 150150 15361536 1066710667 1446714467 115 1420014200 1617616176 1651916519 5251 50 5 H12H12H12H12 73 13042 1332613326 1 135421354213542 1571215712 16278162781627816278565656 5757 16356 H1b 333 C3992T T5004C G9123A AA4024G AA14582G C14365T G8269A AA10044G T10034C T10007C A1656G G11440A T14325C AA15244G 960.XC T7870C G13708A T10124C T14956C AA6040G G13889A G5773A G14569A T9615C AA12642G G15884A G6951A T8380C G15497A G15930A T7581C G7356A G7521A! T10166CG9276A A73G! C16287T T195C! C16286g A153G (T195C) (T16093C) A73G! C16248T H4a1 c H4a1 c1 H4a1 d H4b1 H4c H4c1 H4a1 a3 H4a1 a3a H4a1 a4 H4a1 a4a H4a1 a4b H4a1 a4b1 H4a1 a4b2 H4a1 a5 H4a1 a1a1 H4a1 a1a1a H4a1 a1a1a1 H4a1 a1a2 H4a1 a1a3 H4a1 a1a4 H4a1 a2 H4a1 a2a H4a1 a2a1 H4a1 c H4a1 c1 H4a1 d H4b1 H4c H4c1 H4a1 a3 H4a1 a3a H4a1 a4 H4a1 a4a H4a1 a4b H4a1 a4b1 H4a1 a4b2 H4a1 a5 H4a1 a1a1 H4a1 a1a1a H4a1 a1a1a1 H4a1 a1a2 H4a1 a1a3 H4a1 a1a4 H4a1 a2 H4a1 a2a H4a1 a2a1 H4b H4 H4a H4a1 H4a1 a H4a1 a1 H4a1 a1a Figure 3. Haplogroup H4 internal cladistic structure (Left) Haplogroup H4 as first reported.41 Mutations in bold were considered diagnostic for the haplogroup. (Right) Haplogroup H4 as currently resolved with a total of 236 H4 mitogenomes. An almost perfect resolution of the nested hierarchy is achieved. Additional haplogroups suggested herein are shown in yellow. Control-region mutations are noted in blue. The American Journal of Human Genetics 90, 675–684, April 6, 2012 681
  • 98. switching the mtDNA reference to an ancestral RSRS will end an academically inadmissible conjuncture where virtually all mitochondrial genome sequences are scored in part from derived-to-ancestral states and in part from ancestral-to-derived states. We aim to trigger the radical but necessary change in the way mtDNA mutations are reported relative to their ancestral versus derived status, thus establishing an intellectual cohesiveness with the current consensus of shared common ancestry of all con- temporary human mitochondrial genomes. Note that the problem is not restricted to mtDNA. Indeed, in the much larger perspective of complete nuclear genomes in which comparisons are often currently made relative to modern human reference sequences, often of European origin, it seems worthwhile to begin consid- ering, as valuable alternatives, public reference sequences of ancestral alleles (common in all primates) whereby derived alleles (common to some human populations) would be distinguished. Supplemental Data Supplemental Data include two figures, six tables, and one sequence and can be found with this article online at http:// www.cell.com/AJHG/. Acknowledgments We thank the genealogical community for donating their privately obtained complete mtDNA sequences for scientific studies and FamilyTreeDNA for compiling the data. We thank FamilyTreeDNA for supporting the establishment of the herein released website. We thank Eileen Krauss-Murphy of Family- TreeDNA for help with assembly of the database. We thank Rebekah Canada and William R. Hurst for help with the assembly of haplogroup H and K samples, respectively. R.V. and D.M.B. thank the European Commission, Directorate-General for Research for FP7 Ecogene grant 205419. D.M.B. is a shareholder of FamilyTreeDNA and a member of its scientific advisory board. R.V. and M.M. thank the European Union, Regional Development Fund for a Centre of Excellence in Genomics grant, and R.V. thanks the Swedish Collegium for Advanced Studies for support during the initial stage of this study. M.M. thanks Estonian Science Foundation for grant 8973. A.T. received support from Fondazione Alma Mater Ticinensis and the Italian Ministry of Education, University and Research: Progetti Ricerca Interesse Nazionale 2009. S.R. thanks the Israeli Science Foundation for grant 1227/ 09 and IBM for an Open Collaborative Research grant. FCT, the Portuguese Foundation for Science and Technology, partially sup- ported this work through the personal grant N.M.S. (SFRH/BD/ 69119/2010). Instituto de Patologia e Imunologia Molecular da Universidade do Porto is an Associate Laboratory of the Portuguese Ministry of Science, Technology and Higher Education and is partially supported by the Portuguese Foundation for Science and Technology. Received: January 9, 2012 Revised: February 22, 2012 Accepted: March 2, 2012 Published online: April 5, 2012 Web Resources The URLs for data presented herein are as follows: FASTmtDNA, http://www.mtdnacommunity.org mtDNAble, http://www.mtdnacommunity.org mtPhyl, http://eltsov.org/mtphyl.aspx PhyloTree, http://www.phylotree.org Accession Numbers The 4,265 complete mtDNA sequences reported herein have been submitted to GenBank (accession numbers JQ701803–JQ706067). References 1. Darwin, C. (1859). Natural Selection. On the Origin of Species by Means of Natural Selection, or, The Preservation of Favoured Races in the Struggle for Life, Chapter 4 (London: John Murray). 2. Delsuc, F., Brinkmann, H., and Philippe, H. (2005). Phyloge- nomics and the reconstruction of the tree of life. Nat. Rev. Genet. 6, 361–375. 3. Kivisild, T., Metspalu, E., Bandelt, H.J., Richards, M., and Villems, R. (2006). The world mtDNA phylogeny. In Human mitochondrial DNA and the evolution of Homo sapiens, H.J. Bandelt, V. Macaulay, and M. Richards, eds. (Berlin: Springer- Verlag), pp. 149–179. 4. Ingman, M., Kaessmann, H., Pa¨a¨bo, S., and Gyllensten, U. (2000). Mitochondrial genome variation and the origin of modern humans. Nature 408, 708–713. 5. Richards, M., and Macaulay, V. (2001). The mitochondrial gene tree comes of age. Am. J. Hum. Genet. 68, 1315–1320. 6. Torroni, A., Achilli, A., Macaulay, V., Richards, M., and Bandelt, H.J. (2006). Harvesting the fruit of the human mtDNA tree. Trends Genet. 22, 339–345. 7. van Oven, M., and Kayser, M. (2009). Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum. Mutat. 30, E386–E394. 8. Underhill, P.A., and Kivisild, T. (2007). Use of y chromosome and mitochondrial DNA population structure in tracing human migrations. Annu. Rev. Genet. 41, 539–564. 9. Salas, A., Bandelt, H.J., Macaulay, V., and Richards, M.B. (2007). Phylogeographic investigations: The role of trees in forensic genetics. Forensic Sci. Int. 168, 1–13. 10. Shriver, M.D., and Kittles, R.A. (2004). Genetic ancestry and the search for personalized genetic histories. Nat. Rev. Genet. 5, 611–618. 11. Taylor, R.W., and Turnbull, D.M. (2005). Mitochondrial DNA mutations in human disease. Nat. Rev. Genet. 6, 389–402. 12. Gilbert, M.T.,Kivisild,T., Grønnow, B.,Andersen, P.K., Metspalu, E., Reidla, M., Tamm, E., Axelsson, E., Go¨therstro¨m, A., Campos, P.F., et al. (2008). Paleo-Eskimo mtDNA genome reveals matri- lineal discontinuity in Greenland. Science 320, 1787–1789. 13. Gilbert, M.T., Hansen, A.J., Willerslev, E., Rudbeck, L., Barnes, I., Lynnerup, N., and Cooper, A. (2003). Characterization of genetic miscoding lesions caused by postmortem damage. Am. J. Hum. Genet. 72, 48–61. 14. Haak, W., Forster, P., Bramanti, B., Matsumura, S., Brandt, G., Ta¨nzer, M., Villems, R., Renfrew, C., Gronenborn, D., Alt, K.W., and Burger, J. (2005). Ancient DNA from the first Euro- pean farmers in 7500-year-old Neolithic sites. Science 310, 1016–1018. 682 The American Journal of Human Genetics 90, 675–684, April 6, 2012
  • 99. 15. Denaro, M., Blanc, H., Johnson, M.J., Chen, K.H., Wilmsen, E., Cavalli-Sforza, L.L., and Wallace, D.C. (1981). Ethnic vari- ation in Hpa 1 endonuclease cleavage patterns of human mitochondrial DNA. Proc. Natl. Acad. Sci. USA 78, 5768–5772. 16. Brown, W.M. (1980). Polymorphism in mitochondrial DNA of humans as revealed by restriction endonuclease analysis. Proc. Natl. Acad. Sci. USA 77, 3605–3609. 17. Cann, R.L., Stoneking, M., and Wilson, A.C. (1987). Mito- chondrial DNA and human evolution. Nature 325, 31–36. 18. Vigilant, L., Stoneking, M., Harpending, H., Hawkes, K., and Wilson, A.C. (1991). African populations and the evolution of human mitochondrial DNA. Science 253, 1503–1507. 19. Richards, M., Coˆrte-Real, H., Forster, P., Macaulay, V., Wilkinson-Herbots, H., Demaine, A., Papiha, S., Hedges, R., Bandelt, H.J., and Sykes, B. (1996). Paleolithic and neolithic lineages in the European mitochondrial gene pool. Am. J. Hum. Genet. 59, 185–203. 20. Torroni, A., Bandelt, H.J., D’Urbano, L., Lahermo, P., Moral, P., Sellitto, D., Rengo, C., Forster, P., Savontaus, M.L., Bonne´- Tamir, B., and Scozzari, R. (1998). mtDNA analysis reveals a major late Paleolithic population expansion from south- western to northeastern Europe. Am. J. Hum. Genet. 62, 1137–1152. 21. Torroni, A., Schurr, T.G., Cabell, M.F., Brown, M.D., Neel, J.V., Larsen, M., Smith, D.G., Vullo, C.M., and Wallace, D.C. (1993). Asian affinities and continental radiation of the four founding Native American mtDNAs. Am. J. Hum. Genet. 53, 563–590. 22. Behar, D.M., Villems, R., Soodyall, H., Blue-Smith, J., Pereira, L., Metspalu, E., Scozzari, R., Makkan, H., Tzur, S., Comas, D., et al; Genographic Consortium. (2008). The dawn of human matrilineal diversity. Am. J. Hum. Genet. 82, 1130– 1140. 23. Briggs, A.W., Good, J.M., Green, R.E., Krause, J., Maricic, T., Stenzel, U., Lalueza-Fox, C., Rudan, P., Brajkovic, D., Kucan, Z., et al. (2009). Targeted retrieval and analysis of five Nean- dertal mtDNA genomes. Science 325, 318–321. 24. Green, R.E., Malaspinas, A.S., Krause, J., Briggs, A.W., Johnson, P.L., Uhler, C., Meyer, M., Good, J.M., Maricic, T., Stenzel, U., et al. (2008). A complete Neandertal mitochondrial genome sequence determined by high-throughput sequencing. Cell 134, 416–426. 25. Kivisild, T., Shen, P., Wall, D.P., Do, B., Sung, R., Davis, K., Passarino, G., Underhill, P.A., Scharfe, C., Torroni, A., et al. (2006). The role of selection in the evolution of human mito- chondrial genomes. Genetics 172, 373–387. 26. Kivisild, T., Reidla, M., Metspalu, E., Rosa, A., Brehm, A., Pennarun, E., Parik, J., Geberhiwot, T., Usanga, E., and Villems, R. (2004). Ethiopian mitochondrial DNA heritage: Tracking gene flow across and around the gate of tears. Am. J. Hum. Genet. 75, 752–770. 27. Anderson, S., Bankier, A.T., Barrell, B.G., de Bruijn, M.H., Coulson, A.R., Drouin, J., Eperon, I.C., Nierlich, D.P., Roe, B.A., Sanger, F., et al. (1981). Sequence and organization of the human mitochondrial genome. Nature 290, 457–465. 28. Andrews, R.M., Kubacka, I., Chinnery, P.F., Lightowlers, R.N., Turnbull, D.M., and Howell, N. (1999). Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat. Genet. 23, 147. 29. Yao, Y.G., Salas, A., Bravi, C.M., and Bandelt, H.J. (2006). A reappraisal of complete mtDNA variation in East Asian fami- lies with hearing impairment. Hum. Genet. 119, 505–515. 30. Pello, R., Martı´n, M.A., Carelli, V., Nijtmans, L.G., Achilli, A., Pala, M., Torroni, A., Go´mez-Dura´n, A., Ruiz-Pesini, E., Marti- nuzzi, A., et al. (2008). Mitochondrial DNA background modulates the assembly kinetics of OXPHOS complexes in a cellular model of mitochondrial disease. Hum. Mol. Genet. 17, 4001–4011. 31. Bandelt, H.J., and Parson, W. (2008). Consistent treatment of length variants in the human mtDNA control region: A reappraisal. Int. J. Legal Med. 122, 11–21. 32. Soares, P., Ermini, L., Thomson, N., Mormina, M., Rito, T., Ro¨hl, A., Salas, A., Oppenheimer, S., Macaulay, V., and Ri- chards, M.B. (2009). Correcting for purifying selection: An improved human mitochondrial molecular clock. Am. J. Hum. Genet. 84, 740–759. 33. Yang, Z. (2007). PAML 4: Phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591. 34. Tang, S., and Huang, T. (2010). Characterization of mitochon- drial DNA heteroplasmy using a parallel sequencing system. Biotechniques 48, 287–296. 35. Li, M., Scho¨nberg, A., Schaefer, M., Schroeder, R., Nasidze, I., and Stoneking, M. (2010). Detecting heteroplasmy from high-throughput sequencing of complete human mitochon- drial DNA genomes. Am. J. Hum. Genet. 87, 237–249. 36. Zaragoza, M.V., Fass, J., Diegoli, M., Lin, D., and Arbustini, E. (2010). Mitochondrial DNA variant discovery and evaluation in human Cardiomyopathies through next-generation sequencing. PLoS ONE 5, e12295. 37. Mishmar, D., Ruiz-Pesini, E., Golik, P., Macaulay, V., Clark, A.G., Hosseini, S., Brandon, M., Easley, K., Chen, E., Brown, M.D., et al. (2003). Natural selection shaped regional mtDNA variation in humans. Proc. Natl. Acad. Sci. USA 100, 171–176. 38. Green, R.E., Krause, J., Briggs, A.W., Maricic, T., Stenzel, U., Kircher, M., Patterson, N., Li, H., Zhai, W., Fritz, M.H., et al. (2010). A draft sequence of the Neandertal genome. Science 328, 710–722. 39. Richards, M.B., Macaulay, V.A., Bandelt, H.J., and Sykes, B.C. (1998). Phylogeography of mitochondrial DNA in western Europe. Ann. Hum. Genet. 62, 241–260. 40. Torroni, A., Rengo, C., Guida, V., Cruciani, F., Sellitto, D., Coppa, A., Calderon, F.L., Simionati, B., Valle, G., Richards, M., et al. (2001). Do the four clades of the mtDNA haplogroup L2 evolve at different rates? Am. J. Hum. Genet. 69, 1348–1356. 41. Achilli, A., Rengo, C., Magri, C., Battaglia, V., Olivieri, A., Scoz- zari, R., Cruciani, F., Zeviani, M., Briem, E., Carelli, V., et al. (2004). The molecular dissection of mtDNA haplogroup H confirms that the Franco-Cantabrian glacial refuge was a major source for the European gene pool. Am. J. Hum. Genet. 75, 910–918. 42. Parson, W., and Bandelt, H.J. (2007). Extended guidelines for mtDNA typing of population data in forensic science. Forensic Sci. Int. Genet. 1, 13–19. 43. Salas, A., Carracedo, A., Macaulay, V., Richards, M., and Bandelt, H.J. (2005). A practical guide to mitochondrial DNA error prevention in clinical, forensic, and population genetics. Biochem. Biophys. Res. Commun. 335, 891–899. 44. Bandelt, H.J., Lahermo, P., Richards, M., and Macaulay, V. (2001). Detecting errors in mtDNA data by phylogenetic analysis. Int. J. Legal Med. 115, 64–69. 45. Ballantyne, K.N., van Oven, M., Ralf, A., Stoneking, M., Mitch- ell, R.J., van Oorschot, R.A., and Kayser, M. (2011). MtDNA SNP multiplexes for efficient inference of matrilineal genetic ancestry within Oceania. Forensic Sci. Int. Genet., in press. The American Journal of Human Genetics 90, 675–684, April 6, 2012 683
  • 100. Published online September 20, 2011. 10.1016/j.fsigen.2011. 08.010. 46. Pereira, L., Soares, P., Radivojac, P., Li, B., and Samuels, D.C. (2011). Comparing phylogeny and the predicted pathogenicity of protein variations reveals equal purifying selection across the global human mtDNA diversity. Am. J. Hum. Genet. 88, 433–439. 47. Behar, D.M., Harmant, C., Manry, J., van Oven, M., Haak, W., Martinez-Cruz, B., Salaberria, J., Oyharc¸abal, B., Bauduer, F., Comas, D., and Quintana-Murci, L.; Consortium. TG. (2012). The Basque paradigm: Genetic evidence of a maternal continuity in the Franco-Cantabrian Region since pre- Neolithic times. Am. J. Hum. Genet. 90, 486–493. 48. Zeviani, M., and Carelli, V. (2007). Mitochondrial disorders. Curr. Opin. Neurol. 20, 564–571. 49. Gunnarsdo´ttir, E.D., Nandineni, M.R., Li, M., Myles, S., Gil, D., Pakendorf, B., and Stoneking, M. (2011). Larger mitochon- drial DNA than Y-chromosome differences between matrilocal and patrilocal groups from Sumatra. Nat. Commun. 2, 228. 50. Baum, D.A., Smith, S.D., and Donovan, S.S. (2005). Evolution. The tree-thinking challenge. Science 310, 979–980. 51. Behar, D.M., Metspalu, E., Kivisild, T., Rosset, S., Tzur, S., Hadid, Y., Yudkovsky, G., Rosengarten, D., Pereira, L., Amorim, A., et al. (2008). Counting the founders: The matri- lineal genetic ancestry of the Jewish Diaspora. PLoS ONE 3, e2062. 52. Wallace, D.C., Singh, G., Lott, M.T., Hodge, J.A., Schurr, T.G., Lezza, A.M., Elsas, L.J., 2nd, and Nikoskelainen, E.K. (1988). Mitochondrial DNA mutation associated with Leber’s heredi- tary optic neuropathy. Science 242, 1427–1430. 53. MITOMAP. (2011) A Human Mitochondrial Genome Data- base. http://www.mitomap.org. 54. Quintana-Murci, L., Harmant, C., Quach, H., Balanovsky, O., Zaporozhchenko, V., Bormans, C., van Helden, P.D., Hoal, E.G., and Behar, D.M. (2010). Strong maternal Khoisan contri- bution to the South African coloured population: A case of gender-biased admixture. Am. J. Hum. Genet. 86, 611–620. 55. Scho¨nberg, A., Theunert, C., Li, M., Stoneking, M., and Nasidze, I. (2011). High-throughput sequencing of complete human mtDNA genomes from the Caucasus and West Asia: High diversity and demographic inferences. Eur. J. Hum. Genet. 19, 988–994. 684 The American Journal of Human Genetics 90, 675–684, April 6, 2012
  • 101. ARTICLE Age-Related Somatic Structural Changes in the Nuclear Genome of Human Blood Cells Lars A. Forsberg,1 Chiara Rasi,1 Hamid R. Razzaghian,1 Geeta Pakalapati,1 Lindsay Waite,2 Krista Stanton Thilbeault,2 Anna Ronowicz,3 Nathan E. Wineinger,4 Hemant K. Tiwari,4 Dorret Boomsma,5 Maxwell P. Westerman,6 Jennifer R. Harris,7 Robert Lyle,8 Magnus Essand,1 Fredrik Eriksson,1 Themistocles L. Assimes,9 Carlos Iribarren,10 Eric Strachan,11 Terrance P. O’Hanlon,12 Lisa G. Rider,12 Frederick W. Miller,12 Vilmantas Giedraitis,13 Lars Lannfelt,13 Martin Ingelsson,13 Arkadiusz Piotrowski,3 Nancy L. Pedersen,14 Devin Absher,2 and Jan P. Dumanski1,* Structural variations are among the most frequent interindividual genetic differences in the human genome. The frequency and distri- bution of de novo somatic structural variants in normal cells is, however, poorly explored. Using age-stratified cohorts of 318 monozy- gotic (MZ) twins and 296 single-born subjects, we describe age-related accumulation of copy-number variation in the nuclear genomes in vivo and frequency changes for both megabase- and kilobase-range variants. Megabase-range aberrations were found in 3.4% (9 of 264) of subjects R60 years old; these subjects included 78 MZ twin pairs and 108 single-born individuals. No such findings were observed in 81 MZ pairs or 180 single-born subjects who were %55 years old. Recurrent region- and gene-specific mutations, mostly dele- tions, were observed. Longitudinal analyses of 43 subjects whose data were collected 7–19 years apart suggest considerable variation in the rate of accumulation of clones carrying structural changes. Furthermore, the longitudinal analysis of individuals with structural aber- rations suggests that there is a natural self-removal of aberrant cell clones from peripheral blood. In three healthy subjects, we detected somatic aberrations characteristic of patients with myelodysplastic syndrome. The recurrent rearrangements uncovered here are candi- dates for common age-related defects in human blood cells. We anticipate that extension of these results will allow determination of the genetic age of different somatic-cell lineages and estimation of possible individual differences between genetic and chronological age. Our work might also help to explain the cause of an age-related reduction in the number of cell clones in the blood; such a reduction is one of the hallmarks of immunosenescence. Introduction Structural changes in the human genome have been iden- tified as one of the major types of interindividual genetic variation.1,2 Furthermore, the rate of formation of copy- number variants (CNVs) exceeds the corresponding rate of SNPs by 2–4 orders of magnitude.3–5 In spite of this, little is known about the rate of formation and distribution of de novo somatic CNVs in normal cells and whether these aberrations accumulate with age. There are, however, indi- cations that chromosomal remodeling in the nuclear and mitochondrial genomes increases with age.6–12 Theoretical predictions suggest that somatic mosaicism should be widespread,13,14 and reviews in the field point out that somatic mosaicism, in both healthy and diseased cells, is an understudied aspect of human-genome biology.15–18 A recent estimate of 1.7% for the frequency with which somatic mosaicism causes large-scale structural aberrations in adult human samples is, however, a relatively low number.19 We have shown that adult monozygotic (MZ) twins and differentiated human tissues frequently display somatic CNVs.20,21 We therefore hypothesized that the nuclear genome of blood cells in vivo might accumulate CNVs with age, and we used age-stratified MZ twins as a starting point for testing this hypothesis. Because nuclear genomes of MZ twins are identical at conception, they represent a good model for studying somatic variation. We replicated a MZ-twin-based analysis by using age-strat- ified cohorts of single-born subjects. Using these resources, we show age-related accumulation of CNVs in the nuclear genomes of blood cells in vivo. Age effects were found for both megabase- and kilobase-range variants. 1 Department of Immunology, Genetics and Pathology, Rudbeck Laboratory, Uppsala University, 75185 Uppsala, Sweden; 2 HudsonAlpha Institute for Biotechnology, 601 Genome Way, Huntsville, AL 35806, USA; 3 Department of Biology and Pharmaceutical Botany, Medical University of Gdansk, Hallera 107, 80-416 Gdansk, Poland; 4 Section on Statistical Genetics, Department of Biostatistics, Ryals Public Health Building, University of Alabama at Birming- ham, Suite 327, Birmingham, AL 35294-0022, USA; 5 Department of Biological Psychology, VU University, Van der Boechorststraat 1, 1081 BT Amsterdam, The Netherlands; 6 Hematology Research, Mount Sinai Hospital Medical Center, 1500 S California Avenue, Chicago, IL 60608, USA; 7 Department of Genes and Environment, Division of Epidemiology, The Norwegian Institute of Public Health, P.O. Box 4404 Nydalen, N-0403 Oslo, Norway; 8 Department of Medical Genetics, Oslo University Hospital, Kirkeveien 166, 0407 Oslo, Norway; 9 Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA; 10 Kaiser Foundation Research Institute, Oakland, CA 94612, USA; 11 Deptartment of Psychiatry and Behavioral Sciences and University of Washington Twin Registry, University of Washington, Box 359780 Seattle, WA 98104, USA; 12 Environmental Autoimmunity Group, National Institute of Environmental Health Sciences, National Institutes of Health Clinical Research Center, National Institutes of Health, Building 10, Room 4-2352, 10 Center Drive, MSC 1301, Bethesda, MD 20892-1301, USA; 13 Department of Public Health and Caring Sciences, Division of Molecular Geriatrics, Rudbeck laboratory, Uppsala University, 751 85 Uppsala, Sweden; 14 Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, SE-171 77 Stockholm, Sweden *Correspondence: jan.dumanski@igp.uu.se DOI 10.1016/j.ajhg.2011.12.009. Ó2012 by The American Society of Human Genetics. All rights reserved. The American Journal of Human Genetics 90, 217–228, February 10, 2012 217
  • 102. Material and Methods Studied Cohorts, DNA Isolation, and Quality Control Samples were collected with informed consent from all subjects, and the study was approved by the respective local institutional review boards or research ethics committees. The information about studied cohorts of MZ twins and single-born subjects is provided in Tables S1 and S2, available online. We isolated DNA from peripheral blood by using the QIAGEN kit (QIAGEN, Hilden, Germany). The quality, quantity, and integrity of DNA samples were controlled with NanoDrop (Thermo Fisher Scientific, Waltham, MA, USA), picoGreen fluorescent assay (Invitrogen, Eugene, Ore- gon, USA), and agarose gels. Sorting of Subpopulations of Cells from Peripheral Blood and Culturing of Fibroblasts Peripheral blood mononuclear cells (PBMCs) were isolated from the whole blood with Ficoll-Paque centrifugation (Amersham Biosciences, Uppsala, Sweden), and a mixture of granulocytes was collected from under the PBMC layer. We isolated CD19þ cells from PBMCs by positive selection with CD19 MicroBeads (Milte- nyi Biotech, Auburn, CA, USA). First, we negatively selected CD4þ cells by using the CD4þ T cell Isolation Kit II (Miltenyi Biotech, Auburn, CA, USA), and then we positively selected the cells by using CD4 MicroBeads (Miltenyi Biotech, Auburn, CA, USA). The CD19þ and CD4þ cells were incubated for 30 min at 4 C with phycoerythrin- and PerCP-conjugated antibodies (BD Biosciences, San Diego, CA, USA), respectively, for fluorescence- activated cell sorting (FACS) analysis. We measured purities of >90% for CD19þ and >98% for CD4þ cells by flow cytometry (FACS CantoII, BD Biosciences, San Diego, CA,USA). The skin- biopsy-derived fibroblasts were cultured in RPMI medium supplemented with Hams F-10 medium, fetal bovine serum (10%), penicillin, and L-glutamine (all cell culture reagents were from GIBCO, Invitrogen, Paisley, UK) in an incubator at 37 C. After reaching ~90% confluence, the cells were trypsinized (Trypsin-EDTA, GIBCO, Invitrogen, Paisley, UK), and the fibro- blasts were used for DNA isolation. We performed a standard phenol-chloroform extraction to isolate DNA from CD19þ cells, CD4þ cells, fibroblasts, and crude granulocyte fraction. Genotyping with Illumina SNP Arrays and Calling of Large-Scale CNVs We performed the SNP genotyping experiments by using several types of Illumina beadchips according to the recommendations of the manufacturer. Such experiments were performed at two facili- ties: Hudson Alpha Institute for Biotechnology (Huntsville, AL, USA) and the SNP Technology Platform (Uppsala University, Sweden). All Illumina genotyping experiments passed the follow- ing quality-control criteria: The SNP call rate for all samples was >98%, and the LogRdev value was <0.2. The results from Illumina SNP arrays consist of two main data tracks: log R ratio (LRR) and B-allele frequency (BAF)22 (see Figure 1). Deviations of consecutive probes from normal states are indicative of structural aberrations. We analyzed Illumina output files by using Nexus Copy Number version 5.1 (BioDiscovery, CA, USA), which applies a ‘‘Rank Segmentation’’ algorithm based on the circular binary segmenta- tion (CBS) approach.23 The applied version, ‘‘SNPRank Segmenta- tion,’’ an extended algorithm in which BAF values are also included in the segmentation process, generated both copy-number and allelic-event calls. We applied the default calling parameters of the program. The array data for large-scale CNVs reported in this paper have been submitted to the Database of Genomic Structural Variation (dbVAR) under the accession number nstd58. A Method for Detection of Small-Scale CNVs with Illumina SNP Array Data We developed and applied an algorithm for testing whether smaller structural variants would also accumulate with age. We used deviations in BAF as the main tool for detecting candidate CNV regions because it can detect mosaicism in as low as 5%– 7% of cells24,25 and allows uncovering of deletions and duplica- tions as well as copy-number-neutral loss of heterozygozity (CNNLOH). This method uses an in-house-developed R-script26 to perform scans for deviations in BAF values alone and in BAF values together with LRR values in MZ twins. Figure S1 describes this algorithm, which identifies CNV calls for each MZ pair at user-defined thresholds of either DBAF or both DBAF and DLRR. Our initial tests of the algorithm were based on the entire cohort of 159 MZ pairs. However, a series of ‘‘trial and error’’ tests sug- gested that the method is sensitive to the quality of input data, given that the results were heavily biased toward detection of putative CNV calls in MZ co-twins with lower quality of genotyp- ing, as measured by the Nexus Quality (NQ) score. The latter is one of the features of Nexus Copy Number software. We therefore defined strict NQ-score-based criteria for inclusion of MZ pairs in the analysis (see Table S3 and Figure S1), which resulted in the selection of 87 pairs that were processed further. We based the final analysis on 87 twin pairs by identifying candidate CNV loci in which BAF values were different between co-twins when multiple thresholds were used. As expected, the number of putative CNV calls between MZ co-twins was highly dependent on the settings of the DBAF filtering (Figures S1–S4). Thus, when the settings were too generous in this step, an age- related signal was hidden in large background variation (Figure S2). By using more strict filtering criteria, we found an age-related correlation (Figures 2A and S4C). We trimmed the list of putative CNVs generated by DBAF by using a DLRR filter of >0.35 so that only loci with differences in both BAF and LRR remained in the final list (Figures 2B and S4D). Hence, the DLRR filter removed all loci with copy-number-neutral variation from the list. In the course of tuning DBAF (or both DBAF and DLRR) filtering parame- ters, we took advantage of three already-known large-scale aberra- tions that are described in our dataset (Figures 1A–1F, 3, and S5). These worked as ideal internal controls for the validity of our approach as shown in Figures S2–S4. Hence, by plotting the number of calls both including the probes located within the three known aberrations (Figures S2A–S2B, S3A–S3B, and S4A–S4B) and after excluding the probes located within the known aberrations (Figures S2C–S2D, S3C–S3D, and S4C–S4D), we could compare and evaluate the observed and expected results. For example, in Figure S4B, the twin pair TP25-1/TP25-2 sticks out because the probes positioned within the large de novo aberration of chromo- some 5 (Figure 1) are included in the list of calls. When plotting the same data after excluding probes within this region, we found that the twin pair falls into the cluster of variation similar to that of the other MZ twin pairs (Figure S4D). On the basis of such eval- uations, we observed that probes within the three large-scale CNVs were detected (or not, depending on the input file used in the analysis) as predicted by our DBAF and DLRR algorithm. There- fore, these evaluations provided an internal validation of our approach to detecting de novo small-scale CNVs. 218 The American Journal of Human Genetics 90, 217–228, February 10, 2012
  • 103. Figure 1. Two Examples of Megabase-Range De Novo Somatic Aberrations (A) A normal profile of MZ twin TP25-1. (B) A 32.5 Mb deletion on 5q is shown in nucleated blood cells of co-twin TP25-2. This deletion was uncovered with LRR data from the Illumina SNP array. (C and D) The BAF profiles of twins TP25-1 (C) and TP25-2 (D). The qPCR experiments showed that 66.2% of nucleated blood cells in TP25-2 had the 5q deletion (i.e., 33.1% fewer copies of the DNA segment, Figure 5). The R-package-MAD (Mosaic Alteration Detection) analysis of the Illumina data suggested that 50.5% of the cells had the 5q deletion when the subjects were 77 years old. (E) The deviation of BAF values from 0.5 (the allelic fraction of intensity at each heterozygous SNP) was plotted, and the percentage of cells with the 5q deletion was higher when the subjects were 77 years old than when they were 70 years old (t test: p < 0.001). This slow increase in aberrant clones was also supported by the MAD estimate of 48.3% of cells detected when the subjects were 70 years old. The size and position of this deletion is typical of patients with myelodysplastic syndrome (MDS). (F) A confirmatory array-CGH experiment. (G–K) Another large somatic event: a terminal CNNLOH encompassing 103 Mb of 4q in ULSAM-697. The LRR and BAF data from Illumina SNP genotyping of samples collected when the subjects were 71, 82, 88, and 90 years old are plotted in (G), (H), (I), and (J), respectively. Percentages of cells with the aberration were calculated with the MAD package and are given for each panel. (K) The proportion of cells with the 4q aberration changes with time, and the changes are significantly different between all samplings at different ages (ANOVA: F(3,25935) ¼ 39087, p < 0.001; Tukey’s test for multiple comparisons). Figure S8 shows other analysis details of the samples collected from ULSAM-697 when he was 90 years old. These analyses include those of fibroblasts and three types of sorted blood cells. The analysis of samples obtained when the subjects were 90 years old was performed in duplicate experiments on Illumina 1M-Duo and Omni-Express arrays. The American Journal of Human Genetics 90, 217–228, February 10, 2012 219
  • 104. Design of the Nimblegen 135K Custom-Made Tiling-Path Oligonucleotide Array This tool was designed according to the instructions from Roche- Nimblegen (Madison, WI, USA) and encompassed 137,545 probes used for validation of the 138 putative CNVs detected by the Illu- mina SNP array (Figures 2B, S4C, and S4D). In total, the design consisted of 98,894 experimental probes and an additional 38,651 backbone control probes distributed across the genome. The median overlap of probes (i.e., probe spacing) was 30 bp. This array was applied in cohybridizations of 34 MZ twin pairs (Figures 2G, 2H, and S6 and Table S4). Array-Comparative Genomic Hybridization with Nimblegen 720K and 135K Arrays We performed DNA labeling for both platforms (3 3 720K and 12 3 135K) by using the random priming with the Nimblegen Dual-Color DNA Labeling kit (Roche-Nimblegen) according to Nimblegen’s protocol. In brief, test and reference DNA (500 ng each) samples were labeled with Cy3 and Cy5, respectively. The combined test and reference DNA was cohybridized (for 48 hr at 42 C) onto a human comparative genomic hybridization (CGH) 3 3 720K whole-genome tiling array (100718_HG18_WH_ CGH_v3.1_HX3, OID:30853; Roche-Nimblegen) or a 12 3 135K custom-designed array (110131_HG18_LF_CGH_HX18, OID: 33469; Roche-Nimblegen). The arrays were washed with the Nimblegen Wash Kit. We performed image acquisition with MS 200 Scanner at 2 mm resolution by using high-sensitivity and auto- gain settings. We extracted data with NimbleScan v2.6 segMNT, including spatial correction (LOESS) and qspline fit normalization, in order to compensate for differences in signal between the two dyes.27 We generated an experimental metrics report with NimbleScan v2.6 to verify hybridization quality. We performed CNV analysis with Nexus Copy Number software version 5.1 by using default settings (see above). All plots shown in Figures 2G, 2H, and S6 are derived from unaveraged, normalized raw data. Validation Experiments Involving Quantitative Real-Time Polymerase Chain Reaction We measured the relative amount of DNA molecules by using quantitative real-time polymerase chain reaction (qPCR) with SYBR green to validate the CNV findings from the arrays. qPCRs FE 1015202550 20304050100 (0.2<dBAF<0.45) (0.2<dBAF<0.45,dLRR>0.35) 100 Age of twinf pairs 0 20 40 60 80100 Age of twinf pairs 0 20 40 60 80 Numberofrcallsf Numberofrcallsf Corr. coef. = 0.62 p < 0.001 Corr. coef. = 0.54 p < 0.001 BA 90 Age at second sampling 50 60 70 80 Age at sampling 50 60 70 80 (0.2<dBAF<0.45) Numberofrcallsf 20301525105 1006014020 (0.2<dBAF<0.45) Numberofrcallsf 1 2 3 4 5 6 7 8 9 Age group Numberofrcallsf 10203040500 D (0.2<dBAF<0.45) C F = 7.58, p < 0.001(8,78)FF Age group in panel c N (MZ pairs) Median age 1 10 8 2 10 19 3 9 29 4 10 65 5 10 68 6 10 72 7 10 76 8 10 78 9 8 82 ANOVA Longitudinal changes within individuals Longitudinal changes between twins (10 years) Twin TP31-1 Twin TP31-2 10 kb 10 kb Position fo 0rs6928830 200 bp Pair TP31-1/2r 84.2752 Mb 500 bp 00.4-0.4 Log2ratio Pair TP63-1/2r onPositio of 5020rs4635 5 kb5 kb Twin TP63-1 Twin TP63-2 0.501 BAF 0.501 BAF 0.501 BAF 0.501 BAF 00.4-0.4 Log2ratio HG p = 5.85E-08 p = .82E-101. Age 76 Age 70 at ageg 76 at age 76 at age 70 at age 70 100.695 100.710 100.695 100.710 84.265 84.285 84.265 84.285Mb Mb Mb Mb Mb Mb Mb Mb 100.695 Mb 100.704 Mb 84.2764 Mb Figure 2. Age-Related Accumulation of Small Somatic Structural Rearrangements in 87 Pairs of MZ Twins (A and B) Linear regression analyses showing that the number of calls increases with age in MZ twin pairs when DBAF values are between 0.2 and 0.45 as well as when DBAF values are between 0.2 and 0.45 and when the LRR deviation is >0.35. Each dot repre- sents data from one MZ twin pair. Details regarding the filtering algorithms used are shown in Figure S1. (C and D) An analysis of statistical significance for nine age groups of MZ twin pairs when DBAF values are between 0.2 and 0.45. (E and F) Longitudinal data analyses comparing the number of DBAF reports (between 0.2 and 0.45) of 18 twin pairs that were sampled twice, 10 years apart. Each point in the plot represents the number of differences within one MZ pair (E). Each line (plotted between the two time points for the same MZ pair) thus represents the change over time of the number of differences within a pair (blue line, increase; red line, decrease; green line, no change). The intraindividual changes for each twin over a period of 10 years are shown in (F). The x axis shows individual ages at the later sampling. On the y axis, the number of differences found between the two samples from the same person at the two time points is shown, and vertical lines connect co-twins. (G and H) Validation of copy-number imbalance between MZ twins in two pairs (chromosomes 10 and 6, respectively), which were detected by the DBAF analysis. The small boxes at the top of both (G) and (H) display original data from Illumina arrays for pairs TP63-1/TP63-2 and TP31-1/TP32-2, respectively. The larger boxes at the bottom of (G) and (H) display raw data from Nimblegen tiling-path 135K array for these two twin pairs. Each line is drawn to scale and represents data from one oligonucleotide probe. Statistical significance for the results of the Nimblegen array was calculated with the Mann-Whitney U test; values were analyzed for the region of interest (shaded) and for both areas on either side of the control regions. Twenty additional examples of validation experiments are shown in Figure S6. There was no difference between the rates of validation success for the young (n ¼ 8) and old (n ¼ 26) MZ pairs used in these experiments (t test: t ¼ 0.7062, p value ¼ 0.4819), supporting the results from linear-regression analyses. The detailed description of the Nimblegen array is provided in Figure S6 and Table S4. 220 The American Journal of Human Genetics 90, 217–228, February 10, 2012
  • 105. were performed in 20 ml reactions containing 5 ng genomic DNA, 0.3 mM of each primer, and 13 Maxima SYBR Green/ROX qPCR Master Mix (Fermentas, Vilnius, Lithuania) (for primer sequences, see Table S5). The reactions were incubated at 95 C for 10 min, after which they underwent 40 cycles of 95 C for 15 s and 60 C for 60 s in a Stratagene Mx3000P (Agilent Technologies) machine. The reactions for evaluation of primer efficiencies were performed in duplicates with control DNA (normal human female genomic DNA, Promega Corporation, Madison, WI, USA), whereas all other reactions with test and reference DNA were performed in tripli- cates; in both instances, the averages were used in analyses. Each primer pair’s efficiency and standard curve are described in Figure S7. Melting-curve analysis was performed in all the experi- Figure 3. An Example of a Somatic Megabase-Range Aberration (A, E, and F) A deletion encompassing 12.9 Mb of 20q in MZ twin TP30-1 was sampled when she was 69 years old. (B, G, and H) The normal profile of co-twin TP30-2, as detected by LRR and BAF after Illumina SNP array genotyping. R-package-MAD analysis of the Illumina data suggested that 41.5% of the blood cells had the 20q deletion. qPCR valida- tion experiments confirmed this result by showing 39.6% aberrant cells (i.e., 19.8% fewer copies of the DNA segment, Figure 5). (C and D) Array-CGH validation experi- ments also confirmed the copy-number variation. The genetic change in MZ twin TP30-1 is another example of an MDS-like aberration, which was uncovered in a subject withouta clinical diagnosis of MDS. ments, and the results were analyzed with MxPro v4.10 software. We used ultra- conserved elements on human chromo- somes 3 and 6 (UCE3 and UCE6) as control loci as previously described.28,29 We used the average cycle threshold (Ct) value of UCE6 to normalize the average Ct values of UCE3 and test loci. We used these normalized Ct values to calculate copy- number ratios of test regions. Using the estimated copy-number ratios from UCE3 and the test loci from multiple replicate experiments, we performed t tests for statistical testing. Statistical Methods The statistical analyses were performed with the R 2.12–2.13 software.26 We used methods such as linear regression, t tests, and one-way analyses of variance (ANOVAs) when suitable, as further specified in the text. Prior to testing, we controlled the data so that no test assumptions were violated. For multiple comparisons (i.e., Figures 1K and S8G), we used the Tukey honest-signifi- cant-difference method by implementing the TukeyHSD function in R. When appro- priate, we performed the nonparametric Fisher’s exact test and Mann-Whitney U test, as described in the text. Boxplots of Longitudinal-Analysis Data Heterozygous SNPs have a theoretical expected BAF value of 0.5, and deviations from this normal state can be indicative of struc- tural aberrations.24 We can therefore use changes in the magni- tude of these deviations in the subjects’ longitudinal samples to measure intraindividual changes over time and to estimate the proportion of cells affected by large-scale aberrations. We produced the boxplots in Figures 1E, 1K, 4J, S9D, S9G, and S8G to visualize such changes in BAF variation. In these figures, we plotted the absolute deviation of BAF values from 0.5 for all heterozygous SNPs in the region of interest (i.e., ABS (0.5ÀBAF)) The American Journal of Human Genetics 90, 217–228, February 10, 2012 221
  • 106. on the y axes. We only included heterozygous SNPs (i.e., those with a BAF value between 0.2 and 0.8) in these calculations to increase quality and accuracy of the plots. A larger BAF value devi- ation from 0.5 corresponds to a larger degree of mosaicism, i.e., a higher proportion of cells with a specific aberration. We used t tests (in cases with two factor levels) or one-way ANOVAs (in cases with >2 factor levels) to test for significance of such differences. For the model illustrated in Figures 1K and S8G, we used the Tukey Figure 4. Longitudinal Analysis of ULSAM-340, a Single-Born Subject Containing a 13.8 Mb Deletion on 20q, as Detected by LRR and BAF with the Illumina SNP Array The size and position of this deletion is typical of MDS patients. This subject, however, has not been diagnosed with MDS. When the patient was 71 years old, the deletion was only carried by a small proportion of blood cells and was barely detectable, and neither Nexus Copy Number software nor R-package MAD reported this aberration at this age (A, D, and E). R-package MAD suggested that 50.7% of the nucleated cells had the deletion when ULSAM-340 was 75 years old (B, F, and G) and that when he was 88 years old, the corresponding proportion of cells was 36.1% (C, H, and I). qPCR validation experiments showed that the sample taken when the patient was 88 years old contained 14.5% fewer copies of DNA in the segment as compared to the sample taken when he was 75 years old (Figure 5). The deviations from 0.5 of the BAF values within the deleted region in the three different sampling stages are illustrated in (J). 222 The American Journal of Human Genetics 90, 217–228, February 10, 2012
  • 107. post-hoc test for multiple comparisons to compute differences between factor-level means after adjusting p values for the multiple testing. Quantification of the Number of Cells Affected by Megabase-Range Aberrations We calculated the approximate percentage of cells affected by aberrations in the megabase range by using data from qPCR exper- iments (the data are described in Figure 5). The qPCR measure- ments provided the approximate number of DNA molecules that are affected by an aberration. Assuming that an aberration affects only one chromosome (i.e., an aberration that is a heterozygous event) in a diploid genome, we used this number and converted it to the approximate number of affected cells. Our assumption is reasonable, given that we are studying normal cells and that the size of these large-scale aberrations renders them unlikely to affect both chromosomes (i.e., they are unlikely to be homozygous [biallelic] events). For example, the relative number of DNA copies in nucleated blood cells of twin TP25-2 at the age of 77 years confirmed the array data. To determine these numbers, we used two primer pairs (41.1 and 42.1) designed within the deleted region and took five independent measurements for both primer pairs. These experiments suggested that, at the age of 77, twin TP25-2 had 30.8% (when primer pair 41.1 was used) and 35.4% (when primer pair 42.1 was used)—an average of 33.1%—fewer DNA copies with a 32.5 Mb 5q deletion than did her co-twin at the same age (Figure 5). If one assumes that this deletion is affecting one chromosome in a diploid cell, our calculations suggest that 66.2% of cells contain this deletion. In order to quantify the level of mosaicism, we also applied an alternative, published method19,30 based on calculations of the deviation of BAF values from the expected value of 0.5 for the heterozygous SNPs in a normal state. This method has been tailored for data derived from the Illumina SNP platform. The R-package MAD (Mosaic Alteration Detection) version 0.5–930 identifies the aberrant regions, such as deletions, gains, and CNNLOHs, and calculates the B deviation (Bdev, deviation from the expected BAF value of 0.5 for heterozygous SNPs) value, which is then used for calculation of the number of cells affected by the aberration. We used the following modified version of the pub- lished19 formula for deletions, gains, and CNNLOHs: Proportion of cells with aberration ¼ 2Bdev ð0:5 þ BdevÞ Results Age-Related Accumulation of Megabase-Range Structural Variants Our analysis of 159 MZ pairs involved genotyping with Illumina 600K SNP arrays, confirmation of monozygozity (>99.9% genotype concordance), CNV calling with Nexus Copy Number software (BioDiscovery, CA, USA), followed by inspection of genomic profiles. Validation was per- formed with a different Illumina array, Nimblegen array, and qPCR. Comparison of MZ twin pairs, including 19 previously reported pairs,21 identified five large de novo aberrations of >1 Mb among 81 young or middle-aged (%55 years) and 78 elderly (R60 years) pairs studied (Figures 1, 3, 5, and S5). All five large rearrangements occurred in the older twins, suggesting a relationship between age and the presence of changes. Tables S1 and S2 show a description of subjects, cohorts, and statistical support for the use of Illumina data for the detection of variants. We expanded on the results from twins by using two age-stratified groups of single-born subjects. First, we genotyped DNA from 108 men, all 88 years old, from the ULSAM (Uppsala Longitudinal Study of Adult Men) cohort by using the Illumina-1M-Duo array. We found that four subjects had large-scale rearrangements at the age of 88 years, and the somatic nature of such rearrangements was established by examination of samples taken from the same individuals at other time points (Figures 1, 4, 5, and S8–S10 and Table S1). Second, for the young or middle-aged single-born control cohort (33–55 years), we used existing Illumina 550K data from 180 controls from the ADVANCE (Atherosclerotic Disease, Vascular Function, and Genetic Epidemiology) study.31,32 Analogous analysis of ADVANCE subjects did not reveal any cases of large-scale aberrations. The genotyping quality of 550K experiments is at least as good as the quality of 1M-Duo arrays, and the resolution of the 550K array is sufficient for detection of ~1Mb aberrations that have been uncovered in the ULSAM cohort (Figures S11 and S12 and Table S6). In fact, we described a 1.6 Mb deletion by using the 300K array in twin D8,21 and literature comparing arrays suggests that the 250K level is sufficient for uncovering submegabase-range changes.28,33 By studying the twins and the single-born individuals and by analyzing the two groups together, we obtained firm statistical support for age-related accumulation of large structural variants (with Fisher’s exact test; p value ¼ 0.00052) (Table S2). Overall, 3.4% of the studied population R60 years old carries cells containing megabase-range somatic aberra- tions that are readily detectable by array-based scanning, whereas none of the younger controls carried aberrations in this size range. The sensitivity of our analysis to detect aberrant clones is about 5% of nucleated blood cells.24,25 A previous estimate of 1.7% for somatic mosaicism was performed in an analysis that was not stratified by age.19 Five subjects harboring large CNVs (twin TP25-2 and ULSAM-102, -298, -340, and -697) were followed in repeated samplings collected up to 19 years apart. They all showed accumulation of aberrant cells with a variation in the rate of this process. Twin TP25-2 is an example of slow accumulation of a 5q-deletion clone (Figure 1); when this twin was 77 years old, two independent methods (q-PCR and MAD-program-based) suggested that 66.2% and 50.5% of cells, respectively, contained a deletion on one copy of chromosome 5. The change in deviation of BAF within the deleted region when twin TP25-2 was 70 and 77 years old translates into a 2.2% increase in cells with the 5q deletion. The latter estimation was based on analysis with the MAD program. It is note- worthy that the size and position of this 5q deletion are typical of myelodysplastic syndrome (MDS).34–38 However, twin TP25-2 has not been diagnosed with this disease. The American Journal of Human Genetics 90, 217–228, February 10, 2012 223
  • 108. A MZ pair TP25-1/2 at the age of 77 Chr. 5 locus 41.1 n = 5 0 50 100 RelativeamountofDNAmolecules(%) Control region UCE3 Test loci ~30.8% fewer DNA copies in test locus in twin TP25-2 p = 0.0149 ~35.4% fewer DNA copies in test locus in twin TP25-2 p < 0.001 MZ pair TP25-1/2 at the age of 77 Chr. 5 locus 42.1 n = 5 MZ pair TP30-1/2 at the age of 69 Chr. 20 locus 45.1 n = 5 ULSAM-340 at the age of 75 and 88 Chr. 20 locus 45.1 n = 6 ~19.8% fewer DNA copies in test locus in twin TP30-1 p < 0.001 ~14.5% fewer DNA copies in test locus at the age of 88 p < 0.001 ULSAM-102 Chr. 1 age 88 vs. f-gDNA locus rs540796 n = 5 ~49.1% more DNA copies in test locus in ULSAM-102 compared to reference DNA ~34.7% more DNA copies in test locus in ULSAM-102 compared to reference DNA p < 0.001 p = 0.0015150 ULSAM-102 Chr. 8 age 88 vs. f-gDNA locus rs9298462 n = 5 B Control region UCE3 Test loci ~8.9% fewer DNA copies in test locus p = 0.0449 ~14.2% fewer DNA copies in test locus p < 0.0001 ~7.8% fewer DNA copies in test locus p = 0.0057 ~5.9% fewer DNA copies in test locus p = 0.0458 ~5.7% fewer DNA copies in test locus p = 0.0101 MZ pair TP31-1/2 at the age of 69 SNP rs6928830 n = 8 0 50 100 RelativeamountofDNAmolecules(%) MZ pair TP19-1/2 at the age of 75 SNP rs329312 n = 9 MZ pair TP63-1/2 at the age of 76 SNP rs4635020 n = 6 MZ pair TP16-1/2 at the age of 77 SNP rs4841318 n = 7 MZ pair TP63-1/2 at the age of 76 SNP rs708039 n = 11 Figure 5. Validation of de novo CNVs by qPCR with SYBR Green Eleven independent qPCR experiments, each composed of multiple (5–11) independent measurements, are shown. The relative number of DNA copies in both test loci (white bars) and the control region UCE3 (gray bars) were plotted. Before we plotted and performed statis- tical analyses with t tests, we normalized all Ct values by using the control region UCE6. Figure S7 shows the determination of primer efficiency for each of the primer pairs. (A and B) Validations for five large-scale (A) and five small-scale (B) aberrations. The dotted line drawn at 100% represents the copy- number state in control DNA (i.e., that from the normal MZ co-twin, or human female control DNA, or DNA from the same subject sampled at another age), and error bars indicate standard error of means. (A) The 5q deletion in twin TP25-2 (Figure 1) was validated with two primer pairs (41.1 and 42.1) designed within the deleted region. In total, ten independent qPCR experiments showed that ~66.2% of all nucleated blood cells in TP25-2 had the 5q deletion (i.e., an average of 33.1% [30.8% with primer pair 41.1 and 35.4% with primer pair 42.1] fewer copies of the DNA segment). Similarly, the 20q deletion in twin TP30-1 (Figure 3) was validated with primer pair 45.1 in five experiments. The 19.8% fewer DNA copies found in the test locus indi- cates that 39.6% of the nucleated blood cells had the deletion. For ULSAM-340, the array data indicated a longitudinal somatic change in the number of cells carrying the 20q deletion. Six independent qPCR experiments comparing DNA sampled when ULSAM-340 was 75 224 The American Journal of Human Genetics 90, 217–228, February 10, 2012
  • 109. ULSAM-102 is another example of slow accumulation and contains gains on 1p and 8q (Figure S9). The 1p gain is stable, whereas the 8q gain shows a statistically significant (ANOVA: p value <0.05) increase over a period of 10 years. Consequently, ULSAM-102 probably carries two coexisting clones with different aberrations. In ULSAM-340 and -697, the rate of accumulation was faster and there was a decrease in the proportion of cells with aberrations at later sam- plings. ULSAM-340 contains a 20q deletion, which was barely detectable at the age of 71 (Figure 4). The number of cells containing the 20q deletion was estimated by anal- ysis with the MAD program to be 50.7% when ULSAM-340 was 75 years old and to be 36.1% when he was 88 years old. ULSAM-340 is another example of an aberration typical of MDS in a subject without this diagnosis. However, his clinical history includes thrombocytopenia, which is normally a part of MDS clinical features. We there- fore speculate that this symptom might be due to clonal expansion of cells with a 20q deletion and suppression of normal thrombocyte production. Finally, ULSAM-697 was analyzed four times and shows the most pronounced increase and decrease in the number of cells with CNNLOH of 4q (Figures 1 and S8). This aberration was not detectable at the age of 71, reached 58.4% at the age of 88, and decreased radically to 29.9% of cells at the age of 90. When ULSAM-697 was 90 years old, we profiled sorted CD4þ cells, CD19þ cells, granulocytes, and fibro- blasts, in addition to whole-blood DNA. CD4þ cells, gran- ulocytes, and whole blood showed similar levels of aberrant cells, whereas CD19þ cells and fibroblasts ap- peared normal. We performed all experiments on samples taken when ULSAM-697 was 90 years old in duplicate with different types of arrays. Thus, in ULSAM-697, both lymphoid and myeloid cells were affected, except for, quite surprisingly, CD19þ B cells. Overall, the analyses per- formed on ULSAM-340 and ULSAM-697 suggest that the cells with aberrations have a higher proliferative potential than do other cells in the immune system, but they are not immortalized because they apparently disappear from circulation. Small-Scale Structural Aberrations Also Display Positive Correlation with Age Given the above results, we tested whether smaller struc- tural variants would also accumulate with age, and we used deviations in BAF as the main detection tool because they can detect mosaicism in as low as 5%–7% of cells24,25 and allow detection of deletions and duplications as well as CNNLOH. We performed scans for deviations in BAF values alone and BAF together with LRR in twins by using a new R-script (Figure S1) that identifies CNV calls for each MZ pair at various thresholds of DBAF and DLRR. Early analyses showed that the algorithm was sensitive to the quality of genotyping because calls were preferentially observed in co-twins with lower data quality. We therefore applied strict inclusion criteria by using the NQ score, which is based on genome-wide noise measurements. This resulted in the selection of 87 out of 159 MZ pairs (Table S3). We found that small putative CNVs increased with age (Figure 2A, linear regression F(1,85) ¼ 54.00, p < 0.001, Figures S2–S4). We further narrowed the number of calls by combining the DBAF and DLRR values >0.35 from both twins in each MZ twin pair, and this process also indicated that these CNVs accumulate with age (Figure 2B; F(1,85) ¼ 34.60, p < 0.001). We also tested whether genotyping quality (DNQ value is the abso- lute value of the difference in quality score within pairs) might explain the observed pattern. Importantly, there was no effect of DNQ on age (F(1,85) ¼ 1.85, p > 0.05), sug- gesting that the positive correlation with age reflects true aberrations. Figure 2B displays a total of 827 CNV calls at 378 loci in 87 pairs with an age span of 3–86 years. Plotting of the 378 calls against the genome shows the nonrandom distribution and recurrent nature of these CNV calls (Figure S13). On the basis of frequency and/or location in the vicinity of known genes, we selected 138 loci for vali- dation by using a tiling-path array (Nimblegen 135K) in 34 twin pairs. With this platform, 15% of putative CNVs were validated in the same twin pairs in which they were first detected by DBAF and DLRR analysis. There was no bias in the success rate of validation between younger and older groups (t test: t ¼ 0.7062, p value ¼ 0.4819). In total, 52 of the 138 loci (38%) included on the 135K array showed CNVs within 32 of the 34 MZ pairs tested (Figures 2G, 2H, and S6), and the majority of CNVs encompassed <1 kb. The reason for the discrepancy (i.e., 15% versus 38%) in the validation success rates mentioned above is probably due, at least in part, to the high stringency of the DBAF and DLRR analysis that only reported a subset of preferentially strong calls representing structural vari- ants and the recurrent nature of loci that are affected by the small-scale variation. Hence, some true structural vari- ants were validated in (often multiple) MZ pairs on the 135K array, even though the initial DBAF and DLRR anal- ysis did not pick them up because the filtering parameters were too stringent. We selected 5 of these 52 loci for further validation with qPCR, and all five were confirmed by this alternative approach (Figure 5). We also performed break- point-PCR validation in 17 out of the above 52 loci by using PCR across the deleted region in instances that and 88 years old showed that the subject had 14.5% fewer copies of the DNA segment when he was 88 years old. In ULSAM-102, the Illumina array identified a duplication event on both chromosomes 1 and 8 (Figure S9). Given that the proportion of cells with a gained segment in this subject was relatively stable over time, we used human female genomic DNA as control DNA in these experiments. The qPCR experiments validated both somatic CNVs. (B) qPCR validation of five loci with small-scale de novo CNVs within MZ twins. These loci were identified by Illumina array genotyping and were confirmed on the Nimblegen 135K array (see also Figures 2G, 2H, and S6). The layout of this panel is similar to that of (A), described above. For example, the first locus (rs6928830) illustrates de novo CNVs in twin TP31-1 (Figure 2H). The American Journal of Human Genetics 90, 217–228, February 10, 2012 225
  • 110. were presumed to represent the shortest deletions based on the Illumina and Nimblegen 135K array data. However, these attempts were not successful. We obtained correctly sized PCR bands representing wild-type alleles for tested loci. However, we could not detect any shorter, mutated alleles that were mapped to the correct genomic regions. These validation experiments included gel purification of PCR fragments, PCR-fragment analysis, subcloning in plas- mids, and Sanger sequencing (details not shown). These results suggest that the vast majority of the uncovered small structural variants are due to more complex rear- rangements involving deletions or gains embedded together with other structural changes. These results are in agreement with a recent sequencing-based validation analysis of CNV loci; the analysis showed that as few as 5% of CNVs suspected to represent gains or deletions are in fact ‘‘pure blunt-end breakpoints.’’39 Details for the 52 validated loci are shown in Table S4, which includes infor- mation about genes affected by the variation. The results presented in Table S4 and Figure S13 emphasize the recurrent nature of the 52 validated loci. For example, out of the 52 loci, 13 only occurred once in any of the 34 tested twin pairs, whereas the remaining 39 were recurrent and occurred 2–16 times in the same set of MZ twin pairs. The number of CNVs per pair validated with the 135K Nimblegen array ranged from 1 to 32 (median 6) (Table S7). In summary, the deviation between MZ co- twins ranged from 0 to 51,040 bp (median 4,995 bp), and the latter corresponds to ~0.0000016% genome-wide divergence. By using the small-scale CNV pipeline, we analyzed 18 pairs of MZ twins that were sampled twice, 10 years apart (Figures 2E, 2F, and S1 and Table S8). Analyses were per- formed in two ways: as an interindividual comparison of one twin to its co-twin at the first and second sampling and as an intraindividual comparison of the two samplings of a single twin. Both types of comparisons suggest varia- tion in the dynamics of changes between co-twins and show both increases and decreases over a period of 10 years in the number of calls in different twin pairs. Interestingly, this evidence for the dynamics of small-scale CNVs over time (Figure 2E) is consistent with the results from longitu- dinal analyses of large-scale aberrations in ULSAM-697 and ULSAM-340 (Figures 1 and 4), suggesting both increases and decreases over time in the number of cells containing different variants. Discussion The phenotypic consequences of accumulating aberra- tions are an interesting aspect of our results. In two subjects diagnosed with chronic lymphocytic leukemia (CLL), we detected multiple changes consistent with the disease (Figure S10). These findings are not unexpected: Our population-based cohort was not preselected against any diagnoses, and CLL is the most prevalent leukemia among the elderly.40 However, it is surprising that appar- ently healthy subjects have aberrations characteristic of MDS. A typical 5q deletion (observed in one subject) and a 20q deletion (observed in two subjects) are among the most common aberrations in patients diagnosed with MDS.34–38 Trisomy 8 is also a recurrent aberration in MDS, and ULSAM-102 displays a restricted 8q gain; it remains unclear whether this gain is related to MDS. None of the above-mentioned individuals were diagnosed with MDS, and their cases might represent an indolent, subclinical form of MDS. In two individuals followed in longitudinal sampling (i.e., ULSAM-340 and -697), we observed not only an increase but also a clear subsequent decrease in the proportion of nucleated blood cells with aberrations (Figures 1, 4, and S8). These results suggest an ‘‘autocorrection’’ of the immune system, given that the aberrant clones are apparently disappearing from circula- tion. Similar expansions of preleukemic clones containing gene fusions specific to acute leukemia have been described in newborns;41 the gene fusions TEL-AML1 and AML1-ETO were present in cord blood at a frequency 1003 greater than the frequency that is associated with the risk of developing the corresponding leukemia. The presented data are probably only part of all the somatic changes that actually occurred in the studied cohorts because balanced inversions and translocations escape our detection and because we interrogated a fraction of all the nucleotides in the genome. Furthermore, we only detected high-frequency aberrations, presumably because these aberrations provided the affected cells with a prolifer- ative advantage, which lead to clonal expansion above the detection limit of ~5% of cells. It follows from this reasoning that deleterious aberrations leading to prolifera- tive disadvantage or aberrations that are neutral from the point of view of the proliferative potential go undetected. Nevertheless, the chromosomal regions (e.g., those that contain the 20q deletion) and loci affected in a recurrent fashion (Figure S13 and Table S4) are candidates for con- taining common and redundant age-related defects in human blood cells. These mutations are presumed to provide the affected cells with a mild proliferative advan- tage without transforming the affected cells into immortal- ized cancer clones. However, the proliferative advantage for a limited number of cells will most likely affect the overall complexity of cell clones present in blood and should therefore be discussed in the context of immunose- nescence, which, in fact, involves loss of complexity of cell clones in both B and T cell lineages.42,43 Our results might therefore help to explain the cause of age-related reduction in the number of cell clones in the blood. This reduction could lead to a less diverse immune system caused by the accumulation of genetic changes that induce the expan- sion of a limited number of clones. We also anticipate that extension of our work will allow determination of the genetic age of different somatic cell lineages and esti- mation of possible individual differences between genetic and chronological age. 226 The American Journal of Human Genetics 90, 217–228, February 10, 2012
  • 111. Supplemental Data Supplemental Data include 13 figures and eight tables and can be found with this article online at http://www.cell.com/AJHG. Acknowledgments We thank Lars Feuk, Brigitte Schlegelberger, Jacek Witkowski, Greg Cooper, Richard Rosenquist Brandell, Eva Hellstro¨m-Lindberg, Chris Gunther, and Eva Tiensuu Janson for critical review of the manuscript and Larry Mansouri and Juan R. Gonzalez for method- ological advice. This study was sponsored by grants from the Ellison Medical Foundation (J.P.D. and D.A.) and from the Swedish Cancer Society, the Swedish Research Council, and the Science for Life Laboratory-Uppsala (J.P.D.). A.P. acknowledges FOCUS 4/2008 and FOCUS 4/08/2009 grants from the Foundation for Polish Science. Genotyping was performed in part by the SNP&SEQ Technology Platform, which is supported by Uppsala University, Uppsala University Hospital, the Science for Life Laboratory– Uppsala, and the Swedish Research Council (contracts 80576801 and 70374401). Received: November 10, 2011 Revised: December 6, 2011 Accepted: December 14, 2011 Published online: February 2, 2012 Web Resources The URLs for data presented herein are as follows: GenePipe PrimerZ, http://genepipe.ngc.sinica.edu.tw/primerz/ Illumina Beadchip information, http://www.illumina.com/ documents/products/appnotes/appnote_cytogenetics.pdf R 2.12–2.13 software, http://www.r-project.org/ Roche-Nimblegen array CGH Protocols, http://www. nimblegen.com/ R-package MAD version 0.5–9, http://www.creal.cat/jrgonzalez/ software.htm Surveillance Epidemiology and End Results (SEER) Program Fast Stats, http://seer.cancer.gov/faststats/ The Gene Ontology, http://www.geneontology.org/ The Genetic Association Database, http://geneticassociationdb. nih.gov/ The HUGO Gene Nomenclature Committee, http://www. genenames.org/ University of California Santa Cruz Human Genome Browser, http://genome.cse.ucsc.edu/cgi-bin/hgGateway Accession Numbers The array data for large-scale CNVs reported in this paper have been submitted to the Database of Genomic Structural Variation (dbVAR) under the accession number nstd58. References 1. Conrad, D.F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O., Zhang, Y., Aerts, J., Andrews, T.D., Barnes, C., Campbell, P., et al; Wellcome Trust Case Control Consortium. (2010). Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712. 2. Itsara, A., Cooper, G.M., Baker, C., Girirajan, S., Li, J., Absher, D., Krauss, R.M., Myers, R.M., Ridker, P.M., Chasman, D.I., et al. (2009). Population analysis of large copy number vari- ants and hotspots of human genetic disease. Am. J. Hum. Genet. 84, 148–161. 3. van Ommen, G.J. (2005). Frequency of new copy number vari- ation in humans. Nat. Genet. 37, 333–334. 4. Lupski, J.R. (2007). Genomic rearrangements and sporadic disease. Nat. Genet. 39 (7 Suppl), S43–S47. 5. Itsara, A., Wu, H., Smith, J.D., Nickerson, D.A., Romieu, I., London, S.J., and Eichler, E.E. (2010). De novo rates and selec- tion of large copy number variation. Genome Res. 20, 1469– 1481. 6. Harley, C.B., Futcher, A.B., and Greider, C.W. (1990). Telo- meres shorten during ageing of human fibroblasts. Nature 345, 458–460. 7. Vaziri, H., Scha¨chter, F., Uchida, I., Wei, L., Zhu, X., Effros, R., Cohen, D., and Harley, C.B. (1993). Loss of telomeric DNA during aging of normal and trisomy 21 human lymphocytes. Am. J. Hum. Genet. 52, 661–667. 8. Lee, H.C., Pang, C.Y., Hsu, H.S., and Wei, Y.H. (1994). Differen- tial accumulations of 4,977 bp deletion in mitochondrial DNA of various tissues in human ageing. Biochim. Biophys. Acta 1226, 37–43. 9. Fraga, M.F., Ballestar, E., Paz, M.F., Ropero, S., Setien, F., Balles- tar, M.L., Heine-Sun˜er, D., Cigudosa, J.C., Urioste, M., Benitez, J., et al. (2005). Epigenetic differences arise during the lifetime of monozygotic twins. Proc. Natl. Acad. Sci. USA 102, 10604– 10609. 10. Mohamed, S.A., Hanke, T., Erasmi, A.W., Bechtel, M.J., Scharfschwerdt, M., Meissner, C., Sievers, H.H., and Gosslau, A. (2006). Mitochondrial DNA deletions and the aging heart. Exp. Gerontol. 41, 508–517. 11. Flores, M., Morales, L., Gonzaga-Jauregui, C., Domı´nguez- Vidan˜a, R., Zepeda, C., Yan˜ez, O., Gutie´rrez, M., Lemus, T., Valle, D., Avila, M.C., et al. (2007). Recurrent DNA inversion rearrangements in the human genome. Proc. Natl. Acad. Sci. USA 104, 6099–6106. 12. Sloter, E.D., Marchetti, F., Eskenazi, B., Weldon, R.H., Nath, J., Cabreros, D., and Wyrobek, A.J. (2007). Frequency of human sperm carrying structural aberrations of chromosome 1 increases with advancing age. Fertil. Steril. 87, 1077–1086. 13. Frank, S.A. (2010). Evolution in health and medicine Sackler colloquium: Somatic evolutionary genomics: Mutations during development cause highly variable genetic mosaicism with risk of cancer and neurodegeneration. Proc. Natl. Acad. Sci. USA 107 (Suppl 1), 1725–1730. 14. Lynch, M. (2010). Evolution of the mutation rate. Trends Genet. 26, 345–352. 15. Youssoufian, H., and Pyeritz, R.E. (2002). Mechanisms and consequences of somatic mosaicism in humans. Nat. Rev. Genet. 3, 748–758. 16. Erickson, R.P. (2010). Somatic gene mutation and human disease other than cancer: An update. Mutat. Res. 705, 96–106. 17. De, S. (2011). Somatic mosaicism in healthy human tissues. Trends Genet. 27, 217–223. 18. Dumanski, J.P., and Piotrowski, A. (2012). Structural genetic variation in the context of somatic mosaicism. In Genomic Structural Variation, L. Feuk, ed. (New York: Humana Press). 19. Rodrı´guez-Santiago, B., Malats, N., Rothman, N., Armengol, L., Garcia-Closas, M., Kogevinas, M., Villa, O., Hutchinson, A., Earl, J., Marenne, G., et al. (2010). Mosaic uniparental The American Journal of Human Genetics 90, 217–228, February 10, 2012 227
  • 112. disomies and aneuploidies as large structural variants of the human genome. Am. J. Hum. Genet. 87, 129–138. 20. Piotrowski, A., Bruder, C.E., Andersson, R., Diaz de Sta˚hl, T., Menzel, U., Sandgren, J., Poplawski, A., von Tell, D., Crasto, C., Bogdan, A., et al. (2008). Somatic mosaicism for copy number variation in differentiated human tissues. Hum. Mu- tat. 29, 1118–1124. 21. Bruder, C.E., Piotrowski, A., Gijsbers, A.A., Andersson, R., Erickson, S., Diaz de Sta˚hl, T., Menzel, U., Sandgren, J., von Tell, D., Poplawski, A., et al. (2008). Phenotypically concordant and discordant monozygotic twins display different DNA copy-number-variation profiles. Am. J. Hum. Genet. 82, 763–771. 22. Steemers, F.J., Chang, W., Lee, G., Barker, D.L., Shen, R., and Gunderson, K.L. (2006). Whole-genome genotyping with the single-base extension assay. Nat. Methods 3, 31–33. 23. Olshen, A.B., Venkatraman, E.S., Lucito, R., and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557–572. 24. Conlin, L.K., Thiel, B.D., Bonnemann, C.G., Medne, L., Ernst, L.M., Zackai, E.H., Deardorff, M.A., Krantz, I.D., Hakonarson, H., and Spinner, N.B. (2010). Mechanisms of mosaicism, chimerism and uniparental disomy identified by single nucle- otide polymorphism array analysis. Hum. Mol. Genet. 19, 1263–1275. 25. Razzaghian, H.R., Shahi, M.H., Forsberg, L.A., de Sta˚hl, T.D., Absher, D., Dahl, N., Westerman, M.P., and Dumanski, J.P. (2010). Somatic mosaicism for chromosome X and Y aneu- ploidies in monozygotic twins heterozygous for sickle cell disease mutation. Am. J. Med. Genet. A. 152A, 2595–2598. 26. R_Development_Core_Team. (2010). R: A language and envi- ronment for statistical computing. In. (Vienna, Austria). URL: http://www.R-project.org/ 27. Workman, C., Jensen, L.J., Jarmer, H., Berka, R., Gautier, L., Nielser, H.B., Saxild, H.H., Nielsen, C., Brunak, S., and Knud- sen, S. (2002). A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biol. 3, research0048. 28. Gunnarsson, R., Staaf, J., Jansson, M., Ottesen, A.M., Go¨rans- son, H., Liljedahl, U., Ralfkiaer, U., Mansouri, M., Buhl, A.M., Smedby, K.E., et al. (2008). Screening for copy-number alter- ations and loss of heterozygosity in chronic lymphocytic leukemia—a comparative study of four differently designed, high resolution microarray platforms. Genes Chromosomes Cancer 47, 697–711. 29. Gunnarsson, R., Isaksson, A., Mansouri, M., Go¨ransson, H., Jansson, M., Cahill, N., Rasmussen, M., Staaf, J., Lundin, J., Norin, S., et al. (2010). Large but not small copy-number alter- ations correlate to high-risk genomic aberrations and survival in chronic lymphocytic leukemia: A high-resolution genomic screening of newly diagnosed patients. Leukemia 24, 211–215. 30. Gonza´lez, J.R., Rodrı´guez-Santiago, B., Ca´ceres, A., Pique-Regi, R., Rothman, N., Chanock, S.J., Armengol, L., and Pe´rez- Jurado, L.A. (2011). A fast and accurate method to detect allelic genomic imbalances underlying mosaic rearrange- ments using SNP array data. BMC Bioinformatics 12, 166. 31. Schunkert, H., Ko¨nig, I.R., Kathiresan, S., Reilly, M.P., Assimes, T.L., Holm, H., Preuss, M., Stewart, A.F., Barbalic, M., Gieger, C., et al; Cardiogenics; CARDIoGRAM Consor- tium. (2011). Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat. Genet. 43, 333–338. 32. Assimes, T.L., Knowles, J.W., Basu, A., Iribarren, C., Southwick, A., Tang, H., Absher, D., Li, J., Fair, J.M., Rubin, G.D., et al. (2008). Susceptibility locus for clinical and subclinical coro- nary artery disease at chromosome 9p21 in the multi-ethnic ADVANCE study. Hum. Mol. Genet. 17, 2320–2328. 33. Hagenkord, J.M., Monzon, F.A., Kash, S.F., Lilleberg, S., Xie, Q., and Kant, J.A. (2010). Array-based karyotyping for prognostic assessment in chronic lymphocytic leukemia: Performance comparison of Affymetrix 10K2.0, 250K Nsp, and SNP6.0 arrays. J. Mol. Diagn. 12, 184–196. 34. Bernasconi, P., Boni, M., Cavigliano, P.M., Calatroni, S., Giardini, I., Rocca, B., Zappatore, R., Dambruoso, I., and Care- sana, M. (2006). Clinical relevance of cytogenetics in myelo- dysplastic syndromes. Ann. N Y Acad. Sci. 1089, 395–410. 35. Haase, D. (2008). Cytogenetic features in myelodysplastic syndromes. Ann. Hematol. 87, 515–526. 36. Tiu, R.V., Gondek, L.P., O’Keefe, C.L., Elson, P., Huh, J., Mohamedali, A., Kulasekararaj, A., Advani, A.S., Paquette, R., List, A.F., et al. (2011). Prognostic impact of SNP array karyo- typing in myelodysplastic syndromes and related myeloid malignancies. Blood 117, 4552–4560. 37. Braun, T., de Botton, S., Taksin, A.L., Park, S., Beyne-Rauzy, O., Coiteux, V., Sapena, R., Lazareth, A., Leroux, G., Guenda, K., et al. (2011). Characteristics and outcome of myelodysplastic syndromes (MDS) with isolated 20q deletion: A report on 62 cases. Leuk. Res. 35, 863–867. 38. Bejar, R., Levine, R., and Ebert, B.L. (2011). Unraveling the molecular pathophysiology of myelodysplastic syndromes. J. Clin. Oncol. 29, 504–515. 39. Conrad, D.F., Bird, C., Blackburne, B., Lindsay, S., Mamanova, L., Lee, C., Turner, D.J., and Hurles, M.E. (2010). Mutation spectrum revealed by breakpoint sequencing of human germ- line CNVs. Nat. Genet. 42, 385–391. 40. Surveillance Epidemiology and End Results (SEER) Program. Fast stats. Bethesda, MD, National Cancer Institute, NIH, USA (2011) URL: http://seer.cancer.gov/faststats/ 41. Mori, H., Colman, S.M., Xiao, Z., Ford, A.M., Healy, L.E., Donaldson, C., Hows, J.M., Navarrete, C., and Greaves, M. (2002). Chromosome translocations and covert leukemic clones are generated during normal fetal development. Proc. Natl. Acad. Sci. USA 99, 8242–8247. 42. Naylor, K., Li, G., Vallejo, A.N., Lee, W.W., Koetz, K., Bryl, E., Witkowski, J., Fulbright, J., Weyand, C.M., and Goronzy, J.J. (2005). The influence of age on T cell generation and TCR diversity. J. Immunol. 174, 7446–7452. 43. Gibson, K.L., Wu, Y.C., Barnett, Y., Duggan, O., Vaughan, R., Kondeatis, E., Nilsson, B.O., Wikby, A., Kipling, D., and Dunn-Walters, D.K. (2009). B-cell diversity decreases in old age and is correlated with poor health status. Aging Cell 8, 18–25. 228 The American Journal of Human Genetics 90, 217–228, February 10, 2012
  • 113. to read the latest issue of any Cell Press journal. BE THE FIRST Register for Cell Press Email Alerts and get the complete table of contents as soon as the issue publishes online — FREE! Cell Press Email Alerts deliver the news, research, and commentaries featured in each journal’s latest issue, including the full title of every article, direct links to the articles, and the complete author list. Plus, to save you time, each research article has a brief summary highlighting its significant findings. You don’t have to be a subscriber to sign up for Cell Press Email Alerts. While subscribers have instant access to the full text of all articles listed in the Email Alerts, non-subscribers can read the abstracts of all articles as well as the full text of the issue’s Featured Article. www.cellpress.com
  • 114. REPORT Rare Mutations in XRCC2 Increase the Risk of Breast Cancer D.J. Park,1,20 F. Lesueur,2,20 T. Nguyen-Dumont,1 M. Pertesi,2 F. Odefrey,1 F. Hammet,1 S.L. Neuhausen,3 E.M. John,4,5 I.L. Andrulis,6 M.B. Terry,7 M. Daly,8 S. Buys,9 F. Le Calvez-Kelm,2 A. Lonie,10 B.J. Pope,10 H. Tsimiklis,1 C. Voegele,2 F.M. Hilbers,11 N. Hoogerbrugge,12 A. Barroso,13 A. Osorio,13,14 the Breast Cancer Family Registry, the Kathleen Cuningham Foundation Consortium for Research into Familial Breast Cancer, G.G. Giles,15 P. Devilee,11,16 J. Benitez,13,14 J.L. Hopper,17 S.V. Tavtigian,18 D.E. Goldgar,19 and M.C. Southey1,* An exome-sequencing study of families with multiple breast-cancer-affected individuals identified two families with XRCC2 mutations, one with a protein-truncating mutation and one with a probably deleterious missense mutation. We performed a population-based case- control mutation-screening study that identified six probably pathogenic coding variants in 1,308 cases with early-onset breast cancer and no variants in 1,120 controls (the severity grading was p < 0.02). We also performed additional mutation screening in 689 multiple- case families. We identified ten breast-cancer-affected families with protein-truncating or probably deleterious rare missense variants in XRCC2. Our identification of XRCC2 as a breast cancer susceptibility gene thus increases the proportion of breast cancers that are asso- ciated with homologous recombination-DNA-repair dysfunction and Fanconi anemia and could therefore benefit from specific targeted treatments such as PARP (poly ADP ribose polymerase) inhibitors. This study demonstrates the power of massively parallel sequencing for discovering susceptibility genes for common, complex diseases. Currently, only approximately 30% of the familial risk for breast cancer has been explained, leaving the substantial majority unaccounted for.1 Recently, exome sequencing has been demonstrated to be a powerful tool for identi- fying the underlying cause of rare Mendelian disorders. However, diseases such as breast cancer present substan- tially increased complexity in terms of locus, allelic and phenotypic heterogeneity, and relationships between genotype and phenotype. As part of a collaborative (Leiden University Medical Centre, the Spanish National Cancer Center, and The University of Melbourne) project involving the exome capture and massively parallel sequencing of multiple- case breast-cancer-affected families, we applied whole- exome sequencing to DNA from multiple affected relatives from 13 families (family structure and sample availability were considered before the affected relatives were chosen). Bioinformatic analysis of the resulting exome sequences identified a protein-truncating mutation, c.651_652del (p.Cys217*), in X-ray repair cross complementing gene-2 (XRCC2(( [MIM 600375; NM_005431.1]) in the peripheral- blood DNA of a man participating in the Australian Breast Cancer Family Registry2 (ABCFR; Figure 1A); this man (III-4 in Figure 1A) had been diagnosed with breast cancer at 29 years of age, and his mother (II-3), sister (III-5), and cousin (III-1) had been diagnosed with breast cancer at 37, 41, and 34 years of age, respectively. The cousin (III-1), who had also been selected for exome sequencing, did not carry this mutation, the sister’s DNA was Sanger sequenced and was found to carry the mutation, and there was no DNA available for testing of the mother. Exome sequencing of three individuals from a family participating in a Dutch research study of multiple-case breast-cancer- affected families identified a probably deleterious missense mutation (c.271C>T [p.Arg91Trp] in XRCC2) (Figure 2) in two sisters (II-6 and II-8 in Figure 1B) diagnosed with breast cancer at 40 and 48 years of age, respectively, but not in their cousin (II-1), who was diagnosed at 47 years of age. Genotyping of XRCC2 mutations c.651_652del (p.Cys217*) and c.271C>T (p.Arg91Trp) in 1,344 cases 1 Genetic Epidemiology Laboratory, The University of Melbourne, Victoria 3010, Australia; 2 Genetic Cancer Susceptibility Group, International Agency for Research on Cancer, 69372 Lyon, France; 3 Department of Population Sciences, Beckman Research Institute of City of Hope, Duarte, CA 91010, USA; 4 Cancer Prevention Institute of California, Fremont, CA 94538, USA; 5 Department of Health Research and Policy, Stanford Cancer Center Institute, Stan- ford, CA 94305, USA; 6 Department of Molecular Genetics, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, ON M5G 1X5, Canada; 7 Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY 10032, USA; 8 Fox Chase Cancer Center, Philadelphia, PA 19111, USA; 9 Huntsman Cancer Institute, University of Utah Health Sciences Center, Salt Lake City, UT 84112, USA; 10 Victorian Life Sciences Compu- tation Initiative, Carlton, Victoria 3010, Australia; 11 Department of Human Genetics, Leiden University Medical Center, Leiden, 2300 RC Leiden, The Netherlands; 12 Department of Human Genetics, Radboud University Nijmegen Medical Center, 6525 GA Nijmegen, The Netherlands; 13 Human Genetics Group, Human Cancer Genetics Program, Spanish National Cancer Center, 28029 Madrid, Spain; 14 Spanish Network on Rare Diseases, 46010 Valencia, Spain; 15 Centre for Cancer Epidemiology, The Cancer Council Victoria, Carlton, Victoria 3052, Australia; 16 Department of Pathology, Leiden University Medical Center, Leiden, 2300 RC Leiden, The Netherlands; 17 Centre for Molecular, Environmental, Genetic, and Analytical Epidemiology, School of Pop- ulation Health, The University of Melbourne, Victoria 3010, Australia; 18 Department of Oncological Sciences, Huntsman Cancer Institute, University of Utah School of Medicine, Salt Lake City, UT 84112, USA; 19 Department of Dermatology, University of Utah School of Medicine, Salt Lake City, UT 84132, USA 20 These authors contributed equally to this work *Correspondence: msouthey@unimelb.edu.au DOI 10.1016/j.ajhg.2012.02.027. Ó2012 by The American Society of Human Genetics. All rights reserved. 734 The American Journal of Human Genetics 90, 734–739, April 6, 2012
  • 115. and 1,436 controls from the Melbourne Collaborative Cohort Study3 (MCCS) and the ABCFR revealed one control (II-2, Figure 1C) who carried c.651_652del (p.Cys217*). Intriguingly, this control individual’s sister (II-1) was diagnosed with breast cancer at 63 years of age, and her mother (I-2) was diagnosed with melanoma at 69 years of age (Figure 1C, Tables 1 and 2). XRCC2, a RAD51 paralog, was cloned because of its ability to complement the DNA-damage sensitivity of the irs1 hamster cell line.4 Cells derived from Xrcc2-knockout mice exhibit profound genetic instability as a result of homologous recombination (HR) deficiency.5 XRCC2 is highly conserved, and most truncations of the protein destroy its ability to protect cells from the effects of the DNA cross-linking agent mitomycin C.6 The involvement of the HR DNA repair genes BRCA1 (MIM 113705), BRCA2 (MIM 600185), ATM (MIM 607585), CHEK2 (MIM 604373), BRIP1 (MIM 605882), PALB2 (MIM 610355), and RAD51C (MIM 602774) in breast cancer risk empha- sizes the importance of this mechanism in the etiology of breast cancer.7–9 Biallelic mutations in three of these genes are associated with Fanconi anemia (FA), and, most interestingly, Shamseldin et al.10 have recently reported a homozygous frameshift mutation in XRCC2 as being associated with a previously unrecognized form of FA. XRCC2 binds directly to the C-terminal portion of the product of the breast cancer susceptibility pathway gene RAD51 (MIM 179617), which is central to HR.6,11 XRCC2 also complexes in vivo with RAD51B (RAD51L1 [MIM 602948]), the product of the breast and ovarian cancer susceptibility gene RAD51C9 and the product of the ovarian cancer risk gene RAD51D (MIM 602954),12,13 and localizes to sites of DNA damage.6 Cells deficient in XRCC2 also show centrosome disruption, a key compo- nent of mitotic-apparatus dysfunction, which is often linked to the onset of mitotic catastrophe. XRCC2 is important in preventing chromosome missegregation leading to aneuploidy.14 Studies of common genetic varia- tion in XRCC2 have reported some evidence of association with breast cancer risk (e.g., rs3218408),15 subtle effects on DNA-repair capacity,16 and poor survival associated with rs3218536 (XRCC2, Arg188His).15 On the basis of the exome-sequencing results, the subse- quent genotyping of the two probably pathogenic variants * * * * * * A B C D EF G H IJ Figure 1. Pedigrees of Families Found to Carry XRCC2 Mutations Mutation status is indicated for all family members for whom a DNA sample was available. Cancer diagnosis and age of onset are indi- cated for affected members. Asterisks indicate that DNA underwent exome sequencing (libraries for 50 bp fragment reads were prepared according to the SOLiD Baylor protocol 2.1 and the Nimblegen exome-capture protocol v.1.2 with some variations). The following abbreviations are used: BC, breast cancer (black filled symbols); PC, pancreatic cancer; BwC, bowel cancer; UC, uterine cancer; MM, malignant melanoma; UK, unknown age; BlC, bladder cancer; OC, ovarian cancer; BCC, basal cell carcinoma; L, lung cancer; (all gray-filled symbols); V, verified cancer (via cancer registry or pathology report); and wt, wild-type. Some symbols represent more than one person as indicated by a numeral. The American Journal of Human Genetics 90, 734–739, April 6, 2012 735
  • 116. in the MCCS and ABCFR, the rarity of these variants, and the biochemical plausibility of XRCC2, we conducted two further studies in parallel. The first study was case-control mutation screening of XRCC2 (with high-resolution melt [HRM] curve analysis followed by Sanger-sequencing confirmation) in an additional series of 1,308 cases with early-onset breast cancer and 1,120 frequency-matched controls recruited through population-based sampling by the Breast Cancer Family Registry2 (BCFR; Supplemental Data, available online); the BCFR sampling was recently carried out for the characterization of the breast cancer risk associated with variants in ATM and CHEK2.17,18 The second study was mutation screening of XRCC2 in a series of index cases from multiple-case breast-cancer-affected families and a series of male breast cancer cases. The case-control mutation screening identified two cases that carried protein-truncating variants in XRCC2: indi- vidual III-2 had c.49C>T (p.Arg17*) (Figure 1F), and indi- vidual II-1 had c.651_652del (p.Cys217*) (Figure 1G). Five cases carried singleton missense substitutions ranging from probably deleterious to relatively innocuous (accord- ing to in silico prediction). One control carried a relatively innocuous missense substitution (Table 2). In addition, a case diagnosed with breast cancer at 32 years of age carried a G>A substitution located one nucleotide prior to the start codon. We graded the rare missense variants by using three computational tools: SIFT, Polyphen2.1, and Align- GVGD. Differences in grading between these tools were minor. Depending on which of the three computational tools we used to grade the missense substitutions, the statistical significances of the differences in the frequency and severity distributions of protein-truncating variants and rare missense substitutions between cases and controls from the case-control mutation-screening study fell in the range of p ¼ 0.01–0.02 (adjusted for race, study center, and age). There were six probably deleterious variants (pre- dicted deleterious by at least two prediction algorithms) in the cases and none in the controls, corresponding to a p value by Fisher’s exact test of 0.02. All together, the case-control mutation-screening data provide statistical support for the hypothesis that rare, evolutionarily unlikely sequence variation in XRCC2 is associated with increased risk of breast cancer. Mutation screening (by Sanger sequencing) of XRCC2 in the index cases of 689 multiple-case breast-cancer-affected families participating in the BCFR and the Kathleen Cuningham Foundation Consortium for Research into Familial Breast Cancer19 (kConFab) plus 150 male breast cancer cases participating in a US-based study of male breast cancer (Beckman Research Institute of the City of Hope20 ) and kConFab revealed three rare coding-sequence alterations. We identified a second family (from the kCon- Fab resource) with an index case who carried XRCC2 c.651_652del (p.Cys217*); this individual (II-5, Figure 1D) also carried a truncating mutation in BRCA1 (c.70_80del [p.Cys24Serfs*13]). We identified an ABCFR index case (II-2, Figure 1E and Figure 2) who carried the previously identified missense substitution, XRCC2 c.271C>T (p.Arg91Trp). We also identified a male breast cancer case who carried a relatively innocuous missense substitution, c.283A>C (p.Ile95Leu). In addition to the protein-truncating mutations and the above-described missense variants, a number of missense, silent, and intronic variants were also observed in XRCC2, and common SNPs that were reported in public databases such as dbSNP, HapMap, or the 1,000 Genomes Project were also identified. These included the common coding SNP c.563G>A (p.Arg188His) (rs3218536), one silent substitution, three 50 UTR variants, five 30 UTR vari- ants, and six intronic variants in the vicinity of exon- intron boundaries. All these variants were predicted to be neutral according to various in silico predictions tools (Supplemental Data, Tables 1 and 2). For common SNPs (>1% in controls), no difference in allele frequency was observed between cases and controls in the BCFR series. The genetic studies included in this report received ap- proval from The University of Melbourne Human Research Ethics Committee, the International Agency for Research on Cancer institutional review board (IRB), and the local IRBs of every center from which we report findings. Of the six distinct rare variants predicted to severely affect protein function and identified in our work, two were truncating mutations, and four were missense changes. Although most recognized pathogenic mutations in the major breast cancer susceptibility genes are protein trun- cating, there is evidence that missense mutations might be the more prominent of some more recently-identified Figure 2. XRCC2 Multiple-Sequence Alignment Centered on Position Arg91 Missense substitutions observed in this interval are given with the missense residue directly above the corresponding human refer- ence sequence residue. The following abbreviations are used: Hsap, Homo sapiens; Mmul, Macaca mulatta; Mmus, Mus musculus; Cfam, Canis familiaris; Lafr, Loxodonta africana; Mdom, Monodelphis domestica; Oana, Ornithorhynchus anatinus; Ggal, Gallus gallus; Acar, Anolis coralinensis; Xtro, Xenopus tropicalis; Drer, Danio rerio; Bflo, Branchiostoma floridae; Spur, Strongylocentrotus purpuratus; Nvec, Nematostella vectensis; and Tadh, Trichoplax adhaerans. The alignment, or updated versions thereof, is available at the Align- GVGD website (see Web Resources). 736 The American Journal of Human Genetics 90, 734–739, April 6, 2012
  • 117. breast cancer susceptibility genes. For example, in compre- hensive studies of ATM and CHEK2, the proportion of prob- ably deleterious or pathogenic rare sequence variants that are missense changes is often over 50%. More relevantly, estimates of breast cancer risk are higher for missense vari- ants than they are for protein-truncating variants. This has been observed through case-control mutation- screening analyses of ATM and CHEK217,18 and through a pedigree analysis21 of ATM; in these analyses, the breast cancer risk associated with one specific missense mutation approaches the average risk associated with pathogenic BRCA2 mutations. A very recent analysis of PALB2 muta- tions found no difference in the frequency of missense mutations between two case groups (contralateral and unilateral breast cancer cases),22 suggesting that the contri- bution of missense mutations to breast cancer risk might vary between susceptibility genes. Our finding of XRCC2 as a breast cancer susceptibility gene expands the proportion of breast cancer that is associ- ated with rare mutations in the HR-DNA-repair pathways and the number of breast cancer susceptibility genes in which biallelic mutations are associated with FA; the precise contribution of mutation in these genes will become clearer as more whole-exome-sequencing (or whole-genome- sequencing) and targeted-pathway-sequencing studies are performed. XRCC2 mutations appear to be very rare, even in the context of multiple-case families; they appear in 1 of 66 (1.5%) early-onset female breast cancer cases with a strong family history of the disease present in the ABCFR, compared to 9 (14%) BRCA1 mutations, 6 (9%) BRCA2 mutations, 3 (5%) TP53 (MIM 191170) mutations, and 2 (3%) PALB2 mutations. These frequencies are consistent with data from both breast cancer linkage studies that have suggested that no single gene is likely to account for a large fraction of the re- maining familial aggregation of breast cancer5 and reports from recent candidate-gene sequencing studies that have associated other members of the HR pathway with breast cancer susceptibility.23,24 Although mutations in HR- DNA-repair genes are rare, it is important to identify people whose breast cancer is associated with HR-DNA-repair dysfunction because they could benefit from specific tar- geted treatments such as PARP inhibitors. Unaffected rela- tives of people with a mutation in a HR-DNA-repair gene could also be offered predictive testing and subsequent clinical management and genetic counseling on the basis of their mutation status. The identification of a family with rare mutations in both XRCC2 and BRCA1 illustrates the complexity of the underlying genetic architecture of breast cancer susceptibility for some families and the chal- lenges for personalized risk-prediction models that are incorporating an increasing array of risk factors, which include rare mutations in breast cancer susceptibility genes and more common genetic variation. Currently, esti- mating the relative importance of the XRCC2 mutation to the breast cancer risk for members of this family is diffi- cult because of the presence of a BRCA1 protein-truncating mutation in the proband in addition to the XRCC2 muta- tion. Many examples have been described of individuals and families carrying deleterious mutations in more than Table 1. Mutation Screening in Multiple-Case Breast Cancer Families Rare XRCC2 Variants Effect on Protein Align-GVGDa SIFTb PolyPhen-2.1 (HumDiv) Case or Control Pedigree (Study Source) Age and Origin of Carrier Truncating variants c.651_652del p.Cys217* À À À case Figure 1A (ABCFR)e 29, white c.651_652del p.Cys217* À À À casec Figure 1C (kConFab) 36, white c.651_652del p.Cys217* À À À control Figure 1D (MCCS) 72, white Missense substitutions c.271C>T p.Arg91Trp C65 0.00 probably damaging case Figure 1B (Dutch)e 40, white c.271C>T p.Arg91Trp C65 0.00 probably damaging cased Figure 1E (ABCFR) 32, white c.283A>C p.Ile95Val C0 0.34 benign case À (kConFab) 59, white c.283A>G p.Ile95Leu C0 0.41 benign case À (kConFab) 70, white c.283A>C p.Ile95Val C0 0.34 benign case À (BRICOH) 68, white Silent substitution c.582G>T p.Thr194Thr À À À case À (kConFab) 60, white The following abbreviations are used: ABCFR; Australian Breast Cancer Family Registry; kConFab, Kathleen Cuningham Foundation Consortium for Research into Familial Breast Cancer; MCCS, Melbourne Collaborative Cohort Study; and BRICOH, Beckman Research Institute of City of Hope. a Protein multiple sequence alignment (PMSA) used for obtaining scores for Align-GVGD: from Human to Branchiostoma floridae (Bflo). b PMSA used for obtaining scores for SIFT: from Human to Trichoplax (Tadh). c This woman also carries BRCA1 c.70_80del (p.Cys24Serfs*13). d This carrier of p.Arg91Trp was identified through both the ABFCR multiple-case family screening and the BCFR-IARC (Breast Cancer Family Registry-International Agency for Research on Cancer) case-control screening. e Family included in the exome-sequencing phase. The American Journal of Human Genetics 90, 734–739, April 6, 2012 737
  • 118. one proven breast cancer susceptibility gene; one such example is the co-observation of BRCA1, BRCA2, ATM, and CHEK2 mutations.21,25 This study demonstrates the power of massively parallel sequencing in the discovery of additional breast cancer susceptibility genes when used with an appropriate study design. Our approach could be applied to other common, complex diseases with components of unexplained herita- bility. Supplemental Data Supplemental Data include 6 tables and can be found with this article online at http://www.cell.com/AJHG. Acknowledgments This work was supported by Cancer Council Victoria (grant 628774), the National Institutes of Health (R01CA155767 and R01CA121245), the Australian National Health and Medical Research Council (grant 466668), The University of Melbourne (infrastructure award to J.L.H.), a Victorian Life Sciences Computa- tion Initiative grant (VR00353) on its Peak Computing Facility at the University of Melbourne, and an initiative of the Victorian Government and Dutch Cancer Society (grant UL 2009-4388). The research resources, including the Melbourne Collaborative Cohort Study, the Australian Breast Cancer Family Study, the Breast Cancer Family Registry, and the Kathleen Cuningham Foundation Consortium for Research into Familial Breast Cancer, are further acknowledged in the supplementary information. We wish to thank Nivonirina Robinot and Geoffroy Durand for their technical help during the case-control mutation screening at the Interna- tional Agency for Research on Cancer, Georgia Chenevix-Trench for her support of and contribution to the establishment of the case-control mutation-screening study, and Greg Wilhoite for sequencing the male breast cancer cases at the Beckman Research Institute of City of Hope. This work and partial support for S.L.N. was provided by the Morris and Horowitz Families Endowment. Work at the Spanish National Cancer Center was partially funded by the Spanish Association Against Cancer and Health Ministry (FIS08/1120). M.C.S. is a National Health and Medical Research Council (NHMRC) Senior Research Fellow and a Victorian Breast Cancer Research Consortium (VBCRC) Group Leader. J.L.H. is a NHMRC Australia Fellow and a VBCRC Group Leader. T.N.-D. is a Susan G. Komen for the Cure Postdoctoral Fellow. Received: November 20, 2011 Revised: January 16, 2012 Accepted: February 29, 2012 Published online: March 29, 2012 Web Resources The URLs for data presented herein are as follows: Align-GVGD, http://agvgd.iarc.fr/alignments GATK v.1.0.4418, http://gatk.sourceforge.net/ Genome Viewer (IGV v.1.5.48), http://www.broadinstitute.org/ software/igv/ Online Mendelian Inheritance in Man (OMIM), http://www. omim.org Picard v.1.29, http://sourceforge.net/projects/picard/ PolyPhen2.1, http://genetics.bwh.harvard.edu./pph2/ SIFT, http://sift.jcvi.org/ SOLiD Baylor protocol 2.1, http://www.hgsc.bcm.tmc.edu/ documents/Preparation_of_SOLiD_Capture_Libraries.pdf UCSC Genome Browser, http://genome.ucsc.edu/cgi-bin/ hgGateway Table 2. Case-Control Mutation Screening Applied to the BCFR Population-Based Study Rare XRCC2 Variants Effect on Protein Align-GVGDa SIFTb PolyPhen-2.1 (HumDiv) Case (n ¼ 1,308) or Control (n ¼ 1,120) Pedigree (BCFR) Age and Origin of Carrier Truncating variants c.49C>T p.Arg17* À À À case Figure 1F 33, white c.46G>T p.Ala16Ser C0 0.24 benign case À 44, East Asian c.181C>A p.Leu61Ile C0 0.00 possibly damaging case Figure 1H 30, East Asian c.271C>T p.Arg91Trp C65 0.00 probably damaging casec Figure 1E 32, white c.283A>G p.Ile95Val C0 0.34 benign control À 44, white c.693G>T p.Trp231Cys C65 0.00 probably damaging cased Figure 1I 44, East Asian c.808T>G p.Phe270Val C45 0.00 probably damaging case Figure 1J 38, African Silent substitution c.354G>A p.Val118Val À À À cased À 44, East Asian 50 UTR variants c.-1G>A ? À À À casee À 32, white The following abbreviation is used: BCFR, Breast Cancer Family Registry. a Protein multiple sequence alignment (PMSA) used for obtaining scores for Align-GVGD: from Human to Branchiostoma floridae (Bflo). b PMSA used for obtaining scores for SIFT: from Human to Trichoplax (Tadh). c This carrier of p.Arg91Trp was identified through both the ABFCR multiple-case family screening and the BCFR-IARC (Breast Cancer Family Registry-International Agency for Research on Cancer) case-control screening. d This 44-year-old East Asian case carries p.Trp231Cys and p.Val118Val. e This case is considered a ‘‘noncarrier’’ in the analysis. 738 The American Journal of Human Genetics 90, 734–739, April 6, 2012
  • 119. References 1. Turnbull, C., and Rahman, N. (2008). Genetic predisposition to breast cancer: Past, present, and future. Annu. Rev. Geno- mics Hum. Genet. 9, 321–345. 2. John, E.M., Hopper, J.L., Beck, J.C., Knight, J.A., Neuhausen, S.L., Senie, R.T., Ziogas, A., Andrulis, I.L., Anton-Culver, H., Boyd, N., et al; Breast Cancer Family Registry. (2004). The Breast Cancer Family Registry: An infrastructure for coopera- tive multinational, interdisciplinary and translational studies of the genetic epidemiology of breast cancer. Breast Cancer Res. 6, R375–R389. 3. Giles, G.G., and R, E.D. (2002). The Melbourne Collaborative Cohort Study. IARC Sci Publ 156, 2. 4. Cartwright, R., Tambini, C.E., Simpson, P.J., and Thacker, J. (1998). The XRCC2 DNA repair gene from human and mouse encodes a novel member of the recA/RAD51 family. Nucleic Acids Res. 26, 3084–3089. 5. Deans, B., Griffin, C.S., O’Regan, P., Jasin, M., and Thacker, J. (2003). Homologous recombination deficiency leads to profound genetic instability in cells derived from Xrcc2- knockout mice. Cancer Res. 63, 8181–8187. 6. Tambini, C.E., Spink, K.G., Ross, C.J., Hill, M.A., and Thacker, J. (2010). The importance of XRCC2 in RAD51-related DNA damage repair. DNA Repair (Amst.) 9, 517–525. 7. Moynahan, M.E., Chiu, J.W., Koller, B.H., and Jasin, M. (1999). Brca1 controls homology-directed DNA repair. Mol. Cell 4, 511–518. 8. Moynahan, M.E., Pierce, A.J., and Jasin, M. (2001). BRCA2 is required for homology-directed repair of chromosomal breaks. Mol. Cell 7, 263–272. 9. Meindl, A., Hellebrand, H., Wiek, C., Erven, V., Wappensch- midt, B., Niederacher, D., Freund, M., Lichtner, P., Hartmann, L., Schaal, H., et al. (2010). Germline mutations in breast and ovarian cancer pedigrees establish RAD51C as a human cancer susceptibility gene. Nat. Genet. 42, 410–414. 10. Shamseldin, H.E., Elfaki, M., and Alkuraya, F.S. (2012). Exome sequencing reveals a novel Fanconi group defined by XRCC2 mutation. J. Med. Genet. 49, 184–186. 11. Gao, L.-B., Pan, X.-M., Li, L.-J., Liang, W.-B., Zhu, Y., Zhang, L.-S., Wei, Y.-G., Tang, M., and Zhang, L. (2011). RAD51 135G/C polymorphism and breast cancer risk: A meta-analysis from 21 studies. Breast Cancer Res. Treat. 125, 827–835. 12. Loveday, C., Turnbull, C., Ramsay, E., Hughes, D., Ruark, E., Frankum, J.R., Bowden, G., Kalmyrzaev, B., Warren-Perry, M., Snape, K., et al; Breast Cancer Susceptibility Collaboration (UK). (2011). Germline mutations in RAD51D confer suscepti- bility to ovarian cancer. Nat. Genet. 43, 879–882. 13. Liu, N., Schild, D., Thelen, M.P., and Thompson, L.H. (2002). Involvement of Rad51C in two distinct protein complexes of Rad51 paralogs in human cells. Nucleic Acids Res. 30, 1009–1015. 14. Griffin, C.S., Simpson, P.J., Wilson, C.R., and Thacker, J. (2000). Mammalian recombination-repair genes XRCC2 and XRCC3 promote correct chromosome segregation. Nat. Cell Biol. 2, 757–761. 15. Lin, W.-Y., Camp, N.J., Cannon-Albright, L.A., Allen-Brady, K., Balasubramanian, S., Reed, M.W.R., Hopper, J.L., Apicella, C., Giles, G.G., Southey, M.C., et al. (2011). A role for XRCC2 gene polymorphisms in breast cancer risk and survival. J. Med. Genet. 48, 477–484. 16. Rafii, S., O’Regan, P., Xinarianos, G., Azmy, I., Stephenson, T., Reed, M., Meuth, M., Thacker, J., and Cox, A. (2002). A poten- tial role for the XRCC2 R188H polymorphic site in DNA- damage repair and breast cancer. Hum. Mol. Genet. 11, 1433–1438. 17. Le Calvez-Kelm, F., Lesueur, F., Damiola, F., Valle´e, M., Voegele, C., Babikyan, D., Durand, G., Forey, N., McKay- Chopin, S., Robinot, N., et al; Breast Cancer Family Registry. (2011). Rare, evolutionarily unlikely missense substitutions in CHEK2 contribute to breast cancer susceptibility: results from a breast cancer family registry case-control mutation- screening study. Breast Cancer Res. 13, R6. 18. Tavtigian, S.V., Oefner, P.J., Babikyan, D., Hartmann, A., Healey, S., Le Calvez-Kelm, F., Lesueur, F., Byrnes, G.B., Chuang, S.-C., Forey, N., et al; Australian Cancer Study; Breast Cancer Family Registries (BCFR); Kathleen Cuningham Foundation Consortium for Research into Familial Aspects of Breast Cancer (kConFab). (2009). Rare, evolutionarily unlikely missense substitutions in ATM confer increased risk of breast cancer. Am. J. Hum. Genet. 85, 427–446. 19. Mann, G.J., Thorne, H., Balleine, R.L., Butow, P.N., Clarke, C.L., Edkins, E., Evans, G.M., Fereday, S., Haan, E., Gattas, M., et al; Kathleen Cuningham Consortium for Research in Familial Breast Cancer. (2006). Analysis of cancer risk and BRCA1 and BRCA2 mutation prevalence in the kConFab familial breast cancer resource. Breast Cancer Res. 8, R12. 20. Ding, Y.C., Steele, L., Chu, L.-H., Kelley, K., Davis, H., John, E.M., Tomlinson, G.E., and Neuhausen, S.L. (2011). Germline mutations in PALB2 in African-American breast cancer cases. Breast Cancer Res. Treat. 126, 227–230. 21. Goldgar, D.E., Healey, S., Dowty, J.G., Da Silva, L., Chen, X., Spurdle, A.B., Terry, M.B., Daly, M.J., Buys, S.M., Southey, M.C., et al; BCFR; kConFab. (2011). Rare variants in the ATM gene and risk of breast cancer. Breast Cancer Res. 13, R73. 22. Tischkowitz, M., Capanu, M., Sabbaghian, N., Li, L., Liang, X., Valle´e, M.P., Tavtigian, S.V., Concannon, P., Foulkes, W.D., Bernstein, L., et al; The WECARE Study Collaborative Group. (2012). Rare germline mutations in PALB2 and breast cancer risk: A population-based study. Hum Mutat 33, 674–680. 23. Rahman, N., Seal, S., Thompson, D., Kelly, P., Renwick, A., Elliott, A., Reid, S., Spanova, K., Barfoot, R., Chagtai, T., et al; Breast Cancer Susceptibility Collaboration (UK). (2007). PALB2, which encodes a BRCA2-interacting protein, is a breast cancer susceptibility gene. Nat. Genet. 39, 165–167. 24. Seal, S., Thompson, D., Renwick, A., Elliott, A., Kelly, P., Barfoot, R., Chagtai, T., Jayatilake, H., Ahmed, M., Spanova, K., et al; Breast Cancer Susceptibility Collaboration (UK). (2006). Truncating mutations in the Fanconi anemia J gene BRIP1 are low-penetrance breast cancer susceptibility alleles. Nat. Genet. 38, 1239–1241. 25. Turnbull, C., Seal, S., Renwick, A., Warren-Perry, M., Hughes, D., Elliott, A., Pernet, D., Peock, S., Adlard, J.W., Barwell, J., et al; Breast Cancer Susceptibility Collaboration (UK), EMBRACE. (2012). Gene-gene interactions in breast cancer susceptibility. Hum. Mol. Genet. 21, 958–962. The American Journal of Human Genetics 90, 734–739, April 6, 2012 739
  • 120. sponsored by snapshots.cell.com view the archive C e na v 0 SnapShots—sorted categorized—from chromatin remodelers and autophagy to cancer andr autism. All SnapShots published from a year agor or morer are open access and freely available.
  • 121. Be Frustrated No More. www.sdix.com/perform frustrated Better Antigens. Better Antibodies. Better Assays. Discover how SDIX can help you create better antibodies to difficult targets, like GPCRs. You need antibodies to perform in critical research, diagnostic and therapeutic applications — that’s what SDIX is all about, Design For Purpose™. Our scientists have pioneered novel technologies in antigen design, including SDIX Genomic Antibody Technology™. Antibodies designed to perform for YOU. No reason to be frustrated anymore.
  • 122. ® Empowering Sequencing, Our Focus. The NGS Experts™ Complete Kit - Everything you need upstream of target capture Optimized - Offers larger number of unique reads Multiplexed - Up to 24 barcodes and barcode blockers Available Now - Next Day Delivery The NEXTflex™ Pre-Capture Combo Kit for NimbleGen SeqCap is a complete DNA-Seq library prep, barcode and barcode blocking solution, designed and validated for use upstream of Roche NimbleGen’s SeqCap v3 Target Capture. DNA-Seq ChIP-Seq Bisulfite-Seq Methyl-Seq RNA-Seq Small RNA-Seq Directional RNA-Seq PCR-Free DNA-Seq Pre-Target Capture Multiple Platform Compatibility Simplify your NimbleGen SeqCap Target Capture. Visit BiooNGS.com and turn your focus to your NGS results.