Your SlideShare is downloading. ×
Trends in genetics_-_october_2013
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Trends in genetics_-_october_2013


Published on

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Editor Rhiannon Macrae Portfolio Manager Milka Kostic Journal Manager Basil Nyaku Journal Administrators Ria Otten and Patrick Scheffmann Advisory Editorial Board K.V. Anderson, New York, USA A. Clark, Ithaca, USA G. Fink, Cambridge, USA S. Gasser, Geneva, Switzerland D. Goldstein, Durham, USA L. Guarente, Cambridge, USA Y. Hayashizaki, Yokohama, Japan S. Henikoff, Seattle, USA J. Hodgkin, Oxford, UK H.R. Horvitz, Cambridge, USA L. Hurst, Bath, UK E. Koonin, Bethesda, USA E. Meyerowitz, Pasadena, USA S. Moreno, Salamanca, Spain A. Nieto, Alicante, Spain C. Scazzocchio, Orsay, France and London, UK D. Tautz, Plön, Germany O. Voinnet, Strasburg, France J. Wysocka, Stanford, California Editorial Enquiries Trends in Genetics Cell Press 600 Technology Square, 5th floor Cambridge MA 02139, USA Tel: +1 617 397 2818 Fax: +1 617 397 2810 E-mail: Cover: In this special issue of Trends in Genetics, we turn the lens on ourselves. The articles this month focus on human genetics, with topics ranging from resources and methods to make the most of the explosion of sequencing data to evolutionary questions about mutation rates and how selection acts through pregnancy. Cover image: iStockKameleonMedia. October 2013 Volume 29, Number 10 pp. 555–608 Jeffrey A. Fawcett and Hideki Innan Eli Eisenberg and Erez Y. Levanon 561 The role of gene conversion in preserving rearrangement hotspots in the human genome 569 Human housekeeping genes, revisited Opinions 559 LongevityMap: a database of human genetic variants associated with longevity 556 Genome sequencing for healthy individuals Arie Budovsky, Thomas Craig, Jingwei Wang, Robi Tacutu, Attila Csordas, Joana Lourenço, Vadim E. Fraifeld, and João Pedro de Magalhães Saskia C. Sanderson Spotlight Reviews Catarina D. Campbell and Evan E. Eichler Elizabeth A. Brown, Maryellen Ruvolo, and Pardis C. Sabeti David C. Samuels, Leng Han, Jiang Li, Sheng Quanghu, Travis A. Clark, Yu Shyr, and Yan Guo Nir Oksenberg and Nadav Ahituv Feature Review 575 Properties and rates of germline mutations in humans 585 Many ways to die, one way to arrive: how selection acts through pregnancy 593 Finding the lost treasures in exome sequencing data 600 The role of AUTS2 in neurodevelopment and human evolution Science & Society 555 Inherited uncertainty Rhiannon Macrae Editorial Special Issue: Human Genetics
  • 2. Inherited uncertainty Rhiannon Macrae My college physics textbook contained an anecdote about a physics professor who used to joke that instead of giving a seminar as part of their thesis defense, students should instead demonstrate their faith in physical principles by walking over a bed of hot coals. The trick is to get your feet wet first (hence, many people walk across dewy grass before stepping on to the coals), and the moisture will create an insulating vapor barrier through a phenomenon called the Leidenfrost effect, protecting your bare skin from the heat of the coals. If walking across hot coals is the ultimate test of a physicist’s faith in the laws of the universe, the equivalent for a geneticist is having a baby (Figure 1). Although it was not until Gregor Mendel presented his work in 1865 that inheritance was formally quantitated, humans innately understood the concept of heredity well before then. Perhaps the most pervasive evidence of this comes from breeding programs dating back to prehistoric times, in which animals or plants with desirable traits were selectively bred. Plato wrote about extending these ideas to humans, and history is full of examples of known familial diseases, such as hemophilia. The development of molecular genetics transformed these observations into a mechanistic understanding of the hereditary material, and now with the advent of genomic technologies, a full picture of inheritance is beginning to emerge. Efforts are under- way to identify the genetic changes underlying every known Mendelian disorder ( and much work has been done to demonstrate associations between genetic variants and human traits (e.g., the GIANT consortium). It is easy to see in these systematic approaches a future of predictable genetic outcomes. The reality of the uncertainty in what lies in an indi- vidual’s DNA, however, announces itself along with the news of pregnancy. Although prenatal genetic screening is now routinely offered for some diseases, such as cystic fibrosis carrier testing or trisomy screening, thousands of known causal variants go untested, despite the feasibility of noninvasive fetal genome sequencing. Even with this new technology, the unknown variants and the dreaded ‘variants of unknown significance’ continue to pose chal- lenges to our understanding of the genotype–phenotype relation. I suspect most expecting parents do not phrase their fears in those terms, but I would venture that most if not all are hoping not so much for a boy or a girl, but for a healthy baby. Luckily for the parents (and the human race), this wish is often granted, allowing parents to refo- cus all their energy on raising their healthy baby, arriving at another classic debate in genetics – nature versus nurture. For indeed, your DNA is not your fate. Our prehistoric ancestors knew that even crops planted from the hardiest and most productive parents would fail in a drought. A catalog of all the disease-associated variants in the human genome would still only provide probabilities of outcomes in many cases, and it is difficult to imagine an algorithm sophisticated enough to consider all of the gene x–environment interactions that could influence those probabilities. Add in epigenetics, and it begins to feel as though we know less about inheritance than Mendel did. Nevertheless, we continue to put our faith in the pro- cesses that guide evolution and bring new lives into the world. It would be nice if there was a simple trick to ensure success, but for all the advice new parents receive, there is no equivalent to the suggestion to get your feet wet before walking across hot coals. Physicists are currently exploring the limits of the universe, but geneticists are still expand- ing the limits of what is knowable. In this Special Issue on human genetics, authors tackle this question from a vari- ety of angles, from describing resources and methods for probing the human genome to discussing how evolution has shaped our species. As we go to press, my husband and I will be completing the 9-month pilot phase of our own human genetics project. Preliminary data indicate that it’s a healthy girl. Editorial TRENDS in Genetics Figure 1. An ultrasound image at 12 weeks of pregnancy. Courtesy of Wolfgang Moroder. Corresponding author: Macrae, R. ( 0168-9525/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. Trends in Genetics, October 2013, Vol. 29, No. 10 555
  • 3. Genome sequencing for healthy individuals Saskia C. Sanderson Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA Genome sequencing of healthy individuals has the po- tential to lead to improved well-being and disease pre- vention, but numerous challenges remain that must be addressed to realize these benefits and, importantly, these benefits must be equitable across society. Sequencing people, not only patients Over the past few years, several seemingly healthy indi- viduals have had their genomes sequenced, analyzed, and published in peer-reviewed scientific journals. These in- clude scientist Mike Snyder at Stanford [1], eight other individuals at Stanford [2], and participants in the Per- sonal Genome Project at Harvard [3]. There is considerable hope that whole-genome sequencing (WGS) in healthy individuals will lead to great advances in disease preven- tion and improved well-being [4]. However, numerous challenges and concerns exist, including the costs of ana- lyzing and interpreting WGS data as well as the potential for adverse outcomes such as confusion, anxiety, inappro- priate referrals, and overutilization of health services [5– 7]. Although more research is required to evaluate these pros and cons, if implemented fairly there is a great potential for WGS to improve the lives of people regardless of whether or not they currently appear healthy. The promise: improved health and well-being Sequencing the first human genome took 15 years and $3 billion. Today, a human genome can be sequenced for $$3000 in a few days, and costs are expected to continue to fall. Although WGS is currently used primarily for clinical diagnostic and research purposes, WGS in seem- ingly healthy individuals has the promise to empower them to take greater control of their lives, and to take action to prevent diseases earlier and more effectively. In the future, WGS may provide healthy individuals with carrier information relevant to reproductive decision-mak- ing and pharmacogenomic information to inform drug prescribing and dosage. It may also identify people who appear healthy – but who have rare variants that greatly increase their risk of cancer or a cardiac event [8], or combinations of common variants that modestly increase their risk of common, complex diseases such as type 2 diabetes [2] or psychiatric conditions such as bipolar disor- der. This may enable doctors to intervene with medications or procedures, and/or motivate individuals to make risk- reducing changes themselves, such as losing weight, quit- ting smoking, reducing stress, improving medication adher- ence, or increasing screening. There is significant commercial as well as academic and public health interest in capitalizing on these potential advantages. The challenges along the way There are also significant challenges to applying WGS in the context of healthy individuals. WGS for a healthy individual is an open-ended investigation: the sheer vol- ume of data that could potentially be informative is cur- rently overwhelming [9]. The nature of the data challenges current notions of what can be guaranteed regarding con- fidentiality and privacy [10]. Other policy aspects, such as those related to discrimination and insurance [7], as well as logistical issues including storage of such vast amounts of data [5] and access within electronic healthcare records [4], must also be considered. The volume of data produced poses particular chal- lenges regarding analysis and interpretation [7]. Today, it takes many person-hours to curate, analyze and inter- pret the thousands of variants arising from WGS that may be significant for a healthy individual. Vast amounts of work are involved in translating the raw data into compre- hensive but easy-to-understand results that can confident- ly be communicated back to the individual. Although the ACMG provides guidelines regarding the return of inci- dental findings in clinical settings [11], deciding where to draw the line between known pathogenic and suspected pathogenic variants is a major barrier to rapidly interpret- ing WGS data for healthy individuals. It is likely to be some time before analysis and interpretation pipelines are fully automated and user interfaces enabling individuals to access results in meaningful ways are developed and wide- ly adopted. Ethical considerations, including the implications for family members [7], also pose important challenges for WGS for healthy individuals. Crucially, the question of the appropriate age at which to consider introducing WGS needs to be addressed. This was highlighted by the ACMG guidelines, which recommended returning incidental find- ings about specific, high-penetrance variants regardless of age [11], sparking considerable debate. The notion of chil- dren or adolescents having their genomes sequenced, par- ticularly without an immediate clinical need, is ethically challenging and raises important questions around assent and consent. However, the value of waiting until adulthood before implementing WGS is also debatable. In addition, healthcare providers are unprepared for the deluge of genomic data that WGS produces: they typically Science & Society Corresponding author: Sanderson, S.C. ( 556
  • 4. have minimal understanding of genomics and lack confi- dence in their ability to interpret genomic information for their patients. Some genomics education efforts for health- care providers are underway, but more are urgently needed. New models of consent and return of results are needed As Biesecker emphasized, WGS ‘is a resource, not a test’ [12]. This is particularly true for healthy individuals. In the future, WGS results will not be offered at a single moment in time. Instead, the individual or clinician will interrogate the data in different ways over time depend- ing on life-stage, circumstances, and evolving genomics knowledge. This has implications for consent and counsel- ing because it poses a challenge to how informed consent is conceptualized. To make informed decisions about WGS, individuals should be helped to understand the potential risks, benefits, and uncertainties of WGS, and think fully through how potential results would make them think, feel, and act. However, this is virtually im- possible when WGS results could pertain to any disease or trait in the world, and the interpretation of the results will continue to evolve with ongoing research. Patient expectations about the potential outcomes of WGS must be realistically set both during informed consent and via public education initiatives. In addition to consent, models for the return of results will need to be modified. Traditional genetic counseling models involve hours of in-person education and support from already overstretched genetic counselors [5], which is clearly unsustainable in this new context. Novel multi- media approaches to patient education are needed to help patients make informed decisions about WGS [13], partic- ularly when there is no primary phenotype of immediate concern. In addition, whether individual preferences re- garding return of specific WGS results should be taken into account remains an open question. On the one hand, the ACMG suggests that it is impractical to incorporate pa- tient preferences regarding incidental findings into the WGS process [11]. On the other, some investigators are already building novel, dynamic, multi-media tools to as- sess and incorporate patient preferences into WGS pipe- lines [13] ( Will WGS affect behaviors and emotions? Although early studies found little evidence that genetic risk information influenced individual health behaviors such quitting smoking [14], these ‘proof-of-principle’ stud- ies tested for single variants of low penetrance, and it is therefore not surprising that there was little impact upon individual perceptions of disease threat or subsequent motivation to change behavior, given the small effects on disease risk and the lack of objective clinical benefit that could be achieved from this knowledge. Our understanding of genomic influences on disease is rapidly increasing, how- ever, and current investigations in which complex, multi- scale personal information about healthy individuals is generated based on WGS information integrated with mul- tiple other ‘omics’ data [1,2] bear little resemblance to those early studies in which individuals were tested for one single- nucleotide polymorphism (SNP) or variant of similarly low penetrance [14], or selection of SNP-based risk scores. Similarly, early studies did not find significant emotion- al impacts from personal genomic information [15]. How- ever, again, these were not based on WGS, and there is far greater potential for WGS to produce unanticipated results that may be valued by one individual, but completely devastating to another. The potential for emotional harm from WGS should not be underestimated – nor should it be overstated. One trial funded by the US National Institutes of Health (NIH), the MedSeq Project (http://www.genome- is beginning to explore these issues. More evidence from randomized trials with larger samples of diverse populations is needed before conclusions about behavioral and emotional effects of WGS on healthy individuals can be drawn. Given the limited evidence-base today, the loud skepti- cism regarding the potential for genomic information to succeed in motivating people to make health-protective behavioral changes where other efforts have failed is un- derstandable. Behavior change is unquestionably hard, but this should propel us to continue exploring whether WGS together with other emerging self-monitoring and big data applications will help change behaviors. It is imperative that we do this in an ethically-responsible way that minimizes the potential for harms. The jury is still out, and the behavioral and emotional effects of personal WGS information remain to be seen. Equitable access for all Most healthy individuals who have had their genomes sequenced to date are early adopters, scientists experi- menting on themselves, or people with the means and resources to obtain WGS through initiatives such as the Illumina Understand Your Genome conferences (http:// This self-experimen- tation is valuable while pipelines are still being built and challenges regarding results communication are still being tackled. Simultaneous efforts are needed, however, to ensure that WGS does not contribute to the already wide health disparities across society. The declining costs of WGS will undoubtedly be pivotal, as will efforts already underway to broaden genomics research to include under- represented populations. Furthermore, explicit efforts are needed to ensure that informed consent procedures are accessible and appropriate for people with lower literacy levels, patient education materials are developed that are accessible and understandable, results are communicated in ways that are easy to understand by people across a spectrum of educational attainment, and WGS is accessi- ble to individuals from all walks of life, not only those with the greatest resources. Only then will the promise of WGS be truly realized. Acknowledgments I am deeply indebted to Barbara Biesecker, Robert Green, Muin Khoury, Eric Schadt, Jo Waller, and Ron Zimmern for their valuable feedback on an earlier draft of this article. References 1 Chen, R. et al. (2012) Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell 148, 1293–1307 2 Patel, C.J. et al. (2013) Whole genome sequencing in support of wellness and health maintenance. Genome Med. 5, 58 Science & Society Trends in Genetics October 2013, Vol. 29, No. 10 557
  • 5. 3 Angrist, M. (2009) Eyes wide open: the personal genome project, citizen science and veracity in informed consent. Pers. Med. 6, 691–699 4 Burn, J. (2013) Should we sequence everyone’s genome? Yes. BMJ 346, 3133 5 Brunham, L.R. and Hayden, M.R. (2012) Whole-genome sequencing: the new standard of care? Science 336, 1112–1113 6 Flinter, F. (2013) Should we sequence everyone’s genome? No. BMJ 346, 3132 7 Ormond, K.E. et al. (2010) Challenges in the clinical application of whole-genome sequencing. Lancet 375, 1749–1751 8 Evans, J.P. et al. (2013) We screen newborns, don’t we? Realizing the promise of public health genomics. Genet. Med. 15, 332–334 9 Cassa, C.A. et al. (2012) Disclosing pathogenic genetic variants to research participants: quantifying an emerging ethical responsibility. Genome Res. 22, 421–428 10 Schadt, E.E. (2012) The changing privacy landscape in the era of big data. Mol. Syst. Biol. 8, 612 11 Green, R.C. et al. (2013) ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet. Med. 15, 565–574 12 Biesecker, L.G. (2012) Opportunities and challenges for the integration of massively parallel genomic sequencing into clinical practice: lessons from the ClinSeq project. Genet. Med. 14, 393–398 13 Yu, J.H. et al. (2013) Self-guided management of exome and whole- genome sequencing results: changing the results return model. Genet. Med. 14 Marteau, T.M. et al. (2010) Effects of communicating DNA-based disease risk estimates on risk-reducing behaviours. Cochrane Database Syst. Rev. 10, CD007275 15 Bloss, C.S. et al. (2011) Effect of direct-to-consumer genomewide profiling to assess disease risk. N. Engl. J. Med. 364, 524–534 0168-9525/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. Trends in Genetics, October 2013, Vol. 29, No. 10 Science & Society Trends in Genetics October 2013, Vol. 29, No. 10 558
  • 6. LongevityMap: a database of human genetic variants associated with longevity Arie Budovsky1,2* , Thomas Craig3* , Jingwei Wang3* , Robi Tacutu3 , Attila Csordas4 , Joana Lourenc¸o3 , Vadim E. Fraifeld1 , and Joa˜o Pedro de Magalha˜es3* 1 The Shraga Segal Department of Microbiology, Immunology and Genetics, Center for Multidisciplinary Research on Aging, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel 2 Judea Regional Research and Development Center, Carmel 90404, Israel 3 Integrative Genomics of Ageing Group, Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK 4 European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Understanding the genetic basis of human longevity remains a challenge but could lead to life-extending interventions and better treatments for age-related dis- eases. Toward this end we developed the LongevityMap (, the first database of genes, loci, and variants studied in the context of human longevity and healthy ageing. We describe here its content and interface, and discuss how it can help to unravel the genetics of human longevity. Given the worldwide ageing of the population, studying the genetics of human longevity is of widespread impor- tance [1,2]. Longevity is moderately heritable in humans ($25%), with increasing heritability with age [1], and exceptional longevity and healthy ageing in humans is an inherited phenotype [3]. Hundreds of longevity associ- ation studies have been performed in recent years and some genes associated with human longevity may be suitable targets for drug development [4]. Nonetheless, the heritability of human longevity remains largely unex- plained in part due to the complexity of this phenotypic trait [1]. Thanks to advances in next-generation sequenc- ing and genome-wide approaches, the capacity of longevity association studies is increasing. The growing amounts of data being generated also increase the complexity of the data analysis and the difficulty of placing findings in context of previous studies. We created the LongevityMap (, the first cat- alogue of human genetic variants associated with longevi- ty, to serve as a reference to help researchers navigate the rising tide of data related to human longevity. The LongevityMap is a new addition to our already highly successful collection of online databases and tools on the biology and genetics of ageing, the Human Ageing Genomic Resources ( [5]. GenAge, our existing database of ageing-related genes, focuses mostly on genes modulating longevity in model organisms plus the few genes associated with human progeroid syndromes [5], and thus there is an unmet need for a database of human genetic variants associated with longevity. As such, we followed the high standards and rigorous procedures of GenAge to develop the Longevity- Map. Briefly, all entries in the LongevityMap were manu- ally curated from the literature. Studies were selected following an in-depth literature survey. The LongevityMap is an inclusive database in which both large and small studies are included; different types of study are featured, from cross-sectional studies to studies of extreme longevity (e.g., centenarians). However, studies focused on cohorts of unhealthy individuals at baseline, such as cancer patients, were excluded. Details on study design are provided for each entry, including a brief description of the type of study, population ethnicity, sample size, age of probands and controls, and any gender bias. Negative results are also integrated in the LongevityMap to provide visitors with as much information as possible regarding each gene, variant, and locus previously studied in the context of longevity. Each entry refers to a specific observation from a study. This means that studies, and large-scale studies in particular, can have multiple entries in the LongevityMap, reflecting different results and observations. Each entry also includes a brief description of the major conclusions. Entries are flagged regarding whether results were sta- tistically significant or not, though many studies have marginal or indicative results that require a brief expla- nation of the findings. Our policy concerning controversial and subjective results is to detail the facts concerning the controversy and let users form their own opinions. A link to the primary publication in PubMed is always included in each entry. We developed an intuitive, user-friendly interface for the LongevityMap that allows users to query genes, variants (including by reference SNP ID number), stud- ies, and cytogenetic locations (Figure 1A). Users can browse/filter the data by association (i.e., significant or non-significant), population, and chromosome. For each single nucleotide polymorphism (SNP) and gene, addi- tional annotation was retrieved from the US National Center for Biotechnology Information (NCBI) databases dbSNP and RefSeq [6] to provide further information on Spotlight Corresponding author: de Magalha˜es, J.P. ( Keywords: ageing; genetics; GWAS; humans; lifespan; polymorphisms. * These authors contributed equally to this work. 559
  • 7. genes associated with SNPs and gene function, respec- tively. Homologues in model organisms were obtained from the InParanoid database [7]. Links are widely implemented to allow users to identify quickly other entries related to a given study, gene, or variant. In fact, each gene in the LongevityMap has a gene-centric page that aggregates and condenses the information on the database taken from different studies. In addition, the LongevityMap is fully integrated with our other ageing- related databases to provide users with selected, relevant information. In particular, crosslinks to GenAge are in- cluded to indicate genes associated with progeroid syn- dromes and those with homologues in model organisms known to modulate ageing/longevity. If appropriate, links to other major databases, such as Ensembl, Swiss-Prot, dbSNP, HapMap, and NCBI Entrez, are included for each entry. At time of writing, the LongevityMap includes data from 246 studies, featuring 751 different genes and 1987 variants (Figure 1B). Similarly to our other ageing-relat- ed databases, the LongevityMap is freely available online under a Creative Commons Attribution license. The full dataset is available for download and third-party use. It is our hope that the LongevityMap will serve as a novel database to help researchers decipher the genetics of human longevity. Acknowledgements The authors wish to thank Joana Costa, Daniel Wuttke, and Alex Freitas for helping to collate data and for comments and suggestions. This work was funded by a Wellcome Trust grant (ME050495MES) to J.P.M. This work was also funded in part by the European Union Framework Program (FP) 7 Health Research Grant number HEALTH-F4-2008-202047 (to V.E.F.) and the Israel Ministry of Science and Technology (to A.B.). J.P.M. is also grateful for support from the Ellison Medical Foundation and R.T. is supported by a Marie Curie Intra-European Fellowship within FP7. References 1 Christensen, K. et al. (2006) The quest for genetic determinants of human longevity: challenges and insights. Nat. Rev. Genet. 7, 436–448 2 Chung, W.H. et al. (2010) The role of genetic variants in human longevity. Ageing Res. Rev. 9 (Suppl. 1), S67–S78 3 Atzmon, G. et al. (2005) Biological evidence for inheritance of exceptional longevity. Mech. Ageing Dev. 126, 341–345 4 de Magalhaes, J.P. et al. (2012) Genome–environment interactions that modulate aging: powerful targets for drug discovery. Pharmacol. Rev. 64, 88–101 5 Tacutu, R. et al. (2013) Human ageing genomic resources: integrated databases and tools for the biology and genetics of ageing. Nucleic Acids Res. 41, D1027–D1033 6 NCBI Resource Coordinators (2013) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 41, D8–D20 7 Ostlund, G. et al. (2010) InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 38, D196–D203 0168-9525/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. Trends in Genetics, October 2013, Vol. 29, No. 10 Entries significantly associated with longevity Entries not significantly associated with longevity Total entries Genes Variants Studies Type of data (A) (B) Number 249 255 504 751 1987 (1832 with a refSNP number) 246 TRENDS in Genetics Figure 1. LongevityMap home page which showcases the design and layout of the website as well as its multiple search options and links (A); old couple picture by Jonel Hanopol. Types and amount of data in the LongevityMap (B). Spotlight Trends in Genetics October 2013, Vol. 29, No. 10 560
  • 8. The role of gene conversion in preserving rearrangement hotspots in the human genome Jeffrey A. Fawcett and Hideki Innan Graduate University for Advanced Studies, Hayama, Kanagawa 240-0193, Japan Hotspots of non-allelic homologous recombination (NAHR) have a crucial role in creating genetic diversity and are also associated with dozens of genomic disor- ders. Recent studies suggest that many human NAHR hotspots have been preserved throughout the evolution of primates. NAHR hotspots are likely to remain active as long as the segmental duplications (SDs) promoting NAHR retain sufficient similarity. Here, we propose an evolutionary model of SDs that incorporates the effect of gene conversion and compare it with a null model that assumes SDs evolve independently without gene con- version. The gene conversion model predicts a much longer lifespan of NAHR hotspots compared with the null model. We show that the literature on copy number variants (CNVs) and genomic disorders, and also the results of additional analysis of CNVs, are all more consistent with the gene conversion model. Many rearrangement hotspots are shared across species Recombination is a major mutational mechanism that has a crucial role in producing genetic diversity. Because of its potential impact on important phenotypes, includ- ing diseases, much attention has been paid to recombina- tion, whether it is allelic or nonallelic [1,2]. To understand the interaction between recombination and phenotypes, it is important to know how different parts of the genome differ in the rate at which recombination occurs. Recent genome-wide surveys demonstrated that the distribution of the recombination rate across the genome is far from uniform. Instead, there are several hotspots where re- combination occurs at a much higher rate than in the rest of the genome [3,4]. This applies to both allelic and nonallelic recombination [5]. Given that these hotspots are especially important in producing genetic diversity, a good understanding of their characteristics should be extremely valuable. Evolutionary approaches provide a means to investigate how these hotspots arose and have been maintained throughout evolution, which might enable us to better pre- dict regions that affect the phenotype. A recent interesting finding is that most allelic recombination hotspots detected in the human genome do not exist in the chimpanzee genome, indicating a rapid turnover of hotspots [6,7]. This rapid turnover is at least partly because hotspots are largely determined by the fast-evolving PR domain-containing 9 (Prdm9) gene. This gene encodes a protein that contains several zinc finger domains and is able to bind motifs that are overrepresented in recombination hotspots [8]. Single mutations in Prdm9 or its binding motif can be sufficient to alter the recombination activity [9–11]. This means that hotspots are determined by human-specific factors, which ultimately raises the question of whether studying the genomes of other primate species would be useful in under- standing the role of recombination in shaping the pattern of genetic diversity in the human genome. The situation seems to be different for hotspots of nonallelic recombination, the major cause of genomic rear- rangements such as duplications, deletions, and inver- sions. Recent studies of CNVs in various primate species have shown that CNV hotspots are often shared across species, even between human and macaque [12–15]. This suggests that nonallelic recombination hotspots have a longer lifespan than do hotspots of allelic recombination. This is related to the key mechanism of nonallelic recom- bination, that is, NAHR. Highly similar homologous sequences, or segmental duplications (SDs), serve as sub- strates for NAHR, which causes the duplication or deletion of the intervening region (or inversion in the case of inverted SDs) (Figure 1A). Although nonallelic recombina- tion pathways other than NAHR also have a large role in generating CNVs [16,17], it is thought that NAHR hotspots remain active for a longer period of time and are largely responsible for generating recurrent rearrangements. For the sake of clarity, here we define NAHR hotspots as SD pairs that are initiating recurrent NAHR. Therefore, each new duplication creates a new potential hotspot even if they occur in neighboring regions that could be considered as the same fragile region, sometimes making a complicat- ed nested structure of multiple duplications. We also as- sume that a long (e.g., >200 bp) stretch of perfect identity shared between the SD pair is crucial for the maintenance of the hotspot. NAHR can sometimes occur even when the perfect match is short, and the rate may also be influenced by other factors (e.g., distance between the SDs or recom- binogenic sequence motifs) [3,18,19]. However, a long iden- tical stretch is known to enhance greatly the efficiency of Opinion 0168-9525/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. Corresponding author: Innan, H. ( Keywords: gene conversion; non-allelic homologous recombination; rearrangement hotspot; segmental duplication; copy number variant. Trends in Genetics, October 2013, Vol. 29, No. 10 561
  • 9. NAHR [18,20,21], which is predicted to be crucial for repeatedly generating rearrangements over a long period of time. Thus, whereas allelic recombination hotspots are largely determined by the PRDM9 motif and a small number of mutations are sufficient to cause turnovers, NAHR hotspots will potentially remain active as long as the SD contains a subregion with sufficient similarity and length. Indeed, CNV hotspots are enriched for SDs [12,14], and it has been suggested that the long-term evolution of hotspots is determined by the birth-and-death process of matching pairs of SDs [22]. An important question then is how long is the expected lifespan of an individual NAHR hotspot. We consider two evolutionary models that give different predictions regard- ing the lifespan of hotspots. The first is the turnover model (Figure 1B), which assumes that SDs accumulate mutations independently. According to this model, the divergence between the SDs increases in proportion to time and the SDs lose their ability to initiate NAHR as they become too divergent. Consequently, the hotspots are subject to a rapid turnover, and new SDs must constantly arise for the genome to maintain a certain number of hotspots. Thus, the turn- over model predicts that hotspots would be shared only among closely related species and not between distantly relatedspecies,as has been previouslysuggested [22]. In the caseofprimates, themodelpredictsthatitwould beunlikely for hotspots to remain active for more than 25 million years, or since the divergence of human and macaque (Box 1). Therefore, the turnover model might not be sufficient to explain recent findings where several CNV hotspots are shared between human and macaque [13,15]. A model incorporating gene conversion better explains the evolution of CNV hotspots An alternative, which we propose here, is the gene conversion model (Figure 1C). This model predicts the long-term preservation of hotspots and is supported both theoretically and empirically. The model takes into ac- count the effect of gene conversion, a recombinational mechanism that can retard the divergence between SDs. Ongoing gene conversion results in the SDs main- taining high similarity for a long period of time. There is increasing evidence for gene conversion between SDs in various species, including humans [23,24]. It is easy to imagine that gene conversion would provide an ideal substrate for NAHR, as has been previously suggested [25]. The gene conversion model predicts that a larger number of older SDs would be associated with the current hotspots compared with the turnover model (Box 1). The potential role of gene conversion in preserving hotspots has been suggested by several case studies [25–28]. An extreme case is the polymorphic inversion on the human chromosome Xq28 region containing the filamin A (FLNA) and emerin (EMD) loci that is probably caused by NAHR between inverted duplicates. It was found that this pair of inverted duplicates is shared by various eutherian lineages and that these duplicates have recur- rently caused inversions in independent lineages (at least ten times since the origin of eutherians) [27]. The se- quence identity between the duplicates was found to be high in each species. Based on these observations, it was suggested that gene conversion has been homogenizing the duplicates, thus preserving the activity as a hotspot, for at least 100 million years. Another study [13], which identified several macaque CNVs, suggests that this model is applicable to some CNV hotspots in primates. Three CNV regions were identified that were shared between human and macaque where the flanking match- ing SD pairs in both species were clearly orthologous. In all three cases, the paralogous copies were more closely related to each other than to the orthologous copies. This indicates that gene conversion has been maintaining high AcƟve hotspot AcƟve hotspot Divergence Gene conversion SƟll acƟveNew hotspot (A) (B) (C) No more NAHR TRENDS in Genetics Figure 1. Diagram of non-allelic homologous recombination (NAHR) hotspots and two models of their evolution. (A) Illustration of NAHR between tandem segmental duplications (SDs; green arrows) that results in the duplication or deletion of the intervening region (the outcome would be an inversion if the SDs are in inverted orientation). Two models could explain the evolution of NAHR hotspots. (B) The turnover model assumes that the two SD copies diverge in proportion to time and, thus, quickly become unable to initiate NAHR. Therefore, new hotspots must constantly arise for a certain number of hotspots to remain in the genome. (C) The gene conversion model considers the effect of paralogous gene conversion, which maintains the similarity between the two copies. Therefore, the SD is able to initiate NAHR for a much longer period of time. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 562
  • 10. similarity, thereby preserving the ability to initiate NAHR in both lineages for more than 25 million years [13]. Based on further analyses on primate CNV hotspots, we show here that most SD-associated CNV hotspots are more consistent with the gene conversion model than with the turnover model. We examined a previously published data set [15], which contains CNVs identified by previous large-scale population surveys [29,30], and identified 79 cases where both ends of the CNV regions (i.e., break- points) lie within matching SD pairs reported in the segmental duplications database [31,32]. We assume that these CNVs were likely formed by NAHR between the flanking SDs. We first looked at the average nucleotide divergence over the entire region. The divergence was higher than the average human–chimpanzee divergence (approximately 1.3%) for almost all SDs and higher than the average human–-macaque divergence (approximately 6%) for approximately one-third of the SDs (Figure 2A). The actual ages of the SDs could be even older because gene conversion retards their divergence. Indeed, if we look at the spatial distribution of the divergence, most of the 79 SDs show a nonuniform distribution and contain identical stretches that are significantly longer than expected (70/79 at P <0.05; 43/70 at P <0.0001). Figure 2 clearly shows that the longest identical stretches of the observed data are much longer than those of the null data with the same level of divergence. Gene conversion is the most likely mechanism respon- sible for creating these unexpectedly long stretches of perfect identity within the SDs (see Box 2 for a detailed discussion on the divergence process of SDs undergoing gene conversion). The action of gene conversion between the matching SD pairs can be better demonstrated by a comparative genomics approach where SD sequences of multiple species are compared [23,33]. Consider an SD pair in human, Xh and Yh and their orthologs in chimpanzee, Xc and Yc. Gene conversion will create sites where Xh and Box 1. The lifespan of NAHR hotspots under the turnover model and the gene conversion model How long are NAHR hotspots expected to remain in the genome? The gene conversion model predicts that hotspots will remain active for a longer period of time compared with the null turnover model. We illustrate this using a simple computation. The time period is measured by the probability that the SD pair retains an identical stretch of !200 bp. Under the turnover model, we consider three different lengths of the SD (1, 10, and 100 kb). Although the requirement of !200-bp perfect identity is a simplified assumption, this computation provides an approximation of how long a hotspot should remain active and how gene conversion affects its longevity. We note that using different length requirements and changing the values of the parameters shown in Figure I do not affect the overall pattern. As shown in Figure I (red, green, and blue lines for 1, 10, and 100 kb, respectively), the probability quickly drops, especially when the length of the SD is short. A hotspot as old as the human–chimpanzee divergence is still likely to be active (unless short), whereas a hotspot as old as the human–macaque divergence (approximately 25-million years old) is highly unlikely to be active (even for an SD as long as 100 kb) (Figure I). Thus, the lifespan of a hotspot in primates is likely to be between 5 and 25 million years under the turnover model with no gene conversion. The situation dramatically changes under the gene conversion model. We added the effect of gene conversion using three different gene conversion rates for the case of a 10-kb SD (shown by green- dashed lines in Figure I). Including the effect of gene conversion increases the probability that NAHR will still occur after a given amount of time, especially when the rate of gene conversion is high. The rate of gene conversion should be highly variable because it is determined by several factors [60]. Thus, gene conversion can substantially increase the longevity of an NAHR hotspot. Time (million years) Probability 0 10 20 30 Chimp Orangutan Macaque 1kb c = 0 10kb c = 0 100kb c = 0 10kb c = 5 × 10−8 10kb c = 3 × 10−8 10kb c = 1 × 10−8 TRENDS in Genetics Figure I. The probability that a given segmental duplication (SD) pair of 1 kb, 10 kb, and 100 kb (red, green, and blue lines, respectively) will retain an identical stretch of !200 bp based on 10 000 simulation runs. The expected probability was calculated by a simulation following the model in [61]. The model assumes random accumulation of point mutations at a rate of 10À9 /site/generation and that gene conversion occurs at a given rate c per site (see [61] for details). The red, green, and blue solid lines represent simulation results of SDs of 1 kb, 10 kb, and 100 kb when c = 0, and the green-dashed lines represent results of a 10-kb SD when c = {1,3,5} Â 10À8 with an average tract length of 1 kb (1/Q = 0.1 in [61]) representing low, intermediate, and high gene conversion rates. The vertical gray lines approximately correspond to the divergence between human and chimpanzee, orangutan, and macaque. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 563
  • 11. Yh share the same nucleotide and Xc and Yc share another nucleotide. Although strong purifying selection can also create regions of low divergence, significant clustering of such sites cannot be explained by selection and is consid- ered a strong signature of gene conversion [33,34]. Despite the genomic regions containing SDs often being poorly sequenced and/or assembled in nonhuman species, we were able to identify both copies of the SDs in the genome of another primate species for 35 out of the 79 cases. In almost all of those cases (34/35), we found regions that showed strong signatures of gene conversion. These results suggest that gene conversion and the retention of regions of perfect identity are common features of SD pairs in CNV regions, which directly results in the long-term preserva- tion of the CNV hotspots detected by population surveys, that is, common CNVs. The gene conversion model also applies to regions associated with genomic disorders Does this typical pattern also apply to CNVs that cause genomic disorders, whose frequencies are often too low to be detected by a population survey? According to the literature, the answer seems to be yes. Dozens of ‘known’ disorders are often caused by NAHR between SDs (also referred to as low copy repeats) [17,35–37]. For 14 of them, we were able to identify unambiguously SDs containing NAHR breakpoints in the current human genome assem- bly (Table 1). These included two well-studied cases where both copies of the matching SD pair have been identified in other primate genomes and the action of gene conversion has been documented. One is the deletion of the azoospermia factor a (AZFa) locus on chromosome Y that is associated with male infertility (Table 1, #1). This locus is flanked by direct repeats and both copies are present in the orthologous regions of chimpanzee and gorilla [25]. The rearrangement breakpoints map to two specific regions within the duplicates. One region shows 1285 bp of perfect identity and the other contains one single mismatch over 1609 bp, despite some other regions showing <90% identity. Strong signatures of gene con- version were reported in these two breakpoint regions [25,38]. The other example is the coagulation factor VIII (F8) locus, which contains two pairs of inverted repeats (Table 1, #2). Inversion between either pair causes hemo- philia A. Despite originating before the divergence of human and African green monkey (and, thus, macaque), both pairs exhibit >99% identity [26]. It is interesting to note that hemophilia A caused by the inversion of the same region due to NAHR has also been reported in dog, although it is not clear whether the inversion is mediated by repeats ancestral to human and dog [39]. In addition to these two cases, we found five cases in which the orthologous copies of the matching SD pairs could be identified in at least one of the chimpanzee, orangutan, or macaque genomes (Table 1, #3–7). Each SD pair exhibited evidence of gene conversion. One inter- esting case is the SD pair associated with Incontinentia Pigmenti (Table 1, #7), a severe X-linked disorder that is lethal in males. The main cause of this disease is a genomic deletion that eliminates exons 4–10 of the inhib- itor of kappa light polypeptide gene enhancer in B-cells, kinase gamma (NEMO/IKBKG) gene, which is located on Xq28. This deletion is caused by NAHR between two identical MER67B repeated sequences of 878 bp, one 0.00 0.05 0.10 0.15 05001000150020002500 Observed distribuƟon NucleoƟde divergence ObservedlongestidenƟcalstretch(bp) P ≥ 0.05 Key: P < 0.05 P < 0.01 P < 0.0001 Chimp Orangutan Macaque 0.00 0.05 0.10 0.15 05001000150020002500 Null distribuƟon NucleoƟde divergence ExpectedlongestidenƟcalstretch(bp) Chimp Orangutan Macaque (B)(A) TRENDS in Genetics Figure 2. The probability for observing the longest identical stretch present in the segmental duplications (SDs) flanking the copy number variants (CNVs). (A) The observed longest identical stretch (bp) within each SD pair flanking a CNV region is plotted against the divergence level. The significance of the observed length for each SD was evaluated by creating 10 000 random patterns of divergence where the diverged nucleotide positions are distributed randomly across the entire SD, and are shown as filled squares, triangles, and circles when significant (P <0.05, <0.01, and <0.0001, respectively), and by open circles when not significant. (B) Typical distribution of the longest identical stretch in the randomized data used for evaluating the significance in (A). Only some of the data are shown to demonstrate the point. The vertical gray lines show the time corresponding to the average genome-wide nucleotide divergence between human and chimpanzee, orangutan, and macaque [62]. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 564
  • 12. located in intron 3 and the other located downstream of the last exon of NEMO [40,41]. Both copies were present in the orthologous regions of the genomes of chimpanzee, orangutan, and macaque. The two copies show >99% similarity in all species and exhibit strong signatures of gene conversion. This indicates that gene conversion has maintained the genomic configuration that predis- poses carriers to severe disorders (at least in humans) for more than 25 million years. This Xq28 region contains several other extreme examples of extensive homogeni- zation of ancient duplicates within approximately 1 Mb. The F8 locus associated with hemophilia A [26] and the inverted repeats at the FLNA–EMD locus [27] (both dis- cussed above), as well as the red- and green-opsin gene duplicates undergoing frequent gene conversion [42], are all in this region. Thus, the rate of gene conversion could be elevated in this region. Several other genomic disorders, such as Williams– Beuren syndrome, Smith–Magenis syndrome, neurofibro- matosis type 1 (NF1), and DiGeorge/velocardiofacial syn- drome (Table 1, #8–11), are caused by NAHR between SDs that are present in multiple copies in other primate gen- omes [43–48]. These reports are based on fluorescent in situ hybridization (FISH), and the ages of the exact copies involved in NAHR in humans are not clear. Nevertheless, strong signatures of gene conversion around the break- point regions of the SDs have been reported for all four cases [49–52]. For instance, many of the breakpoints of NAHR associated with NF1 map to a region within the 51- kb SD that shows elevated sequence identity, probably due to gene conversion, including a 700-bp identical stretch [50]. Also, several polymorphic sites shared by both SD copies, which are strong signatures of gene conversion, were detected around the breakpoint region of the SDs Box 2. Divergence pattern of a segmental duplication undergoing gene conversion How do SDs evolve when gene conversion frequently occurs? Following a duplication event, the divergence will remain at a low equilibrium as long as gene conversion is ongoing (see [61] for details). The accumulation of mutations or large indels will result in the termination of gene conversion and the increase of divergence in that region, whereas concerted evolution will continue in other regions. Regions undergoing gene conversion within the SD will decrease as time proceeds (Figure I). Future work will be needed to reveal the process that determines which region within the SD retains high similarity. One possibility is that any region within the SD can potentially retain high similarity because indels and point mutations accumulate randomly across the SD. Therefore, the ongoing or termination of gene conversion will occur randomly across the SD. Under this scenario, if we consider an SD pair that is shared among species, we would also expect that gene conversion would be ongoing in different regions of the SDs in each species (Figure IA). Note that when multiple species are compared, the homogenized regions will not be distributed completely randomly because of their shared evolutionary history. We can also imagine an alternative scenario where specific regions undergo homogenization for a long period of time. If the same specific region of the two copies is under selective constraint, the divergence will remain low within that region, which will make it more likely for gene conversion to occur. Also, gene conversion might be favored in a specific region if the retention of high similarity of that region has some functional benefit. The rate of gene conversion could also be elevated locally due to, for example, the DNA structure or the presence of certain motifs. Under this nonrandom scenario, gene conversion might continue to occur at the same specific region in different species even long after their divergence (Figure IB). (A) (B) Human Chimp Orangutan Human Chimp Orangutan TRENDS in Genetics Figure I. Illustration of how duplicates diverge in the presence of gene conversion. The green bars represent regions within the segmental duplications (SDs) that are undergoing gene conversion. Regions undergoing gene conversion gradually decrease due to large indels or the accumulation of mutations. (A) Scenario where the termination of gene conversion occurs randomly throughout the SD. Regions undergoing gene conversion in each species differ, although they are not entirely independent due to their shared history. (B) Scenario where selection favors ongoing gene conversion in specific regions (blue bar) due to some functional constraint. The continuation and termination of gene conversion is not random, and the same region likely retains high similarity in each species. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 565
  • 13. associated with DiGeorge/velocardiofacial syndrome [51]. Thus, although we could not confirm the presence of both SD copies in other primate genomes for seven cases, in- cluding these four (Table 1, #8–14), possibly because these regions are repetitive and poorly assembled in other spe- cies, it is likely that gene conversion is involved in pre- serving the hotspots. In summary, the examples discussed here clearly show that the gene conversion model applies to SDs associated with genomic disorders, even though the rearrangements are pathological. Concluding remarks Here, we have shown that most SD-associated CNV hot- spots have been preserved for a long period of time, much longer than hotspots of allelic recombination. Gene conver- sion appears to be having a key role in the preservation by maintaining long stretches (e.g., several hundred bases) of perfect identity within SD pairs that can serve as sub- strates for NAHR. This has implications in disease, be- cause the preservation often increases the risk of pathological rearrangements. The preservation should be determined by the balance between factors that cause the preservation (e.g., rate of gene conversion or selection favoring the preservation) and the reduction of fitness caused by the preservation (e.g., rate of NAHR or severity of the resulting disorder). Although the maintenance of stretches of high similarity by gene conversion might be promoted by selection due to a functional constraint in some cases, it is unlikely that all the homogenized regions are functional. Rather, given that most of the breakpoints in Table 1 map to repeat regions, functional constraint may not be the major contributor to the preservation. This is consistent with the observation that regions within the SDs being homogenized are different in each primate species. Thus, it seems most likely that CNV hotspots, in general, are preserved as a byproduct of gene conversion that occurs at a high enough rate to override their negative consequences. Future work involving comparative analysis of sequences from multiple species and careful modeling of the divergence process of the SDs considering the effect of gene conversion and selection should be valuable for better understanding the different factors, including selection, that are responsible for the preservation of CNV hotspots (Box 2). The preservation of rearrangement hotspots might have had a key role in the adaptive evolution of humans. Recent studies have identified several regions within the human genome that comprise mosaic structures of duplication subunits (duplicons) as a result of recurrent duplica- tions-within-duplications. In particular, several ‘core duplicons’ that have duplicated several times throughout evolution and are shared across multiple duplication blocks are known to contain primate-specific genes under- going positive selection [37,53,54]. Another recent study showed that CNV regions shared among human, chimpan- zee, and macaque (CNV hotspots) were significantly likely to overlap with genic regions [15]. This is in stark contrast with human-specific CNV regions, which are generally depleted of genes. Furthermore, many of the genes that overlap with CNV hotspots are evolving under positive selection, and some are evolving under balancing selection in humans [15]. It has been suggested that the genomic plasticity in these hotspot regions has provided the muta- tional flexibility for the residing genes to adapt to changing selective pressures [15,37,55]. If so, we further suggest that gene conversion has had an important role in maintaining Table 1. The presence of duplicates flanking human genomic disorder regions in other species and the occurrence of gene conversion No. Locus Candidate genesa Associated phenotypes Evolutionary originb Gene conversionc Refs #1 Yq11 AZFa Male infertility Gorilla + [25,38] #2 Xq28 F8 Hemophilia Ad African green monkey + [26] #3 5q35 NSD1 Sotos syndrome Orangutan (macaque) ++ [63,64] #4 15q24 MAN2C1, CYP11A1, STRA6 Growth retardation and microcephaly Orangutan ++ [65] #5 16p11 MAPK3, MAZ, DOC2A, SEZ6L2, HIRIP3 Autism Chimp ++ [66,67] #6 17p11 PMP22 Charcot-Marie-Tooth type 1A Chimp ++ [68–70] #7 Xq28 NEMO Incontinentia pigmenti Macaque ++ [40,41] #8 7q11 GTF2I Williams–Beuren syndrome (macaque, gibbon) + [43,44,52] #9 17p11 RAI1 Smith–Magenis syndrome (macaque) + [45,49] #10 17q11 NF1 NF1 (gorilla) + [46,50] #11 22q12 BCR, USP18, GGT DiGeorge/velocardiofacial syndrome (macaque) + [47,48,51,71] #12 2q13 NPHP1 Familial juvenile nephronophthisis ND – [72] #13 10q22-23 NRG3, GRID1, BMPR1, SNCG, GLUD1 Cognitive and behavioral abnormalities ND – [73] #14 17q23 TBX2, TBX4 Developmental delay and heart defects ND – [74] a Abbreviations: BCR, breakpoint cluster region; BMPR1, bone morphogenetic protein receptor 1; CYP11A1, cytochrome P450, family 11, subfamily A, polypeptide 1; DOC2A, double C2-like domains, alpha; GGT, gamma-glutamyl transferase; GLUD1, glutamate dehydrogenase 1; GRID1, glutamate receptor, ionotropic, delta 1; GTF2I, general transcription factor II i; HIRIP3, HIRA interacting protein 3; MAN2C1, mannosidase, alpha, class 2C, member 1; MAPK3, mitogen-activated protein kinase 3; MAZ, MYC- associated zinc finger protein; NPHP1, nephronophthisis 1; NRG3, neuregulin 3; NSD1, nuclear receptor binding SET domain protein 1; PMP22, peripheral myelin protein 22; RAI1, retinoic acid induced 1; SEZ6L2, seizure related 6 homolog (mouse)-like 2; SNCG, synuclein, gamma; STRA6, stimulated by retinoic acid 6; TBX, T-box; USP18, ubiquitin specific peptidase 18. b The most distant species from human in which the duplicates were confirmed to be present based on genomic sequences are listed. Those not based on genomic sequences (e.g. FISH signals) are shown in brackets. Those identified in this study are in bold. ‘ND’ denotes those where the presence of both copies could not be confirmed in the genome of chimpanzee, orangutan, or macaque. c + indicates duplicates where gene conversion has likely occurred; ++ indicates those that are based on this study. d Caused by inversion due to NAHR between inverted duplicates. The remaining disorders are all caused by deletions due to NAHR between duplicates in direct orientation. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 566
  • 14. genomic plasticity, which most likely contributed to the adaptive evolution of the human lineage. Almost all the duplicates we examined here showed evidence of gene conversion. This might seem at odds with previous studies that detected gene conversion in only approximately 10–15% of human duplicated gene pairs [56,57]. However, these studies did not focus on duplicates of low divergence (e.g., <5% divergence) that are either young or undergoing extensive gene conversion. We predict the fraction of recently duplicated sequences containing regions still undergoing gene conversion to be substantial- ly higher. Indeed, a study analyzing 30 multiple align- ments of human duplicated sequences of <4% nucleotide divergence found evidence of sequence exchange due to gene conversion or unequal crossing over in all 30 align- ments [58]. A recent population survey of CNVs in multi- copy gene families also reported several cases of gene conversion [59]. Thus, there could be a large number of nearly identical regions undergoing gene conversion with- in the genome, especially in SDs that are located close to each other. These regions could be acting as rearrange- ment hotspots that are yet to be identified. The accumulating genomic data of human population and other primate species should enable us to identify such regions undergoing gene conversion. This should be a pow- erful approach to detect potential hotspots of genetic dis- orders that are difficult to detect due to their low frequencies in the human population. In this respect, we note that many hotspot regions are likely to be missed by low-coverage genomes or resequencing studies because they are often highly repetitive. Thus, more high-quality reference gen- omes from nonhuman primates and also multiple human individuals in the future should be valuable in understand- ing perhaps the most important genomic regions in terms of human disease and human evolution. Acknowledgments We thank K. Teshima for technical help. This work is supported by a grant from Japan Society for the Promotion of Science (JSPS) to H.I. J.A.F. is a JSPS postdoctoral fellow. References 1 Coop, G. and Przeworski, M. (2007) An evolutionary view of human recombination. Nat. Rev. Genet. 8, 23–34 2 Webster, M.T. and Hurst, L.D. (2012) Direct and indirect consequences of meiotic recombination: implications for genome evolution. Trends Genet. 28, 101–109 3 Myers, S. et al. (2005) A fine-scale map of recombination rates and hotspots across the human genome. Science 310, 321–324 4 Ptak, S.E. et al. (2005) Fine-scale recombination patterns differ between chimpanzees and humans. Nat. Genet. 37, 429–434 5 Myers, S. et al. (2008) A common sequence motif associated with recombination hot spots and genome instability in humans. Nat. Genet. 40, 1124–1129 6 Winckler, W. et al. (2005) Comparison of fine-scale recombination rates in humans and chimpanzees. Science 308, 107–111 7 Auton, A. et al. (2012) A fine-scale chimpanzee genetic map from population sequencing. Science 336, 193–198 8 Ponting, C.P. (2011) What are the genomic drivers of the rapid evolution of PRDM9? Trends Genet. 27, 165–171 9 Baudat, F. et al. (2010) PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice. Science 327, 836–840 10 Myers, S. et al. (2010) Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination. Science 327, 876–879 11 Parvanov, E.D. et al. (2010) Prdm9 controls activation of mammalian recombination hotspots. Science 327, 835 12 Perry, G.H. et al. (2008) Copy number variation and evolution in humans and chimpanzees. Genome Res. 18, 1698–1710 13 Lee, A.S. et al. (2008) Analysis of copy number variation in the rhesus macaque genome identifies candidate loci for evolutionary and human disease studies. Hum. Mol. Genet. 17, 1127–1136 14 Gazave, E. et al. (2011) Copy number variation analysis in the great apes reveals species-specific patterns of structural variation. Genome Res. 21, 1626–1639 15 Gokcumen, O. et al. (2011) Refinement of primate copy number variation hotspots identifies candidate genomic regions evolving under positive selection. Genome Biol. 12, R52 16 Conrad, D.F. et al. (2010) Mutation spectrum revealed by breakpoint sequencing of human germline CNVs. Nat. Genet. 42, 385–391 17 Liu, P. et al. (2012) Mechanisms for recurrent and complex human genomic rearrangements. Curr. Opin. Genet. Dev. 22, 211–220 18 Waldman, A.S. (2008) Ensuring the fidelity of recombination in mammalian chromosomes. Bioessays 30, 1163–1171 19 Liu, P. et al. (2011) Frequency of nonallelic homologous recombination is correlated with length of homology: evidence that ectopic synapsis precedes ectopic crossing-over. Am. J. Hum. Genet. 89, 580–588 20 Jinks-Robertson, S. et al. (1993) Substrate length requirements for efficient mitotic recombination in Saccharomyces cerevisiae. Mol. Cell. Biol. 13, 3937–3950 21 Reiter, L.T. et al. (1998) Human meiotic recombination products revealed by sequencing a hotspot for homologous strand exchange in multiple HNPP deletion patients. Am. J. Hum. Genet. 62, 1023– 1033 22 Alekseyev, M.A. and Pevzner, P.A. (2010) Comparative genomics reveals birth and death of fragile regions in mammalian evolution. Genome Biol. 11, R117 23 Gao, L-Z. and Innan, H. (2004) Very low gene duplication rate in the yeast genome. Science 306, 1367–1370 24 Chen, J-M. et al. (2011) Gene conversion in human genetic disease. Genes 1, 550–663 25 Hurles, M.E. et al. (2004) Origins of chromosomal rearrangement hotspots in the human genome: evidence from the AZFa deletion hotspots. Genome Biol. 5, R55 26 Bagnall, R.D. et al. (2005) Gene conversion and evolution of Xq28 duplicons involved in recurring inversions causing severe hemophilia A. Genome Res. 15, 214–223 27 Ca´ceres, M. et al. (2007) A recurrent inversion on the eutherian X chromosome. Proc. Natl. Acad. Sci. U.S.A. 104, 18571–18576 28 Zody, M.C. et al. (2008) Evolutionary toggling of the MAPT 17q21.31 inversion region. Nat. Genet. 40, 1076–1083 29 Conrad, D.F. et al. (2010) Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 30 Park, H. et al. (2010) Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing. Nat. Genet. 42, 400–405 31 Bailey, J.A. et al. (2001) Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 32 She, X. et al. (2004) Shotgun sequence assembly and recent segmental duplications within the human genome. Nature 431, 927–930 33 Osada, N. and Innan, H. (2008) Duplication and gene conversion in the Drosophila melanogaster genome. PLoS Genet. 4, e1000305 34 Fawcett, J.A. and Innan, H. (2011) Neutral and non-neutral evolution of duplicated genes with gene conversion. Genes 2, 191–209 35 Stankiewicz, P. and Lupski, J.R. (2002) Molecular-evolutionary mechanisms for genomic disorders. Curr. Opin. Genet. Dev. 12, 312–319 36 Mefford, H.C. and Eichler, E.E. (2009) Duplication hotspots, rare genomic disorders, and common disease. Curr. Opin. Genet. Dev. 19, 196–204 37 Marques-Bonet, T. et al. (2009) The origins and impact of primate segmental duplications. Trends Genet. 25, 443–454 38 Bosch, E. et al. (2004) Dynamics of a human interparalog gene conversion hotspot. Genome Res. 14, 835–844 39 Lozier, J.N. et al. (2002) The Chapel Hill hemophilia A dog colony exhibits a factor VIII gene inversion. Proc. Natl. Acad. Sci. U.S.A. 99, 12991–12996 Opinion Trends in Genetics October 2013, Vol. 29, No. 10 567
  • 15. 40 Smahi, A. et al. (2000) Genomic rearrangement in NEMO impairs NF- kB activation and is a cause of incontinentia pigmenti. Nature 405, 466–472 41 Aradhya, S. et al. (2001) A recurrent deletion in the ubiquitously expressed NEMO (IKK-U) gene accounts for the vast majority of incontinentia pigmenti mutations. Hum. Mol. Genet. 10, 2171–2179 42 Zhao, Z. et al. (1998) Frequent gene conversion between human red and green opsin genes. J. Mol. Evol. 46, 494–496 43 DeSilva, U. et al. (1999) Comparative mapping of the region of human chromosome 7 deleted in Williams syndrome. Genome Res. 9, 428–436 44 Antonell, A. et al. (2005) Evolutionary mechanisms shaping the genomic structure of the Williams-Beuren syndrome chromosomal region at human 7q11.23. Genome Res. 15, 1179–1188 45 Park, S-S. et al. (2002) Structure and evolution of the Smith-Magenis syndrome repeat gene clusters, SMS-REPs. Genome Res. 12, 729–738 46 De Raedt, T. et al. (2004) Genomic organization and evolution of the NF1 microdeletion region. Genomics 84, 346–360 47 Shaikh, T.H. et al. (2000) Chromosome 22-specific low copy repeats and the 22q11.2 deletion syndrome: genomic organization and deletion endpoint analysis. Hum. Mol. Genet. 9, 489–501 48 Bailey, J.A. et al. (2002) Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22. Am. J. Hum. Genet. 70, 83–100 49 Bi, W. et al. (2003) Reciprocal crossovers and a positional preference for strand exchange in recombination events resulting in deletion or duplication of chromosome 17p11.2. Am. J. Hum. Genet. 73, 1302–1315 50 Forbes, S.H. et al. (2004) Genomic context of paralogous recombination hotspots mediating recurrent NF1 region microdeletion. Genes Chromosomes Cancer 41, 12–25 51 Pavlicek, A. et al. (2005) Traffic of genetic information between segmental duplications flanking the typical 22q11.2 deletion in velo- cardio-facial syndrome/DiGeorge syndrome. Genome Res. 15, 1487–1495 52 Baye´s, M. et al. (2003) Mutational mechanisms of Williams-Beuren syndrome deletions. Am. J. Hum. Genet. 73, 131–151 53 Johnson, M.E. et al. (2006) Recurrent duplication-driven transposition of DNA during hominoid evolution. Proc. Natl. Acad. Sci. U.S.A. 103, 17626–17631 54 Jiang, Z. et al. (2007) Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nat. Genet. 39, 1361–1368 55 Iskow, R.C. et al. (2012) Exploring the role of copy number variants in human adaptation. Trends Genet. 28, 245–257 56 McGrath, C.L. et al. (2009) Minimal effect of ectopic gene conversion among recent duplicates in four mammalian genomes. Genetics 182, 615–622 57 Ezawa, K. et al. (2010) Evolutionary pattern of gene homogenization between primate-specific paralogs after human and macaque speciation using the 4-2-4 method. Mol. Biol. Evol. 27, 2152–2171 58 Jackson, M.S. et al. (2005) Evidence for widespread reticulate evolution within human duplicons. Am. J. Hum. Genet. 77, 824–840 59 Sudmant, P.H. et al. (2010) Diversity of human copy number variation and multicopy genes. Science 330, 641–646 60 Mansai, S.P. et al. (2011) The rate and tract length of gene conversion. Genes 2, 313–331 61 Teshima, K.M. and Innan, H. (2004) The effect of gene conversion on the divergence between duplicated genes. Genetics 166, 1553–1560 62 Scally, A. et al. (2012) Insights into hominid evolution from the gorilla genome sequence. Nature 483, 169–175 63 Visser, R. et al. (2005) Identification of a 3.0-kb major recombination hotspot in patients with Sotos syndrome who carry a common 1.9-Mb microdeletion. Am. J. Hum. Genet. 76, 52–67 64 Kurotaki, N. et al. (2005) Sotos syndrome common deletion is mediated by directly oriented subunits within inverted Sos-REP low-copy repeats. Hum. Mol. Genet. 14, 535–542 65 Sharp, A.J. et al. (2007) Characterization of a recurrent 15q24 microdeletion syndrome. Hum. Mol. Genet. 16, 567–572 66 Kumar, R.A. et al. (2008) Recurrent 16p11.2 microdeletions in autism. Hum. Mol. Genet. 17, 628–638 67 Weiss, L.A. et al. (2008) Association between microdeletion and microduplication at 16p11.2 and autism. N. Engl. J. Med. 358, 667–675 68 Kiyosawa, H. and Chance, P.F. (1996) Primate origin of the CMT1A- REP repeat and analysis of a putative transposon-associated recombinational hotspot. Hum. Mol. Genet. 5, 745–753 69 Hurles, M.E. (2001) Gene conversion homogenizes the CMT1A paralogous repeats. BMC Genomics 2, 11 70 Lindsay, S.J. et al. (2006) A chromosomal rearrangement hotspot can be identified from population genetic variation and is coincident with a hotspot for allelic recombination. Am. J. Hum. Genet. 79, 890–902 71 Shaikh, T.H. et al. (2007) Low copy repeats mediate distal chromosome 22q11.2 deletions: sequence analysis predicts breakpoint mechanisms. Genome Res. 17, 482–491 72 Saunier, S. et al. (2000) Characterization of the NPHP1 locus: mutational mechanism involved in deletions in familial juvenile nephronophthisis. Am. J. Hum. Genet. 66, 778–789 73 Balciuniene, J. et al. (2007) Recurrent 10q22-q23 deletions: a genomic disorder on 10q associated with cognitive and behavioral abnormalities. Am. J. Hum. Genet. 80, 938–947 74 Ballif, B.C. et al. (2010) Identification of a recurrent microdeletion at 17q23.1q23.2 flanked by segmental duplications associated with heart defects and limb abnormalities. Am. J. Hum. Genet. 86, 454–461 Opinion Trends in Genetics October 2013, Vol. 29, No. 10 568
  • 16. Human housekeeping genes, revisited Eli Eisenberg1 and Erez Y. Levanon2 1 Raymond and Beverly Sackler School of Physics and Astronomy, Tel-Aviv University, Tel Aviv 69978, Israel 2 Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat Gan 52900, Israel Housekeeping genes are involved in basic cell mainte- nance and, therefore, are expected to maintain constant expression levels in all cells and conditions. Identification of these genes facilitates exposure of the underlying cellular infrastructure and increases understanding of various structural genomic features. In addition, house- keeping genes are instrumental for calibration in many biotechnological applications and genomic studies. Advances in our ability to measure RNA expression have resulted in a gradual increase in the number of identified housekeeping genes. Here, we describe housekeeping gene detection in the era of massive parallel sequencing and RNA-seq. We emphasize the importance of expres- sion at a constant level and provide a list of 3804 human genes that are expressed uniformly across a panel of tissues. Several exceptionally uniform genes are singled out for future experimental use, such as RT-PCR control genes. Finally, we discuss both ways in which current technology can meet some of past obstacles encoun- tered, and several as yet unmet challenges. The concept of housekeeping genes Housekeeping genes are genes that are required for the maintenance of basal cellular functions that are essential for the existence of a cell, regardless of its specific role in the tissue or organism. Thus, they are expected to be expressed in all cells of an organism under normal condi- tions, irrespective of tissue type, developmental stage, cell cycle state, or external signal. From a fundamental point of view, full characterization of the minimal set of genes required to sustain life is of special interest [1,2]. In addi- tion, housekeeping genes are widely used as internal con- trols for experimental as well as computational studies [3–7]. Furthermore, many studies have highlighted unique genomic and evolutionary features of this special group of genes. For example, housekeeping genes were shown to have shorter introns and exons [8–11], a different repeti- tive sequence environment [enriched in short interspersed elements (SINEs) and depleted in long interspersed ele- ments (LINEs)] [12,13], more simple sequence repeats in the 50 untranslated region (UTR) [14], lower conservation of the promoter sequence [15], and lower potential for nucleosome formation in the 50 region of these genes [16]. Protein products of housekeeping genes are enriched in some domain families [17]. These studies shed light on general aspects of gene structure and evolution. Early detection schemes for housekeeping genes The notion of housekeeping genes has been in use in the literature for nearly 40 years. In particular, several mam- malian genes have been used widely as internal controls in experimental expression studies, such as glyceraldehyde- 3-phosphate dehydrogenase (GAPDH), tubulins, cyclophi- lin, albumin, actins, 18S rRNA or 28S rRNA. Yet, only at the turn of the 21st century, with the advancement of transcriptome profiling technology, did it become possible to identify, systematically, a set of housekeeping genes. These first attempts used large-scale expression data [18–20] or, more often, microarray profiling to look at the expression levels of many genes across a panel of tissue samples. Typically, they resulted in lists of hundreds to thousands of genes [8,19–25], many more than the dozen or so commonly used control genes. Generally, the many lists produced show a considerable level of consistency. Typically, the intersection of any two of them yields approximately 50% coverage [8,24,26], sug- gesting that the sets are enriched in housekeeping genes but still lacking in specificity and selectivity. This could be partly attributed to the limited number of tissues exam- ined in each separate analysis and the differences between the tissues across analyses. However, it is likely that technological limitations affecting the underlying data have contributed much to the quality and reproducibility of the results. In particular, first-generation microarray technology is known to have had many problematic nonspecific probes [27]. Even the improved versions of microarrays are typi- cally assumed to achieve only an approximately twofold accuracy in expression level measurement, and they are limited in their dynamical range. These inaccuracies could have large effects on deciding whether a gene is expressed (regardless of the rather arbitrary expression cutoff used to determine which probe set is ‘expressed’). A second, more fundamental, issue relates to the very definition of housekeeping genes. Should one look for genes merely being expressed in all tissues, or should the gene also be expressed at a constant level across tissues? Early studies generally adopted the first definition and, in fact, GAPDH and other popular housekeeping genes for experi- mental controls have been found to vary considerably across tissues [3,28–30]. This choice was the pragmatic one to make, because it enabled the use of the binary present or absent calls of the microarray and rendered normalization issues unnecessary. However, this approach has two shortcomings. First, measurement errors and stochastic noise make it difficult to distinguish genes absent from the sample from those weakly expressed. Second, and more importantly, it was later appreciated Opinion 0168-9525/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. Corresponding author: Eisenberg, E. ( Keywords: housekeeping genes; RNA-seq; gene expression patterns; internal control; next generation sequencing. Trends in Genetics, October 2013, Vol. 29, No. 10 569
  • 17. that a large part of the genome is expressed at a low basal level in all tissues [31]. Thus, most genes are expressed at some background level in all tissues. In light of this obser- vation, and to make the concept of housekeeping genes more useful, one should either modify the definition of housekeeping genes to ‘genes that are expressed above some cutoff level’, which necessarily introduces an arbi- trary parameter explicitly, or rather adopt the second option above and look for genes that are expressed at a constant level across all normal tissues. Introducing an expression cutoff requires a quantitative comparison of expression levels of different genes in the same sample. This is known to be a complex problem, due to questions of bias in PCR amplification, different probe affinities, and so on. Furthermore, normalizing the values obtained from different experiments is also a non- trivial challenge. Early microarrays studies generally used linear normalization, setting the mean expression level, or the trimmed mean, constant. Later, the more sophisticated quantile normalization was introduced [32]. These and other normalization procedures generally assume similar expression-value distributions for all samples studied. This could be justified for samples coming from identical or highly similar biological conditions, perhaps even for healthy and diseases samples of the same tissue. However, it is not yet clear how accurate this assumption is for cross- tissue comparisons, and how much it skews the results [33]. A third issue that was not fully addressed in previous studies of housekeeping genes is alternative splicing. It has been appreciated for more than a decade that most human genes have more than one isoform [34,35]. Thus, one could envision a situation in which one splice variant is consti- tutively expressed, making it a housekeeping transcript, whereas another transcript from the same gene exhibits a more complex expression profile (Figure 1A). Moreover, it is possible that a single gene expresses one transcript in one set of tissues and another transcript in other tissues, such that the gene, as such, is always expressed, but each transcript is specific to a subset of tissues. In principle, then, one would like to define the set of housekeeping transcripts. Early microarray technology did rather poorly in distinguishing between transcripts and, thus, some studies deliberately ‘zoomed out’ to the gene level. Housekeeping genes in the deep-sequencing era New horizons are opening as deep-sequencing technology takes over microarrays as the method of choice for tran- scriptome profiling [36]. RNA-seq was found to be prefera- ble to microarrays as a tool for expression measurement. Unlike microarrays, RNA-seq does not require pre-knowl- edge of the genomic sequence (although it is helpful for analysis), and requires smaller amounts of RNA. It pro- vides information at the single-base level, enabling better assessment of alternative splicing and even allelic varia- tion. Background levels in RNA-seq are lower, due to the better specificity and improved control of in silico sequence alignment compared with probe hybridization. Conse- quently, a wider dynamic range is accessible. Importantly, RNA-Seq is also more accurate in quantifying spike-in RNA controls of known concentration, and produces expression values that correlate better with quantitative PCR (qPCR) results [36] and protein levels [37]. This new and improved platform enables some of the challenges to be met that have been standing for many years, but it also opens up new questions. In terms of normalization, read coverage generally pro- vides a rather robust measure for comparing different genomic regions within the same sample. Exceptions to this are generally a result of alignment problems in repeti- tive or duplicative regions (Figure 1B). For the task of housekeeping gene identification, these can be partly avoided by limiting analysis to the nonrepetitive coding regions of the exons [33] and using long reads. Note, however, that highly expressed coding exons (e.g., GAPDH) are prone to having more duplications [38], resulting in alignment problems. Small-scale PCR biases are expected to be washed out when looking at the aver- aged expression level over whole exons. By contrast, the issue of cross-tissue normalization is still open. The popu- lar reads per kilobase per million mapped reads (RPKM) measure takes care of normalizing for the two most obvious factors affecting the raw number of reads per gene, tran- script, or exon: the total number of reads produced and their length [39]. The RPKM measure is simple and straightforward, but does not fully solve the between- sample normalization issue. More subtle biases, resulting from variations in transcript length distribution in the sample, coverage dependence on local sequence due to GC content, priming and other biases, and variability in mappability of different regions were detected [40–45]. A (A) (B) (C) ?? B B C A B C A A′ B′ TRENDS in Genetics Figure 1. Examples of challenges in housekeeping gene detection. (A) Genes having several splice variants could have different expression levels [indicated by the number of reads (black bars)] for different parts of the gene. (B) Duplicative regions, due to pseudogenes and other duplications, complicate unique read alignments, thus biasing expression-level measurement. (C) Expression measurement has several biases, including the lower expression (on average) of the upstream exons due to imperfect reverse transcription resulting in partial cDNA molecules. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 570
  • 18. There is still no consensus as to the best way to account for all of these in a standard and consistent way. In terms of housekeeping gene identification, RNA-seq dataindeedshow explicitly thatbasal (leaky) lowexpression levels can be found throughout the genome. Therefore, any definition of housekeeping genes should refer to the quanti- tative expression level. This can be done using a cutoff, or by adding the requirement of low variability in expression across tissues. Here, we promote the latter course of action. Setting a cutoff value as the main criteria for defining the housekeeping genes is undesirable for three reasons. First, there seems to be no natural cutoff value, thus forcing one to make an arbitrary choice. Second, due to the lack of a proper intergene normalization scheme, the same RPKM values for different genes could indicate different expression levels [4,46]. Third, using the expression level as a measure of importance for cell function is also questionable: cells are likely to require different gene products at different concen- trations. There is no good reason to exclude genes that are constantly expressedata midratherthana highlevel.Thus, we feel that low variability should be used as the main criteria for selecting housekeeping genes. Another advantage of RNA-seq data is that they mea- sure the expression along the gene (similar to the older exon arrays) and can thereby provide expression at the exon level. Some software tools try to extract transcript expression levels from RNA-seq data (e.g., [47]). However, there is still much to be desired in terms of reliability within the limits of current technology [43]. This is expected to improve significantly, as read length increases. Note that recent findings [48] show significant variability in exon boundaries, making even the comparison of exon expression imperfect. An interim partial solution, which we adopt below, is to measure expression at the more basic exon level and aim to define a set of housekeeping exons. Extracting a set of housekeeping genes from Human BodyMap data Here, we demonstrate the power of the new technology for identifying housekeeping genes by analyzing expression data from the Human BodyMap (HBM) 2.0 Project. This includes publicly available RNA-Seq data (GEO accession number GSE30611, HBM), generated on HiSeq 2000 instruments, providing expression profiling in 16 normal human tissue types: adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells. Two different read lengths were used for each tissue (2 Â 50-bp paired- end and 1 Â 75-bp single-read data), each of which was sequenced in a separate HiSeq 2000 lane. We aligned the reads to the genome using the Bowtie2 aligner [49] and measured the read coverage of each of the coding exons of the (uniquely aligned) RefSeq sequences [50], in normalized RPKM units. For exons that were partly coding, only the coding part was considered. Short exons (<50 bp) are prone to alignment problems and were discarded. We compared the RPKM values obtained from the paired-end data and the single-read data to assess the technical reproducibility of the RPKM measure, and found that the typical fold-ratio between the two was 1.5 (Figure 2A). We observed a bias against the upstream exons of transcripts, which tended to have a lower expres- sion levels. This effect might result from imperfect reverse transcription resulting in cDNA missing the upstream part of the transcript (Figure 1C). -1.5 -1 -0.5 0 0.5 1 1.5 log2 (RPKM50_PE /RPKM75 ) 0 (A) (B) (C) 1 0 0.25 0.5 FracƟon of exons passing 0.01 1 100 Cutoffvalue(RPKM) Minimum expression over Ɵssues Key: Geometric mean expression 0 0.1 0.2 0.3 0.4 0.5 FracƟon of exons below cutoff 0 0.5 1 1.5 2 2.5 std[log2 (RPKM)]cutoff TRENDS in Genetics Figure 2. Characterization of the expression profile in Human BodyMap (HBM) data. (A) Reproducibility of the measured reads per kilobase per million mapped reads (RPKM) levels per exon, as assessed by comparing the 50-bp paired-end and the 75-bp single-read data. The continuous line is the best fit for a Gaussian distribution, added to accentuate the fat tails of the actual distribution. The width of the distribution is approximately 0.55 (log2 units), leading to a typical variability of 1.5-fold. (B) Fraction of exons expressed above a cutoff value in all 16 tissues, for different cutoff values. In total, 55% of all exons are expressed to a detectable level in the HBM data set. (C) Cumulative distribution of the exon expression variance. Most of the exons being expressed in all tissues have standard-deviation [log2(RPKM)] values between 0.7 and 1.5. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 571
  • 19. Figure 2B presents the fraction of exons being expressed above a certain cutoff RPKM value in all tissues. Note that approximately 55% of all exons are expressed at a detect- able level in all HBM tissues, demonstrating why the old definition of housekeeping genes is not useful. In addition, it is hard to detect a natural expression cutoff value. The variation in expression level is estimated by the standard deviation of log2(RPKM) over samples. Figure 2C shows Table 1. Genes proposed for calibrationa Gene symbol RefSeq accession number Gene name Genomic coordinates (hg19) of exons passing the filters C1orf43 NM_015449 Chromosome 1 open reading frame 43 chr1 154192817 154192883 chr1 154186932 154187050 chr1 154186368 154186422 chr1 154184933 154185100 chr1 154184795 154184854 CHMP2A NM_014453 Charged multivesicular body protein 2A chr19 59065411 59065579 chr19 59063625 59063805 chr19 59063421 59063552 EMC7 NM_020154 ER membrane protein complex subunit 7 chr15 34382517 34382656 chr15 34380253 34380334 chr15 34376537 34376687 GPI NM_000175 Glucose-6-phosphate isomerase chr19 34857687 34857756 chr19 34859487 34859607 chr19 34868639 34868786 chr19 34869838 34869910 chr19 34872370 34872424 chr19 34884152 34884213 chr19 34884818 34884971 chr19 34887205 34887335 chr19 34887485 34887562 chr19 34890111 34890240 chr19 34890460 34890536 chr19 34890623 34890690 PSMB2 NM_002794 Proteasome subunit, beta type, 2 chr1 36101910 36102033 chr1 36096874 36096945 chr1 36070833 36070883 PSMB4 NM_002796 Proteasome subunit, beta type, 4 chr1 151372456 151372663 chr1 151372917 151373064 chr1 151373239 151373321 chr1 151373714 151373831 RAB7A NM_004637 Member RAS oncogene family chr3 128525214 128525433 chr3 128526385 128526514 chr3 128532169 128532262 REEP5 NM_005669 Receptor accessory protein 5 chr5 112256859 112256953 chr5 112238076 112238215 chr5 112222711 112222880 SNRPD3 NM_004175 Small nuclear ribonucleoprotein D3 chr22 24953642 24953768 chr22 24963951 24964144 VCP NM_007126 Valosin containing protein chr9 35067887 35068060 chr9 35066671 35066814 chr9 35064150 35064282 chr9 35062213 35062347 chr9 35061999 35062135 chr9 35061573 35061686 chr9 35061011 35061176 chr9 35060797 35060920 chr9 35060309 35060522 chr9 35059489 35059798 chr9 35059060 35059216 chr9 35057372 35057527 chr9 35057116 35057219 chr12 110930800 110931036 VPS29 NM_016226 Vacuolar protein sorting 29 homolog chr12 110929812 110929927 chr12 110929812 110929927 a Genes chosen have most of their exons showing geometrical mean expression exceeding RPKM = 50, standard deviation of log2(RPKM) <0.5, and no single tissue showing an expression level different from the geometrical mean by twofold or more. Genes with pseudogenes were excluded. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 572
  • 20. the cumulative distribution of these standard deviation values for the different exons. To define housekeeping exons, the exon must be expressed in all tissues at any nonzero level, and must exhibit a uniform expression level across tissues. Thus, we adopted the following criteria: (i) expression observed in all tissues; (ii) low variance over tissues: standard-deviation [log2(RPKM)]<1; and (iii) no exceptional expression in any single tissue; that is, no log- expression value differed from the averaged log2(RPKM) by two (fourfold) or more. These criteria resulted in a list of 37 363 unique exons (20% of studied exons), belonging to 11 648 RefSeq transcripts and 6289 genes. These included most of the stable housekeeping genes reported based on microarray data [30]. We define a housekeeping gene as a gene for which at least one RefSeq transcript has more than half of its exons meeting the previous criteria (thus being housekeeping exons). Altogether, we found 3804 such human housekeep- ing genes. The lists of housekeeping exons and housekeep- ing genes are available at$elieis/ HKG/. In addition, we propose a short list of highly uniform and strongly expressed genes that may be used for calibra- tion in future experimental settings (Table 1). As expected, the housekeeping genes are enriched in gene ontology (GO) categories associated with basic cellu- lar activity, such as gene expression and biogenesis of nucleotides and amino acids, catabolic processes, protein localization, and so on [51]. The overlap with previous lists is partial, due to the different definition of housekeeping genes used. In particular, GAPDH and actin beta (ACTB) do not appear in our new list, because these genes vary across tissues [3,28–30]. Nevertheless, some of the most pronounced features previously reported for housekeeping genes, such as the much shorter introns [8–11] and more duplications [52], also characterize the new set. Concluding remarks Current technology enables global measurement of expres- sion levels with unprecedented accuracy. This advance- ment has revealed that large parts of the genome are normally expressed at a low level. Accordingly, we found that most human exons are expressed at some level in all the human tissues studied. This new technological era calls the community to reevaluate the concept of a housekeeping gene. Here, we have presented our own perspective, sug- gesting the use of low expression variation as the main criteria for defining housekeeping genes. We also provide sets of exons and genes that are ubiquitously and uniform- ly expressed, as well as a short list of genes suitable for experimental calibration. More high-quality deep-sequencing transcriptome pro- filing data are expected to emerge in the near future, enabling improvements of the analysis described here using better statistics for the tissues studied and adding more tissue types. Furthermore, including extreme patho- logical conditions relevant for various tissues could further purify the housekeeping genes list [53]. A significant ad- vance should come from new experiments currently being done on single-cell transcriptome profiling [54]. This could improve the specificity in detecting housekeeping genes, narrowing the list to genes that are expressed in each and every single cell. In addition, accumulation of tissue-spe- cific epigenetic data, such as histone marks and nucleotide methylations, could be used in the future to better distin- guish regulated expression from low-level noise. As discussed above, normalization (within a sample and across samples) is still an unresolved issue. Advancement in this direction could greatly improve housekeeping gene detection. In addition, usage of longer reads is expected to decrease alignment errors and reduce bias. Longer reads (and improved analysis tools) are expected to raise consid- erably the sensitivity of expression level measurement at the transcript level, enabling direct evaluation of the housekeeping splice-variants list. In conclusion, the dramatic advancement of sequencing technologies calls for a reassessment of the notion of housekeeping genes, and allows for improving quantita- tively and qualitatively the resolution. We thus provide updated lists of housekeeping exons and genes for public use, available at$elieis/HKG/. It is expected that emerging technologies could very soon facili- tate meeting the yet open challenges, allowing for better and more accurate housekeeping gene profiling. Acknowledgments We thank Ami Haviv and Gilad Finkelstein for help with reads’ alignments, and Lily Bazak for help in gene lengths’ analysis. This work was supported by Israel Science Foundation 379/12 (EE), by the I- CORE Program of the Planning and Budgeting Committee and the Israel Science Foundation (grant No 41/11) and by the Marie Curie Integration Grant 256593(EYL). References 1 Fraser, C.M. et al. (1995) The minimal gene complement of Mycoplasma genitalium. Science 270, 397–403 2 Koonin, E.V. (2000) How many genes can make a cell: the minimal- gene-set concept. Annu. Rev. Genomics Hum. Genet. 1, 99–116 3 Thellin, O. et al. (1999) Housekeeping genes as internal standards: use and limits. J. Biotechnol. 75, 291–295 4 Robinson,M.D.andOshlack,A.(2010)Ascalingnormalizationmethodfor differential expression analysis of RNA-seq data. Genome Biol. 11, R25 5 Dheda, K. et al. (2004) Validation of housekeeping genes for normalizing RNA expression in real-time PCR. Biotechniques 37, 112–114, 116, 118–119 6 Rubie, C. et al. (2005) Housekeeping gene variability in normal and cancerous colorectal, pancreatic, esophageal, gastric and hepatic tissues. Mol. Cell. Probes 19, 101–109 7 Vandesompele, J. et al. (2002) Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol. 3, RESEARCH0034 8 Eisenberg, E. and Levanon, E.Y. (2003) Human housekeeping genes are compact. Trends Genet. 19, 362–365 9 Vinogradov, A.E. (2004) Compactness of human housekeeping genes: selection for economy or genomic design? Trends Genet. 20, 248–253 10 Carmel, L. and Koonin, E.V. (2009) A universal nonmonotonic relationship between gene compactness and expression levels in multicellular eukaryotes. Genome Biol. Evol. 1, 382–390 11 Castillo-Davis, C.I. et al. (2002) Selection for short introns in highly expressed genes. Nat. Genet. 31, 415–418 12 Eller, C.D. et al. (2007) Repetitive sequence environment distinguishes housekeeping genes. Gene 390, 153–165 13 Versteeg, R. et al. (2003) The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. Genome Res. 13, 1998–2004 14 Farre´, D. et al. (2007) Housekeeping genes tend to show reduced upstream sequence conservation. Genome Biol. 8, R140 15 Lawson, M.J. and Zhang, L. (2008) Housekeeping and tissue-specific genes differ in simple sequence repeats in the 50 -UTR region. Gene 407, 54–62 Opinion Trends in Genetics October 2013, Vol. 29, No. 10 573
  • 21. 16 Ganapathi, M. et al. (2005) Comparative analysis of chromatin landscape in regulatory regions of human housekeeping and tissue specific genes. BMC Bioinformatics 6, 126 17 Lehner, B. and Fraser, A.G. (2004) Protein domains enriched in mammalian tissue-specific or widely expressed genes. Trends Genet. 20, 468–472 18 Velculescu, V.E. et al. (1999) Analysis of human transcriptomes. Nat. Genet. 23, 387–388 19 Zhu, J. et al. (2008) How many human genes can be defined as housekeeping with current expression data? BMC Genomics 9, 172 20 Zhu, J. et al. (2008) On the nature of human housekeeping genes. Trends Genet. 24, 481–484 21 Chang, C-W. et al. (2011) Identification of human housekeeping genes and tissue-selective genes by microarray meta-analysis. PLoS ONE 6, e22859 22 Hsiao, L.L. et al. (2001) A compendium of gene expression in normal human tissues. Physiol. Genomics 7, 97–104 23 Lee, S. et al. (2007) Identification of novel universal housekeeping genes by statistical analysis of microarray data. J. Biochem. Mol. Biol. 40, 226–231 24 She, X. et al. (2009) Definition, conservation and epigenetics of housekeeping and tissue-enriched genes. BMC Genomics 10, 269 25 Warrington, J.A. et al. (2000) Comparison of human adult and fetal expression and identification of 535 housekeeping/maintenance genes. Physiol. Genomics 2, 143–147 26 Butte, A.J. et al. (2001) Further defining housekeeping, or ‘maintenance’, genes Focus on ‘A compendium of gene expression in normal human tissues’. Physiol. Genomics 7, 95–96 27 Irizarry, R.A. et al. (2003) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 31, e15 28 Barber, R.D. et al. (2005) GAPDH as a housekeeping gene: analysis of GAPDH mRNA expression in a panel of 72 human tissues. Physiol. Genomics 21, 389–395 29 Lee, P.D. et al. (2002) Control genes and variability: absence of ubiquitous reference transcripts in diverse mammalian expression studies. Genome Res. 12, 292–297 30 De Jonge, H.J.M. et al. (2007) Evidence based selection of housekeeping genes. PLoS ONE 2, e898 31 Kapranov, P. et al. (2007) Genome-wide transcription and the implications for genomic organization. Nat. Rev. Genet. 8, 413–423 32 Bolstad, B.M. et al. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 33 Ramsko¨ld, D. et al. (2009) An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput. Biol. 5, e1000598 34 Modrek, B. and Lee, C. (2002) A genomic view of alternative splicing. Nat. Genet. 30, 13–19 35 Johnson, J.M. et al. (2003) Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302, 2141–2144 36 Wang, Z. et al. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 37 Fu, X. et al. (2009) Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genomics 10, 161 38 Zhang, Z. et al. (2003) Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res. 13, 2541–2558 39 Mortazavi, A. et al. (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 40 Wagner, G.P. et al. (2012) Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 131, 281–285 41 Dillies, M-A. et al. (2012) A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 42 Dohm, J.C. et al. (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 36, e105 43 Schwartz, S. et al. (2011) Detection and removal of biases in the analysis of next-generation sequencing reads. PLoS ONE 6, e16685 44 Li, J. et al. (2010) Modeling non-uniformity in short-read rates in RNA- Seq data. Genome Biol. 11, R50 45 Jones, D.C. et al. (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40, e171 46 Roberts, A. et al. (2011) Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 12, R22 47 Trapnell, C. et al. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 48 Pelechano, V. et al. (2013) Extensive transcriptional heterogeneity revealed by isoform profiling. Nature 497, 127–131 49 Langmead, B. and Salzberg, S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 50 Pruitt, K.D. et al. (2012) NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 40, D130–D135 51 Huang, D.W. et al. (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 52 Zhang, Z. et al. (2004) Comparative analysis of processed pseudogenes in the mouse and human genomes. Trends Genet. 20, 62–67 53 Chen, M. et al. (2013) Identification of human HK genes and gene expression regulation study in cancer from transcriptomics data analysis. PLoS ONE 8, e54082 54 Tang, F. et al. (2009) mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 6, 377–382 Opinion Trends in Genetics October 2013, Vol. 29, No. 10 574
  • 22. Feature Review Properties and rates of germline mutations in humans Catarina D. Campbell1 and Evan E. Eichler1,2 1 Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA 2 Howard Hughes Medical Institute, Seattle, WA 98195, USA All genetic variation arises via new mutations; therefore, determining the rate and biases for different classes of mutation is essential for understanding the genetics of human disease and evolution. Decades of mutation rate analyses have focused on a relatively small number of loci because of technical limitations. However, advances in sequencing technology have allowed for empirical assessments of genome-wide rates of mutation. Recent studies have shown that 76% of new mutations originate in the paternal lineage and provide unequivocal evidence for an increase in mutation with paternal age. Although most analyses have focused on single nucleotide var- iants (SNVs), studies have begun to provide insight into the mutation rate for other classes of variation, including copy number variants (CNVs), microsatellites, and mo- bile element insertions (MEIs). Here, we review the genome-wide analyses for the mutation rate of several types of variants and suggest areas for future research. The fundamental process in genetics The replication of the genome before cell division is a remarkably precise process. Nevertheless, there are some errors during DNA replication that lead to new mutations. If these errors occur in the germ cell lineage (i.e., the sperm and egg), then these mutations can be transmitted to offspring. Some of these new genetic variants will be deleterious to the organism, and a select few will be advantageous and serve as substrates for selection. There- fore, knowledge about the rate at which new mutations appear and the properties of new mutations is critical in the study of human genetics from evolution to disease. The study of the mutation rate in humans dates back further than the discovery of the structure of DNA or the determi- nation of DNA as the genetic material. In seminal work performed during the 1930s and 1940s, J.B.S. Haldane studied hemophilia with the assumption of a mutation– selection balance to estimate mutation rate at that locus and determined that most new mutations arose in the paternal germline [1,2]. Until recently, most mutation rate analyses were similar to this initial work in that they extrapolated rates and properties from a handful of loci (often linked to dominant genetic disorders; for example, see [3]). Over the past few years, it has become feasible to generate large amounts of sequence data (including the genomes of parents and their offspring), and it is now possible to calculate empirically a genome-wide mutation rate. In addition, much interest has focused on under- standing the role of de novo mutations in human disease. Therefore, in this review, we synthesize the recent anal- yses of mutation rate for multiple forms of genetic varia- tion and discuss their implications with respect to human disease and evolution. SNV mutation rate It is now feasible to perform whole-genome sequencing on all individuals from a nuclear family; from these data, one can identify de novo mutations that ‘disobey’ Mendelian inheritance (Box 1, Figure I). The first two papers to apply this approach were limited in scope to three families [4,5], thus restricting the total number of de novo SNVs ob- served. Even with this limitation, these two analyses reported similar overall mutation rates of approximately 1 Â 10À8 SNV mutation per base pair per generation, although there was considerable variation in families [4,5]. A more recent study using whole-sequence data from 78 Icelandic parent–offspring trios suggested a higher rate of 1.2 Â 10À8 SNVs per generation from de novo mutations [6]. Another study used autozygous segments (see Glossa- ry) in the genomes of Hutterite trios, who were descended from a 13-generation pedigree with 64 founders, to calcu- late independently the same SNV mutation rate of Review Glossary Autozygosity: large regions of homozygous sequence inherited from a recent ancestor; also referred to as homozygosity by recent descent. De novo mutation: a mutation observed in a child but not in his or her parents. Such mutations are assumed to have occurred in one of the parental germlines. Haplotype phase: determination of which alleles segregate on the same physical chromosomes. For example, which alleles of nearby variants in a child occur on the chromosome inherited from his or her father. Microsatellite: a locus comprising a simple repeat of DNA bases. The repeating unit usually comprises two, three, or four bases. rDNA: the regions of the genome encoding ribosomal RNA. These comprise repeating units of either 2.2 kbp located on chromosome 1 or 43 kbp located on the acrocentric chromosomes. Retrotransposon: a DNA sequence that copies itself through an mRNA intermediate and reinserts the copied sequence through reverse transcription into a new location in the genome. Segmental duplication (SD): a segment (>1 kbp) of high sequence identity (>90%) that exists at two or more locations in a genome.0168-9525/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. Corresponding author: Eichler, E.E. ( Keywords: germline mutation rate; de novo mutation; paternal bias; paternal age; genome wide. Trends in Genetics, October 2013, Vol. 29, No. 10 575
  • 23. 1.2 Â 10À8 [7]. A study of ten additional families of indi- viduals affected with autism reported a rate of 1 Â 10À8 [8]. In addition to the direct approaches in families, earlier studies used more indirect approaches to estimate muta- tion rate. Using fixed differences between the human and chimpanzee genomes (Box 1) yielded a mutation rate for SNVs of approximately 2.5 Â 10À8 in pseudogenes, where selection is not a confounding factor [9,10]; this is over twofold higher than the rates estimated from direct approaches. However, more recent comparisons of the human, chimpanzee, and gorilla genomes bring the muta- tion rate estimates in line with what is observed in family- based analyses [11]. Another indirect approach estimated the mutation rate for SNVs to be 1.82 Â 10À8 using in- ferred ancestry of nearby microsatellites [12] (Box 1, Figure I). The difference between this mutation rate and those calculated with family information may be due to differences in filtering applied for SNVs or in sequencing methodology. Recent genome-wide studies of the SNV mutation rate in humans have started to converge (Table 1). Studies based on whole-genome sequencing and direct estimates of de novo mutations give an average SNV mutation rate of 1.16 Â 10À8 mutations per base pair per generation [95% confidence interval (CI) of the mean: 1.11–1.22] in 96 total families [4–8] (Table 1). However, it is important to note that all of these studies involve substantial filtering of de novo variants to remove false positives and often exclude highly repetitive regions of the genome. Given the rele- vance of variants in protein-coding sequence to disease, it is also important to understand the mutation rate in exonic regions. Studies from targeted sequencing of exomes or other regions have reported higher mutation rates (1.31– 2.17 Â 10À8 mutations per base pair per generation) [13– 16]; this apparent increase may be due to several factors, as discussed below. CNV mutation rate In addition to SNVs, there has been considerable effort in estimating the rates of formation of CNVs. Although CNVs are operationally defined as deletions and duplications of 50 bp or more [17], most studies have assessed de novo events only in the multi-kilobase pair range. As with SNVs, initial studies in this area focused on only a few loci. These analyses found that the locus mutation rate was higher for CNVs (2.5 Â 10À6 –1 Â 10À4 mutations per locus per gener- ation) compared with SNVs and that the rate varied by more than an order of magnitude between loci [18,19]; data from mice suggest that the difference in rates between loci are even larger [20]. A genome-wide analysis of large CNVs (>100 kbp) revealed a mutation rate of 1.2 Â 10À2 CNVs per generation based on approximately 400 parent–off- spring trios [21]. A significantly higher mutation rate of 3.6 Â 10À2 mutations per generation was observed for Box 1. Methods for discovering new mutations and estimating mutation rate Most of the methods developed for estimating mutation rate were developed for SNV data, but can be applied more broadly to other forms of variation. The most common approach for estimating mutation rate is to use families to look for mutations carried by a child but not by either of his or her parents (Figure I). This approach has been carried out on selected loci up to whole genomes. However, it is important to note that this method can be confounded by false positives for which putative de novo variants are enriched [5]. In addition, somatic mutations in offspring of the sequenced families cannot be distinguished from germline de novo variants. The other classical approach for estimating mutation rates is to look at fixed differences between species [9,10]. The mutation rate can then be calculated based on the estimated divergence time between the species (Figure I). Although this approach is not confounded by false positives or somatic mutations, there is uncertainty in the divergence time between humans and chimpanzees, the average generation time, and effective population sizes. Recently, other approaches for determining mutation rate have been described. One group constructed a model of microsatellite evolution and applied this model to estimate the time to the most recent common ancestor (MRCA) for microsatellite alleles [12]. Because SNVs near the microsatellite have the same ancestry as the microsatellite, the mutation rate for SNVs could be calculated using the SNV differences between haplotypes and the time to the MRCA [12]. Another approach to estimating mutation rate involves the identification of heterozygous mutations in large regions of homozygosity by recent descent (autozygosity) [7,120] (Figure I). Such regions are particularly abundant among founder populations, providing a means for estimating mutation rate from a recent common ancestor in populations such as the Hutterites, the Amish, and the Icelandic population. Although different in many ways, these two approaches have some important simila- rities. Both are less susceptible to false positive and somatic mutations than are analyses of de novo mutations in trios. In addition, both approaches estimate the time to the MRCA for segments of the genome in different ways, but benefit by studying haplotypes with a more recent coalescent time than humans and chimpanzees. Human (A) (B) (C) Chimpanzee TRENDS in Genetics Figure I. Methods of discovering new mutations and estimate mutation rate. (A) Sequence data from parent–offspring trios can be used to find mutations present in the child but not observed in either parent (red star). (B) Fixed differences between closely related species can be identified and counted; red or green stars represent mutations occurring in the lineage leading to humans and orange or yellow stars represent mutations in the lineage leading to chimpanzees. This value, in combination with the estimated number of generations between the species, can be used to calculate mutation rate. A modification of this approach can be used within species if the coalescent time of haplotypes can be estimated [12]. (C) Mutations in regions of autozygosity appear as heterozygous variants in long stretches of homozygous DNA [7,120]. With known pedigree information, the most recent common ancestor (MRCA) of the autozygous haplotype can be identified and the mutation rate calculated [7]. Review Trends in Genetics October 2013, Vol. 29, No. 10 576
  • 24. individuals with intellectual disability, probably because some of these de novo CNVs were influencing the develop- ment of the disorders observed in these individuals [22]. Using high-density microarrays and population genetic approaches, the rate of CNV formation was estimated to be 3 Â 10À2 for variants >500 bp [23]. However, this rate is likely a lower boundary because selection will remove deleterious mutations from the population and most large CNVs are estimated to be deleterious [21,23]. Notably, when considering the total number of mutated base pairs between SNVs and CNVs, CNVs account for the vast majority. New large CNVs (>100 kbp) are relatively rare compared with SNVs: one new large CNV per 42 births (95% Poisson CI: 23–97) [21] compared with an average 61 new SNVs per birth (95% CI of the mean: 58–64) [5–8] (Figure 1). The average number of base pairs affected by large CNVs is 8–25 kbp per gamete (16–50 kbp per birth) [21], which is larger than the average of 30.5 bp per gamete observed for SNVs (61 bp per birth; Figure 1). It is important to note that the estimates for CNVs are based on microarray data that could not be used reliably to detect smaller CNVs (<100 kbp); therefore, the mutational prop- erties and rates of formation of these smaller variants remain unknown. Comparisons between the human and chimpanzee genomes also revealed that insertions and deletions account for close to three times the number of bases that are different compared with SNVs (3% versus 1.23%) [24]. Although caution must be exercised in the estimate of the de novo rate of CNVs, the data suggest a more than 100-fold differential between the number of base pairs affected (on average) per generation, yet only a threefold difference after 12 million years of evolution based on chimpanzee and human genome comparisons. This may reflect significant differences in the action of SNVs NumberofmutaƟons Indels MEIs Large CNVs Aneuploidies SNVs (A) (B) bpofmutaƟons 10100100010000100000 Indels MEIs Large CNVs Aneuploidies TRENDS in Genetics Figure 1. Comparison of the frequency and scale of different forms of genetic variation. There is an inverse relation between mutation size and frequency. Although single nucleotide variants (SNVs) occur more frequently, each mutation affects only a single base pair. By contrast, large mutations, such as copy number variants (CNVs) or chromosomal aneuploidy, are rare, yet affect thousands to millions of base pairs. In addition, although these mutations are rare, they affect more base pairs per birth on average than do SNVs. (A) Average number of mutations of each type of variant per birth. (B) Average number of mutated bases contributed by each type of variant per birth. Y-axis is log10 scaled in both (A) and (B). Abbreviation: MEI, mobile element insertion. Table 1. Genome-wide estimates of SNV mutation rate Type Number of families m(• 10S8 ) 95% CI % Paternal Refs Whole genome 1 1.10 0.68–1.70 [4] 1 1.17 0.88–1.62 92% [5] 1 0.97 0.67–1.34 36% [5] 78 1.20 76% [6] 5 0.96 0.82–1.09 85% [7] 10a 1.00 74% [8] Targeted resequencing of 430 Mbp 570b 1.36 0.34–2.70 [13] Whole exome 209c 2.17 81% [15] 238d 1.31 [16] 175c 1.50 [14] Indirect from microsatellites 23e 1.82 1.40–2.28 [12] 512 Mbp of autozygosity 5 1.20 0.89–1.43 [7] a Families of monozygotic twins with autism. b Half of these families have probands with autism or schizophrenia. Mutation rate is based on ‘neutral’ sites. c Probands are affected with autism. d Families comprise proband with autism, unaffected sibling, and parents. Mutation rate for unaffected siblings is reported here. e Number of unrelated individuals. Review Trends in Genetics October 2013, Vol. 29, No. 10 577
  • 25. selection or radical rate changes since divergence for these different classes of mutation [25]. Other classes of genetic variation In addition to CNVs and SNVs, there are many other forms of genetic variation that arise by completely different mutational processes and, consequently, have distinct biases. The largest, of course, are aneuploidies (the dupli- cation or deletion of an entire chromosome). Due to the severity of these mutations (the most well-studied aneu- ploidy is Down syndrome), most aneuploidies are lethal in utero. Studies of spontaneous abortions and embryos cre- ated with in vitro fertilization suggest that 30–60% of embryos and 0.3% of newborns have a chromosomal aneu- ploidy (reviewed in [26]; Figure 1). Interestingly, there are substantial differences between chromosomes in the inci- dence of aneuploidy; trisomies of chromosomes 16, 18, 21, and the sex chromosomes are most prevalent [27]. Chro- mosomal aneuploidies are thought to primarily arise dur- ing meiosis I through several mechanisms. Most simply, homologous chromosomes can fail to pair or stay paired in meiosis, potentially due to lack of recombination events [28]. However, trisomies can also arise if sister chromatids improperly segregate during meiosis I [29] (Figure 1), and it appears as though different chromosomes may be pri- marily affected by different mechanisms [26]. Other forms of genetic variation have been less well characterized, often due to methodological biases in their discovery leading to reduced sensitivity. The rate of small insertions and deletions or ‘indels’ has been reported as approximately 0.20 Â 10À9 per site per generation for insertions and 0.53 Â 10À9 –0.58 Â 10À9 per site per gener- ation for deletions; this corresponds to approximately 6% of the SNV mutation rate [3,30] (Figure 1). Whole-genome sequence data from the 1000 Genomes Project suggested that each individual carries approximately one-tenth the number of indels compared with SNVs [31], but compari- son of two Sanger-sequenced human genomes suggested a ratio closer to one-fifth [32]. The estimates from short-read sequencing must be considered conservative, because re- petitive and low complexity regions of the genome have been difficult to assay because short reads harboring indels are difficult to map, especially in low complexity regions of the genome where this type of variation is enriched. In addition to indels, several recent studies have focused on the rate of MEIs. The MEI rate has been estimated to be approximately 2.5 Â 10À2 per genome per generation or 1 in 20 births (for the active retrotransposons: Alu, L1, and SVA) [33] (Figure 1). It should be noted that comparative analyses of great ape genomes have suggested that this rate has varied radically in different lineages over the past 15 million years of human–great ape evolution. Unlike SNVs, the rate of MEIs has been far less clocklike over the course of evolution [34]. Within the human lineage, the insertions of Alus constitute most MEI events with a rate of 2–4.6 Â 10À2 per genome per generation or approximately 1 in 20 births [33,35], whereas LI and SVA insertions are rarer, occurring at 3–4 Â 10À3 per genome per generation (1 per approximately 100–150 births) [33,36] and 6.5 Â 10À4 per genome per generation (1 per 770 births) [33], respectively. However, these rates were primarily calculated indirectly using assumptions of the SNV muta- tion rate; therefore, additional studies based on direct estimates from families are warranted. Given the low frequency of such occurrences and biases in terms of their integration into AT-rich and repetitive DNA, such analyses will require very large sample sizes and deeply sequenced genomes preferably with long reads to provide a reliable estimate. Several loci in the genome are especially prone to mu- tation, including microsatellites [37], rDNA gene clusters [38], and segmental duplications (SDs) [39,40]. A recent genome-wide analysis of over 2000 known microsatellites in over 24 000 Icelandic trios revealed a mutation rate of 2.73 Â 10À4 mutations per locus per generation for dinu- cleotide repeats and approximately 10 Â 10À4 mutations per locus per generation for tetranucleotide repeats [12], which is similar to original projections based on population genotype data and Mendelian inconsistencies in families [37,41]. It is important to note that this rate is several orders of magnitude greater than the rate for SNVs (base for base), underscoring the fact that microsatellites are an extraordinary reservoir of new mutation. In addition, the mutation rate of individual microsatellites increases with average allele length and repeat uniformity, likely because it is easier for DNA polymerase to slip on longer, purer repeats [12,37,42,43] (reviewed in [44]; Figure 2). Interest- ingly, there are length constraints on di- and tetranucleo- tide repeats where very long alleles tend to mutate to short ones and vice versa [12]; in contrast, studies of loci associ- ated with trinucleotide repeat disorders indicate a polarity toward increasing length, where mutability depends on the length and purity of the repeat tract length (reviewed in [45]). This property, where the increasing repeat length increases the probability of new mutation, has been de- scribed as dynamic mutation in contrast to the bulk of static mutations in the human genome [46]. Although generated by a different mechanism involving nonallelic homologous recombination (NAHR; Figure 2), clusters of ribosomal RNA genes (rDNA), centromeric satellites, and SDs also show extraordinary rates of muta- tion. The mutation rate for rDNA is estimated to be 0.11 per gene cluster per generation, leading to an incredible diversity of rDNA alleles [38]. Centromeric satellites are also large regions of highly duplicated DNA where unequal crossover is rampant [47,48]. The mutability of these regions gives rise to large differences in chromosomal length among individuals [49]; however, the repetitive nature of these regions has made them historically difficult to study other than by Southern blot and pulsed-field gel electrophoresis [50]. There is emerging data that SDs similarly are highly dynamic regions of the genome and prone to recurrent mutation. Copy number polymorphisms (CNPs), for example, are significantly enriched in regions of SDs [51,52]; 90% of CNP genes map to SDs [53,54]. Similar to satellites and rDNA, this bias is due, in large part, to the propensity for these segments to undergo NAHR [55–57]. As a result, CNPs in SDs are less likely to be in linkage disequilibrium with nearby SNPs [58,59]. In addition, significant overlap between CNV loci in humans and nonhuman primates is likely due to recurrent mutation rather than ancestral polymorphism [60,61]. Review Trends in Genetics October 2013, Vol. 29, No. 10 578
  • 26. Nonrandom distribution of new mutations Given the tendency for certain types of loci to mutate, it is not surprising that new SNV and CNV mutations are not random. Several reported and predicted properties of new SNVs have been confirmed in recent genome-wide analyses. First, transitions outnumber transversions by twofold for de novo SNVs [4,5,30]. The rate of mutation at CpG dinucleo- tides has been observed to be ten- to 18-fold the rate of non- CpG dinucleotides [3,6,7,30]. CpG dinucleotides are pre- dicted to be more mutagenic because these are preferential sites of cytosine methylation, and spontaneous deamination of 5-methylcytosine yields thymine and, thus, creates a cytosine to thymine mutation (Figure 2). Considering that most estimates of de novo mutation rate have been based on sequencing technology that biases against particularly GC- rich DNA [31,62], these current estimates probably repre- sent a lower boundary. Several different properties besides GC content have been associated with variation in mutation rate, including nucleosome occupancy and DNaseI hypersensitivity, rep- lication timing, recombination rate, transcription, and repeat content [8,63–68]. The higher mutation rates reported in or near protein-coding regions may be explained in part by the higher GC content of these regions [13,15,16] in combination with the effects of transcription- associated mutations [67]. Interestingly, a recent study of human RNA-seq data and human–macaque divergence found that an increase of twofold in gene expression leads to a 15% increase in mutation due to transcription-associ- ated mutagenesis (TAM) [67]. In addition, there is a strand asymmetry in mutations in transcribed regions of the genome where mutations induced from DNA damage (C to T, A to G, G to T, and A to T) are increased on the nontranscribed strand, likely due to exposure of single- stranded DNA during transcription [66,67,69]. The tran- scribed strand, by contrast, is subject to RNA polymerase stalling leading to the recruitment of transcription coupled repair (TCR) machinery, which corrects some mutations (reviewed in [70]). The opposing forces of TAM and TCR lead to a bias toward G and T bases on the coding strand [67,69]. Recent whole-genome sequencing studies have con- firmed the nonrandomness of mutations, which have been reported as an enrichment for clustered de novo SNVs. It was recently reported that 2–3% of de novo SNVs are part of multinucleotide mutations, or mutations within 20 bp of another de novo SNV [71]. Similarly, a recent study reported an enrichment of SNVs (2% of de novo variants) within 10 kbp that could not be fully explained by GC content or multinucleotide mutations [7]. Finally, other recent work [8] confirmed previous reports of large devia- tions in the distribution of de novo SNVs compared with DeaminaƟon (A) (B) (C) (D) ReplicaƟon m m C CA CA ABC ABC GT GT GT GT CA CA CA5′ 3′ 3′ 5′ CA CA GT GT GT GT G G C Mismatch repair ReplicaƟon m m CG GC m TG GC TG AC Slippage RecombinaƟon between paralogs Mismatch repair ReplicaƟon DeleƟon DuplicaƟon CA CA CA CA GT GT GT GT CA GT CA CA CA CA GT GT GT GT ABC ABC Premature loss of cohesion Telophase I Meiosis II TRENDS in Genetics Figure 2. Common mechanisms leading to biases in mutation. (A) CpG dinucleotides are the sites of cytosine methylation and frequent mutation. 5-methyl-cytosine can be deaminated to thymine (red). This mutation can either be repaired by mismatch repair pathways (reviewed in [121]) or be replicated to yield a cytosine to thymine mutation. (B) Indels can occur by polymerase slippage during replication if these events are not repaired by mismatch repair (reviewed in [121]), especially in regions of low complexity, such as microsatellites. Replication slippage is shown (red) on the newly synthesized strand leading to an insertion. (C) Regions flanked by highly identical segmental duplications (SDs; black boxes) are prone to nonallelic homologous recombination (NAHR). Recombination between homologous chromosomes (blue and magenta) occurs in paralogous regions, leading to duplication of genes ABC in one of the recombined chromosomes and deletion on the other. (D) Replicated homologous chromosomes are shown in black and gray. Premature loss of cohesion between sister chromatids can lead to separation of chromatids in meiosis I (black), leading to cells with only one chromatid or three chromatids. Trisomy results after meiosis II, when one gamete ends up with an extra chromatid (red). Review Trends in Genetics October 2013, Vol. 29, No. 10 579
  • 27. what would be expected under a model of random mutation [66,72]. These studies suggest that a model of random SNV mutation is inaccurate at many different levels. With additional genome-wide mutation rate data, it should also be possible to assign local SNV mutation rates across the genome. Such biases are critical to assessing the signifi- cance of new mutations at a locus-specific level with respect to disease [73], especially as the community begins to explore the noncoding landscape. Similar to SNVs, new CNVs are nonrandomly distrib- uted. Long stretches of highly paralogous sequences (SDs or low copy repeats) in direct orientation predispose to NAHR, which leads to deletions and duplications of the intervening sequence [39,40] (Figure 2). The process of NAHR is involved in a greater fraction of large CNVs, and it does not contribute much to the formation of smaller (<50 kbp) CNVs [23,74], which are thought to arise as a result of errors in replication or microhomology-mediated mutation [75–78]. Loci flanked by paralogous sequences have significantly higher rates of CNV mutation compared with loci outside of these regions [51,79], and many of the CNVs in these regions have been strongly associated with diseases, including developmental delay, autism, and epi- lepsy (reviewed in [80]). Within loci flanked by SDs, there are differences in the rates of CNV formation. These differences are largely due to the presence of directly oriented SDs and the size and level of sequence identity of the flanking duplications. Thus, larger and more identi- cal duplications provide better substrates for NAHR, lead- ing to higher rates of CNV formation [81,82] (Figure 2). Moreover, as the size of CNVs increased so did the proba- bility that the variants occurred de novo, reflecting the effect of strong selection against such large variants [82] (Figure 3). Interestingly, NAHR ‘hotspots’ often show structural variation in the flanking SDs that mediate the NAHR events. These structural variants lead to hap- lotypes that are prone to, and protected from, recurrent deletion because of differences in their genomic architec- ture and content of the flanking SDs [79,83–86]. Interest- ingly, many of these ‘structural’ haplotypes occur at different frequencies among human populations, leading to differences in ethnic predilection to recurrent CNVs and disease [86,87]. Parental bias and paternal age effects It has long been hypothesized and observed that more mutations arise on the paternal germline [2,88], and this difference is thought to be due to the larger number and continuous nature of cell divisions in spermatogenesis. Female eggs arise from a finite number of 22–33 cell divisions, whereas male sperm monotonically increase every 15–16 days as a result of mitotic maintenance of the spermatogonial pool (reviewed in [89]). The depen- dence of SNV mutation on replication dictates an increase in mutations with advancing paternal age [88]. Whole- genome and whole-exome sequencing studies have con- firmed the paternal bias for SNVs. The combined studies report that 76% (95% binomial CI = 73–80%) of new mutations arise in the paternal germline based on 497 new mutations where the parental origin has been ascer- tained [6–8,15]. Multiple studies have confirmed that the number of de novo mutations increases with the age of the father [6,8,15]. Yet, the data remain conflicted on the magnitude and model of this effect (Figure 4). In one study of the whole-genome sequences of two parent–offspring trios, for example, a paternal bias was observed in one trio and a maternal bias in the other [5]. If the increase in de novo mutations was solely due to the increased number of cell divisions in sperm production as a man aged, then it would be expected that there should be a linear relation between paternal age and number of mutations. The data from these recent publications are not inconsistent with a linear model that estimates that the number of mutations increases by one to two mutations per year of the father’s life [6,8]. However, others have suggested that an expo- nential increase of approximately 3% per year may be a slightly better fit for this data [6]. Further studies with larger ranges of paternal ages (especially older fathers) are needed to resolve this issue. An important consideration in paternal bias and age effects is the selective potential of de novo mutations on spermatogonial cells. Recent analysis has revealed that mutations in several genes [e.g., encoding fibroblast growth factor receptor 2 and 3 (FGFR2 and FGFR3), v- Ha-ras Harvey rat sarcoma viral oncogene homolog (HRAS), and tyrosine-protein phosphatase nonreceptor type 11 (PTPN11)] likely confer growth advantages to 200 400 600 800 1000 1200 1400 1600 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 NumberofCNVs Minimum size of call (Mbp) DenovoproporƟon TRENDS in Genetics Figure 3. Larger copy number variants (CNVs) are more likely to be de novo. Size distributions of CNVs from over 15 000 children with developmental delay are plotted. Inherited CNVs are in black and de novo CNVs are in red, with the number of CNVs on the left-hand y-axis. The proportion of CNVs that are de novo is plotted in blue with the de novo proportion on the right-hand y-axis. Reproduced from [82]. Review Trends in Genetics October 2013, Vol. 29, No. 10 580
  • 28. spermatogonial cells, leading to further proliferation of sperm carrying those mutations, even though mutations in these genes lead to autosomal dominant disorders at the organismal level, including Apert syndrome (FGFR2) and achondroplasia (FGFR3) [90,91]. A strong ‘paternal age effect’ has been observed for these disorders [92,93] with mutations in these genes at a rate exceeding linear expec- tation [94,95]. Mutations associated with these disorders are almost exclusively paternal (95–100%), gain-of-func- tion missense mutations. These observations are consis- tent with a model of selfish spermatogonial selection, where mutations confer growth advantages to spermato- gonial cells, leading to a clonal proliferation in the testis that, in turn, contributes disproportionately to the number of mutant sperm as a man ages [90,91]. These genes are likely the reason that previous studies focused on select autosomal dominant loci estimated a faster than linear increase of mutations with paternal age [94,95]. With the exception of a few loci such as these, the available data are consistent with a linear increase of mutations with advanc- ing paternal age [6,8], primarily as a result of increased cell division and replication errors. In addition to SNVs, other forms of genetic variation have been assessed for parental origin and association with increased parental age. Similar to SNVs, a strong paternal bias has also been reported for mutations at microsatellites with a paternal to maternal ratio of 3.3:1. Once again, the number of microsatellite mutations increases linearly with paternal age [12]. Parental origin has also been assessed for structural variation, albeit limited to children with developmental delay where parental data were available. A paternal bias has been observed for large chromosomal rearrangements visible by microscopy, including deletions, duplications, and translocations [96]. Similarly, CNVs (>150 kbp) also have been reported to have a paternal bias, with 90 out of 118 of all de novo CNVs arising on the paternal haplotype (76%; binomial 95% CI = 69%-84%) [22]. This result is driven primarily by mechanisms other than NAHR, where no significant difference is found in the number of events between paternal and maternal origin. Similar to SNVs, the number of non-NAHR CNVs in- creased with paternal age [22]. So far, the only exception to the rule of paternal origin for new mutations and increase with paternal age is chromosomal aneuploidy, including Down syndrome (trisomy of chromosome 21), where most mutations originate in the maternal germline and the risk of aneuploidy increases exponentially with maternal age (reviewed extensively in [26,27]). New mutations, selection, and human disease There has been much recent interest in identifying de novo mutations that play a role in the development of human disease; knowledge of the patterns of human mutation is critical to the interpretation of these studies. Some broad themes are beginning to emerge. First, it is clear that deleterious de novo mutations contribute significantly to human disease and probably have played a more impor- tant role in all diseases than previously anticipated as a result of the super exponential increase in the human population over the past 5000 years [97–99]. Exome se- quencing revealed an increase in the number of de novo loss-of-function SNVs in individuals with autism [15,16,100] and schizophrenia [101]. The story is similar for CNVs, where individuals with neurocognitive diseases show an increase in de novo CNVs [79,102–104]. Interest- ingly, individuals with autism in families with multiple affected individuals also show an increased number of de novo CNVs compared with their siblings, even though the multiplex nature of these families would suggest a primar- ily inherited model of disease [21]. Given the data that de novo SNVs contribute to disease in combination with an increase in mutation rate with paternal age, there has been considerable discussion re- garding the effect of paternal age on disease [105]. Howev- er, it is important to consider the potential magnitude of this effect, which is likely to be modest. Even if there are two new mutations per year of paternal age or a doubling of mutations every 16.5 years [6], most of these new muta- tions will be neutral and not contribute to disease. These data are consistent with epidemiological data that suggest a modest, albeit significant, increase in prevalence of disease in children from older fathers: there is a twofold increase in relative risk of a child developing autism from a father over 55 years of age when compared with a father less than 29 years of age [106]. The notable exceptions are diseases caused by mutations in spermatogonial selection genes, where the effect of paternal age increases more significantly [91]. Inferring dates of human evolution The increasing number of direct analyses in human fami- lies has led to discussion aimed at resolving these new rate 15 406080100120 20 25 30 35 40 45 012345 Paternal age 2.01 mutaƟons per y 0.04 exonic mutaƟons per y 1.02 mutaƟons per y NumberofmutaƟons(exome) NumberofmutaƟons(genome) TRENDS in Genetics Figure 4. Relation between paternal age and de novo mutations. Current fitted models are shown of the increase in single nucleotide variant (SNV) mutations with paternal age from whole-exome and whole-genome sequencing of parent– offspring trios. There is some difference between the studies in regards to the magnitude of this effect, but sample sizes were relatively low and more studies, especially with older fathers, are needed to achieve a more precise estimate. The paternal age is on the x-axis, the left-hand y-axis shows the number of mutations per genome per birth and the right-hand y-axis shows the number of mutations per exome per birth. Exome data from 189 trios yielded an increase of 0.04 exonic mutations per year of paternal age (broken green line) [15]; the smaller number of mutations compared with the whole-genome studies is consistent with the smaller target (protein-coding exons). Whole-genome data from 78 trios yielded an increase of 2.01 mutations per year (blue) [6]. Whole-genome data from ten families yielded an increase of 1.02 mutations per year (red) [8]. Review Trends in Genetics October 2013, Vol. 29, No. 10 581
  • 29. estimates with our knowledge of important dates in human evolution. This stems from the fact that the mutation rates calculated directly in human families are approximately half of that calculated based on sequence divergence and fossil record [107,108]. As a result of these updated muta- tion rates, generation times in the great ape lineages may be longer than previously thought [107]. Taken together, this pushes divergence times further back, and these dates are more in line with the fossil record in some cases but seem ridiculous in others (see [107,108] for a detailed discussion). However, if mutation rates calculated from whole-genome sequencing of human families represent a lower boundary as discussed above, then rates from direct and indirect approaches would be more concordant and the lengthening of divergence times would be overestimated. Moreover, there is also considerable uncertainty in terms of the effect of paternal age with respect to ancestral populations, and this may account for some of the differ- ence between direct and indirect estimates of mutation rate. Adding to the complexity, there is good evidence that mutation rates have not remained constant over evolution- ary time with a slowdown in hominids, likely a conse- quence of generational time [9,109]. Outside of humans, there is little genome-wide data on the extent of this slowdown, even among closely related species. Concluding remarks Over the past few years, genomic technologies have made it possible to obtain direct knowledge concerning rates of human mutation. Recent studies are converging on similar SNV mutation rates, quantifying the male mutation bias and its relation with paternal age. The current rate esti- mate for SNVs likely represents a lower boundary because of biases in next-generation sequencing technology [31,62] and the stringent filtering required to remove false positive calls. In addition, we have gained new insight into the mutational properties of large CNVs, their regional biases within the genome, and their genomic impact. However, our understanding of the properties of human mutation is far from complete. Many studies have focused on identify- ing de novo mutations in individuals with disease, and this may introduce biases in our understanding of the natural processes of mutation. Large studies of individuals from relatively healthy families will provide valuable insight into the general patterns of mutation. It also remains unclear how mutation rate increases with paternal age and the number of genes subject to spermatogonial selec- tion. Many of the recent de novo mutations associated with autism have been found in genes potentially important in cell growth and chromatin modification; it is possible that mutations in these also confer growth advantage in the testis. One approach may be to sequence more families with many children or children born from particularly old fathers. It will also be important to sequence DNA from multigeneration families to understand what fraction of new mutations discovered specifically in the blood are transmitted to the next generation. In light of the impor- tance of new mutations in understanding evolution, efforts to sequence genomes from nonhuman primate families should be a high priority to understand how the rate has changed in different lineages. Although discussed briefly, we are still lacking reliable estimates of the muta- tion rate and the complexity of short indels and smaller CNVs, especially those mapping within SDs. One promis- ing approach would be to use sequencing of large-insert clones to phase long haplotypes fully [110], which would allow parental origin to be determined for all de novo mutations and enable better interpretation of indels. Un- derstanding the mutation rate of SDs and centromeric satellite sequences will likely require single molecular sequencing with very long reads (>50 kbp) [111,112] and accurate de novo assembly. Although we are beginning to understand the pattern of germline mutation, somatic mutation processes are largely unknown outside of cancer studies. Somatic mutations, however, have the potential to contribute to diseases other than cancer and may be subjected to different mutational biases as a result of differences in repair and replication between meiotic and mitotic tissues (reviewed in [113,114]). Such mutations can be identified as genetic differences either between tissues from the same donor or differences between monozygotic twins. Given the proportion of the somatic mutation compared with the germline alleles in a populationofcellsora tissuesampleand with some assump- tions, one can currently estimate approximately where in development the mutation occurred [114,115]. There is compelling evidence that somatic structural variants accu- mulatewith age, likely asaresult ofanincreasing number of replication copy errors [116]. The continued development of single-cell whole-genome sequencing technologies will rev- olutionize this area of research. It has already enabled analysis of somatic mutations in tumor samples [117], embryos [118], and haplotype phasing of individual cells [119]. Its application to sperm and egg will enable the calculation of the true germline mutation rate and provide data on effects of positive and negative selection of muta- tions within germ cells. Such technologies coupled with advances in genome sequencing will ultimately allow scien- tists to generate ontogenic maps of mutation tracking the originandfateofsomaticmutationsduringthedevelopment of organisms. Acknowledgments We thank Santhosh Girirajan and Bradley Coe for sharing data and figures. We are grateful to Andrew Wilkie, Anne Goriely, and Peter Sudmant for helpful discussions and to Tonia Brown for assistance with manuscript preparation. We would like to thank Jacob Michaelson and Jonathan Sebat for sharing a prepublication version of their manuscript. C.D.C. was supported by a Ruth L. Kirschstein National Research Service Award (NRSA; F32HG006070). E.E.E. is an Investigator of the Howard Hughes Medical Institute. References 1 Haldane, J.B.S. (1935) The rate of spontaneous mutation of a human gene. J. Genet. 31, 317–326 2 Haldane, J.B. (1947) The mutation rate of the gene for haemophilia, and its segregation ratios in males and females. Ann. Eugen. 13, 262–271 3 Kondrashov, A.S. (2003) Direct estimates of human per nucleotide mutation rates at 20 loci causing Mendelian diseases. Hum. Mutat. 21, 12–27 4 Roach, J.C. et al. (2010) Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636–639 5 Conrad, D.F. et al. (2011) Variation in genome-wide mutation rates within and between human families. Nat. Genet. 43, 712–714 Review Trends in Genetics October 2013, Vol. 29, No. 10 582
  • 30. 6 Kong, A. et al. (2012) Rate of de novo mutations and the importance of father’s age to disease risk. Nature 488, 471–475 7 Campbell, C.D. et al. (2012) Estimating the human mutation rate using autozygosity in a founder population. Nat. Genet. 44, 1277–1281 8 Michaelson, Jacob J. et al. (2012) Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151, 1431–1442 9 Li, W.H. and Tanimura, M. (1987) The molecular clock runs more slowly in man than in apes and monkeys. Nature 326, 93–96 10 Nachman, M.W. and Crowell, S.L. (2000) Estimate of the mutation rate per nucleotide in humans. Genetics 156, 297–304 11 Scally, A. et al. (2012) Insights into hominid evolution from the gorilla genome sequence. Nature 483, 169–175 12 Sun, J.X. et al. (2012) A direct characterization of human mutation based on microsatellites. Nat. Genet. 44, 1161–1165 13 Awadalla, P. et al. (2010) Direct measure of the de novo mutation rate in autism and schizophrenia cohorts. Am. J. Hum. Genet. 87, 316–324 14 Neale, B.M. et al. (2012) Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485, 242–245 15 O’Roak, B.J. et al. (2012) Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature 485, 246–250 16 Sanders, S.J. et al. (2012) De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 485, 237–241 17 Scherer, S.W. et al. (2007) Challenges and standards in integrating surveys of structural variation. Nat. Genet. 39, S7–S15 18 Lupski, J.R. (2007) Genomic rearrangements and sporadic disease. Nat. Genet. 39, S43–S47 19 Turner, D.J. et al. (2008) Germline rates of de novo meiotic deletions and duplications causing several genomic disorders. Nat. Genet. 40, 90–95 20 Egan, C.M. et al. (2007) Recurrent DNA copy number variation in the laboratory mouse. Nat. Genet. 39, 1384–1389 21 Itsara, A. et al. (2010) De novo rates and selection of large copy number variation. Genome Res. 20, 1469–1481 22 Hehir-Kwa, J.Y. et al. (2011) De novo copy number variants associated with intellectual disability have a paternal origin and age bias. J. Med. Genet. 48, 776–778 23 Conrad, D.F. et al. (2010) Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 24 Chimpanzee Sequencing and Analysis Consortium (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87 25 Marques-Bonet, T. et al. (2009) A burst of segmental duplications in the genome of the African great ape ancestor. Nature 457, 877–881 26 Nagaoka, S.I. et al. (2012) Human aneuploidy: mechanisms and new insights into an age-old problem. Nat. Rev. Genet. 13, 493–504 27 Hassold, T. and Hunt, P. (2001) To err (meiotically) is human: the genesis of human aneuploidy. Nat. Rev. Genet. 2, 280–291 28 Henderson, S.A. and Edwards, R.G. (1968) Chiasma frequency and maternal age in mammals. Nature 218, 22–28 29 Angell, R.R. (1991) Predivision in human oocytes at meiosis I: a mechanism for trisomy formation in man. Hum. Genet. 86, 383–387 30 Lynch, M. (2010) Rate, molecular spectrum, and consequences of human mutation. Proc. Natl. Acad. Sci. U.S.A. 107, 961–968 31 The 1000 Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 32 Chen, J.Q. et al. (2009) Variation in the ratio of nucleotide substitution and indel rates across genomes in mammals and bacteria. Mol. Biol. Evol. 26, 1523–1531 33 Stewart, C. et al. (2011) A comprehensive map of mobile element insertion polymorphisms in humans. PLoS Genet. 7, e1002236 34 Locke, D.P. et al. (2011) Comparative and demographic analysis of orang-utan genomes. Nature 469, 529–533 35 Cordaux, R. et al. (2006) Estimating the retrotransposition rate of human Alu elements. Gene 373, 134–137 36 Ray, D.A. and Batzer, M.A. (2011) Reading TE leaves: new approaches to the identification of transposable element insertions. Genome Res. 21, 813–820 37 Weber, J.L. and Wong, C. (1993) Mutation of human short tandem repeats. Hum. Mol. Genet. 2, 1123–1128 38 Stults, D.M. et al. (2008) Genomic architecture and inheritance of human ribosomal RNA gene clusters. Genome Res. 18, 13–18 39 Lupski, J.R. (1998) Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. Trends Genet. 14, 417–422 40 Bailey, J.A. et al. (2002) Recent segmental duplications in the human genome. Science 297, 1003–1007 41 Whittaker, J.C. et al. (2003) Likelihood-based estimation of microsatellite mutation rates. Genetics 164, 781–787 42 Eichler, E.E. et al. (1994) Length of uninterrupted CGG repeats determines instability in the FMR1 gene. Nat. Genet. 8, 88–94 43 Ballantyne, K.N. et al. (2010) Mutability of Y-chromosomal microsatellites: rates, characteristics, molecular bases, and forensic implications. Am. J. Hum. Genet. 87, 341–353 44 Ellegren, H. (2004) Microsatellites: simple sequences with complex evolution. Nat. Rev. Genet. 5, 435–445 45 McMurray, C.T. (2010) Mechanisms of trinucleotide repeat instability during human development. Nat. Rev. Genet. 11, 786–799 46 Richards, R.I. and Sutherland, G.R. (1997) Dynamic mutation: possible mechanisms and significance in human disease. Trends Biochem. Sci. 22, 432–436 47 Waye, J.S. and Willard, H.F. (1986) Structure, organization, and sequence of alpha satellite DNA from human chromosome 17: evidence for evolution by unequal crossing-over and an ancestral pentamer repeat shared with the human X chromosome. Mol. Cell. Biol. 6, 3156–3165 48 Alkan, C. et al. (2004) The role of unequal crossover in alpha-satellite DNA evolution: a computational analysis. J. Comput. Biol. 11, 933–944 49 Mahtani, M.M. and Willard, H.F. (1990) Pulsed-field gel analysis of alpha-satellite DNA at the human X chromosome centromere: high-frequency polymorphisms and array size estimate. Genomics 7, 607–613 50 Warburton, P.E. and Willard, H.F. (1990) Genomic analysis of sequence variation in tandemly repeated DNA. Evidence for localized homogeneous sequence domains within arrays of alpha- satellite DNA. J. Mol. Biol. 216, 3–16 51 Sharp, A.J. et al. (2005) Segmental duplications and copy-number variation in the human genome. Am. J. Hum. Genet. 77, 78–88 52 Redon, R. et al. (2006) Global variation in copy number in the human genome. Nature 444, 444–454 53 Bailey, J.A. and Eichler, E.E. (2006) Primate segmental duplications: crucibles of evolution, diversity and disease. Nat. Rev. Genet. 7, 552–564 54 Bailey, J.A. et al. (2008) Human copy number polymorphic genes. Cytogenet. Genome Res. 123, 234–243 55 Conrad, D.F. et al. (2010) Mutation spectrum revealed by breakpoint sequencing of human germline CNVs. Nat. Genet. 42, 385–391 56 Kidd, J.M. et al. (2010) A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 143, 837–847 57 Mills, R.E. et al. (2011) Mapping copy number variation by population- scale genome sequencing. Nature 470, 59–65 58 Locke, D.P. et al. (2006) Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome. Am. J. Hum. Genet. 79, 275–290 59 Campbell, C.D. et al. (2011) Population-genetic properties of differentiated human copy-number polymorphisms. Am. J. Hum. Genet. 88, 317–332 60 Perry, G.H. et al. (2006) Hotspots for copy number variation in chimpanzees and humans. Proc. Natl. Acad. Sci. U.S.A. 103, 8006–8011 61 Lee, A.S. et al. (2008) Analysis of copy number variation in the rhesus macaque genome identifies candidate loci for evolutionary and human disease studies. Hum. Mol. Genet. 17, 1127–1136 62 Bentley, D.R. et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 63 Stamatoyannopoulos, J.A. et al. (2009) Human mutation rate associated with DNA replication timing. Nat. Genet. 41, 393–395 64 Ying, H. et al. (2010) Evidence that localized variation in primate sequence divergence arises from an influence of nucleosome placement on DNA repair. Mol. Biol. Evol. 27, 637–649 65 Chen, C.L. et al. (2010) Impact of replication timing on non-CpG and CpG substitution rates in mammalian genomes. Genome Res. 20, 447–457 Review Trends in Genetics October 2013, Vol. 29, No. 10 583
  • 31. 66 Hodgkinson, A. and Eyre-Walker, A. (2011) Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet. 12, 756–766 67 Park, C. et al. (2012) Genomic evidence for elevated mutation rates in highly expressed genes. EMBO Rep. 13, 1123–1129 68 Koren, A. et al. (2012) Differential relationship of DNA replication timing to different forms of human mutation and variation. Am. J. Hum. Genet. 91, 1033–1040 69 Green, P. et al. (2003) Transcription-associated mutational asymmetry in mammalian evolution. Nat. Genet. 33, 514–517 70 Hanawalt, P.C. and Spivak, G. (2008) Transcription-coupled DNA repair: two decades of progress and surprises. Nat. Rev. Mol. Cell Biol. 9, 958–970 71 Schrider, D.R. et al. (2011) Pervasive multinucleotide mutational events in eukaryotes. Curr. Biol. 21, 1051–1054 72 Matassi, G. et al. (1999) Chromosomal location effects on gene sequence evolution in mammals. Curr. Biol. 9, 786–791 73 O’Roak, B.J. et al. (2012) Multiplex targeted sequencing identifies recurrently mutated genes in autism spectrum disorders. Science 338, 1619–1622 74 Kidd, J.M. et al. (2010) Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat. Methods 7, 365–371 75 Smith, C.E. et al. (2007) Template switching during break-induced replication. Nature 447, 102–105 76 Lee, J.A. et al. (2007) A DNA replication mechanism for generating nonrecurrent rearrangements associated with genomic disorders. Cell 131, 1235–1247 77 Payen, C. et al. (2008) Segmental duplications arise from Pol32- dependent repair of broken forks through two alternative replication–based mechanisms. PLoS Genet. 4, e1000175 78 Hastings, P.J. et al. (2009) Mechanisms of change in gene copy number. Nat. Rev. Genet. 10, 551–564 79 Sharp, A.J. et al. (2006) Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome. Nat. Genet. 38, 1038–1042 80 Mefford, H.C. and Eichler, E.E. (2009) Duplication hotspots, rare genomic disorders, and common disease. Curr. Opin. Genet. Dev. 19, 196–204 81 Liu, P. et al. (2011) Frequency of nonallelic homologous recombination is correlated with length of homology: evidence that ectopic synapsis precedes ectopic crossing-over. Am. J. Hum. Genet. 89, 580–588 82 Cooper, G.M. et al. (2011) A copy number variation morbidity map of developmental delay. Nat. Genet. 43, 838–846 83 Osborne, L.R. et al. (2001) A 1.5 million-base pair inversion polymorphism in families with Williams-Beuren syndrome. Nat. Genet. 29, 321–325 84 Koolen, D.A. et al. (2006) A new chromosome 17q21.31 microdeletion syndrome associated with a common inversion polymorphism. Nat. Genet. 38, 999–1001 85 Zody, M.C. et al. (2008) Evolutionary toggling of the MAPT 17q21.31 inversion region. Nat. Genet. 40, 1076–1108 86 Antonacci, F. et al. (2010) A large, complex structural polymorphism at 16p12.1 underlies microdeletion disease risk. Nat. Genet. 42, 745–750 87 Steinberg, K.M. et al. (2012) Structural diversity and African origin of the 17q21.31 inversion polymorphism. Nat. Genet. 44, 872–880 88 Crow, J.F. (2000) The origins, patterns and implications of human spontaneous mutation. Nat. Rev. Genet. 1, 40–47 89 Hurst, L.D. and Ellegren, H. (1998) Sex biases in the mutation rate. Trends Genet. 14, 446–452 90 Goriely, A. et al. (2003) Evidence for selective advantage of pathogenic FGFR2 mutations in the male germ line. Science 301, 643–646 91 Goriely, A. and Wilkie, A.O. (2012) Paternal age effect mutations and selfish spermatogonial selection: causes and consequences for human disease. Am. J. Hum. Genet. 90, 175–200 92 Cohen, M.M., Jr et al. (1992) Birth prevalence study of the Apert syndrome. Am. J. Med. Genet. 42, 655–659 93 Orioli, I.M. et al. (1995) Effect of paternal age in achondroplasia, thanatophoric dysplasia, and osteogenesis imperfecta. Am. J. Med. Genet. 59, 209–217 94 Risch, N. et al. (1987) Spontaneous mutation and parental age in humans. Am. J. Hum. Genet. 41, 218–248 95 Crow, J.F. (1997) The high spontaneous mutation rate: is it a health risk? Proc. Natl. Acad. Sci. U.S.A. 94, 8380–8386 96 Thomas, N.S. et al. (2006) Parental and chromosomal origin of unbalanced de novo structural chromosome abnormalities in man. Hum. Genet. 119, 444–450 97 Nelson, M.R. et al. (2012) An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 337, 100–104 98 Keinan, A. and Clark, A.G. (2012) Recent explosive human population growth has resulted in an excess of rare genetic variants. Science 336, 740–743 99 Tennessen, J.A. et al. (2012) Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 100 Iossifov, I. et al. (2012) De novo gene disruptions in children on the autistic spectrum. Neuron 74, 285–299 101 Xu, B. et al. (2011) Exome sequencing supports a de novo mutational paradigm for schizophrenia. Nat. Genet. 43, 864–868 102 de Vries, B.B. et al. (2005) Diagnostic genome profiling in mental retardation. Am. J. Hum. Genet. 77, 606–616 103 Sebat, J. et al. (2007) Strong association of de novo copy number mutations with autism. Science 316, 445–449 104 Walsh, T. et al. (2008) Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 320, 539–543 105 Kondrashov, A. (2012) Genetics: The rate of human mutation. Nature 488, 467–468 106 Hultman, C.M. et al. (2011) Advancing paternal age and risk of autism: new evidence from a population-based study and a meta- analysis of epidemiological studies. Mol. Psychiatry 16, 1203–1212 107 Langergraber, K.E. et al. (2012) Generation times in wild chimpanzees and gorillas suggest earlier divergence times in great ape and human evolution. Proc. Natl. Acad. Sci. U.S.A. 109, 15716–15721 108 Scally, A. and Durbin, R. (2012) Revising the human mutation rate: implications for understanding human evolution. Nat. Rev. Genet. 13, 745–753 109 Elango, N. et al. (2006) Variable molecular clocks in hominoids. Proc. Natl. Acad. Sci. U.S.A. 103, 1370–1375 110 Kitzman, J.O. et al. (2011) Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 111 Branton, D. et al. (2008) The potential and challenges of nanopore sequencing. Nat. Biotechnol. 26, 1146–1153 112 Eid, J. et al. (2009) Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 113 Erickson, R.P. (2010) Somatic gene mutation and human disease other than cancer: an update. Mutat. Res. 705, 96–106 114 Frank, S.A. (2010) Evolution in health and medicine Sackler colloquium: Somatic evolutionary genomics: mutations during development cause highly variable genetic mosaicism with risk of cancer and neurodegeneration. Proc. Natl. Acad. Sci. U.S.A. 107 (Suppl. 1), 1725–1730 115 Abyzov, A. et al. (2012) Somatic copy number mosaicism in human skin revealed by induced pluripotent stem cells. Nature http:// 116 Forsberg, L.A. et al. (2012) Age-related somatic structural changes in the nuclear genome of human blood cells. Am. J. Hum. Genet. 90, 217–228 117 Navin, N. et al. (2011) Tumour evolution inferred by single-cell sequencing. Nature 472, 90–94 118 Voet, T. et al. (2011) Breakage-fusion-bridge cycles leading to inv dup del occur in human cleavage stage embryos. Hum. Mutat. 32, 783–793 119 Fan, H.C. et al. (2011) Whole-genome molecular haplotyping of single cells. Nat. Biotechnol. 29, 51–57 120 Alkuraya, F.S. (2010) Autozygome decoded. Genet. Med. 12, 765–771 121 Li, G.M. (2008) Mechanisms and functions of DNA mismatch repair. Cell Res. 18, 85–98 Review Trends in Genetics October 2013, Vol. 29, No. 10 584
  • 32. Many ways to die, one way to arrive: how selection acts through pregnancy Elizabeth A. Brown1 , Maryellen Ruvolo1 , and Pardis C. Sabeti2,3,4 1 Department of Human Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA 2 Center for Systems Biology, Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA 3 Broad Institute of the Massachusetts Institute of Technology and Harvard, Cambridge, MA 02142, USA 4 Department of Immunology and Infectious Diseases, Harvard School of Public Health, Boston, MA 02115, USA When considering selective forces shaping human evolu- tion, the importance of pregnancy to fitness should not be underestimated. Although specific mortality factors may only impact upon a fraction of the population, birth is a funnel through which all individuals must pass. Human pregnancy places exceptional energetic, physical, and immunological demands on the mother to accommodate the needs of the fetus, making the woman more vulnera- ble during this time-period. Here, we examine how meta- bolic imbalances, infectious diseases, oxygen deficiency, and nutrient levels in pregnancy can exert selective pres- sures on women and their unborn offspring. Numerous candidate genes under selection are being revealed by next-generation sequencing, providing the opportunity to study further the relationship between selection and pregnancy. This relationship is important to consider to gain insight into recent human adaptations to unique diets and environments worldwide. Selection and pregnancy Some of the earliest records of mortality from London in John Graunt’s ‘Bills of Mortality’ for 1632 reveal several distinct causes of death in the population [1]. Any specific cause of death affects only a fraction of the population, lessening the importance of each particular factor for fitness (Figure 1). Managing to be born, however, is a universal requirement for fitness. Thus, factors that influ- ence fecundity and pregnancy are likely to shape human evolution strongly. The many physiological compromises of pregnancy make it a tremendous challenge for both mothers and infants, and a potential selective force. To provide for the growing fetus mothers increase blood sugar [2], blood volume, and hemoglobin count [3]; remodel uterine arter- ies [4]; and decrease vascular resistance [5]. These changes put the mother at risk of diabetes, high blood pressure, strokes, hemorrhaging, and seizures [2,6–8]. Moreover, properties of the immune system are downregulated to prevent immune response to the ‘foreign’ fetus, potentially contributing to the greater susceptibility of pregnant wom- en to infectious disease [9]. These difficulties for mothers also translate into pro- blems for infants: pre-industrial data show that nearly a quarter of babies died during labor and infancy, whereas maternal mortality was nearly 1.5% per birth due to infectious diseases, diabetes, eclampsia, and jaundice [10]. Similarly, modern foraging populations and sub- Saharan African nations in 1970 also had infant mortal- ity rates of 20–25%, in contrast to Norway, for example, at only $1.6% [11,12]. Maternal mortality in sub- Saharan Africa was $1.0% in the year 2000 (comparable to 16th and 17th century England) with hemorrhage, hypertension (preeclampsia/eclampsia), and infectious diseases as the major causes. By contrast, maternal mortality in Northern Europe was only 0.02% in the year 2000 [13]. These data from historic, foraging, and developing country populations only serve as rough prox- ies for the conditions facing humans during recent evo- lution, but they give some indication of the difficulty of pregnancy experienced by pre-modern foraging and Neo- lithic populations. In addition to the challenges of pregnancy, the number of babies a woman births, compounded across genera- tions, can have huge evolutionary impact. For example, landless Finnish women living 1760–1849 had an aver- age of 4.27 babies, whereas landowning women had an average of 4.55 babies: a change in absolute fitness of this magnitude would cause a geometric rise in the number of descendants in a few generations [14] (Figure 2A). The nutritional benefits of the Industrial Revolution (ca 1880) boosted average Finnish fertility to 5.3 babies [15]. Any such increase in fertility from either environmental or genetic factors will dramatically increase the fitness of women (Figure 2B). An earlier revolution, the develop- ment of agriculture and pastoralism, may have conferred similar fertility benefits, especially to women with genet- ic mutations allowing them to exploit these new resources maximally – lactase persistence, described below, may be an example of this [16]. Furthermore, changes in female fertility could have played an important role during human population migrations. For example, a large study of Que´be´cois settlers indicated that women on the wave- front of territory expansion had a 15–20% fertility advan- tage, with a heritable component for fertility, suggesting that genes influencing fertility may be shaped by selec- tion [17]. Review 0168-9525/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. Corresponding author: Sabeti, P.C. (, Keywords: selection; pregnancy; human evolution; gestational diabetes; preeclampsia. Trends in Genetics, October 2013, Vol. 29, No. 10 585
  • 33. Considering the impact of female fertility alongside the challenges of pregnancy may be critical for understanding recent human adaptations. This review explores how se- lection may have acted through pressures on mothers and infants during pregnancy given the changing environment, diet, and behavior of the past 10 000 years. These factors are critical to bear in mind as opportunities for evolution- ary geneticists to generate new adaptive hypotheses pro- liferate, fueled by next-generation sequencing data and new statistical tools for predicting adaptive variants in diverse populations. Metabolic disorders and selection during pregnancy Theories of human adaptation surrounding metabolic disorders, such as hypertension and type 2 diabetes, are constrained by the fact that these diseases typically strike at post-reproductive ages. The related disorders of gestational diabetes mellitus (GDM) and preeclampsia (hypertension in pregnancy), however, occur precisely during the critical reproductive period of pregnancy. GDM occurs as the maternal blood glucose level rises to nourish the fetus, increasing the risk of maternal diabetes [18]. Preeclampsia occurs as a mother increases blood volume and remodels vasculature for fetal ventilation, raising the risk of maternal hypertension [6]. Women predisposed for these conditions can be pushed into meta- bolic dysfunction. GDM and preeclampsia are common diseases, with grave consequences in pregnancy, and thus may strongly impact upon reproductive fitness. GDM affects 4–20% of pregnan- cies in different populations worldwide [19]. It can cause macrosomia, in which the fetus grows toolargeto fit through the maternal pelvis [20–23]. Before the advent of caesarian sections (C-sections), GDM could lead to fetal morbidity and mortality, and maternal hemorrhage and tearing during delivery [7,20]. Preeclampsia is the leading cause of mater- nal mortality worldwide, accounting for 10–19% of deaths [24–26]. It can cause fetal hypoxia and oxidative stress, low birthweight, and maternal hemorrhage and seizures Infant mortality Tuberculosis Fever Poxviruses Teeth Edema Diarrhea Other infecƟons Violent deaths Aged over 60 Convulsion Other Unclear SƟllborn Childbed Chronic respiratory diseases Flu and pneumonia Kidney disease Accidents Suicide Heart disease Cancer Alzheimer’s Diabetes Stroke Other Reported causes of death in London, 1632 Ten leading causes of death in USA, 2009 (A) (B) TRENDS in Genetics Figure 1. Multiple varied causes of death in modern and historic populations. (A) Many different factors caused death for individuals who died in London in 1632 [1]. ‘Childbed’ referred to mothers who died during or after labor, often due to infections. Over a quarter of deaths occurred in infants and unborn fetuses. (B) By contrast, the leading causes of death in modern, developed countries, such as the USA in 2009, are very different, with heart disease and cancer accounting for fully half of the deaths [93]. 0 160 320 480 640 800 GeneraƟons s = 0.126 s = 0.082 s = 0.033 1 2 3 4 5 6 7 8 9 10 0 GeneraƟons FerƟlity in Finnish women Key: Key: SelecƟon corresponding to differences in ferƟlity 1880 FerƟlity boost (5.3 babies) 1760–1849 Landowning (4.55 babies) 1760–1849 Landless (4.27 babies) (A) (B) Numberofdaughters(thousands)Allelefrequency 2 4 6 8 10 12 14 16 0.2 0.4 0.6 0.8 1 TRENDS in Genetics Figure 2. Rapid change in prevalence of fertility-enhancing traits. (A) The increase in number of female descendants (y axis in thousands), compounded across generations, for maternal lineages with an average of 5.3, 4.55, or 4.27 babies over a lifetime, based on pre-industrial data on differences in female fertility in Finland [14,15]. (B) The increase in frequency of new mutations conferring fertility advantages that correspond to the differences in fertility for the three groups of Finnish women (selection coefficient s = 0.126 for 5.3 vs 4.27 babies; s = 0.82 for 5.3 vs 4.55 babies; s = 0.033 for 4.55 vs 4.27 babies). This demonstrates how readily any mutation with a positive impact on female reproduction will sweep through a population over a very short time due to the compounding effect across generations. Review Trends in Genetics October 2013, Vol. 29, No. 10 586
  • 34. (eclampsia) if not treated by premature delivery [24] (see Box 1 for a discussion of high-altitude adaptation and the risks of preeclampsia). The rates of GDM and preeclampsia vary significantly in different populations, even when controlling for envi- ronmental factors such as obesity [27,28]. This raises the possibility that selective pressures during pregnancy have fine-tuned metabolism to suit different environments and diets around the world, resulting in the current distribu- tion of disease prevalence. By contrast, alternative expla- nations, discussed in Box 2, may also account for these patterns – distinguishing between these competing hy- potheses is an important avenue for future research. Intriguingly, the incidence of GDM in modern popula- tions is inversely related to traditional consumption of dietary components known to increase risk for diabetes and GDM (Table 1). These include high glycemic carbohy- drates, which produce large glucose responses in the blood, and dairy products, which produce large insulin responses due to the effect of whey proteins [29–33]. Europeans have the lowest prevalence of GDM in the world – 3.6% in a study of over a million births in New York City (NYC) [19] – but have the longest history of high glycemic diets. In the past 10 000 years European grain-based agriculture in- creased carbohydrate consumption to roughly 70% of diet, whereas hunter-gatherers consume only 3–50% [34]. In the past 8000 years Europeans also began consuming dairy products in large quantities [35]. By comparison, South Central Asians had a much higher incidence of GDM in the NYC cohort (14.3%), with Bangladeshis the highest at 21.2% [19]. Traditionally, Bangladeshis have had high consumption of fish, a low glycemic food; rice, of moderate glycemic index due to little processing; and no dairy [36,37]. Finally, among African-Americans, the incidence of GDM was intermediate at 4.3% [19]. This is consistent with their admixed ancestry and the mixed consumption of dairy across populations in West Africa, the origin of most US African-Americans. Given the inverse correlation between traditional con- sumption of dietary components increasing GDM risk and current incidence of GDM, high glycemic foods and dairy may have acted as selective agents on metabolism during pregnancy. Because GDM is very likely to have a genetic basis – 67% of the risk of type 2 diabetes for adults younger than 60 is heritable [38], and women with GDM have a 7- to 12-fold elevated risk for type 2 diabetes [39,40] – natural selection can act on its underlying risk factors. Therefore, any population environmentally at risk for GDM without access to C-sections should experience selection against genetic risk factors for GDM. Conversely, any population without access to high glycemic food items should experi- ence selection to make blood sugars more available to the fetus, perhaps through increasing insulin resistance by increasing the frequency of genetic risk factors for GDM. Supporting these predictions, evidence suggests Eur- opeans may have a blunted glycemic response to food Box 1. Oxygen and selection during pregnancy Another environmental pressure detrimental to pregnant women is high-altitude hypoxia. When brought to high altitudes, people from sealevel populations increase hemoglobin levels to carry more oxygen to the tissues. With long-term exposure and old age, increased hemoglobin causes altitude sickness and even death. However, pregnant women experience a special danger: preeclamp- sia caused by oxygen-restriction for the fetus. As described in the main text, preeclampsia often results in premature labor, small birthweight babies, and hemorrhaging, seizures, and death for the mother [24]. Tibetans, Andeans, and the Ethiopian Amhara have each adapted to hypoxic high-altitude conditions possibly due to its impact on pregnancy. In these populations, strong signatures of selection surround genetic loci related to hypoxia and hemoglobin concen- tration, including EGLN1, EPAS1, PPARA, THRB, and ARNT2 [94–97]. However, Andeans are still at risk for altitude sickness in old age because they exhibit the same elevated hemoglobin levels of lowlanders at high altitudes, indicating that selection for post- reproductive survival was not the primary force in this population [98]. Even so, some studies find that Andeans and Tibetans giving birth at high altitudes have fewer instances of low fetal birthweights and preeclampsia than do lowlanders at high altitudes, possibly due to increased uterine capillary density [99–101]. Also, some genes under selection among the Amhara are involved in fetal hemoglobin levels (BCL11A) and angiogenesis (AIMP1 and VAV3), an important feature of pregnancy [94]. These pieces of evidence indicate that pressures during pregnancy may have been significant in adapting to high-altitude hypoxia for Tibetans, Andeans, and the Amhara. Box 2. Alternative hypotheses and avenues of research Although the evidence described in the main text support the importance of pregnancy to recent selection in humans, alternative hypotheses could also explain some phenomena that we argue suggest selection in pregnancy. Take, for example, the differences in GDM prevalence across populations, and the inverse correlation with historical glycemic intake. When mothers born in energy-poor environments emigrate to energy-rich environments, fetal program- ming may contribute to the pattern because these women have heightened risks of GDM and type 2 diabetes [102]. Maternal epigenetic modifications could be the mechanism underlying this programming to suit the early life environment. Another contributor could be the differences in patterns of adipose storage across populations – Asian women tend to have more central adiposity than women in other populations, and this is thought to increase insulin resistance [103]. However, this proximate cause of increased GDM among Asians is not at odds with a history of natural selection acting on the trait. Distinguishing among these competing explanations for the patterns we see could be a fruitful line of research. For example, first, one could conduct association studies in diverse ethnic populations to identify genetic loci linked to GDM risk. Second, these loci associated with GDM could be analyzed for signatures of recent selection to test whether selection has influenced GDM incidence across populations. Finally, one could test whether incidence of GDM among immigrants approaches that of the rest of the population across generations. GDM is reduced for South Asians born in the USA compared to first generation immigrants, but it is still elevated above the level of European-Americans [19], indicating fetal programming may explain a large fraction of differences in GDM risk, but is probably not the only factor. Similar approaches could be used to test hypotheses of selection for resistance to preeclampsia, infectious disease, hypoxia and other reproductive factors. In a broad sense, this will require a better understanding of the axes of human variation – genetic and phenotypic. Next-generation sequencing data from diverse popula- tions of humans will contribute to this understanding. However, the phenotypic data are equally critical. We need a clearer under- standing of the susceptibility of pregnant women to infectious diseases and metabolic diseases across populations, and how this is mediated by nutritional status, UV irradiation, hypoxia, and other external factors. Testing these hypotheses will be important both for evolutionary genetics and for improving care for human health across diverse ethnicities. Review Trends in Genetics October 2013, Vol. 29, No. 10 587
  • 35. compared to other populations, which could be a result of this selection on maternal metabolism to suit diet [41,42]. Similarly to GDM, preeclampsia has an incidence that varies across populations, and it appears to have an inverse relationship with the dietary risk factor of salt intake (Table 1) [43]. In a study of preeclampsia in NYC, preeclampsia rates were lower among immigrants from East Asia (1.4%), especially Japan (1.2%) and Taiwan (0.9%), and lowest in the world among Iranians (0.6%) [44], compared to an incidence of 3–5% of pregnancies in other developed countries [24]. Although these popula- tions are less obese than Americans, Japanese and Ira- nians have historically high salt intakes due to consumption of coastal foods (Japan) and high soil salinity (Iran) [45–47]. High salt-consuming populations, such as Japanese and Iranians, may have experienced strong selection to protect them from the deadly threat of preeclampsia. Because the heritability of preeclampsia is 0.55 according to a study on a Swedish cohort [48], this provides variation for selection to act upon. Populations consuming large amounts of salt should experience strong selection against genetic risk factors for preeclampsia in the absence of modern medical support for premature deliveries. Supporting this, insen- sitivity to salt in the diet is common in Japanese: women consuming the most salt (20.6 g/day) have no more hyper- tension than those consuming the least (8 g/day) [49]. By comparison, the WHO recommends less than 5 g/day of salt consumption for adults [50]. Adaptation for consuming a high glycemic, high dairy diet may have been the result of selection in Europeans through the pressure of GDM, whereas adaptation for consuming a high salt diet may have evolved in Japanese and Iranians through the selective pressure of preeclamp- sia. By contrast, alternative hypotheses may also explain the trends described (see Box 2). In the past several thousand years, populations migrated to new environ- ments and invented new methods of food extraction and processing, such as agriculture, pastoralism, and fishing. The hypotheses presented here focus on how selective pressures during pregnancy may cause strong selection in response to changing diets in recent human evolution. Nutrients and selection during pregnancy Access to nutrients has been critical in human evolution, contingent upon dietary resources and the physiological processes that determine the bioavailability of ingested nutrients. Two selective pressures in humans that changed the amount and bioavailability of nutrients in the diet were exposure to solar UV radiation and adult milk-drinking. The ways in which these impacted upon fecundity and pregnancy may explain why UV radiation and milk-drink- ing exerted such strong fitness effects. Skin pigmentation closely correlates with UV radiation worldwide [51], perhaps partly because UV radiation exerted strong selection across populations during preg- nancy in addition to other stages of life. Lighter or darker pigmentation impacts upon the absorption of UV radiation and thereby on folate and vitamin D3, critical micronu- trients during pregnancy [51,52]. Folate – obtained from eating plants – is stored in cutaneous blood vessels and can be destroyed by UV radiation [53]. Folate deficiency causes failure of neural tubes to close during fetal development, resulting in anencephalus and spina bifida, defects lethal to the fetus [54]. Neural tube defects rarely occur in darkly pigmented people because their melanin protects their folate stores in equatorial areas [51]. Therefore, increased melanin production among equatorial populations of Africa, as well as of Asia, Australia, and the Pacific where populations migrated, was potentially selected to protect folate stores in the skin during pregnancy. By contrast, melanin in the skin also blocks synthesis of vitamin D3 at higher latitudes [55]. Vitamin D3 enables absorption of calcium for skeletal formation in the fetus and maintenance in the mother [56]. Deficiencies cause malformation of the maternal pelvis, maternal osteoporo- sis, and rickets in fetuses and growing children [57,58]. In addition, vitamin D3 may assist development of the fetal innate immune system and critical organs [59,60]. There- fore, balancing the synthesis of vitamin D3 with protection of folate stores for pregnancy probably played a role in the strong selection for graded melanation with UV-radiation clines worldwide [51,52]. Signatures of strong selection have been found surround- ing genes with variants associated with skin pigmentation Table 1. Relationship between metabolic diseases of pregnancy and traditional diets GDM incidence, glycemic index, and dairy consumption Population GDM incidence Diet Dairy Agriculture Glycemic index Refs European-Americans 3.6%a 70% Carbohydrate; grain-based Yes Yes High [19,34,35] Hunter-gatherers ? 3–50% Carbohydrate; game, tubers, vegetables, fruits, nuts, etc. No No Moderate [34] Bangladeshis 7–9%b 21.2%a Rice, fish No Yes Moderate [19,36,37] African-Americans 4.3%a Agriculture, pastoralism, or hunter-gatherer Mixed Mixed Moderate [19] Preeclampsia incidence and traditional salt consumption Population Preeclampsia incidence Salt consumption Obesity Refs European-Americans 2%a ? High [44,45] Sub-Saharans 3.3–3.9%a Low, especially in rainforests Low [44,45] African-Americans 4.6%a Low, mixed ancestry High [44,45] Iranians 0.6%a High, due to soil salinity Medium [44–46] Japanese 1.2%a High, due to seafood Medium [44,45,47] a Incidence for populations living in New York City. b Incidence for populations living in Bangladesh. Review Trends in Genetics October 2013, Vol. 29, No. 10 588
  • 36. in diverse populations – notably SLC24A5, MATP, and TYR in Europeans, DCT, EGFR, and DRD2 in East Asians, and TYRP1, KITLG, ASIP, and OCA2 in both populations [61– 64]. In addition, ancestral alleles of these genes that tend to be associated with darker pigmentation, and that occur at a higher frequency in Africans, also tend to be highly frequent in darkly pigmented Melanesian populations. This may indicate convergent selection on the same genetic variants in diverse populations [61], although many populations remain to be tested. Alternatively, UV radiation may have selected for ap- propriate skin pigmentation at other life stages such as childhood. Some detrimental effects of UV radiation on skin, such as skin cancer, occur post-reproductively, miti- gating their importance to fitness [52,65]. However, sun- burn alone causes significant morbidity for lightly pigmented people living in high UV regions because it damages the skin, increasing infection and water loss, and decreasing thermoregulatory control. Furthermore, although vitamin D3 is critical for pregnancy, it is also important for bone density, immune function, and other effects in childhood and throughout life. To address this, one piece of evidence indicating that pregnancy, specifical- ly, may have been important to selection on skin pigmen- tation is that women exhibit slightly lower levels of skin pigmentation on low-exposure patches of skin than do men, across world populations, indicating that the need for vitamin D3 may have been more critical for women than men [51]. Research clarifying the importance of vitamin D3 status to human health at different life-stages could shed more light on this hypothesis. Likewise, the ability to drink milk among pastoralists who keep dairy animals may also have been driven by selection on reproductive fitness. These pastoralists ex- perienced strong selection in the past 10 000 years to continue digesting the lactose found in milk into adult- hood, instead of losing this ability shortly after birth as occurs in most mammals [66]. Strong selection has been detected for a number of different genetic polymorphisms in diverse pastoralist populations from Europe, Africa, the Middle East, and Central Asia, each associated with regulation of LCT expression, encoding the enzyme lac- tase, which is responsible for cleaving lactose, the disac- charide in milk [35,67–70]. Researchers have been surprised by the strength of this selection and have struggled to develop plausible explanations for it. Milk from animals provides an extra source of sugar, protein, fat, calcium, and hydration, beneficial not only for sur- vival but also for reproduction. Several possible hypotheses could link milk to repro- ductive fitness. First, milk from animals provided a sterile source of hydration, especially for those living in hot, arid climates such as Africa and the Middle East [66]. Consid- ering the sensitivity of pregnant women to contaminated food and drink [71], pregnant women able to drink sterile fresh milk may have experienced special fitness benefits. Second, the extra calcium in milk could be beneficial due to its role in skeletal development and maintenance and to female reproductive maturation because large pelvises are required for vaginal delivery [72]. Third, because fat is more calorie-dense than proteins and carbohydrates, fat from milk could help the mother nourish her infant during pregnancy and lactation. Fat stores and energy balance have also been linked to age of menarche and length of anovulatory period post-pregnancy [73,74]. A final hypothesis involves the fact that milk and other animal fats contain cholesterols used to synthesize repro- ductive hormones, critical for fecundity and early fetal development and growth [75]. The grain-based diets of Neolithic farmers were lower in cholesterol than the diets of hunter-gatherer ancestors who consumed more wild game [34]. Less cholesterol in the diet correlates with lower levels of reproductive steroids [76], reducing ovarian func- tion and fecundity, suggesting that milk drinking could have provided a much-needed cholesterol and fertility boost for Neolithic Europeans. Therefore, the increase in fat, cholesterol, and calcium from drinking milk may have accelerated female skeletal maturation, increased caloric resources, and increased fecundity among women who could consume dairy, creating strong fitness benefits. Infectious disease and selection during pregnancy Infectious diseases have exerted some of the strongest forces of selection on humans, most notably since the increase in population densities following the transition to agriculture and pastoralism 10 000 years ago. For example, genetic variants conferring resistance to malaria, such as alleles in the regions of HBB, HBA, FY, CD36, G6PD, were strongly selected among African populations and others where malaria is endemic [77]. Though infec- tious diseases are threats to survival generally, their differential impact on infants and pregnant women makes them especially powerful selective agents. During pregnancy the maternal immune system is sup- pressed so that the mother does not launch an adaptive immune response to the foreigncellularantigens ofthe fetus [9]. Although details are still being clarified, this response may make pregnant women less able to clear infections requiring strong inflammatory responses [9]. The outcome is that pregnant women experience spontaneous abortion and have higher morbidity and mortality in response to many infections than the general population [9]. Malaria, influenza, and cholera are three infectious diseases that pose severe risks for pregnancy. In particu- lar, African Plasmodium falciparum can infect the placen- ta [9]. As a result, pregnant women with malaria die two- to threefold more often than the general infected population [78]. In sub-Saharan Africa malaria causes 20% of the cases of low infant birthweight, together with slow growth, spontaneous abortion, maternal anemia, and infant mor- tality [9,78,79]. Intriguingly, positive selection on a genetic variant of the gene FLT1, which reduces spontaneous abortions in cases of placental malaria, has been found for a malaria-endemic population in Tanzania [80]. This indicates that, in the case of malaria resistance, selection mediated by pregnant women and their fetuses alone is sufficient for adaptive change in allele frequency in a population. Based upon this evidence, although genetic variants conferring general resistance to malaria experi- enced positive selection that could have been mediated by a broader subset of the population, pregnant women likely comprised an important portion of this selection. Review Trends in Genetics October 2013, Vol. 29, No. 10 589
  • 37. During the 1918 influenza pandemic $50% of all infected pregnant women contracted pneumonia and $50% of this subset died ($27% total mortality for infected pregnant women), far more than the $1% mortality for all individuals of reproductive age with influenza [81,82]. Together with fetal abortion, this caused a 5–15% drop in birth rate the following spring [83]. This pattern is typical of other influenza pandemics [84]. Mortality by influenza is heritable [85], and therefore resistance to influenza may have been strongly selected for in recent human evolution, although this has been understudied. Cholera causes diarrhea, vomiting, dehydration, and cramping, which can induce spontaneous abortion, pre- term small-birthweight babies, and maternal death [86]. Similarly to influenza, smallpox, and dysentery, cholera decreases birthrates significantly during epidemic years [10,87], indicating it has strong potential as a selective agent in humans. Many other infectious diseases are particularly danger- ous for pregnant women. Among female Lassa fever patients of childbearing years admitted to a hospital in Sierra Leone, death was significantly higher for pregnant women (25%) than non-pregnant women (13%) [88]. Tell- ingly, symptoms improved with delivery [88]. The Ebola virus killed more pregnant patients (95.5%) than the pop- ulation average (77%) during an outbreak in the Demo- cratic Republic of the Congo [89]. Some infectious agents, for example the parasite Toxoplasma gondii, cause disease only in pregnant women, who are likely to experience abortion [9]. Evidence from mice suggests that another parasite, Leishmania, also exploits immunological changes in pregnant women [90]. Finally, Varicella zoster, the chickenpox virus, causes pregnant women to develop more skin lesions and pneumonia at higher rates than the average adult with chicken pox [91]. Pregnant women are clearly especially vulnerable to infectious disease. Although many of these diseases also cause significant morbidity in non-pregnant adults, the dramatic impact on pregnant women makes it likely that selective effects would have been strongly mediated by this population, though the adaptive benefit of genetic resis- tance to infectious disease is felt across all life-stages for both males and females. As researchers discover functional genetic variants in areas under selection in the human genome, we predict that many are likely to confer resis- tance to infectious diseases that severely impact upon pregnant women who lack resistance in addition to causing high infant mortality. Concluding remarks The field of human evolutionary genomics is in a period of transition. Currently, only a few examples of selection in response to environmental pressures felt by particular populations have been elucidated – such as malaria resis- tance and lactase persistence. These examples were al- ready under study before the development of evolutionary genomics, and the signatures of selection surrounding the genetic variants under selection merely served to substan- tiate strong adaptive hypotheses already presented. How- ever, next-generation sequencing data, conducted in diverse populations, now provides the raw material to detect many more strong candidates for selection. Thus, the field of evolutionary genomics now has the potential to provide many new testable hypotheses of selection, which were not developed a priori. For example, a catalog of candidate variants for selection was recently published, and one of these variants was experimentally character- ized [92]. At this turning point in the field we seek to underscore that many aspects of human evolution are best understood by investigating the life-history bottleneck of pregnancy and birth from the perspective of both the mother and the infant. During pregnancy, nutritional, energetic, physical, and immunological requirements are constrained in the mother to support the fetus, concentrating selective forces upon the mother at a sensitive life-stage. The pressures that have been most important in recent human evolution – infectious diseases from high population densities, adult dairy consumption from pastoralism, grain consumption from agriculture, and changes in UV radiation and oxygen levels from moving to extreme latitudes and altitudes – have left genetic signatures of their selective impact. Al- though these selective factors may be felt across the life- span, nowhere are they more serious than during infancy and pregnancy. We should thus remain cognizant of these phases of life because next-generation sequencing now provides evolutionary genomicists with the data to gener- ate many new testable hypotheses of why particular loci are under selection in humans. Acknowledgments We thank Katie Hinde for comments on the manuscript and helpful discussions. We also thank the Packard Foundation for their support. References 1 Graunt, J. (1662) Natural and Political Observations Mentioned in a Following Index, and Made Upon the Bills of Mortality, Royal Society of London 2 Butte, N.F. (2000) Carbohydrate and lipid metabolism in pregnancy: normal compared with gestational diabetes mellitus. Am. J. Clin. Nutr. 71, 1256S–1261S 3 Pritchard, J. (1965) Changes in the blood volume during pregnancy and delivery. Anesthesiology 26, 393–399 4 Kaufmann, P. et al. (2004) Aspects of human fetoplacental vasculogenesis and angiogenesis. II. Changes during normal pregnancy. Placenta 25, 114–126 5 Sladek, S.M. et al. (1997) Nitric oxide and pregnancy. Am. J. Physiol. 272, R441–R463 6 Hermida, R.C. et al. (2000) Blood pressure patterns in normal pregnancy, gestational hypertension, and preeclampsia. Hypertension 36, 149–158 7 Jolly, M.C. et al. (2003) Risk factors for macrosomia and its clinical consequences: a study of 350,311 pregnancies. Eur. J. Obstet. Gynecol. Reprod. Biol. 111, 9–14 8 James, A.H. et al. (2005) Incidence and risk factors for stroke in pregnancy and the puerperium. Obstet. Gynecol. 106, 509–516 9 Robinson, D.P. and Klein, S.L. (2012) Pregnancy and pregnancy- associated hormones alter immune responses and disease pathogenesis. Horm. Behav. 62, 263–271 10 Woods, R. (2009) Death before Birth, Oxford University Press 11 Marlowe, F.W. (2005) Hunter-gatherers and human evolution. Evol. Anthropol. 14, 54–67 12 Rajaratnam, J.K. et al. (2010) Neonatal, postneonatal, childhood, and under-5 mortality for 187 countries, 1970–2010: a systematic analysis of progress towards Millennium Development Goal 4. Lancet 375, 1988–2008 13 Ronsmans, C. and Graham, W.J. (2006) Maternal mortality: who, when, where, and why. Lancet 368, 1189–1200 Review Trends in Genetics October 2013, Vol. 29, No. 10 590
  • 38. 14 Courtiol, A. et al. (2012) Natural and sexual selection in a monogamous historical human population. Proc. Natl. Acad. Sci. U.S.A. 109, 8044– 8049 15 Liu, J. et al. (2012) Maternal risk of breeding failure remained low throughout the demographic transitions in fertility and age at first reproduction in Finland. PLoS ONE 7, e34898 16 Laland, K.N. et al. (2010) How culture shaped the human genome: bringing genetics and the human sciences together. Nat. Rev. Genet. 11, 137–148 17 Moreau, C. et al. (2011) Deep human genealogies reveal a selective advantage to be on an expanding wave front. Science 334, 1148–1150 18 Barbour, L.A. et al. (2007) Cellular mechanisms for insulin resistance in normal pregnancy and gestational diabetes. Diabetes Care 30 (Suppl. 2), S112–S119 19 Savitz, D.A. et al. (2008) Ethnicity and gestational diabetes in New York City, 1995–2003. BJOG 115, 969–978 20 Langer, O. et al. (2005) Gestational diabetes: the consequences of not treating. Am. J. Obstet. Gynecol. 192, 989–997 21 Sermer, M. et al. (1998) The Toronto Tri-Hospital Gestational Diabetes Project. A preliminary review. Diabetes Care 21 (Suppl. 2), B33–B42 22 Rosenberg, K. and Trevathan, W. (2002) Birth, obstetrics and human evolution. BJOG 109, 1199–1206 23 Dunsworth, H.M. et al. (2012) Metabolic hypothesis for human altriciality. Proc. Natl. Acad. Sci. U.S.A. 109, 15212–15216 24 WHO (2005) World Health Report: Make Every Mother and Child Count, World Health Organization 25 Moodley, J. (2008) Maternal deaths due to hypertensive disorders in pregnancy. Best Pract. Res. Clin. Obstet. Gynaecol. 22, 559–567 26 Duley, L. (1992) Maternal mortality associated with hypertensive disorders of pregnancy in Africa, Asia, Latin America and the Caribbean. Br. J. Obstet. Gynaecol. 99, 547–553 27 Hunsberger, M. et al. (2010) Racial/ethnic disparities in gestational diabetes mellitus: findings from a population-based survey. Womens Health Issues 20, 323–328 28 Caughey, A.B. et al. (2010) Maternal and paternal race/ethnicity are both associated with gestational diabetes. Am. J. Obstet. Gynecol. 202, 616.e1–5 29 Holt, S. et al. (1997) An insulin index of foods: the insulin demand generated by 1000-kJ portions of common foods. Am. J. Clin. Nutr. 66, 1264–1276 30 Hoyt, G. et al. (2007) Dissociation of the glycaemic and insulinaemic responses to whole and skimmed milk. Br. J. Nutr. 93, 175 31 Zhang, C. et al. (2006) Dietary fiber intake, dietary glycemicload, and the risk for gestational diabetes mellitus. Diabetes Care 29, 2223–2230 32 Zhang, C. and Ning, Y. (2011) Effect of dietary and lifestyle factors on the risk of gestational diabetes: review of epidemiologic evidence. Am. J. Clin. Nutr. 94, 1975S–1979S 33 Hoppe, C. et al. (2005) High intakes of milk, but not meat, increase s- insulin and insulin resistance in 8-year-old boys. Eur. J. Clin. Nutr. 59, 393–398 34 Stro¨hle, A. and Hahn, A. (2011) Diets of modern hunter-gatherers vary substantially in their carbohydrate content depending on ecoenvironments: results from an ethnographic analysis. Nutr. Res. 31, 429–435 35 Myles, S. et al. (2005) Genetic evidence in support of a shared Eurasian–North African dairying origin. Hum. Genet. 117, 34–42 36 Itan, Y. et al. (2010) A worldwide correlation of lactase persistence phenotype and genotypes. BMC Evol. Biol. 10, 36–47 37 Atkinson, F.S. et al. (2008) International tables of glycemic index and glycemic load values: 2008. Diabetes Care 31, 2281–2283 38 Almgren, P. et al. (2011) Heritability and familiality of type 2 diabetes and related quantitative traits in the Botnia Study. Diabetologia 54, 2811–2819 39 Metzger, B.E. et al. (2007) Summary and recommendations of the Fifth International Workshop–Conference on Gestational Diabetes Mellitus. Diabetes Care 30 (Suppl. 2), S251–S260 40 Bellamy, L. et al. (2009) Type 2 diabetes mellitus after gestational diabetes: a systematic review and meta-analysis. Lancet 373, 1773–1779 41 Dickinson, S. et al. (2002) Postprandial hyperglycemia and insulin sensitivity differ among lean young adults of different ethnicities. J. Nutr. 2574–2579 42 Henry, C.J.K. et al. (2008) Glycaemic index of common foods tested in the UK and India. Br. J. Nutr. 99, 840–845 43 Reyes, L. et al. (2012) Nutritional status among women with pre- eclampsia and healthy pregnant and non-pregnant women in a Latin American country. J. Obstet. Gynaecol. Res. 38, 498–504 44 Gong, J. et al. (2012) Maternal ethnicity and pre-eclampsia in New York City, 1995–2003. Paediatr. Perinat. Epidemiol. 26, 45–52 45 Intersalt Cooperative Research Group (1988) Intersalt: an international study of electrolyte excretion and blood pressure. Results for 24 hour urinary sodium and potassium excretion. BMJ 297, 319–328 46 FAO/IIASA/ISRIC/ISS-CAS/JRC (2012) Harmonized World Soil Database, Food and Agriculture Organization of the United Nations and International Institute for Applied Systems Analysis (Version 1.2) 47 Brown, I.J. et al. (2009) Salt intakes around the world: implications for public health. Int. J. Epidemiol. 38, 791–813 48 Cnattingius, S. et al. (2004) Maternal and fetal genetic factors account for most of familial aggregation of preeclampsia: a population-based Swedish cohort study. Am. J. Med. Genet. 130A, 365–371 49 Miura, K. et al. (2010) Dietary salt intake and blood pressure in a representative Japanese Population: baseline analyses of NIPPON DATA80. J. Epidemiol. 20, S524–S530 50 WHO (2010) Global Status Report on Non-Communicable Diseases 2010, World Health Organization 51 Jablonski, N.G. and Chaplin, G. (2000) The evolution of human skin coloration. J. Hum. Evol. 39, 57–106 52 Jablonski, N.G. and Chaplin, G. (2010) Colloquium paper: human skin pigmentation as an adaptation to UV radiation. Proc. Natl. Acad. Sci. U.S.A. 107 (Suppl. 2), 8962–8968 53 Steindal, A.H. et al. (2008) 5-Methyltetrahydrofolate is photosensitive in the presence of riboflavin. Photochem. Photobiol. Sci. 7, 814 54 Fleming, A. and Copp, A.J. (1998) Embryonic folate metabolism and mouse neural tube defects. Science 280, 2107–2109 55 Holick, M.F. (1987) Photosynthesis of vitamin D in the skin: effect of environmental and life-style variables. Fed. Proc. 46, 1876–1882 56 Brunvand, L. et al. (1996) Vitamin D deficiency and fetal growth. Early Hum. Dev. 45, 27–33 57 Fogelman, Y. et al. (1995) High prevalence of vitamin D deficiency among Ethiopian women immigrants to Israel: exacerbation during pregnancy and lactation. Isr. J. Med. Sci. 31, 221–224 58 Henderson, J.B. et al. (1987) The importance of limited exposure to ultraviolet radiation and dietary factors in the aetiology of Asian rickets: a risk-factor model. Q. J. Med. 63, 413–425 59 Norman, A.W. (2008) From vitamin D to hormone D: fundamentals of the vitamin D endocrine system essential for good health. Am. J. Clin. Nutr. 88, 491S–499S 60 Holick, M.F. (2004) Vitamin D: importance in the prevention of cancers, type 1 diabetes, heart disease, and osteoporosis. Am. J. Clin. Nutr. 79, 362–371 61 Lao, O. et al. (2007) Signatures of positive selection in genes associated with human skin pigmentation as revealed from analyses of single nucleotide polymorphisms. Ann. Hum. Genet. 71, 354–369 62 Norton, H.L. et al. (2007) Genetic evidence for the convergent evolution of light skin in Europeans and East Asians. Mol. Biol. Evol. 24, 710–722 63 Alonso, S. et al. (2008) Complex signatures of selection for the melanogenic loci TYR, TYRP1 and DCT in humans. BMC Evol. Biol. 8, 74 64 Quillen, E.E. et al. (2012) OPRM1 and EGFR contribute to skin pigmentation differences between Indigenous Americans and Europeans. Hum. Genet. 131, 1073–1080 65 Blum, H. (1961) Does the melanin pigment of human skin have adaptive value? An essay in human ecology and the evolution of race. Q. Rev. Biol. 36, 50–63 66 Ingram, C.J.E. et al. (2009) Lactose digestion and the evolutionary genetics of lactase persistence. Hum. Genet. 124, 579–591 67 Tishkoff, S.A. et al. (2007) Convergent adaptation of human lactase persistence in Africa and Europe. Nat. Genet. 39, 31–40 68 Enattah, N.S. et al. (2008) Independent introduction of two lactase- persistence alleles into human populations reflects different history of adaptation to milk culture. J. Hum. Genet. 82, 57–72 69 Peng, M-S. et al. (2012) Lactase persistence may have an independent origin in Tibetan populations from Tibet, China. J. Hum. Genet. 57, 394–397 Review Trends in Genetics October 2013, Vol. 29, No. 10 591
  • 39. 70 Heyer, E. et al. (2011) Lactase persistence in central Asia: phenotype, genotype, and evolution. Hum. Biol. 83, 379–392 71 Pouillot,R.etal.(2012)Relative riskoflisteriosisinFoodborneDiseases Active Surveillance Network (FoodNet) sites according to age, pregnancy, and ethnicity. Clin. Infect. Dis. 54 (Suppl. 5), S405–S410 72 Ellison, P.T. (1990) Human ovarian function and reproductive ecology: new hypotheses. Am. Anthropol. 92, 933–952 73 Frisch, R.E. (1984) Body fat, puberty and fertility. Biol. Rev. Camb. Philos. Soc. 59, 161–188 74 Panter-Brick, C. et al. (1993) Seasonality of reproductive function and weight loss in rural Nepali women. Hum. Reprod. 8, 684–690 75 Herrera, E. (2002) Lipid metabolism in pregnancy and its consequences in the fetus and newborn. Endocrine 19, 43–55 76 Goldin, B.R. et al. (1982) Estrogen excretion patterns and plasma levels in vegetarian and omnivorous women. N. Engl. J. Med. 307, 1542–1547 77 Campino, S. et al. (2006) Mendelian and complex genetics of susceptibility and resistance to parasitic infections. Semin. Immunol. 18, 411–422 78 Shulman, C. (2003) Importance and prevention of malaria in pregnancy. Trans. R. Soc. Trop. Med. Hyg. 97, 30–35 79 Steketee, R.W. et al. (2001) The burden of malaria in pregnancy in malaria-endemic areas. Am. J. Trop. Med. Hyg. 64, 28–35 80 Muehlenbachs, A. et al. (2008) Natural selection of FLT1 alleles and their association with malaria resistance in utero. Proc. Natl. Acad. Sci. U.S.A. 105, 14488–14491 81 Harris, J. (1919) Influenza occurring in pregnant women. A statistical studyofthirteenhundredandfiftycases.J.Am.Med.Assoc.72,978–980 82 Taubenberger, J.K. and Morens, D.M. (2006) 1918 Influenza: the mother of all pandemics. Emerg. Infect. Dis. 12, 15–22 83 Bloom-Feshbach, K. et al. (2011) Natality decline and miscarriages associated with the 1918 influenza pandemic: the Scandinavian and United States experiences. J. Infect. Dis. 204, 1157–1164 84 Pazos, M. et al. (2012) The influence of pregnancy on systemic immunity. Immunol. Res. 54, 254–261 85 Horby, P. et al. (2012) The role of host genetics in susceptibility to influenza: a systematic review. PLoS ONE 7, e33180 86 Carrera, J. (ed.) (2007) Recommendations and Guidelines for Perinatal Medicine, Matres Mundi International 87 Hotelling, H. and Hotelling, F. (1931) Causes of birth rate fluctuations. J. Am. Stat. Assoc. 26, 135–149 88 Price, M.E. et al. (1988) A prospective study of maternal and fetal outcome in acute Lassa fever infection during pregnancy. BMJ 297, 584–587 89 Mupapa, K. et al. (1999) Ebola hemorrhagic fever and pregnancy. J. Infect. Dis. 179 (Suppl. 1), S11–S12 90 Roberts, C. et al. (2001) Sex-associated hormones and immunity to protozoan parasites. Clin. Microbiol. Rev. 14, 476–488 91 Harger, J.H. et al. (2002) Risk factors and outcome of varicella–zoster virus pneumonia in pregnant women. J. Infect. Dis. 185, 422–427 92 Grossman, S.R. et al. (2013) Identifying recent adaptations in large- scale genomic data. Cell 152, 703–713 93 Heron, M. (2012) Deaths: leading causes for 2009. Natl. Vital Stat. Rep. 61, 1–95 94 Scheinfeldt, L.B. et al. (2012) Genetic adaptation to high altitude in the Ethiopian highlands. Genome Biol. 13, R1 95 Bigham, A. et al. (2010) Identifying signatures of natural selection in Tibetan and Andean populations using dense genome scan data. PLoS Genet. 6, e1001116 96 Beall, C.M. et al. (2010) Natural selection on EPAS1 (HIF2a) associated with low hemoglobin concentration in Tibetan highlanders. Proc. Natl. Acad. Sci. U.S.A. 107, 11459–11464 97 Simonson, T.S. et al. (2010) Genetic evidence for high-altitude adaptation in Tibet. Science 329, 72–75 98 Mejı´a, O.M. et al. (2005) Genetic association analysis of chronic mountain sickness in an Andean high-altitude population. Haematologica 90, 13–19 99 Moore, L.G. et al. (2001) Oxygen transport in tibetan women during pregnancy at 3,658 m. Am. J. Phys. Anthropol. 114, 42–53 100 Wilson, M.J. et al. (2007) Greater uterine artery blood flow during pregnancy in multigenerational (Andean) than shorter-term (European) high-altitude residents. Am. J. Physiol. Regul. Integr. Comp. Physiol. 293, R1313–R1324 101 Beall, C.M. (2007) Two routes to functional adaptation: Tibetan and Andean high-altitude natives. Proc. Natl. Acad. Sci. U.S.A. 104 (Suppl. 1), 8655–8660 102 Hales, C.N. and Barker, D.J. (2001) The thrifty phenotype hypothesis. Br. Med. Bull. 60, 5–20 103 Raji, A. et al. (2001) Body fat distribution and insulin resistance in healthy Asian Indians and Caucasians. J. Clin. Endocrinol. Metab. 86, 5366–5371 Review Trends in Genetics October 2013, Vol. 29, No. 10 592
  • 40. Finding the lost treasures in exome sequencing data David C. Samuels1* , Leng Han2* , Jiang Li3 , Sheng Quanghu3 , Travis A. Clark4 , Yu Shyr3 , and Yan Guo3 1 Center for Human Genetics Research, Vanderbilt University, Nashville, TN, 37232, USA 2 Department of Bioinformatics and Computational Biology, MD Anderson Cancer Center, Houston, TX, 77030, USA 3 Center for Quantitative Sciences, Vanderbilt University, Nashville, TN, 37232, USA 4 Vanderbilt Technology for Advanced Genomics, Vanderbilt University, Nashville, TN, 37232, USA Exome sequencing is one of the most cost-efficient sequencing approaches for conducting genome research on coding regions. However, significant portions of the reads obtained in exome sequencing come from outside of the designed target regions. These additional reads are generally ignored, potentially wasting an important source of genomic data. There are three major types of unintentionally sequenced read that can be found in exome sequencing data: reads in introns and intergenic regions, reads in the mitochondrial genome, and reads originating in viral genomes. All of these can be used for reliable data mining, extending the utility of exome sequencing. Large-scale exome sequencing data reposi- tories, such as The Cancer Genome Atlas (TCGA), the 1000 Genomes Project, National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project, and The Sequence Reads Archive, provide researchers with ex- cellent secondary data-mining opportunities to study genomic data beyond the intended target regions. The rise of exome sequencing Next-generation sequencing (see Glossary) has substan- tially decreased the cost of sequencing and has become the tool of choice for genomic studies. One of the most popular new sequencing approaches is exome sequencing (Figure 1), in which the coding regions of the full genome are targeted, captured, and sequenced. The exome repre- sents approximately 1–1.5% of the human genome with approximately 50 million bp, but it accounts for over 85% of all mutations that have been identified in Mendelian dis- orders [1]. As a result, exome sequencing is currently an attractive and practical approach for the investigation of coding variations [2,3]. Targeted resequencing enables the enrichment of specific sequences from a whole-genomic library. Exome sequencing is an example of this approach, whereby the complete coding region of the genome is enriched for sequencing. However, many of the captured DNA frag- ments still derive from outside the targeted regions (Figure 2). As a result, intronic and intergenic regions may be sequenced, including promoters, conserved noncod- ing sequences, untranslated regions (UTR), miRNA target sites, and other potentially functional regions. In a typical exome-sequencing study, approximately 40–60% of the reads are off target [4–6] and all or most of these off-target reads are usually ignored. This practice does not utilize the full potential of exome-sequencing data, because it over- looks a large amount of potentially useful data. Recent studies [5,7–10] have shown that off-target reads can be of good quality and can provide useful insights. Reads aligned outside the target regions There are three major exome-sequencing capture kits currentlyinbroaduse:IlluminaTruSeq,AgilentSureSelect, and NimbleGen SeqCap EZ. All three platforms start with whole-genomic libraries made from fragmented genomic DNA and use biotinylated oligonucleotide baits Review Glossary Bait: the hybridization probe designed to capture effectively the coordinates of the target region to be sequenced. The bait design differs by manufacturer and method. Some methods use baits that tile the target region, whereas others use baits that do not overlap and differ in distance between the baits. Exome sequencing: selectively capturing the exome (coding regions) and other content in a whole-genome library before sequencing. This enables deeper coverage of the genomic region that is enriched in disease-causing variants in a megabase-sized DNA library instead of sequencing a lower coverage gigabase-sized whole-genome library. Next-generation sequencing: high-throughput DNA sequencing using mas- sively parallel reactions generating millions of independent reads. The methodology employs a variety of technologies, including highly parallelized pyrosequencing, sequencing-by-synthesis, sequencing by ligation, and single molecule sequencing methods. Off-target reads: the sequencing reads that do not align to the target region. Oncogenic virus: a virus associated with cancer. The cause of this association is generally due to the insertion of the viral genome into the host genome in a location that disrupts a crucial host gene, leading to the expansion of that cell into a tumor. Reads: the fragments of DNA sequences generated that represent data from a unique fragment of the sequencing library. A typical next-generation sequen- cing run generates millions of reads per sample. Target regions: the region of interest defined for enrichment. The genomic coordinates of a target region are used to design the capture baits, probes, or primers for enrichment and vary by exome sequencing kit. Unmappable reads: the reads that are not aligned to the human genome. 0168-9525/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. Corresponding author: Guo, Y. ( Keywords: mitochondria; exome capture; virus; virus integration; mtDNA copy number; unmapped read. * These authors contributed equally to this article. Trends in Genetics, October 2013, Vol. 29, No. 10 593
  • 41. complementary to the design targets to enrich for exons and other vendor-specific content. The target regions for these three exome capture kits vary and range from 37.6 to 62.1 million bp. The capture kits can enrich just the exome, exons plus 30 and 50 UTRs, and other content. The kits also differ in their target regions, bait length, bait density, and the mole- cule used for capturing. Other capture techniques, including array-based, mul- tiplex PCR, selector-probe (HaloPlex), and molecular-in- version probe (MIPs), methods are also available. The capture efficiency varies by capture method. For example, one group [11] (using the NimbleGen 2.1M array-based capture kit) reported having 64.5% of sequenced bases outside the target regions and 31.9% of the reads more than 500 bp away from the target regions; another group [1] (using Agilent 244K microarrays for target enrichment) reported over 50% of sequenced bases outside the target regions. The capture efficiency of the three major exome capture kits has been reported by multiple studies. For Agilent SureSelect, the capture efficiency is between 42% and 58% [4–6]; for Illumina TrueSeq, it is between 45% and 46% [5]; and for NimbleGen SeqCap EZ it is between 50% and 53% [4,6]. Although a capture efficiency of less than 50% can be misinterpreted as failure of the sequencing method, the raw number of reads mapped to the target regions and the median depth of the target regions are more informative parameters to measure the success of the capture method. The unmapped fraction of reads can be anywhere from 5% to 19% [5] and it is related to many factors, such as the type of capture kit, DNA quality, aligner settings, and the completeness of the reference sequence used for the alignment. There is also variability introduced during library preparation and sequencing. Even repeat sequencing of a sample can generate different metrics of capture efficiencies [6]. SNPs outside the exonic regions Many functional elements are located outside the exonic regions [12–15]. Although the role of introns was unclear for many years, several studies have now established some functional significance for introns [11,16–19]. For example, a study [20] identified two mutations within the core promoter of the telomerase reverse transcriptase in 50 of the 70 melanomas examined. Intergenic regions comprise 2009 0 500 1000 CumulaƟvePUBMED publicaƟonswith‘exome’ 1500 2000 2010 Year 201420132011 2012 TRENDS in Genetics Figure 1. Results of a PUBMED search for papers using the term ‘exome’, through 1 July, 2013 showing the rapid and recent spread of this sequencing method. Targeted? Exome-sequencing reads Mapped reads Targeted DNA Unmapped DNA reads Untargeted DNA Viral DNAContaminaƟon Intronic DNA Intergenic DNA mtDNA Mapped? No No Yes Yes PathSeq VirusSeq VirusFusionSeq Any SNP Caller MitoSeek TRENDS in Genetics Figure 2. A flow diagram illustrating how off-target reads can be identified from exome-sequencing data. Currently available tools for the analysis of the different types of off-target read are given. Abbreviation: SNP, single nucleotide polymorphism. Review Trends in Genetics October 2013, Vol. 29, No. 10 594
  • 42. approximately 70% of the human genome. A previous study [5] showed that approximately 50% of the identified single nucleotide polymorphisms (SNPs) from exome se- quencing were in the intended target regions, that 27% of the SNPs identified were in the flanking regions (within 200 bp) of the target regions, and that the remaining 24% of the SNPs were in regions >200 bp away from the target regions. Although exome sequencing is not designed to identify regulatory SNPs in intronic and intergenic regions, off- target reads from this type of experiment should not be discarded a priori. One of the best examples of the useful- ness of these off-target data is a study of Tibetans in high altitude [11], which found a pair of intronic SNPs in endothelial PAS domain protein 1 (EPAS1) with the great- est Tibetan-Han frequency difference. The authors specifi- cally noted that these SNPs were outside the intended target regions of the exome sequencing, drawing attention to the potential value of these reads. These and other studies demonstrate that reliable SNPs can be identified through off-target reads captured by exome sequencing [5], suggesting that it is worth searching for such SNPs even though the experiment was not designed to find them. However, it has been observed that the SNP false positive rate increases as the reads align further away from the captured regions [5]. Thus, more stringent filter criteria, such as depth and genotyping quality score, need to be applied for the SNPs outside the captured regions to achieve the same quality as SNPs inside the captured regions, due to the higher error rate associated with off-target reads. For example, the transi- tion:transversion ratio is commonly used as a quality measurement for SNPs identified through exome sequenc- ing [5,21,22]. To achieve the same transition:transversion ratio for SNPs outside target regions when comparing with SNPs inside the target regions, stronger filters, such as higher depth, are required [5]. Another artifact of exome sequencing is the pseudogene effect, where some intergenic regions are sequenced to abnormally high depth (>1000). This anomaly seems to be consistent regardless of the type of capture kit used [5]. It has been speculated that such phenomena are caused by homologies of pseudogenes. The most commonly used SNP detection framework, Genome Analysis Tool Kit (GATK) [22], developed by the Broad Institute, suggests that SNPs in such regions should be ignored. The mitochondrial genome in exome sequencing Mitochondria have an important role in cellular energy metabolism, free radical generation, and apoptosis [23,24]. mtDNA is a maternally inherited 16 569-bp closed-circle genome that encodes two rRNAs, 22 tRNAs, and ten poly- peptides. Dysfunctions in mitochondrial function are an important cause of many neurological diseases [25] and drug toxicities [26,27], and may contribute to carcinogene- sis and tumor progression [28,29]. Furthermore, the mito- chondrial genome is a fundamental tool for human population genetics and has had a critical role in mapping the migration of humanity across the globe [30–33]. Because the mitochondrial genome is almost all coding sequence, it fits every reasonable definition of the exome. However, mtDNA is not targeted in any of the currently used exome-sequencing methods. Instead, mtDNA se- quence can be extracted from exome-sequencing data [2,10]. The average coverage of the mitochondrial genome from exome sequencing is approximately 100, easily sur- passing the average coverage of even the targeted genomic regions [10]. The relatively high coverage of mtDNA is due to the high copy number of mtDNA per cell, on the order of hundreds to several hundred thousand copies per cell, depending on the tissue type [34]. This should be con- trasted to techniques that specifically target the mitochon- drial sequence, which can produce an average depth of tens of thousands of reads across the mitochondrial genome [35–38]. Given that cells typically contain a very large number of copies of mtDNA, mixtures of wild type and mutant mtDNA (heteroplasmy) can range almost continu- ously from 0 to 100%. Pathogenic mtDNA mutations are typically heteroplasmic in an individual, with asymptom- atic carriers of the mutations having a low heteroplasmy level of the pathogenic mutation [39]. An average read depth of only approximately 100 means that, although polymorphisms can be accurately determined, the identifi- cation of heteroplasmic mtDNA variations is limited to those present in >10% of the mtDNA molecules in the sample. However, these are likely to be the most clinically relevant cases, again pointing to the potential utility of analyzing these sequences. Researchers have started to infer mitochondria mutation information from exome-se- quencing data. The best examples are The Cancer Genome Atlas (TCGA) project, where all mtDNA somatic mutations were inferred from exome-sequencing data. For example, the current somatic mutation results for breast cancer in TCGA [40] contain exome-sequencing data from 776 tumors and report 325 mtDNA somatic mutations derived from off-target reads from the exome-sequencing data. An important complication in aligning DNA reads to the mitochondrial genome is the presence of nuclear copies of the mitochondrial genomes (nuMTS) [41,42]. nuMTS can cause ambiguity about whether a read should map to the nuclear or the mitochondrial genome. The simplest way to obtain the mitochondrial genome is to align the raw reads against the mitochondrial reference genome directly and then filter out the nonaligned reads, thus ignoring the nuMTS. The disadvantage of this approach is that the reads that do derive from the nuMTS may introduce false heteroplasmic variability in the mtDNA sequence. A mid- dle approach is to align the reads against both the nuclear and mitochondrial genomes simultaneously. When a read has multiple locations to which it may be mapped, aligners such as BWA [43] will randomly choose among the possible locations. This has the disadvantage of treating the nuMTs and the mitochondrial genome equally, ignoring the very large copy number difference. The effect of this choice will be that many of the reads coming from the mtDNA will be falsely aligned to the nuMTS, causing an artificially high coverage of the nuMTS and an artificially low coverage of the mtDNA. A third choice gives precedence to the nuMTs by first aligning reads against the nuclear genome and then aligning only the nonaligned reads to the mitochon- drial genome. This approach will have the most extreme misalignment of true mtDNA reads to the nuclear DNA Review Trends in Genetics October 2013, Vol. 29, No. 10 595
  • 43. (potentially leading to false SNP calls in the nuclear DNA), which will lower the coverage of the mitochondria genome and decrease the chance of detecting true variants. The third approach is also the most conservative and time consuming, involving two alignment processes and leaving no chance of misaligning any nuMTS reads to the mito- chondria genome. The second approach is the most bal- anced approach between time consumption and misalignment rate and has been implemented in MitoSeek [44] which can be used to extract mitochondria mutation and heteroplasmy information from exome-sequencing da- ta. mtDNA copy number is highly variable and has been suggested to be associated with many diseases, including cancer [45–48]. Thus, it is an important mitochondrial statistic that can be derived from exome-sequencing data. Traditional methods for evaluating mtDNA copy number involve quantitative (q)PCR [49]. A more recent method has been developed that relies upon a sequencing-based assay of mtDNA copy number that draws on the unbiased nature of next-generation sequencing and incorporates techniques developed for RNA expression profiling [50]. Although the authors claim that this assay reports abso- lute mitochondria copy number, we argue that the amount of library constructed will affect the copy number count. For example, it has been shown that the fraction of cap- tured mitochondrial sequences in exome-sequencing data is proportional to the relative abundance of the correspond- ing mitochondrial genome in the original total DNA extract [10]. Based on this observation, we conclude that relative, but not absolute, mtDNA copy number is detectable through exome-sequencing data. The mtDNA copy extracted from exome-sequencing data can be useful when studying tumor samples for conducting association tests with phenotypes such as tumor stage and metastasis stage. The recently developed software MitoSeek [44] also com- putes relative mtDNA copy number from exome-sequenc- ing data. Pathogen DNA and integration sites Finally, it is important to consider the portion of reads from exome sequencing that does not map to the reference genome. Some of these reads may represent viral DNA, as either free viral DNA or as viral genomes that have been incorporated into the genome of a host. Detecting viral DNA is of particular importance due to the important role of viral DNA integration into the host genome in initiating cancer. Many viruses integrate into the genome of their host cells to replicate and, therefore, mutagenesis caused by viral infection may be quite common. Typically, viruses trigger tumor development by altering host genes or by suppressing the immune system of the host, causing in- flammation over a long period of time. Most viruses lack clearly identifiable oncogenes capable of cellular transfor- mation and instead mediate oncogenic transformation through a process termed insertional mutagenesis (IM). The molecular mechanisms of viral IM can vary, but most involve viral insertion within tumor suppressor genes or upregulation of cellular oncogenes in close proximity to the site of viral integration via cis and trans effects of promoter and enhancer sequences within the viral long terminal repeats (LTRs). Known oncogenic viruses [such as the hepatitis B virus (HBV) for liver cancer and the human papillomavirus (HPV) for head and neck cancer and ovari- an cancer] are estimated to cause 15–20% of all cancers in humans [51,52]. Understanding the viral integration pat- tern of cancer-associated viruses may uncover novel onco- genes and tumor suppressors that are associated with cellular transformation. Viral genomes have been detected using high through- put-sequencing technology [53–57]. The idea of using off- target reads to detect viruses was introduced a few years ago. In general, viruses can be detected through exome sequencing either by detecting viral genome sequences that have been integrated into the host DNA or by inad- vertently capturing the viral sequence itself. The presence of HPV [8] and HBV [58,59] has been detected through analysis of exome-sequencing data. Tools for detecting virus sequence through exome-sequencing data have also been developed. For example, PathSeq was developed to identify viruses through sequencing data of human sam- ples [7]. VirusSeq was developed to identify viral sequences using exome-sequencing or RNAseq data [8]. Most recent- ly, ViralFusionSeq was developed [60] to discover viral integration events and to reconstruct fusion transcripts at single-base resolution. Theoretically, bacteria can also be detected in exome-sequencing data provided they are present. For example, PathSeq [7] is designed to capture both bacterial and viral sequences. One of the challenges associated with identifying viral sequences through exome-sequencing data is the rapid mutation rate of some viruses. DNA viruses have a muta- tion rate of between 10À6 to 10À8 mutations per base per generation, and RNA viruses have an even faster mutation rate of 10À3 –10À5 per base per generation [61]. There are two possible solutions for identifying viral sequences with a high mutation frequency. First, the number of allowed mismatches per read can be increased. The typical read length of exome sequencing is from 75 to 100 bp. The default mismatch allowed per read for most popular aligners such as BWA [43] and Bowtie [62] is usually two. Allowing more mismatches during the viral genome alignment can alleviate the problem caused by the fast viral mutation rate. Second, a virus reference panel can be created that includes all known variations of a targeted virus. Although this method can increase the alignment time, it is more likely to be accurate than simply allowing more mismatches. However, it does have the disadvantage of potentially failing to detect viral strains that have evolved significantly from the strains in the reference panel. Another challenge associated with virus detection in exome-sequencing data is the potential homology be- tween the reference human genome and viral genomes, similar to the problem of nuclear genome copies of the mitochondrial genome described in the previous section. One conservative approach to solving this problem is to use only reads unmapped to the human genome for the viral genome alignment. The location of virus integration into the host genome may have a role in disease etiology [63–66]. However, identifying the sites of virus integration using exome- sequencing data is challenging. For paired-end read data, Review Trends in Genetics October 2013, Vol. 29, No. 10 596
  • 44. a single DNA fragment will have sequence reads on both ends. During alignment, discordant pairs can be detected in which one read is aligned to the viral genome whereas its mate is aligned to the human genome, a good indicator of a possible intervening integration site. To find the exact integration site, read-through reads (in which the break point lies within a read) need to be examined. Existing structural variant detection tools, such as BreakDancer [67], can be used to detect integration sites if the viral genome reference is added to the human genome reference before alignment. VirusSeq [8] detects integration sites by first identifying discordant read pairs and then clustering the discordant read pairs that support the same integra- tion event. By contrast, ViralFusionSeq [60] uses a more sophisticated model to detect breakpoints that support viral fusion. Many viral fusion sites have been identified through exome sequencing. For example, in a study of liver cancer, HBV integration was observed in 70 out of 81 liver cancer samples [58]. Furthermore, HBV viral integration sites have also been identified through exome sequencing in a separate liver study [59]. Virus detection through exome sequencing has several limitations. First is the obvious limitation that this method can only detect DNA viruses or RNA viruses that are reverse transcribed and have a DNA phase. To detect an RNA virus, RNAseq technology needs to be used [68–72]. Second, it is highly dependent on the amount of reads sequenced. If the depth of exome sequencing is low, the chance of detecting any virus also decreases. Finally, it is impractical for exome sequencing to detect any novel virus, or a virus with variants that have not been previously described. Nevertheless, there are successful examples of detection of viral genomes from exome sequencing, provid- ing another example of the value of reconsidering off-target reads. Concluding remarks Exome-sequencing data are now becoming widely avail- able for secondary uses through efforts to encourage data sharing, such as TCGA (currently 15000 exomes) and the NHLBI Exome Sequencing Project (6500 exomes). It was widely predicted that the price of whole-genome sequenc- ing for the human genome would drop to under US$1000 as early as 2003 [73–75]. However, with currently available technologies, to achieve an average of 30Â coverage in whole-genome sequencing still costs over $5000 a sample, whereas exome sequencing at 30Â coverage costs under US$500. There is always a possibility that an advance in technology will reduce the cost of whole-genome sequenc- ing to a comparable price of exome sequencing. However, the extra cost associated with the data analysis of whole- genome sequencing data is likely to remain significantly higher. The storage and processing time of whole-genome sequencing data can be 10 to 20 times more than that of exome sequencing data. Until these limitations of whole- genome sequencing cost and data storage are overcome, the growing amount of exome data available can be use- fully mined for additional research purposes. Another future development that could impact the types of secondary analysis we have outlined here are improve- ments in exome capture technology to eliminate or reduce significantly off-target reads, Exome capture technology has been continuously improving since it was introduced. How- ever, the capture efficiency has increased only slightly over the years. Furthermore, the major reason for the increased capture efficiency has been due to the increased size of capture regions rather than improvement of the capture technology itself. For example, the Agilent SureSelect v1 kit captured 37 Mb of the human genome, whereas the latest SureSelect v5 kit captures 50 Mb. Additionally, the amount of output of sequencing instruments has also increased over the years. The original Illumina GA II platform could output 20 million–25 million reads per lane. The newest Illumina HiSeq 2500 can produce 150 million–200 million reads per lane. Even after multiplexing three to four samples on a HiSeq lane, the amount of reads sequenced per sample is still much higher than that achieved using the GA II ma- chine. Thus, even though the percentage of reads not mapped to target regions might decrease, the raw number ofreads notmapped to the target regions might increase due to the increase of machine throughput, suggesting that exome-sequencing data will continue to be good candidates for additional data mining despite technological improve- ments. Several tools are now available to mine these data for the ‘lost treasure’ buried in off-target reads. We have summarized here the possibilities and challenges in study- ing variants outside of the targeted exonic regions. These include mitochondrial variants, as well as viral genomes and virus–host integration sites. However, we note that another possibility for some of the unmapped reads is that they may still belong to the human genome, but may come from genome regions not covered by the current human genome reference, GRCh37. With GRCh38 (scheduled to be released during late 2013), it is likely that some of the previously unmapped reads will be mapped to the new human reference. There are also possibilities that have yet to be discovered, making studying the unmapped reads a potentially fruitful opportunity. Although we are encouraging researchers to conduct additional data mining using existing data, we would also like to promote good study design. If the goal of the study is to survey all SNPs, then a whole-genome study should be used. If the goal of the study is to examine the mtDNA sequence, then mitochondria-targetedsequencing should be used,and ifthegoalistodetectthepresenceofvirusesthena virus-specific method should be used. Exome sequencing is a powerful tool, but it is not designed specifically for the additional targets described in this review. However, to get the fullest use of this low-cost sequencing technology, and of the massive amount of exome sequences currently publically available, we should not ignore the unexpected DNA reads, which can comprise as much as half of the data produced by exome sequencing methods. The off-target reads must be subject to stringent quality control and, thus, we recommend an additional validation phase for all impor- tant findings observed through off-target reads whenever possible, including the use of targeted resequencing. References 1 Ng, S.B. et al. (2010) Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 42, 30–35 Review Trends in Genetics October 2013, Vol. 29, No. 10 597
  • 45. 2 Durbin, R.M. et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 3 Fu, W. et al. (2013) Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 4 Sulonen, A.M. et al. (2011) Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol. 12, R94 5 Guo, Y. et al. (2012) Exome sequencing generates high quality data in non-target regions. BMC Genomics 13, 194 6 Asan et al. (2011) Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biol. 12, R95 7 Kostic, A.D. et al. (2011) PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nat. Biotechnol. 29, 393–396 8 Chen, Y. et al. (2013) VirusSeq: software to identify viruses and their integration sites using nextgeneration sequencing of human cancer tissue. Bioinformatics 29, 266–267 9 Larman, T.C. et al. (2012) Spectrum of somatic mitochondrial mutations in five cancers. Proc. Natl. Acad. Sci. U.S.A. 109, 14087– 14091 10 Picardi, E. and Pesole, G. (2012) Mitochondrial genomes gleaned from human whole-exome sequencing. Nat. Methods 9, 523–524 11 Yi, X. et al. (2010) Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329, 75–78 12 Djebali, S. et al. (2012) Landscape of transcription in human cells. Nature 489, 101–108 13 Dunham, I. et al. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 14 Harrow, J. et al. (2012) GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 15 Pei, B. et al. (2012) The GENCODE pseudogene resource.Genome Biol. 13, R51 16 Alberobello, A.T. et al. (2011) An intronic SNP in the thyroid hormone receptor beta gene is associated with pituitary cell-specific over- expression of a mutant thyroid hormone receptor beta2 (R338W) in the index case of pituitary-selective resistance to thyroid hormone. J. Transl. Med. 9, 144 17 Kawase, T. et al. (2007) Alternative splicing due to an intronic SNP in HMSD generates a novel minor histocompatibility antigen. Blood 110, 1055–1063 18 Moyer, R.A. et al. (2011) Intronic polymorphisms affecting alternative splicing of human dopamine D2 receptor are associated with cocaine abuse. Neuropsychopharmacology 36, 753–762 19 Rearick, D. et al. (2011) Critical association of ncRNA with introns. Nucleic Acids Res. 39, 2357–2366 20 Huang, F.W. et al. (2013) Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957–959 21 Guo, Y. et al. (2012) The effect of strand bias in Illumina short-read sequencing data. BMC Genomics 13, 666 22 DePristo, M.A. et al. (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 23 Andrews, R.M. et al. (1999) Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat. Genet. 23, 147 24 Verma, M. and Kumar, D. (2007) Application of mitochondrial genome information in cancer epidemiology. Clin. Chim. Acta 383, 41–50 25 Fernandez-Vizarra, E. et al. (2007) Impaired complex III assembly associated with BCS1L gene mutations in isolated mitochondrial encephalopathy. Hum. Mol. Genet. 16, 1241–1252 26 Lemasters, J.J. et al. (1999) Mitochondrial dysfunction in the pathogenesis of necrotic and apoptotic cell death. J. Bioenerg. Biomembr. 31, 305–319 27 Wallace, K.B. and Starkov, A.A. (2000) Mitochondrial targets of drug toxicity. Annu. Rev. Pharmacol. Toxicol. 40, 353–388 28 Modica-Napolitano, J.S. and Singh, K.K. (2004) Mitochondrial dysfunction in cancer. Mitochondrion 4, 755–762 29 Chen, E.I. (2012) Mitochondrial dysfunction and cancer metastasis. J. Bioenerg. Biomembr. 44, 619–622 30 Soares, P. et al. (2012) The Expansion of mtDNA Haplogroup L3 within and out of Africa. Mol. Biol. Evol. 29, 915–927 31 Yao, Y.G. et al. (2002) Phylogeographic differentiation of mitochondrial DNA in Han Chinese. Am. J. Hum. Genet. 70, 635–651 32 Bandelt, H.J. et al. (2003) Identification of Native American founder mtDNAs through the analysis of complete mtDNA sequences: some caveats. Ann. Hum. Genet. 67, 512–524 33 Kong, Q.P. et al. (2003) Phylogeny of east Asian mitochondrial DNA lineages inferred from complete sequences. Am. J. Hum. Genet. 73, 671–676 34 Bogenhagen, D. and Clayton, D.A. (1974) The number of mitochondrial deoxyribonucleic acid genomes in mouse L and human HeLa cells. Quantitative isolation of mitochondrial deoxyribonucleic acid. J. Biol. Chem. 249, 7991–7995 35 Guo, Y. et al. (2012) The use of next generation sequencing technology to study the effect of radiation therapy on mitochondrial DNA mutation. Mutat. Res. 744, 154–160 36 Tang, S. and Huang, T. (2010) Characterization of mitochondrial DNA heteroplasmy using a parallel sequencing system. Biotechniques 48, 287–296 37 He, Y. et al. (2010) Heteroplasmic mitochondrial DNA mutations in normal and tumour cells. Nature 464, 610–614 38 Ameur, A. et al. (2011) Ultra-deep sequencing of mouse mitochondrial DNA: mutational patterns and their origins. PLoS Genet. 7, e1002028 39 Falk, M.J. and Sondheimer, N. (2010) Mitochondrial genetic diseases. Curr. Opin. Pediatr. 22, 711–716 40 Cancer Genome Atlas Network (2012) Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 41 Hazkani-Covo, E. et al. (2010) Molecular poltergeists: mitochondrial DNA copies (numts) in sequenced nuclear genomes. PLoS Genet. 6, e1000834 42 Li, M. et al. (2012) Fidelity of capture-enrichment for mtDNA genome sequencing: influence of NUMTs. Nucleic Acids Res. 40, e137 43 Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 44 Guo, Y. et al. (2013) MitoSeek: extracting mitochondria information and performing high throughput mitochondria sequencing analysis. Bioinformatics 29, 1210–1211 45 Shen, J. et al. (2010) Mitochondrial copy number and risk of breast cancer: a pilot study. Mitochondrion 10, 62–68 46 Yu, M. et al. (2007) Reduced mitochondrial DNA copy number is correlated with tumor progression and prognosis in Chinese breast cancer patients. IUBMB Life 59, 450–457 47 Tseng, L.M. et al. (2006) Mitochondrial DNA mutations and mitochondrial DNA depletion in breast cancer. Genes Chromosomes Cancer 45, 629–638 48 Bai, R.K. et al. (2011) Mitochondrial DNA content varies with pathological characteristics of breast cancer. J. Oncol. 2011, 496189 49 Bhat, H.K. and Epelboym, I. (2004) Quantitative analysis of total mitochondrial DNA: competitive polymerase chain reaction versus real-time polymerase chain reaction. J. Biochem. Mol. Toxicol. 18, 180–186 50 Castle, J.C. et al. (2010) DNA copy number, including telomeres and mitochondria, assayed using next-generation sequencing. BMC Genomics 11, 244 51 Parkin, D.M. (2006) The global health burden of infection-associated cancers in the year 2002. Int. J. Cancer 118, 3030–3044 52 Morissette, G. and Flamand, L. (2010) Herpesviruses and chromosomal integration. J. Virol. 84, 12100–12109 53 Barzon, L. et al. (2011) Applications of next-generation sequencing technologies to diagnostic virology. Int. J. Mol. Sci. 12, 7861–7884 54 Radford, A.D. et al. (2012) Application of next-generation sequencing technologies in virology. J. Gen. Virol. 93, 1853–1868 55 Chevaliez, S. et al. (2012) New virologic tools for management of chronic hepatitis B and C. Gastroenterology 142, 1303–1313 56 Li, L. and Delwart, E. (2011) From orphan virus to pathogen: the path to the clinical lab. Curr. Opin. Virol. 1, 282–288 57 Capobianchi, M.R. et al. (2013) Next-generation sequencing technology in clinical virology. Clin. Microbiol. Infect. 19, 15–22 58 Sung, W.K. et al. (2012) Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma. Nat. Genet. 44, 765–769 59 Jiang, Z. et al. (2012) The effects of hepatitis B virus integration into the genomes of hepatocellular carcinoma patients. Genome Res. 22, 593–601 60 Li, J.W. et al. (2013) ViralFusionSeq: accurately discover viral integration events and reconstruct fusion transcripts at single-base resolution. Bioinformatics 29, 649–651 61 Drake, J.W. et al. (1998) Rates of spontaneous mutation. Genetics 148, 1667–1686 Review Trends in Genetics October 2013, Vol. 29, No. 10 598
  • 46. 62 Langmead, B. et al. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 63 Gozuacik, D. et al. (2001) Identification of human cancer-related genes by naturally occurring Hepatitis B Virus DNA tagging. Oncogene 20, 6233–6240 64 Mason, W.S. et al. (2010) Clonal expansion of normal-appearing human hepatocytes during chronic hepatitis B virus infection. J. Virol. 84, 8308–8315 65 Murakami, Y. et al. (2005) Large scaled analysis of hepatitis B virus (HBV) DNA integration in HBV related hepatocellular carcinomas. Gut 54, 1162–1168 66 Saigo, K. et al. (2008) Integration of hepatitis B virus DNA into the myeloid/lymphoid or mixed-lineage leukemia (MLL4) gene and rearrangements of MLL4 in human hepatocellular carcinoma. Hum. Mutat. 29, 703–708 67 Chen, K. et al. (2009) BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods 6, 677–681 68 Palacios, G. et al. (2008) A new arenavirus in a cluster of fatal transplant-associated diseases. N. Engl. J. Med. 358, 991–998 69 Nakamura, S. et al. (2009) Direct metagenomic detection of viral pathogens in nasal and fecal specimens using an unbiased high- throughput sequencing approach. PLoS ONE 4, e4219 70 Quan, P.L. et al. (2010) Astrovirus encephalitis in boy with X-linked agammaglobulinemia. Emerg. Infect. Dis. 16, 918–925 71 Briese, T. et al. (2009) Genetic detection and characterization of Lujo virus, a new hemorrhagic fever-associated arenavirus from southern Africa. PLoS Pathog. 5, e1000455 72 Isakov, O. et al. (2011) Pathogen detection using short-RNA deep sequencing subtraction and assembly. Bioinformatics 27, 2027–2030 73 Robertson, J.A. (2003) The $1000 genome: ethical and legal issues in whole genome sequencing of individuals. Am. J. Bioeth. 3, 35–42 74 Mardis, E.R. (2006) Anticipating the 1,000 dollar genome. Genome Biol. 7, 112 75 Bennett, S.T. et al. (2005) Toward the 1,000 dollars human genome. Pharmacogenomics 6, 373–382 Review Trends in Genetics October 2013, Vol. 29, No. 10 599
  • 47. The role of AUTS2 in neurodevelopment and human evolution Nir Oksenberg and Nadav Ahituv Department of Bioengineering and Therapeutic Sciences, and Institute for Human Genetics, University of California, San Francisco (UCSF), 1550 4th Street, San Francisco, CA 94158, USA The autism susceptibility candidate 2 (AUTS2) gene is associated with multiple neurological diseases, includ- ing autism, and has been implicated as an important gene in human-specific evolution. Recent functional analysis of this gene has revealed a potential role in neuronal development. Here, we review the literature regarding AUTS2, including its discovery, expression, association with autism and other neurological and non-neurological traits, implication in human evolution, function, regulation, and genetic pathways. Through progress in clinical genomic analysis, the medical impor- tance of this gene is becoming more apparent, as highlighted in this review, but more work needs to be done to discover the precise function and the genetic pathways associated with AUTS2. Neurodevelopmental disorders Neurodevelopmental disorders are characterized by motor, speech, cognitive, and behavioral dysfunctions caused by impairment in growth and development of the central nervous system (CNS). Neurodevelopmental disorders en- compass, but are not limited to, intellectual disability (ID), developmental delay (DD), and autism spectrum disorders (ASDs) [1]. ASDs are known as pervasive developmental disorders that are common (1/88 in the USA) [2] and highly heritable [3]. ASDs are characterized by variable deficits in social communication, language, and restrictive and repet- itive behaviors, and present as a wide spectrum of pheno- types [4]. Other neurological abnormalities, including ID, DD, epilepsy, sensory and motor abnormalities, gastroin- testinal phenotypes, developmental regression, sleep dis- turbance, mood disorders, conduct disorders, aggression, and attention deficit hyperactivity disorder (ADHD), are also frequently associated with ASD [4]. Despite the heri- tability of these disorders, no single gene has been identi- fied as causative for ASD alone. Rather, several different genes have been implicated in these disorders containing either common variants with small effects or rare variants with larger consequences [5]. Over the years, studies ex- amining individual patients, together with advances in sequencing technologies that have allowed the examina- tion of a large number of individuals, have produced a myriad of new ASD, ID, and DD candidate genes, including AUTS2. The discovery of AUTS2 AUTS2 was first identified in 2002 when it was found to be disrupted as a result of a balanced translocation in a pair of monozygotic (MZ) twins with ASD [6]. AUTS2 was mapped to 7q11, spans 1.2 Mb, and is approximately 340 kb up- stream from the Williams–Beuren syndrome (WBS) criti- cal region, a region that – when deleted – causes a neurodevelopmental disorder characterized by a distinc- tive ‘elfin’ facial appearance, a cheerful demeanor, devel- opmental delay, strong language skills, and cardiovascular problems [7]. The AUTS2 protein sequence is highly con- served, with 62% amino acid conservation between humans and zebrafish [8]. It contains regions of homology to other proteins, such as the dwarfin family consensus sequence, human topoisomerase, and fibrosin (FBRS), a fibroblast growth factor [6]. In addition, the Drosophila gene tay has limited similarity to AUTS2. tay mutants have reduced walking speed and activity, thought to be associated with structural defects in the protocerebral bridge [9]. Sequence analysis of AUTS2 identified no mem- brane-spanning domains, but identified two proline-rich domains and a predicted PY (ProTyr) motif (PPPY) at amino acids 515–519 (Figure 1) [6]. The PY motif is a potential WW-domain-binding region that is involved in protein–protein interactions and is present in the activa- tion domain of various transcription factors, suggesting that AUTS2 may be involved in transcriptional regulation [8]. Other predicted protein motifs include several cAMP and cGMP-dependent protein kinase phosphorylation sites, and putative N-glycosylation sites [6]. In addition, AUTS2 has eight CAC (His) repeats (Figure 1) [6], which have been shown to be associated with localization at nuclear speckles [10] – subnuclear structures where com- ponents of the RNA splicing machinery are stored and assembled [11]. Evidence of nuclear localization sequences as well as several predicted protein–protein interaction domains (SH2 and SH3) were also observed for this protein Review 0168-9525/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. Corresponding author: Ahituv, N. ( Keywords: AUTS2; autism; neurodevelopment; human evolution. 600 Trends in Genetics, October 2013, Vol. 29, No. 10
  • 48. (Figure 1). No evidence was found for any signal peptide in AUTS2, indicating that it is not secreted or exposed to the cellular membrane [12]. No DNA-binding domains have been identified. Taken together, sequence analysis has revealed limited insight into the function of this gene. AUTS2 is a nuclear protein that is expressed in the CNS Multiple reports have characterized the expression of AUTS2 in different organisms, concluding that it is pri- marily expressed in the brain. Northern blot shows strong AUTS2 expression in human fetal brain in the frontal, parietal, and temporal regions, but not in the occipital lobe. Expression was also identified in the skeletal muscle and kidney, with lower expression in the placenta, lung, and leukocytes [6]. In human post-mortem fetal brain, AUTS2 mRNA expression was found in the telencephalon (uni- formly), ganglionic eminence, cerebellum anlagen, and, more weakly, in the medulla oblongata at 8 weeks. AUTS2 was also found to be strongly expressed in the cortical plate and ventricular zone. Fetal (23 weeks) human brains showed AUTS2 expression in the dentate gyrus, CA1 and CA3 pyramidal cell subregions, the ganglionic emi- nence, caudate nucleus, and putamen nuclei [13]. AUTS2 was also shown to be expressed in the neocortex and prefrontal cortex up to the late mid-fetal stage [14]. Gene expression profiles from 10 human ocular tissues found AUTS2 to be the 20th highest expressed gene in the sclera [15]. Sequencing of total RNA from human brain and liver found a large fraction of reads (up to 40%) to be within introns [16]. The authors identified enrichment of intronic RNA in brain tissues, particularly for genes involved in axonal growth and synaptic transmission. AUTS2 was among the 10 genes with the highest intronic RNA score in fetal brain. Three of the top 10 genes – neurexin 1 (NRXN1), protocadherin 9 (PCDH9), and methionine sulf- oxide reductase A (MSRA) – have also been implicated in autism. In addition, for long introns, including the first half of AUTS2, there is a 50 to 30 slope in read coverage, with significantly higher levels of RNA at the 50 end. The authors reason that, in the fetal brain, intronic RNAs are subjected to brain-specific regulatory pathways that regulate alternative splicing programs to control neuronal development [16]. A detailed analysis of Auts2 mRNA and protein expres- sion in the developing mouse brain was published in 2010 [12]. The authors found that Auts2 is expressed in the developing cerebral cortex and cerebellum, and is located in the nuclei of neurons and some neuronal progenitors (Table 1). Auts2 expression was identified in numerous neuronal cell types, including glutamatergic neurons AUTS2 Dwarfin homology region (326–453) Fibrosin homology region (645–798) Proline-rich domain (288–471, 545–646) Serine-rich domain (383–410) TrinucleoƟde (H) repeat (1126–1133) PY moƟf (515–519) Human topoisomerase homology region (880–920) Nuclear localizaƟon sequence (11–27, 70–79, 120–141) Predicted cAMP and cGMP-dependent protein kinase phosphorylaƟon site (13–16, 77–80, 116–119, 832–835, 849–852, 975–978, 1235–1238) Y Predicted SH2 interacƟon domain (Y971) N PutaƟve N-glycosylaƟon site (395–398, 785–788, 955–958, 1009–1012) P Predicted SH3 interacƟon domain, (P67, P72, P73, P266, P332, P361, P364, P467, P468, P471, P638, P806, P1234) Dwarfin homology region (326–453) Fibrosin homology region (645–798) Proline-rich domain (288–471, 545–646) Serine-rich domain (383–410) TrinucleoƟde (H) repeat (1126–1133) PY moƟf (515–519) Human topoisomerase homology region (880–920) Nuclear localizaƟon sequence (11–27, 70–79, 120–141) Predicted cAMP and cGMP-dependent protein kinase phosphorylaƟon site (13–16, 77–80, 116–119, 832–835, 849–852, 975–978, 1235–1238) N PutaƟve N-glycosylaƟon site (395–398, 785–788, 955–958, 1009–1012) P Predicted SH3 interacƟon domain N N N NPPP P P PP PPP P P PY TRENDS in Genetics Figure 1. Schematic of the AUTS2 protein. AUTS2 (1259 amino acids) is shown as a gray bar (individual amino acids in single-letter code). The locations of predicted domains, motifs, regions of homology, and other characterized sequences are shown below and within the protein. Numbers in parenthesis represent the amino acid location. The figure is based on predicted features in [6,12]. Table 1. Auts2 expression in the developing mouse braina Timepointb Auts2 expression E11 mRNA barely detectable. E12–13 Colocalization with Tbr1 in the cortical preplate. Tbr1 is a transcription factor specific for postmitotic projection neurons. E12–14 High expression in the developing cortex, thalamus, and cerebellum. There is continued expression in these regions throughout development, but levels fluctuate and are found in gradients. Different markers show Auts2 expression in multiple neuronal subtypes in the developing cortex. E14 Expression in the hippocampal primordium. Transient expression in the locus ceruleus and vestibular nuclei. E16 Expression in the cerebral cortex is now a gradient of high rostral to low caudal expression. E19 Highest expression in inferior and superior colliculi and the pretectum. P0 Auts2 expression becomes progressively more superficial in the frontal cortex. Coexpression with Tbr1 becomes rare as Tbr1 becomes more selective to layer 6. E16–P21 Auts2 is expressed mostly in the frontal cortex, hippocampus, and the cerebellum. In addition, high expression levels were detected in the developing dorsal thalamus, olfactory bulb, inferior colliculus and the substantia nigra. P21 Expression in developing thalamic areas, including the anterior thalamic nuclei and in ventrolateral/ventromedial nuclei. Auts2 is restricted to superficial layers in frontal cortex. Auts2 is expressed throughout the subgranular zone and the granule cell layer of the hippocampus. a Summary based on [12]. b E, embryonic day; P, postnatal day. Review Trends in Genetics October 2013, Vol. 29, No. 10 601
  • 49. (cortex, olfactory bulb, hippocampus), GABAergic neurons (Purkinje cells), and tyrosine hydroxylase (TH)-positive dopaminergic neurons (substantia nigra and ventral teg- mental area). Colocalization of Auts2 with only a subset of eomesodermin (Tbr2) and paired box 6 (Pax6)-positive cells was demonstrated in the ventricular and subventricular zones, suggesting that Auts2 might be expressed in the transition between radial glial and intermediate progeni- tors [12]. It was also suggested that Auts2 and T-box brain 1 (Tbr1) are coexpressed mostly in glutamatergic neuron populations in the forebrain, and other transcription factors likely influence expression of Auts2 in other regions. The report also notes that Auts2 could be expressed in a tran- sient phase of neuronal maturation or differentiation in the cortex [12]. In zebrafish, using wholemount in situ hybrid- ization, auts2 was shown to be expressed in the brain at 24, 48, 72 and 120 hours post-fertilization (hpf). At 48 hpf, auts2 is also expressed in the pectoral fin. From 24–130 hpf, auts2 isalsoweaklyexpressedintheeye[17].Insummary,AUTS2 has been shown to be a nuclear protein that is primarily expressed in the brain in various cell types as well as in regions implicated in ASD, such as the neocortex. AUTS2 and ASD, ID, and DD AUTS2 has been repeatedly implicated as an ASD candi- date gene in recent years. Following the initial finding of an AUTS2 translocation in twins with autism [6], over 50 unrelated individuals with ASD, ID, or DD were identified with distinct structural variants disrupting the AUTS2 region in numerous different reports (Figure 2) [8,18–30]. Some of the structural variants are exclusively non-coding, suggesting that improper regulation and subsequent ex- pression of AUTS2 could be involved in the progression of the disorder [17]. In addition to ASD, ID, and DD, many of these individuals also have other phenotypes, including epilepsy, brain malformations, or dysmorphic features. One group described an ‘AUTS2 syndrome’ in individuals with varying severity of growth and feeding problems, neurode- velopmental features, neurological disorders, dysmorphic features, skeletal abnormalities, and congenital malforma- tions [26]. The spectrum of phenotypes observed in individ- uals with AUTS2 mutations is consistent with the wide range of ASD phenotypes. This suggests that AUTS2 is not associated with a specific subtype of ASD. It has also been noted that dysmorphic features were more pronounced in individuals with 30 AUTS2 deletions, where most of the coding region resides [26]. However,copy-number variations (CNVs) at the AUTS2 locus have also been observed in unaffected individuals, indicating that structural rearran- gements are tolerated in some cases [19,31]. This suggests that disruptions in AUTS2 may lead to neurodevelopmental disorders by being one of multiple genomic ‘hits’. The large number of independent publications implicating AUTS2 in ASD, ID, or DD provides strong evidence for its involvement in these disorders. It is worth noting, however, that no publication has shown single base-pair variants in the AUTS2 locus affiliated with ASD, despite numerous ASD- related exome sequencing studies [32–35]. The observation that AUTS2 variants are mostly CNVs may be due to the susceptibility of this region to ADHD 49 Dyslexia 23 LD, Motor delay 24 Failure to thrive, Macrocephaly 24 LD, Motor delay 24 250kb 68 Human–Neanderthal sweep 67 HACNS369 67 HACNS17466 HAR31 Alcohol consumpƟon Epilepsy 48 Epilepsy 48SD, MCA 24 Behavior problems 24 Microcephaly, DF 24 DF, Microcephaly 24 Ataxia 24 Dyslexia 23 ASD, ID, and/or DD Other neurological phenotypes 23 18 24 24 24 24 24 24 24 24 24 24 24 25 25 26 26 26 26 26 26 26 26 26 26 26 26 26 26 2630 29 26 19 24 21 22 24 24 24 24 24 24 24 24 24 24 25 25 24 28 20 8 8 8 6 26 42 46 46 46 46 46 46 47 TRENDS in Genetics Figure 2. Schematic of the AUTS2 genomic region. Numbers to the left of the lines correspond to reference numbers. Human accelerated sequences are shown as blue lines above the gene [66–68]. Structural variants [6,8,18–26,28–30,48,49] are represented as colored lines (red, deletion; orange, inversion; green, duplication; purple, translocation). Single-nucleotide polymorphisms (SNPs) are shown as magenta stars. rs6943555 is associated with alcohol consumption [42]. SNPs in [46,47] are associated with bipolar disorder. SNPs in [46] are reported to be in strong linkage disequilibrium with each other. Arrows in bars signify that the structural variant extends past the gene in that direction. Exons are depicted as light-blue rectangles, as defined by the RefSeq genes track in the University of California, Santa Cruz (UCSC) Genome Browser. DD, developmental delay; DF, dysmorphic features; HACNS, human accelerated conserved non-coding sequence; HAR, human accelerated region; ID, intellectual disability; LD, language disability; MCA, multiple congenital anomalies; SD, seizure disorder. Figure adapted from [17]. Review Trends in Genetics October 2013, Vol. 29, No. 10 602
  • 50. chromosomal breakpoints. A 2011 report showed that the offspring of older male mice have an increased risk of de novo CNVs in specific locations, including the Auts2 locus [36]. Another report found that hydroxyurea, a ribonucle- otide reductase inhibitor, as well as aphidicolin, a DNA polymerase inhibitor, induce a high frequency of de novo CNVs in cultured human cells, and found a clustering of CNVs in AUTS2 [37]. Aphidocolin also induced CNV for- mation in the Auts2 locus in non-homologous end-joining deficient mouse embryonic stem cells [38]. Because the AUTS2 locus is a hotspot for CNVs, and individuals with ASD generally carry more CNVs than their unaffected siblings [39], examining if these high numbers of ASD- associated CNVs around AUTS2 are consequential, and not merely a result of their susceptibility to CNVs, war- rants investigation. There is also the possibility that these CNVs affect regulatory regions of other genes, including the nearby WBS critical region. In 2013, a genome-wide analysis of DNA methylation was published on ASD discordant and concordant mono- zygotic twins. A region in the AUTS2 promoter (chr7: 68701907; hg18) was the 42nd most differentially methyl- ated CpG site in the genome, suggesting that not only sequence variation but also epigenetic changes to the AUTS2 locus could be involved in the development of ASD-related traits [40]. Significant DNA methylation dif- ferences were often observed near other genes that have been previously implicated in ASD, including methyl-CpG binding domain protein 4 (MBD4) and microtubule-associ- ated protein 2 (MAP2). The authors cautioned, however, that it is difficult to draw conclusions about the causality of the differentially methylated sites due to small sample size, lack of corresponding RNA expression data, the use of whole blood rather than brain tissue, and potential epige- netic effects due to medicine [40]. Combined, the evidence for a causative role of AUTS2 in DD and ID is convincing. However, for ASD the evidence presented so far suggests that disruptions in AUTS2 can play a causative role, but to demonstrate causality more research needs to done on cohorts of well-defined ASD patients and on the functional consequence of these dis- ruptions. AUTS2 and other neurological conditions In addition to ASD, ID, and DD, AUTS2 has been impli- cated in other neurological disorders. Some of these dis- orders, such as epilepsy, have been shown to be linked to ASD. However, other AUTS2-associated phenotypes are ASD-independent. AUTS2 expression was found to have significant association with nicotine-dependence, canna- bis-dependence, and antisocial personality disorder, al- though this study had a small number of cases and would need to be repeated with larger cohorts [41]. The study also suggested, although it did not reach signifi- cance, that AUTS2 expression is implicated in alcohol dependence [41]. In 2011 a genome-wide association meta-analysis found an AUTS2 non-coding single-nucleo- tide polymorphism (SNP), rs6943555, to be significantly associated with alcohol consumption [42]. The authors also reported increased AUTS2 expression in carriers of the minor A allele of rs6943555 compared with the T allele in 96 human prefrontal cortex samples. In addition, they identified significant differences in expression of Auts2 in whole-brain extracts of mice with differences in volun- tary alcohol consumption. The authors also showed that downregulation of tay, which has sequence similarity to AUTS2, caused reduction in alcohol sensitivity in Drosoph- ila [42]. Also implicating AUTS2 in drug dependence was a 2011 study showing that AUTS2 has a 3.01-fold change (downregulation) between 19 male heroin-dependent indi- viduals and 20 controls in lymphoblastoid cell lines [43]. A follow-up study compared AUTS2 transcript levels of lym- phoblastoid cell lines between 124 heroin-dependent and 116 control males using quantitative PCR – and found that average transcript levels of AUTS2 in the heroin-depen- dent group were significantly lower than in controls. They also found that AA homozygotes for rs6943555 were sig- nificantly over-represented in the heroin-dependent sub- jects [44]. Taken together, these reports show strong evidence for AUTS2 involvement in addiction and depen- dence. In addition, the AUTS2 locus has been shown to be implicated or altered in individuals with schizoaffective disorder [45], bipolar disorder [46,47], epilepsy [48], ADHD [49], differential processing speed [50], suicidal tendencies under the influence of alcohol [51], and dyslexia [23], either through CNV or genome-wide association studies. A 2012 article sequenced balanced chromosomal abnormalities in patients with neurodevelopmental disorders, and found the AUTS2 locus to be perturbed in individuals with microcephaly, macrocephaly, ataxia, visual impairment, language disability, seizure disorder, dysmorphic features, behavioral problems, motor delay, or Rubinstein–Taybi syndrome [24]. It could be that the observation that most cases of AUTS2 structural variants are associated with ASD is attributed to more individuals with ASD being tested in this locus than patients with other neurological disorders – thereby leading to an underestimate in the link Box 1. AUTS2 and non-neurological disorders and traits A few reports have implicated AUTS2 in non-neurological disorders and traits. In 2004, 18 cases of childhood hyperdiploid acute lymphoblastic leukemia (ALL) were examined to identify the relationship between extra copies of chromosomes and increased gene expression. The authors identified multiple regions with increased expression that correlated poorly or not at all with the presence of extra copies of chromosomes, including 7q11.2. AUTS2 showed consistently higher expression levels in the cDNA samples of patients than in normal mononuclear cells, possibly implicating the gene in ALL [69]. In 2008 it was reported that paired box 5 (PAX5) can be rearranged with a variety of partners, including AUTS2 (one case) in pediatric ALL [70]. Two years later a second case of PAX5– AUTS2 fusion was identified in pediatric ALL [71]. In 2012, the third case of PAX5–AUTS2 fusion was identified in a patient with pediatric ALL, providing additional evidence that PAX5–AUTS2 is a recurring gene fusion in ALL [72]. Two of the three PAX5–AUTS2 cases had CNS diseases either at the time of diagnosis or relapse [72]. Individual reports, some of which identify single patients, have also implicated the AUTS2 locus in the aging of human skin [73], lung adenocarcinoma [74], lethal prostate cancer [75], the number of corpora lutea in pigs [76], early-onset androgenetic alopecia [77], and metastatic non-seminomatous testicular cancer [78]. Despite several reports suggesting a role for AUTS2 in non-neurological disorders and traits, disruption of AUTS2 is most often reported to be associated with neurological phenotypes. Review Trends in Genetics October 2013, Vol. 29, No. 10 603
  • 51. between AUTS2 and other neurological phenotypes. Taken together, these observations suggest that AUTS2 dysfunc- tion is not restricted to ASD, DD, or ID, but instead AUTS2 dysfunction is involved in a wide range of neurological disorders. In addition, a few studies implicate AUTS2 in non-neurological disorders and traits (Box 1). The function and regulation of AUTS2 Despite the many articles linking AUTS2 to human dis- ease and other traits, few papers have been published describing the function of the gene. In 2013, morpholino knockdowns of auts2 were performed in zebrafish by two different groups [17,26]. The observed phenotypes are summarized in Figure 3 and Table 2. Using HuC (Hu antigen C), a neuronal marker, both groups observed a decrease in neuronal cells in the brain (Figure 3B). In- creased apoptosis and cell proliferation in the brain was reported, and it was noted that this observation could be a result of morphant cells failing to differentiate into mature neurons, which matches the HuC results [17]. Although increased cell proliferation was observed in one study [17], another study described decreased cell proliferation [26]. The differences in this phenotype could be due to differ- ences in the stains used (proliferating cell nuclear antigen, PCNA, which marks cells in early G1- and S-phase versus phosphohistone-H3, a marker of cells in G2 and M phase). Both reports, however, found that auts2 knockdown cells show more replicating DNA, but fewer cells dividing into daughter cells. The craniofacial phenotype of the morphant fish was also characterized in one of the studies, finding that they have micrognathia (undersized jaw) and retro- gnathia (receded jaw) (Figure 3C) [26]. Given that migrat- ing neural crest cells play an important role in craniofacial development [52], it is possible that this phenotype is a result of defects in neuronal cell development. In addition, less movement was reported in morphant fish, and this could be caused by fewer motor neuron cell bodies in the spinal cord, together with improperly angled and weaker projections, and/or fewer sensory neurons, both of which were observed in morphant fish [17]. Although one group observed overall stunted development [17] (Figure 3A), the other reported a phenotype restricted to the brain and jaw [26]. A potential cause for the difference in this phenotype, alongside the differences in cell proliferation phenotypes, could be due to the use of different morpholinos for these assays: an auts2 translational morpholino [17] versus splicing morpholinos [26]. Both groups were able to rescue the morphant phenotype by injecting full-length human AUTS2 mRNA together with the morpholino [17,26]. The morphant phenotype was also rescued by injecting the shorter C-terminal isoform of AUTS2, suggesting that the final nine exons of AUTS2 contain the crucial region of the gene, at least for the dysmorphic phenotype observed in knockdown fish. This is in line with the observation that dysmorphic features were more pronounced in individuals with 30 AUTS2 deletions [26]. The zebrafish knockdown phenotypes appear to be an overall neurodevelopment defect, making it difficult to truly parse out the function ot ce ret chMk auts2 MorpholinoControl (C) Alcian blue, 120 hpf (B) HuC–GFP, 48 hpf (A) Wholemount, 48 hpf TRENDS in Genetics Figure 3. auts2 zebrafish knockdown phenotype. (A) At 48 hours post-fertilization (hpf), fish injected with a 5 bp mismatch auts2 morpholino (MO) control have a similar morphology to wild type fish, whereas fish injected with a corresponding translational MO display a stunted developmental phenotype that includes a smaller head, eyes, body, and fins. (B) At 48 hpf, HuC–GFP fish injected with a 5 bp mismatch auts2 control MO display normal levels of developing neurons in the brain, whereas translational MO injected fish display less developing neurons in the cerebellum (ce), optic tectum (ot), and retina (ret). (C) At 120 hpf, fish injected with an auts2 splicing MO and stained with Alcian blue show a significant reduction in the distance between the Meckel (Mk) and ceratohyal cartilages (ch) (shown as a red line) compared to controls, indicating a reduced lower-jaw size. Panels (A, B) adapted from [17], (C) adapted from [26]. Review Trends in Genetics October 2013, Vol. 29, No. 10 604
  • 52. of this gene. To understand AUTS2 function better, a conditional knockout mouse should be developed. Given the observation that non-coding regions within AUTS2 have been implicated in human evolution (Box 2) and disease, the regulatory landscape around AUTS2 was investigated [17]. Twenty-three enhancers were identified in zebrafish, 10 of which are active in the brain. Three mouse brain enhancers were found to overlap a purely non- coding ASD-associated deletion, and four different mouse enhancers (two of which were positive in the brain) were found to reside in regions implicated in human evolution, supporting the idea that this gene is tightly regulated, and that enhancers for this gene are important for health and evolution [17]. The enhancers described are potentially only a subset of the AUTS2 regulatory landscape – and it is possible that some of these enhancers regulate other genes, including those in the WBS critical region. Although the precise function of AUTS2 remains to be elucidated, current reports show it to be a crucial and tightly regulated gene involved in neurodevelopment. AUTS2 gene pathways A 2010 study used radiation hybrid genotyping data to test for interaction of 99% of all possible gene pairs across the mammalian genome [53]. AUTS2 was the known gene with the greatest number of edges, or connectivity [53]. Despite that finding, little is known about the genetic pathways in which AUTS2 is involved. However, a few articles have provided evidence linking AUTS2 to other proteins and pathways. One potential pathway was revealed by examining genes that can oscillate expression during somitogenesis. Two papers found that the expression of AUTS2 oscillates in phase with other notch pathway genes, suggesting that it is a component of the notch signaling pathway [54,55]. Notch signaling has been shown to be involved in neuronal migration through its interaction with Reelin, a gene im- plicated in ASD and a target of Tbr1 [56,57]. Although not reaching significance, a group found that Auts2 has a 1.33-fold change in cerebellar gene expression in methyl CpG binding protein 2 (Mecp2)-null mice. Loss of MECP2 function can cause neurodevelopmental disorders including Rett syndrome and autism [58]. The authors also compared their data with data generated from other gene expression studies. They found that Auts2 is consistently altered in both their datasets, as well as in post-mortem Rett syndrome patient brain, and is mutated in fibroblasts and lymphocytes [58]. Starting at mouse embryonic (E) day 12, Auts2 mRNA is expressed in the cortical preplate, where it colocalizes with Tbr1, a transcription factor that exerts positive and nega- tive control of regional and laminar identity in postmitotic neurons [12,59]. Using Tbr1 antibodies for chromatin immunoprecipitation (ChIP) of E14.5 cortex, it was shown that the Auts2 promoter is a direct transcriptional target of Tbr1 in the developing neocortex and is involved in frontal identity [59]. SATB homeobox 2 (Satb2) is one of four genes (including Tbr1) that regulates projection identity within the layers of the mammalian cortex. In 2012 a report showed that, in mice, Tbr1 expression is dually regulated by Satb2 and B cell CLL/lymphoma 11B (Ctip2) in cortical layers 2–5. The authors also demonstrated that Satb2 regulates Auts2. They showed that, similarly to Tbr1, Auts2 is expressed in the deep and upper layers of the cortex. They investi- gated whether the loss of Tbr1 expression in the upper layer neurons in Satb2 mutants coincides with changes in Auts2 expression. They observed that there was a signifi- cant loss of Auts2 expression in the upper layers of Satb2 mutants, similar to the loss of Tbr1 in Satb2 mutants. The authors did not observe any changes in Auts2 expression in layers 5 or 6. Their results suggest that Satb2 regulates the expression of Tbr1, which in turn regulates Auts2 expres- sion in callosal projection neurons [60]. GTF2I repeat domain containing 1 (GTF2IRD1) is one of 26 genes deleted in WBS, and encodes a putative tran- scription factor expressed throughout the brain during development. Gtf2ird1 knockout mice display reduced in- nate fear and increased sociability, phenotypes consistent with WBS [61]. Microarray screens were used to find transcriptional targets of Gtf2ird1 in brain tissue from Gtf2ird1 knockout mice at two timepoints – E15.5 and birth [postnatal (P) day 0] – versus wild type littermates. Auts2 was one of only two genes identified in both (E15.5 and P0) microarray experiments to be altered compared to controls. In P0 mouse brains of knockout mice, Auts2 was increased by 1.3-fold, whereas in E15.5 embryos it was decreased by 1.5-fold [62]. It is unclear if Auts2 is a target of Table 2. auts2 morpholino knockdown phenotypes Assay following morpholino injectiona Developmental phenotype Refs Wholemount Overall stunted development, including smaller head and eyes (Figure 3A). Less movement when prodded. [17] Microcephaly with no overall developmental delay. [26] Alcian blue staining Micrognathia (undersized jaw) and retrognathia (receded jaw) (Figure 3C). [26] HuC–GFP zebrafish line Fewer developing neurons in the dorsal region of the midbrain, including the optic tectum, the midbrain- hindbrain boundary (including the cerebellum), the hindbrain and the retina [17] (Figure 3B). [17] HuC/D staining Reduction in HuC/D-positive postmitotic neurons as well as a loss of bilateral symmetry. [26] TUNEL staining Increased apoptosis in the midbrain. [17] PCNA staining Increased cell proliferation in the forebrain, midbrain and hindbrain. [17] Phosphohistone H3 Decreased cell proliferation in the brain. [26] Tg(mnx1:GFP) zebrafish line Fewer motor neuron cell bodies in the spinal cord and weaker, improperly angled projections. [17] HNK-1 staining Fewer sensory neurons in the spinal cord. [17] a HNK-1, neural cell adhesion molecule 1/Ncam1 (CD57); HuC/D, Hu antigen C/D [ELAV (embryonic lethal, abnormal vision, Drosophila)-like 3/4]; mnx1, motor neuron and pancreas homeobox 1; PCNA, proliferating cell nuclear antigen; Tg, transgenic; TUNEL, terminal deoxynucleotidyl transferase dUTP nick end-labeling. Review Trends in Genetics October 2013, Vol. 29, No. 10 605
  • 53. Gtf2ird1 or if this observation reflects the proximity of the two genes. Zinc finger matrin-type 3 (Zmat3, also known as Wig1), a transcription factor regulated by p53, plays an important role in RNA protection and stabilization and, as part of the p53 pathway, is a casual factor in neurodegenerative dis- eases. Wig1 downregulation by antisense oligonucleotide treatment led to a significant reduction in Auts2 mRNA levels in the brains of BACHD (bacterial artificial chromo- some – HD) mice, a mouse model for Huntington’s disease (HD). The authors also reported a trend in reduction of Auts2 mRNA levels in the livers of BALB/c mice but no reduction in Auts2 levels in FVB (background strain of BACHD) mouse brains [63]. These results suggest a role for Wig1 in the regulation of Auts2 expression and further links Auts2 with pathways involved in the CNS. Polycomb repressive complex 1 (PRC1) is a polycomb group (PcG) gene which acts as a developmental regulator through transcriptional repression. It is crucial for many biological processes in mammals, including differentiation. There are six major groups of PRC1 complexes, each con- taining a distinct polycomb group ring finger 1 (PCGF) subunit (PCGF1–6), a RING1 A/B ubiquitin ligase, and unique associated polypeptides. Using tandem affinity purification of PCGF3 and PCGF5, AUTS2 was recovered, implying a role for AUTS2 in transcriptional repression during development [64]. In 2013, the regulatory pathway for SEMA5A (sema- phorin 5A), an autism candidate gene, was mapped in silico using expression quantitative trait locus (eQTL) mapping. The authors found that the SEMA5A regulatory network significantly overlaps with rare CNVs around ASD-associ- ated genes, including AUTS2. Given the extensive trans- regulatory network associated with SEMA5A, the authors also investigated the possibility that there are several upstream master regulators that control this network. Performing eQTL mapping for expression levels of the eQTL-associated genes within the network (eQTLs of the eQTLs of SEMA5A), the authors identified 12 regions associated with the expression of 10 or more primary SEMA5A eQTL genes, including AUTS2. This study sug- gests that AUTS2 is involved, and may be a master regu- lator in ASD-related pathways [65]. Concluding remarks As we identify the genes involved in ASD, DD, and ID, our ability to genetically diagnose these disorders improves, and future screens should assess AUTS2 for potential causative CNVs. However, before we are able to use AUTS2 as a diagnostic tool we must determine what makes a CNV in or around AUTS2 causative or benign and for what disorders (e.g., ID, DD, ASD, ASD with ID/ DD, etc.). This includes a deeper investigation of the regulatory network of this gene. Although not in immedi- ate sight, a major step in developing future ASD and ASD- related phenotype treatments relies on a solid understand- ing of the pathways involved and how they interact. Mul- tiple reports have implicated AUTS2 in addiction and other neurological phenotypes, but the mechanism and certainty of these involvements remain unclear, highlight- ing the need for deeper investigations into the function of this gene and its role in development and disease. Future work using an Auts2 mouse knockout should reveal greater detail of the function of this gene. In addition, genomic studies such as RNA-seq following the knockdown of this gene and chromatin immunoprecipitation followed by deep sequencing (ChIP-seq) could identify the various gene pathways and regions of the genome with which this gene interacts. Obtaining a better understanding of the path- ways associated with AUTS2 will allow us to comprehend better the biological systems that can be perturbed when the function of this gene is disrupted, as well as how nucleotide changes within the gene might have led to human-specific traits. In summary, we can presume that this gene is involved in neurodevelopment, and may play a role in ASD and ASD-related phenotypes. There are also significant data suggesting that AUTS2 has human-spe- cific variants that could possibly contribute to human cognition. It is important to differentiate the evolution and phenotypic data surrounding this gene. The data suggests that genes involved in human specific cognition may also play a role in human-specific disorders of the brain. Acknowledgments We would like to thank Christelle Golzio, Nicholas Katsanis, and Erik A. Sistermans for sharing their work on auts2 including their morpholino results used in Figure 3C. We would also like to thank members of the Ahituv lab for helpful comments. N.A. and N.O. received support for this research from the Simons Foundation (SFARI grant 256769 to N.A.), National Human Genome Research Institute (NHGRI) grant number Box 2. AUTS2 and human evolution In 2006 a comparative genomics approach was used to search the human genome for regions that have significantly changed in humans in the past 5 million years, since the divergence from chimpanzees, but are highly conserved in other species [66,79]. They identified 202 such regions which they termed human accelerated regions (HARs). These HARs are strong candidates for sequences responsible for the evolution of human-specific traits. An intronic region in AUTS2 (Figure 2) ranked as the 31st most accelerated region in their study. Similarly, in 2006 a different group combed the genome for conserved non-coding sequences in the human lineage that displayed accelerated evolution [67]. The authors identified 902 human accelerated conserved non-coding sequences (HACNSs). HACNSs 174 and 369 both lay within introns of AUTS2 (Figure 2). With the publication of the draft sequence of the Neanderthal genome in 2011, it was found that the first half of AUTS2 displayed the strongest statistical signal in a genomic screen differentiating modern humans from Neanderthals (Figure 2) [68]. This region contains 293 consecutive SNPs where only ancestral alleles were observed in the Neanderthals, only two of which are coding variants [a G to C non-synonymous substitution at chr7:68,702,743 (hg18) only in the Han Chinese and a C to T synonymous change at chr7:68,702,866 (hg18) within the Yoruba and Melanesian populations]. Other regions that were found to have the most significant human-Neanderthal changes also include genes that are involved in cognition and social interaction, including dual-specificity tyrosine-(Y)-phosphorylation regulated kinase 1A (DYRK1A), neuregulin 3 (NRG3) and Ca2+ -dependent secretion activator 2 (CADPS2) [68]. The authors conclude that multiple genes involved in cognitive development were positively selected during the evolution of modern humans [68]. Taken together, these studies suggest that significant changes in AUTS2 occurred specifically in modern humans and it is conceivable, based on the neurological role that this gene plays, that these changes could lead to cognitive traits specific to humans. Review Trends in Genetics October 2013, Vol. 29, No. 10 606
  • 54. R01HG005058, National Institute of Child Health and Human Develop- ment (NICHD) grant number R01HD059862, and National Institute of Neurological Disorders and Stroke (NINDS) grant number R01NS079231. N.O. is also supported in part by a Dennis Weatherstone pre-doctoral fellowship from Autism Speaks. References 1 Fleischhacker, W.W. and Brooks, D.J. (2006) Neurodevelopmental Disorders, Springer 2 Baio, J. et al. (2012) Prevalence of autism spectrum disorders – autism and developmental disabilities monitoring network, 14 sites, United States, 2008. MMWR Surveill. Summ. 61, 1–19 3 Risch, N. et al. (1999) A genomic screen of autism: evidence for a multilocus etiology. Am. J. Hum. Genet. 65, 493–507 4 Geschwind, D.H. (2009) Advances in autism. Annu. Rev. Med. 60, 367–380 5 Abrahams, B.S. and Geschwind, D.H. (2008) Advances in autism genetics: on the threshold of a new neurobiology. Nat. Rev. Genet. 9, 341–355 6 Sultana, R. et al. (2002) Identification of a novel gene on chromosome 7q11.2 interrupted by a translocation breakpoint in a pair of autistic twins. Genomics 80, 129–134 7 Martens, M.A. et al. (2008) Research review: Williams syndrome: a critical review of the cognitive, behavioral, and neuroanatomical phenotype. J. Child Psychol. Psychiatry 49, 576–608 8 Kalscheuer, V.M. et al. (2007) Mutations in autism susceptibility candidate 2 (AUTS2) in patients with mental retardation. Hum. Genet. 121, 501–509 9 Poeck, B. et al. (2008) Locomotor control by the central complex in Drosophila – an analysis of the tay bridge mutant. Dev. Neurobiol. 68, 1046–1058 10 Salichs, E. et al. (2009) Genome-wide analysis of histidine repeats reveals their role in the localization of human proteins to the nuclear speckles compartment. PLoS Genet. 5, e1000397 11 Lamond, A.I. and Spector, D.L. (2003) Nuclear speckles: a model for nuclear organelles. Nat. Rev. Mol. Cell Biol. 4, 605–612 12 Bedogni, F. et al. (2010) Autism susceptibility candidate 2 (Auts2) encodes a nuclear protein expressed in developing brain regions implicated in autism neuropathology. Gene Expr. Patterns 10, 9–15 13 Lepagnol-Bestel, A-M. et al. (2008) SLC25A12 expression is associated with neurite outgrowth and is upregulated in the prefrontal cortex of autistic subjects. Mol. Psychiatry 13, 385–397 14 Zhang, Y.E. et al. (2011) Accelerated recruitment of new brain development genes into the human genome. PLoS Biol. 9, e1001179 15 Wagner, A.H. et al. (2013) Exon-level expression profiling of ocular tissues. Exp. Eye Res. 111, 105–111 16 Ameur, A. et al. (2011) Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain. Nat. Struct. Mol. Biol. 18, 1435–1440 17 Oksenberg, N. et al. (2013) Function and regulation of AUTS2, a gene implicated in autism and human evolution. PLoS Genet. 9, e1003221 18 Pinto, D. et al. (2010) Functional impact of global rare copy number variation in autism spectrum disorders. Nature 466, 368–372 19 Bakkaloglu, B. et al. (2008) molecular cytogenetic analysis and resequencing of contactin associated protein-like 2 in autism spectrum disorders. Am. J. Hum. Genet. 82, 165–173 20 Huang, X-L. et al. (2010) A de novo balanced translocation breakpoint truncating the autism susceptibility candidate 2 (AUTS2) gene in a patient with autism. Am. J. Med. Genet. A 152A, 2112–2114 21 Glessner, J.T. et al. (2009) Autism genome-wide copy number variation reveals ubiquitin and neuronal genes. Nature 459, 569–573 22 Ben-David, E. et al. (2011) Identification of a functional rare variant in autism using genome-wide screen for monoallelic expression. Hum. Mol. Genet. 20, 3632–3641 23 Girirajan, S. et al. (2011) Relative burden of large CNVs on a range of neurodevelopmental phenotypes. PLoS Genet. 7, e1002334 24 Talkowski, M.E. et al. (2012) Sequencing chromosomal abnormalities reveals neurodevelopmental loci that confer risk across diagnostic boundaries. Cell 149, 525–537 25 Nagamani, S.C.S. et al. (2013) Detection of copy-number variation in AUTS2 gene by targeted exonic array CGH in patients with developmental delay and autistic spectrum disorders. Eur. J. Hum. Genet. 21, 1–4 26 Beunders, G. et al. (2013) Exonic deletions in AUTS2 cause a syndromic form of intellectual disability and suggest a critical role for the C Terminus. Am. J. Hum. Genet. 92, 210–220 27 Girirajan, S. et al. (2013) Global increases in both common and rare copy number load associated with autism. Hum. Mol. Genet. 22, 2870–2880 28 Cusco´, I. et al. (2009) Autism-specific copy number variants further implicate the phosphatidylinositol signaling pathway and the glutamatergic synapse in the etiology of the disorder. Hum. Mol. Genet. 18, 1795–1804 29 Tropeano, M. et al. (2013) Male-biased autosomal effect of 16p13.11 copy number variation in neurodevelopmental disorders. PLoS ONE 8, e61365 30 Jolley, A. et al. (2013) De novo intragenic deletion of the autism susceptibility candidate 2 (AUTS2) gene in a patient with developmental delay: a case report and literature review. Am. J. Med. Genet. A 161, 1508–1512 31 Redon, R. et al. (2006) Global variation in copy number in the human genome. Nature 444, 444–454 32 O’Roak, B.J. et al. (2012) Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature 485, 246–250 33 Sanders, S.J. et al. (2012) De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 485, 237–241 34 O’Roak, B.J. et al. (2011) Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations. Nat. Genet. 43, 585–589 35 Chahrour, M.H. et al. (2012) Whole-exome sequencing and homozygosity analysis implicate depolarization-regulated neuronal genes in autism. PLoS Genet. 8, e1002635 36 Flatscher-Bader, T. et al. (2011) Increased de novo copy number variants in the offspring of older males. Transl. Psychiatry 1, e34 37 Arlt, M. and Ozdemir, A. (2011) Hydroxyurea induces de novo copy number variants in human cells. Proc. Natl. Acad. Sci. U.S.A. 108, 17360–17365 38 Arlt, M.F. et al. (2012) De novo CNV formation in mouse embryonic stem cells occurs in the absence of Xrcc4-dependent nonhomologous end joining. PLoS Genet. 8, e1002981 39 Sebat, J. et al. (2007) Strong association of de novo copy number mutations with autism. Science 316, 445–449 40 Wong, C. et al. (2013) Methylomic analysis of monozygotic twins discordant for autism spectrum disorder and related behavioural traits. Mol. Psychiatry 41 Philibert, R.A. et al. (2007) Transcriptional profiling of subjects from the Iowa adoption studies. Am. J. Med. Genet. B: Neuropsychiatr. Genet. 144B, 683–690 42 Schumann, G. et al. (2011) Genome-wide association and genetic functional studies identify autism susceptibility candidate 2 gene (AUTS2) in the regulation of alcohol consumption. Proc. Natl. Acad. Sci. U.S.A. 108, 7119–7124 43 Liao, D. et al. (2011) Comparative gene expression profiling analysis of lymphoblastoid cells reveals neuron-specific enolase gene (ENO2) as a susceptibility gene of heroin dependence. Addict. Biol. http:// 44 Chen, Y-H. et al. (2013) Genetic analysis of AUTS2 as a susceptibility gene of heroin dependence. Drug Alcohol Depend. 128, 238–242 45 Hamshere, M.L. et al. (2009) Genetic utility of broadly defined bipolar schizoaffective disorder as a diagnostic concept. Br. J. Psychiatry 195, 23–29 46 Hattori, E. et al. (2009) Preliminary genome-wide association study of bipolar disorder in the Japanese population. Am. J. Med. Genet. B: Neuropsychiatr. Genet. 150B, 1110–1117 47 Lee, H. et al. (2012) A genome-wide association study of seasonal pattern mania identifies NF1A as a possible susceptibility gene for bipolar disorder. J. Affect. Disord. 145, 200–207 48 Mefford, H.C. et al. (2010) Genome-wide copy number variation in epilepsy: novel susceptibility loci in idiopathic generalized and focal epilepsies. PLoS Genet. 6, e1000962 49 Elia, J. et al. (2010) Rare structural variants found in attention-deficit hyperactivity disorder are preferentially associated with neurodevelopmental genes. Mol. Psychiatry 15, 637–646 50 Luciano, M. et al. (2011) Whole genome association scan for genetic polymorphisms influencing information processing speed. Biol. Psychol. 86, 193–202 Review Trends in Genetics October 2013, Vol. 29, No. 10 607
  • 55. 51 Chojnicka, I. et al. (2013) Possible association between suicide committed under influence of ethanol and a variant in the AUTS2 gene. PLoS ONE 8, e57199 52 Gilbert, S.F. (2000) Developmental Biology (6th edn), Sinauer Associates 53 Lin, A. et al. (2010) A genome-wide map of human genetic interactions inferred from radiation hybrid genotypes. Genome Res. 20, 1122–1132 54 William, D. et al. (2007) Identification of oscillatory genes in somitogenesis from functional genomic analysis of a human mesenchymal stem cell model. Dev. Biol. 305, 172–186 55 Deque´ant, M-L. et al. (2006) A complex oscillating network of signaling genes underlies the mouse segmentation clock. Science 314, 1595–1598 56 Hashimoto-Torii, K. et al. (2008) Interaction between Reelin and Notch signaling regulates neuronal migration in the cerebral cortex. Neuron 60, 273–284 57 Wang, G-S. et al. (2004) Transcriptional modification by a CASK- interacting nucleosome assembly protein. Neuron 42, 113–128 58 Ben-Shachar, S. et al. (2009) Mouse models of MeCP2 disorders share gene expression changes in the cerebellum and hypothalamus. Hum. Mol. Genet. 18, 2431–2442 59 Bedogni, F. et al. (2010) Tbr1 regulates regional and laminar identity of postmitotic neurons in developing neocortex. Proc. Natl. Acad. Sci. U.S.A. 107, 13129–13134 60 Srinivasan, K. et al. (2012) A network of genetic repression and derepression specifies projection fates in the developing neocortex. Proc. Natl. Acad. Sci. U.S.A. 109, 19071–19078 61 Young, E.J. et al. (2008) Reduced fear and aggression and altered serotonin metabolism in Gtf2ird1-targeted mice. Genes Brain Behav. 7, 224–234 62 O’Leary, J. and Osborne, L.R. (2011) Global analysis of gene expression in the developing brain of Gtf2ird1 knockout mice. PLoS ONE 6, e23868 63 Sedaghat, Y. et al. (2012) Genomic analysis of wig-1 pathways. PLoS ONE 7, e29429 64 Gao, Z. et al. (2012) PCGF homologs, CBX proteins, and RYBP define functionally distinct PRC1 family complexes. Mol. Cell 45, 344–356 65 Cheng, Y. et al. (2013) An eQTL mapping approach reveals that rare variants in the SEMA5A regulatory network impact autism risk. Hum. Mol. Genet. 22, 2960–2972 66 Pollard, K.S. et al. (2006) Forces shaping the fastest evolving regions in the human genome. PLoS Genet. 2, e168 67 Prabhakar, S. et al. (2006) Accelerated evolution of conserved noncoding sequences in humans. Science 314, 786 68 Green, R.E. et al. (2010) A draft sequence of the Neandertal genome. Science 328, 710–722 69 Gruszka-Westwood, A.M. et al. (2004) Comparative expressed sequence hybridization studies of high-hyperdiploid childhood acute lymphoblastic leukemia. Genes Chromosomes Cancer 41, 191–202 70 Kawamata, N. et al. (2008) Cloning of genes involved in chromosomal translocations by high-resolution single nucleotide polymorphism genomic microarray. Proc. Natl. Acad. Sci. U.S.A. 105, 11921–11926 71 Coyaud, E. et al. (2010) PAX5–AUTS2 fusion resulting from t(7;9)(q11.2;p13.2) can now be classified as recurrent in B cell acute lymphoblastic leukemia. Leuk. Res. 34, e323–e325 72 Denk, D. et al. (2012) PAX5-AUTS2: a recurrent fusion gene in childhood B-cell precursor acute lymphoblastic leukemia. Leuk. Res. 36, e178–e181 73 Lener, T. et al. (2006) Expression profiling of aging in the human skin. Exp. Gerontol. 41, 387–397 74 Weir, B. et al. (2007) Characterizing the cancer genome in lung adenocarcinoma. Nature 450, 893–898 75 Penney, K.L. et al. (2010) Genome-wide association study of prostate cancer mortality. Cancer Epidemiol. Biomarkers Prev. 19, 2869–2876 76 Sato, S. et al. (2011) Characterization of porcine autism susceptibility candidate 2 as a candidate gene for the number of corpora lutea in pigs. Anim. Reprod. Sci. 126, 211–220 77 Li, R. et al. (2012) Six novel susceptibility loci for early-onset androgenetic alopecia and their unexpected association with common diseases. PLoS Genet. 8, e1002746 78 Stadler, Z.K. et al. (2012) Rare de novo germline copy-number variation in testicular cancer. Am. J. Hum. Genet. 91, 379–383 79 Pollard, K.S. et al. (2006) An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443, 167–172 Review Trends in Genetics October 2013, Vol. 29, No. 10 608