Trends in genetics_-_october_2013


Published on

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Trends in genetics_-_october_2013

  1. 1. Editor Rhiannon Macrae Portfolio Manager Milka Kostic Journal Manager Basil Nyaku Journal Administrators Ria Otten and Patrick Scheffmann Advisory Editorial Board K.V. Anderson, New York, USA A. Clark, Ithaca, USA G. Fink, Cambridge, USA S. Gasser, Geneva, Switzerland D. Goldstein, Durham, USA L. Guarente, Cambridge, USA Y. Hayashizaki, Yokohama, Japan S. Henikoff, Seattle, USA J. Hodgkin, Oxford, UK H.R. Horvitz, Cambridge, USA L. Hurst, Bath, UK E. Koonin, Bethesda, USA E. Meyerowitz, Pasadena, USA S. Moreno, Salamanca, Spain A. Nieto, Alicante, Spain C. Scazzocchio, Orsay, France and London, UK D. Tautz, Plön, Germany O. Voinnet, Strasburg, France J. Wysocka, Stanford, California Editorial Enquiries Trends in Genetics Cell Press 600 Technology Square, 5th floor Cambridge MA 02139, USA Tel: +1 617 397 2818 Fax: +1 617 397 2810 E-mail: Cover: In this special issue of Trends in Genetics, we turn the lens on ourselves. The articles this month focus on human genetics, with topics ranging from resources and methods to make the most of the explosion of sequencing data to evolutionary questions about mutation rates and how selection acts through pregnancy. Cover image: iStockKameleonMedia. October 2013 Volume 29, Number 10 pp. 555–608 Jeffrey A. Fawcett and Hideki Innan Eli Eisenberg and Erez Y. Levanon 561 The role of gene conversion in preserving rearrangement hotspots in the human genome 569 Human housekeeping genes, revisited Opinions 559 LongevityMap: a database of human genetic variants associated with longevity 556 Genome sequencing for healthy individuals Arie Budovsky, Thomas Craig, Jingwei Wang, Robi Tacutu, Attila Csordas, Joana Lourenço, Vadim E. Fraifeld, and João Pedro de Magalhães Saskia C. Sanderson Spotlight Reviews Catarina D. Campbell and Evan E. Eichler Elizabeth A. Brown, Maryellen Ruvolo, and Pardis C. Sabeti David C. Samuels, Leng Han, Jiang Li, Sheng Quanghu, Travis A. Clark, Yu Shyr, and Yan Guo Nir Oksenberg and Nadav Ahituv Feature Review 575 Properties and rates of germline mutations in humans 585 Many ways to die, one way to arrive: how selection acts through pregnancy 593 Finding the lost treasures in exome sequencing data 600 The role of AUTS2 in neurodevelopment and human evolution Science & Society 555 Inherited uncertainty Rhiannon Macrae Editorial Special Issue: Human Genetics
  2. 2. Inherited uncertainty Rhiannon Macrae My college physics textbook contained an anecdote about a physics professor who used to joke that instead of giving a seminar as part of their thesis defense, students should instead demonstrate their faith in physical principles by walking over a bed of hot coals. The trick is to get your feet wet first (hence, many people walk across dewy grass before stepping on to the coals), and the moisture will create an insulating vapor barrier through a phenomenon called the Leidenfrost effect, protecting your bare skin from the heat of the coals. If walking across hot coals is the ultimate test of a physicist’s faith in the laws of the universe, the equivalent for a geneticist is having a baby (Figure 1). Although it was not until Gregor Mendel presented his work in 1865 that inheritance was formally quantitated, humans innately understood the concept of heredity well before then. Perhaps the most pervasive evidence of this comes from breeding programs dating back to prehistoric times, in which animals or plants with desirable traits were selectively bred. Plato wrote about extending these ideas to humans, and history is full of examples of known familial diseases, such as hemophilia. The development of molecular genetics transformed these observations into a mechanistic understanding of the hereditary material, and now with the advent of genomic technologies, a full picture of inheritance is beginning to emerge. Efforts are under- way to identify the genetic changes underlying every known Mendelian disorder ( and much work has been done to demonstrate associations between genetic variants and human traits (e.g., the GIANT consortium). It is easy to see in these systematic approaches a future of predictable genetic outcomes. The reality of the uncertainty in what lies in an indi- vidual’s DNA, however, announces itself along with the news of pregnancy. Although prenatal genetic screening is now routinely offered for some diseases, such as cystic fibrosis carrier testing or trisomy screening, thousands of known causal variants go untested, despite the feasibility of noninvasive fetal genome sequencing. Even with this new technology, the unknown variants and the dreaded ‘variants of unknown significance’ continue to pose chal- lenges to our understanding of the genotype–phenotype relation. I suspect most expecting parents do not phrase their fears in those terms, but I would venture that most if not all are hoping not so much for a boy or a girl, but for a healthy baby. Luckily for the parents (and the human race), this wish is often granted, allowing parents to refo- cus all their energy on raising their healthy baby, arriving at another classic debate in genetics – nature versus nurture. For indeed, your DNA is not your fate. Our prehistoric ancestors knew that even crops planted from the hardiest and most productive parents would fail in a drought. A catalog of all the disease-associated variants in the human genome would still only provide probabilities of outcomes in many cases, and it is difficult to imagine an algorithm sophisticated enough to consider all of the gene x–environment interactions that could influence those probabilities. Add in epigenetics, and it begins to feel as though we know less about inheritance than Mendel did. Nevertheless, we continue to put our faith in the pro- cesses that guide evolution and bring new lives into the world. It would be nice if there was a simple trick to ensure success, but for all the advice new parents receive, there is no equivalent to the suggestion to get your feet wet before walking across hot coals. Physicists are currently exploring the limits of the universe, but geneticists are still expand- ing the limits of what is knowable. In this Special Issue on human genetics, authors tackle this question from a vari- ety of angles, from describing resources and methods for probing the human genome to discussing how evolution has shaped our species. As we go to press, my husband and I will be completing the 9-month pilot phase of our own human genetics project. Preliminary data indicate that it’s a healthy girl. Editorial TRENDS in Genetics Figure 1. An ultrasound image at 12 weeks of pregnancy. Courtesy of Wolfgang Moroder. Corresponding author: Macrae, R. ( 0168-9525/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. Trends in Genetics, October 2013, Vol. 29, No. 10 555
  3. 3. Genome sequencing for healthy individuals Saskia C. Sanderson Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA Genome sequencing of healthy individuals has the po- tential to lead to improved well-being and disease pre- vention, but numerous challenges remain that must be addressed to realize these benefits and, importantly, these benefits must be equitable across society. Sequencing people, not only patients Over the past few years, several seemingly healthy indi- viduals have had their genomes sequenced, analyzed, and published in peer-reviewed scientific journals. These in- clude scientist Mike Snyder at Stanford [1], eight other individuals at Stanford [2], and participants in the Per- sonal Genome Project at Harvard [3]. There is considerable hope that whole-genome sequencing (WGS) in healthy individuals will lead to great advances in disease preven- tion and improved well-being [4]. However, numerous challenges and concerns exist, including the costs of ana- lyzing and interpreting WGS data as well as the potential for adverse outcomes such as confusion, anxiety, inappro- priate referrals, and overutilization of health services [5– 7]. Although more research is required to evaluate these pros and cons, if implemented fairly there is a great potential for WGS to improve the lives of people regardless of whether or not they currently appear healthy. The promise: improved health and well-being Sequencing the first human genome took 15 years and $3 billion. Today, a human genome can be sequenced for $$3000 in a few days, and costs are expected to continue to fall. Although WGS is currently used primarily for clinical diagnostic and research purposes, WGS in seem- ingly healthy individuals has the promise to empower them to take greater control of their lives, and to take action to prevent diseases earlier and more effectively. In the future, WGS may provide healthy individuals with carrier information relevant to reproductive decision-mak- ing and pharmacogenomic information to inform drug prescribing and dosage. It may also identify people who appear healthy – but who have rare variants that greatly increase their risk of cancer or a cardiac event [8], or combinations of common variants that modestly increase their risk of common, complex diseases such as type 2 diabetes [2] or psychiatric conditions such as bipolar disor- der. This may enable doctors to intervene with medications or procedures, and/or motivate individuals to make risk- reducing changes themselves, such as losing weight, quit- ting smoking, reducing stress, improving medication adher- ence, or increasing screening. There is significant commercial as well as academic and public health interest in capitalizing on these potential advantages. The challenges along the way There are also significant challenges to applying WGS in the context of healthy individuals. WGS for a healthy individual is an open-ended investigation: the sheer vol- ume of data that could potentially be informative is cur- rently overwhelming [9]. The nature of the data challenges current notions of what can be guaranteed regarding con- fidentiality and privacy [10]. Other policy aspects, such as those related to discrimination and insurance [7], as well as logistical issues including storage of such vast amounts of data [5] and access within electronic healthcare records [4], must also be considered. The volume of data produced poses particular chal- lenges regarding analysis and interpretation [7]. Today, it takes many person-hours to curate, analyze and inter- pret the thousands of variants arising from WGS that may be significant for a healthy individual. Vast amounts of work are involved in translating the raw data into compre- hensive but easy-to-understand results that can confident- ly be communicated back to the individual. Although the ACMG provides guidelines regarding the return of inci- dental findings in clinical settings [11], deciding where to draw the line between known pathogenic and suspected pathogenic variants is a major barrier to rapidly interpret- ing WGS data for healthy individuals. It is likely to be some time before analysis and interpretation pipelines are fully automated and user interfaces enabling individuals to access results in meaningful ways are developed and wide- ly adopted. Ethical considerations, including the implications for family members [7], also pose important challenges for WGS for healthy individuals. Crucially, the question of the appropriate age at which to consider introducing WGS needs to be addressed. This was highlighted by the ACMG guidelines, which recommended returning incidental find- ings about specific, high-penetrance variants regardless of age [11], sparking considerable debate. The notion of chil- dren or adolescents having their genomes sequenced, par- ticularly without an immediate clinical need, is ethically challenging and raises important questions around assent and consent. However, the value of waiting until adulthood before implementing WGS is also debatable. In addition, healthcare providers are unprepared for the deluge of genomic data that WGS produces: they typically Science & Society Corresponding author: Sanderson, S.C. ( 556
  4. 4. have minimal understanding of genomics and lack confi- dence in their ability to interpret genomic information for their patients. Some genomics education efforts for health- care providers are underway, but more are urgently needed. New models of consent and return of results are needed As Biesecker emphasized, WGS ‘is a resource, not a test’ [12]. This is particularly true for healthy individuals. In the future, WGS results will not be offered at a single moment in time. Instead, the individual or clinician will interrogate the data in different ways over time depend- ing on life-stage, circumstances, and evolving genomics knowledge. This has implications for consent and counsel- ing because it poses a challenge to how informed consent is conceptualized. To make informed decisions about WGS, individuals should be helped to understand the potential risks, benefits, and uncertainties of WGS, and think fully through how potential results would make them think, feel, and act. However, this is virtually im- possible when WGS results could pertain to any disease or trait in the world, and the interpretation of the results will continue to evolve with ongoing research. Patient expectations about the potential outcomes of WGS must be realistically set both during informed consent and via public education initiatives. In addition to consent, models for the return of results will need to be modified. Traditional genetic counseling models involve hours of in-person education and support from already overstretched genetic counselors [5], which is clearly unsustainable in this new context. Novel multi- media approaches to patient education are needed to help patients make informed decisions about WGS [13], partic- ularly when there is no primary phenotype of immediate concern. In addition, whether individual preferences re- garding return of specific WGS results should be taken into account remains an open question. On the one hand, the ACMG suggests that it is impractical to incorporate pa- tient preferences regarding incidental findings into the WGS process [11]. On the other, some investigators are already building novel, dynamic, multi-media tools to as- sess and incorporate patient preferences into WGS pipe- lines [13] ( Will WGS affect behaviors and emotions? Although early studies found little evidence that genetic risk information influenced individual health behaviors such quitting smoking [14], these ‘proof-of-principle’ stud- ies tested for single variants of low penetrance, and it is therefore not surprising that there was little impact upon individual perceptions of disease threat or subsequent motivation to change behavior, given the small effects on disease risk and the lack of objective clinical benefit that could be achieved from this knowledge. Our understanding of genomic influences on disease is rapidly increasing, how- ever, and current investigations in which complex, multi- scale personal information about healthy individuals is generated based on WGS information integrated with mul- tiple other ‘omics’ data [1,2] bear little resemblance to those early studies in which individuals were tested for one single- nucleotide polymorphism (SNP) or variant of similarly low penetrance [14], or selection of SNP-based risk scores. Similarly, early studies did not find significant emotion- al impacts from personal genomic information [15]. How- ever, again, these were not based on WGS, and there is far greater potential for WGS to produce unanticipated results that may be valued by one individual, but completely devastating to another. The potential for emotional harm from WGS should not be underestimated – nor should it be overstated. One trial funded by the US National Institutes of Health (NIH), the MedSeq Project (http://www.genome- is beginning to explore these issues. More evidence from randomized trials with larger samples of diverse populations is needed before conclusions about behavioral and emotional effects of WGS on healthy individuals can be drawn. Given the limited evidence-base today, the loud skepti- cism regarding the potential for genomic information to succeed in motivating people to make health-protective behavioral changes where other efforts have failed is un- derstandable. Behavior change is unquestionably hard, but this should propel us to continue exploring whether WGS together with other emerging self-monitoring and big data applications will help change behaviors. It is imperative that we do this in an ethically-responsible way that minimizes the potential for harms. The jury is still out, and the behavioral and emotional effects of personal WGS information remain to be seen. Equitable access for all Most healthy individuals who have had their genomes sequenced to date are early adopters, scientists experi- menting on themselves, or people with the means and resources to obtain WGS through initiatives such as the Illumina Understand Your Genome conferences (http:// This self-experimen- tation is valuable while pipelines are still being built and challenges regarding results communication are still being tackled. Simultaneous efforts are needed, however, to ensure that WGS does not contribute to the already wide health disparities across society. The declining costs of WGS will undoubtedly be pivotal, as will efforts already underway to broaden genomics research to include under- represented populations. Furthermore, explicit efforts are needed to ensure that informed consent procedures are accessible and appropriate for people with lower literacy levels, patient education materials are developed that are accessible and understandable, results are communicated in ways that are easy to understand by people across a spectrum of educational attainment, and WGS is accessi- ble to individuals from all walks of life, not only those with the greatest resources. Only then will the promise of WGS be truly realized. Acknowledgments I am deeply indebted to Barbara Biesecker, Robert Green, Muin Khoury, Eric Schadt, Jo Waller, and Ron Zimmern for their valuable feedback on an earlier draft of this article. References 1 Chen, R. et al. (2012) Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell 148, 1293–1307 2 Patel, C.J. et al. (2013) Whole genome sequencing in support of wellness and health maintenance. Genome Med. 5, 58 Science & Society Trends in Genetics October 2013, Vol. 29, No. 10 557
  5. 5. 3 Angrist, M. (2009) Eyes wide open: the personal genome project, citizen science and veracity in informed consent. Pers. Med. 6, 691–699 4 Burn, J. (2013) Should we sequence everyone’s genome? Yes. BMJ 346, 3133 5 Brunham, L.R. and Hayden, M.R. (2012) Whole-genome sequencing: the new standard of care? Science 336, 1112–1113 6 Flinter, F. (2013) Should we sequence everyone’s genome? No. BMJ 346, 3132 7 Ormond, K.E. et al. (2010) Challenges in the clinical application of whole-genome sequencing. Lancet 375, 1749–1751 8 Evans, J.P. et al. (2013) We screen newborns, don’t we? Realizing the promise of public health genomics. Genet. Med. 15, 332–334 9 Cassa, C.A. et al. (2012) Disclosing pathogenic genetic variants to research participants: quantifying an emerging ethical responsibility. Genome Res. 22, 421–428 10 Schadt, E.E. (2012) The changing privacy landscape in the era of big data. Mol. Syst. Biol. 8, 612 11 Green, R.C. et al. (2013) ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet. Med. 15, 565–574 12 Biesecker, L.G. (2012) Opportunities and challenges for the integration of massively parallel genomic sequencing into clinical practice: lessons from the ClinSeq project. Genet. Med. 14, 393–398 13 Yu, J.H. et al. (2013) Self-guided management of exome and whole- genome sequencing results: changing the results return model. Genet. Med. 14 Marteau, T.M. et al. (2010) Effects of communicating DNA-based disease risk estimates on risk-reducing behaviours. Cochrane Database Syst. Rev. 10, CD007275 15 Bloss, C.S. et al. (2011) Effect of direct-to-consumer genomewide profiling to assess disease risk. N. Engl. J. Med. 364, 524–534 0168-9525/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. Trends in Genetics, October 2013, Vol. 29, No. 10 Science & Society Trends in Genetics October 2013, Vol. 29, No. 10 558
  6. 6. LongevityMap: a database of human genetic variants associated with longevity Arie Budovsky1,2* , Thomas Craig3* , Jingwei Wang3* , Robi Tacutu3 , Attila Csordas4 , Joana Lourenc¸o3 , Vadim E. Fraifeld1 , and Joa˜o Pedro de Magalha˜es3* 1 The Shraga Segal Department of Microbiology, Immunology and Genetics, Center for Multidisciplinary Research on Aging, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel 2 Judea Regional Research and Development Center, Carmel 90404, Israel 3 Integrative Genomics of Ageing Group, Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK 4 European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Understanding the genetic basis of human longevity remains a challenge but could lead to life-extending interventions and better treatments for age-related dis- eases. Toward this end we developed the LongevityMap (, the first database of genes, loci, and variants studied in the context of human longevity and healthy ageing. We describe here its content and interface, and discuss how it can help to unravel the genetics of human longevity. Given the worldwide ageing of the population, studying the genetics of human longevity is of widespread impor- tance [1,2]. Longevity is moderately heritable in humans ($25%), with increasing heritability with age [1], and exceptional longevity and healthy ageing in humans is an inherited phenotype [3]. Hundreds of longevity associ- ation studies have been performed in recent years and some genes associated with human longevity may be suitable targets for drug development [4]. Nonetheless, the heritability of human longevity remains largely unex- plained in part due to the complexity of this phenotypic trait [1]. Thanks to advances in next-generation sequenc- ing and genome-wide approaches, the capacity of longevity association studies is increasing. The growing amounts of data being generated also increase the complexity of the data analysis and the difficulty of placing findings in context of previous studies. We created the LongevityMap (, the first cat- alogue of human genetic variants associated with longevi- ty, to serve as a reference to help researchers navigate the rising tide of data related to human longevity. The LongevityMap is a new addition to our already highly successful collection of online databases and tools on the biology and genetics of ageing, the Human Ageing Genomic Resources ( [5]. GenAge, our existing database of ageing-related genes, focuses mostly on genes modulating longevity in model organisms plus the few genes associated with human progeroid syndromes [5], and thus there is an unmet need for a database of human genetic variants associated with longevity. As such, we followed the high standards and rigorous procedures of GenAge to develop the Longevity- Map. Briefly, all entries in the LongevityMap were manu- ally curated from the literature. Studies were selected following an in-depth literature survey. The LongevityMap is an inclusive database in which both large and small studies are included; different types of study are featured, from cross-sectional studies to studies of extreme longevity (e.g., centenarians). However, studies focused on cohorts of unhealthy individuals at baseline, such as cancer patients, were excluded. Details on study design are provided for each entry, including a brief description of the type of study, population ethnicity, sample size, age of probands and controls, and any gender bias. Negative results are also integrated in the LongevityMap to provide visitors with as much information as possible regarding each gene, variant, and locus previously studied in the context of longevity. Each entry refers to a specific observation from a study. This means that studies, and large-scale studies in particular, can have multiple entries in the LongevityMap, reflecting different results and observations. Each entry also includes a brief description of the major conclusions. Entries are flagged regarding whether results were sta- tistically significant or not, though many studies have marginal or indicative results that require a brief expla- nation of the findings. Our policy concerning controversial and subjective results is to detail the facts concerning the controversy and let users form their own opinions. A link to the primary publication in PubMed is always included in each entry. We developed an intuitive, user-friendly interface for the LongevityMap that allows users to query genes, variants (including by reference SNP ID number), stud- ies, and cytogenetic locations (Figure 1A). Users can browse/filter the data by association (i.e., significant or non-significant), population, and chromosome. For each single nucleotide polymorphism (SNP) and gene, addi- tional annotation was retrieved from the US National Center for Biotechnology Information (NCBI) databases dbSNP and RefSeq [6] to provide further information on Spotlight Corresponding author: de Magalha˜es, J.P. ( Keywords: ageing; genetics; GWAS; humans; lifespan; polymorphisms. * These authors contributed equally to this work. 559
  7. 7. genes associated with SNPs and gene function, respec- tively. Homologues in model organisms were obtained from the InParanoid database [7]. Links are widely implemented to allow users to identify quickly other entries related to a given study, gene, or variant. In fact, each gene in the LongevityMap has a gene-centric page that aggregates and condenses the information on the database taken from different studies. In addition, the LongevityMap is fully integrated with our other ageing- related databases to provide users with selected, relevant information. In particular, crosslinks to GenAge are in- cluded to indicate genes associated with progeroid syn- dromes and those with homologues in model organisms known to modulate ageing/longevity. If appropriate, links to other major databases, such as Ensembl, Swiss-Prot, dbSNP, HapMap, and NCBI Entrez, are included for each entry. At time of writing, the LongevityMap includes data from 246 studies, featuring 751 different genes and 1987 variants (Figure 1B). Similarly to our other ageing-relat- ed databases, the LongevityMap is freely available online under a Creative Commons Attribution license. The full dataset is available for download and third-party use. It is our hope that the LongevityMap will serve as a novel database to help researchers decipher the genetics of human longevity. Acknowledgements The authors wish to thank Joana Costa, Daniel Wuttke, and Alex Freitas for helping to collate data and for comments and suggestions. This work was funded by a Wellcome Trust grant (ME050495MES) to J.P.M. This work was also funded in part by the European Union Framework Program (FP) 7 Health Research Grant number HEALTH-F4-2008-202047 (to V.E.F.) and the Israel Ministry of Science and Technology (to A.B.). J.P.M. is also grateful for support from the Ellison Medical Foundation and R.T. is supported by a Marie Curie Intra-European Fellowship within FP7. References 1 Christensen, K. et al. (2006) The quest for genetic determinants of human longevity: challenges and insights. Nat. Rev. Genet. 7, 436–448 2 Chung, W.H. et al. (2010) The role of genetic variants in human longevity. Ageing Res. Rev. 9 (Suppl. 1), S67–S78 3 Atzmon, G. et al. (2005) Biological evidence for inheritance of exceptional longevity. Mech. Ageing Dev. 126, 341–345 4 de Magalhaes, J.P. et al. (2012) Genome–environment interactions that modulate aging: powerful targets for drug discovery. Pharmacol. Rev. 64, 88–101 5 Tacutu, R. et al. (2013) Human ageing genomic resources: integrated databases and tools for the biology and genetics of ageing. Nucleic Acids Res. 41, D1027–D1033 6 NCBI Resource Coordinators (2013) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 41, D8–D20 7 Ostlund, G. et al. (2010) InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 38, D196–D203 0168-9525/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. Trends in Genetics, October 2013, Vol. 29, No. 10 Entries significantly associated with longevity Entries not significantly associated with longevity Total entries Genes Variants Studies Type of data (A) (B) Number 249 255 504 751 1987 (1832 with a refSNP number) 246 TRENDS in Genetics Figure 1. LongevityMap home page which showcases the design and layout of the website as well as its multiple search options and links (A); old couple picture by Jonel Hanopol. Types and amount of data in the LongevityMap (B). Spotlight Trends in Genetics October 2013, Vol. 29, No. 10 560
  8. 8. The role of gene conversion in preserving rearrangement hotspots in the human genome Jeffrey A. Fawcett and Hideki Innan Graduate University for Advanced Studies, Hayama, Kanagawa 240-0193, Japan Hotspots of non-allelic homologous recombination (NAHR) have a crucial role in creating genetic diversity and are also associated with dozens of genomic disor- ders. Recent studies suggest that many human NAHR hotspots have been preserved throughout the evolution of primates. NAHR hotspots are likely to remain active as long as the segmental duplications (SDs) promoting NAHR retain sufficient similarity. Here, we propose an evolutionary model of SDs that incorporates the effect of gene conversion and compare it with a null model that assumes SDs evolve independently without gene con- version. The gene conversion model predicts a much longer lifespan of NAHR hotspots compared with the null model. We show that the literature on copy number variants (CNVs) and genomic disorders, and also the results of additional analysis of CNVs, are all more consistent with the gene conversion model. Many rearrangement hotspots are shared across species Recombination is a major mutational mechanism that has a crucial role in producing genetic diversity. Because of its potential impact on important phenotypes, includ- ing diseases, much attention has been paid to recombina- tion, whether it is allelic or nonallelic [1,2]. To understand the interaction between recombination and phenotypes, it is important to know how different parts of the genome differ in the rate at which recombination occurs. Recent genome-wide surveys demonstrated that the distribution of the recombination rate across the genome is far from uniform. Instead, there are several hotspots where re- combination occurs at a much higher rate than in the rest of the genome [3,4]. This applies to both allelic and nonallelic recombination [5]. Given that these hotspots are especially important in producing genetic diversity, a good understanding of their characteristics should be extremely valuable. Evolutionary approaches provide a means to investigate how these hotspots arose and have been maintained throughout evolution, which might enable us to better pre- dict regions that affect the phenotype. A recent interesting finding is that most allelic recombination hotspots detected in the human genome do not exist in the chimpanzee genome, indicating a rapid turnover of hotspots [6,7]. This rapid turnover is at least partly because hotspots are largely determined by the fast-evolving PR domain-containing 9 (Prdm9) gene. This gene encodes a protein that contains several zinc finger domains and is able to bind motifs that are overrepresented in recombination hotspots [8]. Single mutations in Prdm9 or its binding motif can be sufficient to alter the recombination activity [9–11]. This means that hotspots are determined by human-specific factors, which ultimately raises the question of whether studying the genomes of other primate species would be useful in under- standing the role of recombination in shaping the pattern of genetic diversity in the human genome. The situation seems to be different for hotspots of nonallelic recombination, the major cause of genomic rear- rangements such as duplications, deletions, and inver- sions. Recent studies of CNVs in various primate species have shown that CNV hotspots are often shared across species, even between human and macaque [12–15]. This suggests that nonallelic recombination hotspots have a longer lifespan than do hotspots of allelic recombination. This is related to the key mechanism of nonallelic recom- bination, that is, NAHR. Highly similar homologous sequences, or segmental duplications (SDs), serve as sub- strates for NAHR, which causes the duplication or deletion of the intervening region (or inversion in the case of inverted SDs) (Figure 1A). Although nonallelic recombina- tion pathways other than NAHR also have a large role in generating CNVs [16,17], it is thought that NAHR hotspots remain active for a longer period of time and are largely responsible for generating recurrent rearrangements. For the sake of clarity, here we define NAHR hotspots as SD pairs that are initiating recurrent NAHR. Therefore, each new duplication creates a new potential hotspot even if they occur in neighboring regions that could be considered as the same fragile region, sometimes making a complicat- ed nested structure of multiple duplications. We also as- sume that a long (e.g., >200 bp) stretch of perfect identity shared between the SD pair is crucial for the maintenance of the hotspot. NAHR can sometimes occur even when the perfect match is short, and the rate may also be influenced by other factors (e.g., distance between the SDs or recom- binogenic sequence motifs) [3,18,19]. However, a long iden- tical stretch is known to enhance greatly the efficiency of Opinion 0168-9525/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. Corresponding author: Innan, H. ( Keywords: gene conversion; non-allelic homologous recombination; rearrangement hotspot; segmental duplication; copy number variant. Trends in Genetics, October 2013, Vol. 29, No. 10 561
  9. 9. NAHR [18,20,21], which is predicted to be crucial for repeatedly generating rearrangements over a long period of time. Thus, whereas allelic recombination hotspots are largely determined by the PRDM9 motif and a small number of mutations are sufficient to cause turnovers, NAHR hotspots will potentially remain active as long as the SD contains a subregion with sufficient similarity and length. Indeed, CNV hotspots are enriched for SDs [12,14], and it has been suggested that the long-term evolution of hotspots is determined by the birth-and-death process of matching pairs of SDs [22]. An important question then is how long is the expected lifespan of an individual NAHR hotspot. We consider two evolutionary models that give different predictions regard- ing the lifespan of hotspots. The first is the turnover model (Figure 1B), which assumes that SDs accumulate mutations independently. According to this model, the divergence between the SDs increases in proportion to time and the SDs lose their ability to initiate NAHR as they become too divergent. Consequently, the hotspots are subject to a rapid turnover, and new SDs must constantly arise for the genome to maintain a certain number of hotspots. Thus, the turn- over model predicts that hotspots would be shared only among closely related species and not between distantly relatedspecies,as has been previouslysuggested [22]. In the caseofprimates, themodelpredictsthatitwould beunlikely for hotspots to remain active for more than 25 million years, or since the divergence of human and macaque (Box 1). Therefore, the turnover model might not be sufficient to explain recent findings where several CNV hotspots are shared between human and macaque [13,15]. A model incorporating gene conversion better explains the evolution of CNV hotspots An alternative, which we propose here, is the gene conversion model (Figure 1C). This model predicts the long-term preservation of hotspots and is supported both theoretically and empirically. The model takes into ac- count the effect of gene conversion, a recombinational mechanism that can retard the divergence between SDs. Ongoing gene conversion results in the SDs main- taining high similarity for a long period of time. There is increasing evidence for gene conversion between SDs in various species, including humans [23,24]. It is easy to imagine that gene conversion would provide an ideal substrate for NAHR, as has been previously suggested [25]. The gene conversion model predicts that a larger number of older SDs would be associated with the current hotspots compared with the turnover model (Box 1). The potential role of gene conversion in preserving hotspots has been suggested by several case studies [25–28]. An extreme case is the polymorphic inversion on the human chromosome Xq28 region containing the filamin A (FLNA) and emerin (EMD) loci that is probably caused by NAHR between inverted duplicates. It was found that this pair of inverted duplicates is shared by various eutherian lineages and that these duplicates have recur- rently caused inversions in independent lineages (at least ten times since the origin of eutherians) [27]. The se- quence identity between the duplicates was found to be high in each species. Based on these observations, it was suggested that gene conversion has been homogenizing the duplicates, thus preserving the activity as a hotspot, for at least 100 million years. Another study [13], which identified several macaque CNVs, suggests that this model is applicable to some CNV hotspots in primates. Three CNV regions were identified that were shared between human and macaque where the flanking match- ing SD pairs in both species were clearly orthologous. In all three cases, the paralogous copies were more closely related to each other than to the orthologous copies. This indicates that gene conversion has been maintaining high AcƟve hotspot AcƟve hotspot Divergence Gene conversion SƟll acƟveNew hotspot (A) (B) (C) No more NAHR TRENDS in Genetics Figure 1. Diagram of non-allelic homologous recombination (NAHR) hotspots and two models of their evolution. (A) Illustration of NAHR between tandem segmental duplications (SDs; green arrows) that results in the duplication or deletion of the intervening region (the outcome would be an inversion if the SDs are in inverted orientation). Two models could explain the evolution of NAHR hotspots. (B) The turnover model assumes that the two SD copies diverge in proportion to time and, thus, quickly become unable to initiate NAHR. Therefore, new hotspots must constantly arise for a certain number of hotspots to remain in the genome. (C) The gene conversion model considers the effect of paralogous gene conversion, which maintains the similarity between the two copies. Therefore, the SD is able to initiate NAHR for a much longer period of time. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 562
  10. 10. similarity, thereby preserving the ability to initiate NAHR in both lineages for more than 25 million years [13]. Based on further analyses on primate CNV hotspots, we show here that most SD-associated CNV hotspots are more consistent with the gene conversion model than with the turnover model. We examined a previously published data set [15], which contains CNVs identified by previous large-scale population surveys [29,30], and identified 79 cases where both ends of the CNV regions (i.e., break- points) lie within matching SD pairs reported in the segmental duplications database [31,32]. We assume that these CNVs were likely formed by NAHR between the flanking SDs. We first looked at the average nucleotide divergence over the entire region. The divergence was higher than the average human–chimpanzee divergence (approximately 1.3%) for almost all SDs and higher than the average human–-macaque divergence (approximately 6%) for approximately one-third of the SDs (Figure 2A). The actual ages of the SDs could be even older because gene conversion retards their divergence. Indeed, if we look at the spatial distribution of the divergence, most of the 79 SDs show a nonuniform distribution and contain identical stretches that are significantly longer than expected (70/79 at P <0.05; 43/70 at P <0.0001). Figure 2 clearly shows that the longest identical stretches of the observed data are much longer than those of the null data with the same level of divergence. Gene conversion is the most likely mechanism respon- sible for creating these unexpectedly long stretches of perfect identity within the SDs (see Box 2 for a detailed discussion on the divergence process of SDs undergoing gene conversion). The action of gene conversion between the matching SD pairs can be better demonstrated by a comparative genomics approach where SD sequences of multiple species are compared [23,33]. Consider an SD pair in human, Xh and Yh and their orthologs in chimpanzee, Xc and Yc. Gene conversion will create sites where Xh and Box 1. The lifespan of NAHR hotspots under the turnover model and the gene conversion model How long are NAHR hotspots expected to remain in the genome? The gene conversion model predicts that hotspots will remain active for a longer period of time compared with the null turnover model. We illustrate this using a simple computation. The time period is measured by the probability that the SD pair retains an identical stretch of !200 bp. Under the turnover model, we consider three different lengths of the SD (1, 10, and 100 kb). Although the requirement of !200-bp perfect identity is a simplified assumption, this computation provides an approximation of how long a hotspot should remain active and how gene conversion affects its longevity. We note that using different length requirements and changing the values of the parameters shown in Figure I do not affect the overall pattern. As shown in Figure I (red, green, and blue lines for 1, 10, and 100 kb, respectively), the probability quickly drops, especially when the length of the SD is short. A hotspot as old as the human–chimpanzee divergence is still likely to be active (unless short), whereas a hotspot as old as the human–macaque divergence (approximately 25-million years old) is highly unlikely to be active (even for an SD as long as 100 kb) (Figure I). Thus, the lifespan of a hotspot in primates is likely to be between 5 and 25 million years under the turnover model with no gene conversion. The situation dramatically changes under the gene conversion model. We added the effect of gene conversion using three different gene conversion rates for the case of a 10-kb SD (shown by green- dashed lines in Figure I). Including the effect of gene conversion increases the probability that NAHR will still occur after a given amount of time, especially when the rate of gene conversion is high. The rate of gene conversion should be highly variable because it is determined by several factors [60]. Thus, gene conversion can substantially increase the longevity of an NAHR hotspot. Time (million years) Probability 0 10 20 30 Chimp Orangutan Macaque 1kb c = 0 10kb c = 0 100kb c = 0 10kb c = 5 × 10−8 10kb c = 3 × 10−8 10kb c = 1 × 10−8 TRENDS in Genetics Figure I. The probability that a given segmental duplication (SD) pair of 1 kb, 10 kb, and 100 kb (red, green, and blue lines, respectively) will retain an identical stretch of !200 bp based on 10 000 simulation runs. The expected probability was calculated by a simulation following the model in [61]. The model assumes random accumulation of point mutations at a rate of 10À9 /site/generation and that gene conversion occurs at a given rate c per site (see [61] for details). The red, green, and blue solid lines represent simulation results of SDs of 1 kb, 10 kb, and 100 kb when c = 0, and the green-dashed lines represent results of a 10-kb SD when c = {1,3,5} Â 10À8 with an average tract length of 1 kb (1/Q = 0.1 in [61]) representing low, intermediate, and high gene conversion rates. The vertical gray lines approximately correspond to the divergence between human and chimpanzee, orangutan, and macaque. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 563
  11. 11. Yh share the same nucleotide and Xc and Yc share another nucleotide. Although strong purifying selection can also create regions of low divergence, significant clustering of such sites cannot be explained by selection and is consid- ered a strong signature of gene conversion [33,34]. Despite the genomic regions containing SDs often being poorly sequenced and/or assembled in nonhuman species, we were able to identify both copies of the SDs in the genome of another primate species for 35 out of the 79 cases. In almost all of those cases (34/35), we found regions that showed strong signatures of gene conversion. These results suggest that gene conversion and the retention of regions of perfect identity are common features of SD pairs in CNV regions, which directly results in the long-term preserva- tion of the CNV hotspots detected by population surveys, that is, common CNVs. The gene conversion model also applies to regions associated with genomic disorders Does this typical pattern also apply to CNVs that cause genomic disorders, whose frequencies are often too low to be detected by a population survey? According to the literature, the answer seems to be yes. Dozens of ‘known’ disorders are often caused by NAHR between SDs (also referred to as low copy repeats) [17,35–37]. For 14 of them, we were able to identify unambiguously SDs containing NAHR breakpoints in the current human genome assem- bly (Table 1). These included two well-studied cases where both copies of the matching SD pair have been identified in other primate genomes and the action of gene conversion has been documented. One is the deletion of the azoospermia factor a (AZFa) locus on chromosome Y that is associated with male infertility (Table 1, #1). This locus is flanked by direct repeats and both copies are present in the orthologous regions of chimpanzee and gorilla [25]. The rearrangement breakpoints map to two specific regions within the duplicates. One region shows 1285 bp of perfect identity and the other contains one single mismatch over 1609 bp, despite some other regions showing <90% identity. Strong signatures of gene con- version were reported in these two breakpoint regions [25,38]. The other example is the coagulation factor VIII (F8) locus, which contains two pairs of inverted repeats (Table 1, #2). Inversion between either pair causes hemo- philia A. Despite originating before the divergence of human and African green monkey (and, thus, macaque), both pairs exhibit >99% identity [26]. It is interesting to note that hemophilia A caused by the inversion of the same region due to NAHR has also been reported in dog, although it is not clear whether the inversion is mediated by repeats ancestral to human and dog [39]. In addition to these two cases, we found five cases in which the orthologous copies of the matching SD pairs could be identified in at least one of the chimpanzee, orangutan, or macaque genomes (Table 1, #3–7). Each SD pair exhibited evidence of gene conversion. One inter- esting case is the SD pair associated with Incontinentia Pigmenti (Table 1, #7), a severe X-linked disorder that is lethal in males. The main cause of this disease is a genomic deletion that eliminates exons 4–10 of the inhib- itor of kappa light polypeptide gene enhancer in B-cells, kinase gamma (NEMO/IKBKG) gene, which is located on Xq28. This deletion is caused by NAHR between two identical MER67B repeated sequences of 878 bp, one 0.00 0.05 0.10 0.15 05001000150020002500 Observed distribuƟon NucleoƟde divergence ObservedlongestidenƟcalstretch(bp) P ≥ 0.05 Key: P < 0.05 P < 0.01 P < 0.0001 Chimp Orangutan Macaque 0.00 0.05 0.10 0.15 05001000150020002500 Null distribuƟon NucleoƟde divergence ExpectedlongestidenƟcalstretch(bp) Chimp Orangutan Macaque (B)(A) TRENDS in Genetics Figure 2. The probability for observing the longest identical stretch present in the segmental duplications (SDs) flanking the copy number variants (CNVs). (A) The observed longest identical stretch (bp) within each SD pair flanking a CNV region is plotted against the divergence level. The significance of the observed length for each SD was evaluated by creating 10 000 random patterns of divergence where the diverged nucleotide positions are distributed randomly across the entire SD, and are shown as filled squares, triangles, and circles when significant (P <0.05, <0.01, and <0.0001, respectively), and by open circles when not significant. (B) Typical distribution of the longest identical stretch in the randomized data used for evaluating the significance in (A). Only some of the data are shown to demonstrate the point. The vertical gray lines show the time corresponding to the average genome-wide nucleotide divergence between human and chimpanzee, orangutan, and macaque [62]. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 564
  12. 12. located in intron 3 and the other located downstream of the last exon of NEMO [40,41]. Both copies were present in the orthologous regions of the genomes of chimpanzee, orangutan, and macaque. The two copies show >99% similarity in all species and exhibit strong signatures of gene conversion. This indicates that gene conversion has maintained the genomic configuration that predis- poses carriers to severe disorders (at least in humans) for more than 25 million years. This Xq28 region contains several other extreme examples of extensive homogeni- zation of ancient duplicates within approximately 1 Mb. The F8 locus associated with hemophilia A [26] and the inverted repeats at the FLNA–EMD locus [27] (both dis- cussed above), as well as the red- and green-opsin gene duplicates undergoing frequent gene conversion [42], are all in this region. Thus, the rate of gene conversion could be elevated in this region. Several other genomic disorders, such as Williams– Beuren syndrome, Smith–Magenis syndrome, neurofibro- matosis type 1 (NF1), and DiGeorge/velocardiofacial syn- drome (Table 1, #8–11), are caused by NAHR between SDs that are present in multiple copies in other primate gen- omes [43–48]. These reports are based on fluorescent in situ hybridization (FISH), and the ages of the exact copies involved in NAHR in humans are not clear. Nevertheless, strong signatures of gene conversion around the break- point regions of the SDs have been reported for all four cases [49–52]. For instance, many of the breakpoints of NAHR associated with NF1 map to a region within the 51- kb SD that shows elevated sequence identity, probably due to gene conversion, including a 700-bp identical stretch [50]. Also, several polymorphic sites shared by both SD copies, which are strong signatures of gene conversion, were detected around the breakpoint region of the SDs Box 2. Divergence pattern of a segmental duplication undergoing gene conversion How do SDs evolve when gene conversion frequently occurs? Following a duplication event, the divergence will remain at a low equilibrium as long as gene conversion is ongoing (see [61] for details). The accumulation of mutations or large indels will result in the termination of gene conversion and the increase of divergence in that region, whereas concerted evolution will continue in other regions. Regions undergoing gene conversion within the SD will decrease as time proceeds (Figure I). Future work will be needed to reveal the process that determines which region within the SD retains high similarity. One possibility is that any region within the SD can potentially retain high similarity because indels and point mutations accumulate randomly across the SD. Therefore, the ongoing or termination of gene conversion will occur randomly across the SD. Under this scenario, if we consider an SD pair that is shared among species, we would also expect that gene conversion would be ongoing in different regions of the SDs in each species (Figure IA). Note that when multiple species are compared, the homogenized regions will not be distributed completely randomly because of their shared evolutionary history. We can also imagine an alternative scenario where specific regions undergo homogenization for a long period of time. If the same specific region of the two copies is under selective constraint, the divergence will remain low within that region, which will make it more likely for gene conversion to occur. Also, gene conversion might be favored in a specific region if the retention of high similarity of that region has some functional benefit. The rate of gene conversion could also be elevated locally due to, for example, the DNA structure or the presence of certain motifs. Under this nonrandom scenario, gene conversion might continue to occur at the same specific region in different species even long after their divergence (Figure IB). (A) (B) Human Chimp Orangutan Human Chimp Orangutan TRENDS in Genetics Figure I. Illustration of how duplicates diverge in the presence of gene conversion. The green bars represent regions within the segmental duplications (SDs) that are undergoing gene conversion. Regions undergoing gene conversion gradually decrease due to large indels or the accumulation of mutations. (A) Scenario where the termination of gene conversion occurs randomly throughout the SD. Regions undergoing gene conversion in each species differ, although they are not entirely independent due to their shared history. (B) Scenario where selection favors ongoing gene conversion in specific regions (blue bar) due to some functional constraint. The continuation and termination of gene conversion is not random, and the same region likely retains high similarity in each species. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 565
  13. 13. associated with DiGeorge/velocardiofacial syndrome [51]. Thus, although we could not confirm the presence of both SD copies in other primate genomes for seven cases, in- cluding these four (Table 1, #8–14), possibly because these regions are repetitive and poorly assembled in other spe- cies, it is likely that gene conversion is involved in pre- serving the hotspots. In summary, the examples discussed here clearly show that the gene conversion model applies to SDs associated with genomic disorders, even though the rearrangements are pathological. Concluding remarks Here, we have shown that most SD-associated CNV hot- spots have been preserved for a long period of time, much longer than hotspots of allelic recombination. Gene conver- sion appears to be having a key role in the preservation by maintaining long stretches (e.g., several hundred bases) of perfect identity within SD pairs that can serve as sub- strates for NAHR. This has implications in disease, be- cause the preservation often increases the risk of pathological rearrangements. The preservation should be determined by the balance between factors that cause the preservation (e.g., rate of gene conversion or selection favoring the preservation) and the reduction of fitness caused by the preservation (e.g., rate of NAHR or severity of the resulting disorder). Although the maintenance of stretches of high similarity by gene conversion might be promoted by selection due to a functional constraint in some cases, it is unlikely that all the homogenized regions are functional. Rather, given that most of the breakpoints in Table 1 map to repeat regions, functional constraint may not be the major contributor to the preservation. This is consistent with the observation that regions within the SDs being homogenized are different in each primate species. Thus, it seems most likely that CNV hotspots, in general, are preserved as a byproduct of gene conversion that occurs at a high enough rate to override their negative consequences. Future work involving comparative analysis of sequences from multiple species and careful modeling of the divergence process of the SDs considering the effect of gene conversion and selection should be valuable for better understanding the different factors, including selection, that are responsible for the preservation of CNV hotspots (Box 2). The preservation of rearrangement hotspots might have had a key role in the adaptive evolution of humans. Recent studies have identified several regions within the human genome that comprise mosaic structures of duplication subunits (duplicons) as a result of recurrent duplica- tions-within-duplications. In particular, several ‘core duplicons’ that have duplicated several times throughout evolution and are shared across multiple duplication blocks are known to contain primate-specific genes under- going positive selection [37,53,54]. Another recent study showed that CNV regions shared among human, chimpan- zee, and macaque (CNV hotspots) were significantly likely to overlap with genic regions [15]. This is in stark contrast with human-specific CNV regions, which are generally depleted of genes. Furthermore, many of the genes that overlap with CNV hotspots are evolving under positive selection, and some are evolving under balancing selection in humans [15]. It has been suggested that the genomic plasticity in these hotspot regions has provided the muta- tional flexibility for the residing genes to adapt to changing selective pressures [15,37,55]. If so, we further suggest that gene conversion has had an important role in maintaining Table 1. The presence of duplicates flanking human genomic disorder regions in other species and the occurrence of gene conversion No. Locus Candidate genesa Associated phenotypes Evolutionary originb Gene conversionc Refs #1 Yq11 AZFa Male infertility Gorilla + [25,38] #2 Xq28 F8 Hemophilia Ad African green monkey + [26] #3 5q35 NSD1 Sotos syndrome Orangutan (macaque) ++ [63,64] #4 15q24 MAN2C1, CYP11A1, STRA6 Growth retardation and microcephaly Orangutan ++ [65] #5 16p11 MAPK3, MAZ, DOC2A, SEZ6L2, HIRIP3 Autism Chimp ++ [66,67] #6 17p11 PMP22 Charcot-Marie-Tooth type 1A Chimp ++ [68–70] #7 Xq28 NEMO Incontinentia pigmenti Macaque ++ [40,41] #8 7q11 GTF2I Williams–Beuren syndrome (macaque, gibbon) + [43,44,52] #9 17p11 RAI1 Smith–Magenis syndrome (macaque) + [45,49] #10 17q11 NF1 NF1 (gorilla) + [46,50] #11 22q12 BCR, USP18, GGT DiGeorge/velocardiofacial syndrome (macaque) + [47,48,51,71] #12 2q13 NPHP1 Familial juvenile nephronophthisis ND – [72] #13 10q22-23 NRG3, GRID1, BMPR1, SNCG, GLUD1 Cognitive and behavioral abnormalities ND – [73] #14 17q23 TBX2, TBX4 Developmental delay and heart defects ND – [74] a Abbreviations: BCR, breakpoint cluster region; BMPR1, bone morphogenetic protein receptor 1; CYP11A1, cytochrome P450, family 11, subfamily A, polypeptide 1; DOC2A, double C2-like domains, alpha; GGT, gamma-glutamyl transferase; GLUD1, glutamate dehydrogenase 1; GRID1, glutamate receptor, ionotropic, delta 1; GTF2I, general transcription factor II i; HIRIP3, HIRA interacting protein 3; MAN2C1, mannosidase, alpha, class 2C, member 1; MAPK3, mitogen-activated protein kinase 3; MAZ, MYC- associated zinc finger protein; NPHP1, nephronophthisis 1; NRG3, neuregulin 3; NSD1, nuclear receptor binding SET domain protein 1; PMP22, peripheral myelin protein 22; RAI1, retinoic acid induced 1; SEZ6L2, seizure related 6 homolog (mouse)-like 2; SNCG, synuclein, gamma; STRA6, stimulated by retinoic acid 6; TBX, T-box; USP18, ubiquitin specific peptidase 18. b The most distant species from human in which the duplicates were confirmed to be present based on genomic sequences are listed. Those not based on genomic sequences (e.g. FISH signals) are shown in brackets. Those identified in this study are in bold. ‘ND’ denotes those where the presence of both copies could not be confirmed in the genome of chimpanzee, orangutan, or macaque. c + indicates duplicates where gene conversion has likely occurred; ++ indicates those that are based on this study. d Caused by inversion due to NAHR between inverted duplicates. The remaining disorders are all caused by deletions due to NAHR between duplicates in direct orientation. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 566
  14. 14. genomic plasticity, which most likely contributed to the adaptive evolution of the human lineage. Almost all the duplicates we examined here showed evidence of gene conversion. This might seem at odds with previous studies that detected gene conversion in only approximately 10–15% of human duplicated gene pairs [56,57]. However, these studies did not focus on duplicates of low divergence (e.g., <5% divergence) that are either young or undergoing extensive gene conversion. We predict the fraction of recently duplicated sequences containing regions still undergoing gene conversion to be substantial- ly higher. Indeed, a study analyzing 30 multiple align- ments of human duplicated sequences of <4% nucleotide divergence found evidence of sequence exchange due to gene conversion or unequal crossing over in all 30 align- ments [58]. A recent population survey of CNVs in multi- copy gene families also reported several cases of gene conversion [59]. Thus, there could be a large number of nearly identical regions undergoing gene conversion with- in the genome, especially in SDs that are located close to each other. These regions could be acting as rearrange- ment hotspots that are yet to be identified. The accumulating genomic data of human population and other primate species should enable us to identify such regions undergoing gene conversion. This should be a pow- erful approach to detect potential hotspots of genetic dis- orders that are difficult to detect due to their low frequencies in the human population. In this respect, we note that many hotspot regions are likely to be missed by low-coverage genomes or resequencing studies because they are often highly repetitive. Thus, more high-quality reference gen- omes from nonhuman primates and also multiple human individuals in the future should be valuable in understand- ing perhaps the most important genomic regions in terms of human disease and human evolution. Acknowledgments We thank K. Teshima for technical help. This work is supported by a grant from Japan Society for the Promotion of Science (JSPS) to H.I. J.A.F. is a JSPS postdoctoral fellow. References 1 Coop, G. and Przeworski, M. (2007) An evolutionary view of human recombination. Nat. Rev. Genet. 8, 23–34 2 Webster, M.T. and Hurst, L.D. (2012) Direct and indirect consequences of meiotic recombination: implications for genome evolution. Trends Genet. 28, 101–109 3 Myers, S. et al. (2005) A fine-scale map of recombination rates and hotspots across the human genome. Science 310, 321–324 4 Ptak, S.E. et al. (2005) Fine-scale recombination patterns differ between chimpanzees and humans. Nat. Genet. 37, 429–434 5 Myers, S. et al. (2008) A common sequence motif associated with recombination hot spots and genome instability in humans. Nat. Genet. 40, 1124–1129 6 Winckler, W. et al. (2005) Comparison of fine-scale recombination rates in humans and chimpanzees. Science 308, 107–111 7 Auton, A. et al. (2012) A fine-scale chimpanzee genetic map from population sequencing. Science 336, 193–198 8 Ponting, C.P. (2011) What are the genomic drivers of the rapid evolution of PRDM9? Trends Genet. 27, 165–171 9 Baudat, F. et al. (2010) PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice. Science 327, 836–840 10 Myers, S. et al. (2010) Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination. Science 327, 876–879 11 Parvanov, E.D. et al. (2010) Prdm9 controls activation of mammalian recombination hotspots. Science 327, 835 12 Perry, G.H. et al. (2008) Copy number variation and evolution in humans and chimpanzees. Genome Res. 18, 1698–1710 13 Lee, A.S. et al. (2008) Analysis of copy number variation in the rhesus macaque genome identifies candidate loci for evolutionary and human disease studies. Hum. Mol. Genet. 17, 1127–1136 14 Gazave, E. et al. (2011) Copy number variation analysis in the great apes reveals species-specific patterns of structural variation. Genome Res. 21, 1626–1639 15 Gokcumen, O. et al. (2011) Refinement of primate copy number variation hotspots identifies candidate genomic regions evolving under positive selection. Genome Biol. 12, R52 16 Conrad, D.F. et al. (2010) Mutation spectrum revealed by breakpoint sequencing of human germline CNVs. Nat. Genet. 42, 385–391 17 Liu, P. et al. (2012) Mechanisms for recurrent and complex human genomic rearrangements. Curr. Opin. Genet. Dev. 22, 211–220 18 Waldman, A.S. (2008) Ensuring the fidelity of recombination in mammalian chromosomes. Bioessays 30, 1163–1171 19 Liu, P. et al. (2011) Frequency of nonallelic homologous recombination is correlated with length of homology: evidence that ectopic synapsis precedes ectopic crossing-over. Am. J. Hum. Genet. 89, 580–588 20 Jinks-Robertson, S. et al. (1993) Substrate length requirements for efficient mitotic recombination in Saccharomyces cerevisiae. Mol. Cell. Biol. 13, 3937–3950 21 Reiter, L.T. et al. (1998) Human meiotic recombination products revealed by sequencing a hotspot for homologous strand exchange in multiple HNPP deletion patients. Am. J. Hum. Genet. 62, 1023– 1033 22 Alekseyev, M.A. and Pevzner, P.A. (2010) Comparative genomics reveals birth and death of fragile regions in mammalian evolution. Genome Biol. 11, R117 23 Gao, L-Z. and Innan, H. (2004) Very low gene duplication rate in the yeast genome. Science 306, 1367–1370 24 Chen, J-M. et al. (2011) Gene conversion in human genetic disease. Genes 1, 550–663 25 Hurles, M.E. et al. (2004) Origins of chromosomal rearrangement hotspots in the human genome: evidence from the AZFa deletion hotspots. Genome Biol. 5, R55 26 Bagnall, R.D. et al. (2005) Gene conversion and evolution of Xq28 duplicons involved in recurring inversions causing severe hemophilia A. Genome Res. 15, 214–223 27 Ca´ceres, M. et al. (2007) A recurrent inversion on the eutherian X chromosome. Proc. Natl. Acad. Sci. U.S.A. 104, 18571–18576 28 Zody, M.C. et al. (2008) Evolutionary toggling of the MAPT 17q21.31 inversion region. Nat. Genet. 40, 1076–1083 29 Conrad, D.F. et al. (2010) Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 30 Park, H. et al. (2010) Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing. Nat. Genet. 42, 400–405 31 Bailey, J.A. et al. (2001) Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 32 She, X. et al. (2004) Shotgun sequence assembly and recent segmental duplications within the human genome. Nature 431, 927–930 33 Osada, N. and Innan, H. (2008) Duplication and gene conversion in the Drosophila melanogaster genome. PLoS Genet. 4, e1000305 34 Fawcett, J.A. and Innan, H. (2011) Neutral and non-neutral evolution of duplicated genes with gene conversion. Genes 2, 191–209 35 Stankiewicz, P. and Lupski, J.R. (2002) Molecular-evolutionary mechanisms for genomic disorders. Curr. Opin. Genet. Dev. 12, 312–319 36 Mefford, H.C. and Eichler, E.E. (2009) Duplication hotspots, rare genomic disorders, and common disease. Curr. Opin. Genet. Dev. 19, 196–204 37 Marques-Bonet, T. et al. (2009) The origins and impact of primate segmental duplications. Trends Genet. 25, 443–454 38 Bosch, E. et al. (2004) Dynamics of a human interparalog gene conversion hotspot. Genome Res. 14, 835–844 39 Lozier, J.N. et al. (2002) The Chapel Hill hemophilia A dog colony exhibits a factor VIII gene inversion. Proc. Natl. Acad. Sci. U.S.A. 99, 12991–12996 Opinion Trends in Genetics October 2013, Vol. 29, No. 10 567
  15. 15. 40 Smahi, A. et al. (2000) Genomic rearrangement in NEMO impairs NF- kB activation and is a cause of incontinentia pigmenti. Nature 405, 466–472 41 Aradhya, S. et al. (2001) A recurrent deletion in the ubiquitously expressed NEMO (IKK-U) gene accounts for the vast majority of incontinentia pigmenti mutations. Hum. Mol. Genet. 10, 2171–2179 42 Zhao, Z. et al. (1998) Frequent gene conversion between human red and green opsin genes. J. Mol. Evol. 46, 494–496 43 DeSilva, U. et al. (1999) Comparative mapping of the region of human chromosome 7 deleted in Williams syndrome. Genome Res. 9, 428–436 44 Antonell, A. et al. (2005) Evolutionary mechanisms shaping the genomic structure of the Williams-Beuren syndrome chromosomal region at human 7q11.23. Genome Res. 15, 1179–1188 45 Park, S-S. et al. (2002) Structure and evolution of the Smith-Magenis syndrome repeat gene clusters, SMS-REPs. Genome Res. 12, 729–738 46 De Raedt, T. et al. (2004) Genomic organization and evolution of the NF1 microdeletion region. Genomics 84, 346–360 47 Shaikh, T.H. et al. (2000) Chromosome 22-specific low copy repeats and the 22q11.2 deletion syndrome: genomic organization and deletion endpoint analysis. Hum. Mol. Genet. 9, 489–501 48 Bailey, J.A. et al. (2002) Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22. Am. J. Hum. Genet. 70, 83–100 49 Bi, W. et al. (2003) Reciprocal crossovers and a positional preference for strand exchange in recombination events resulting in deletion or duplication of chromosome 17p11.2. Am. J. Hum. Genet. 73, 1302–1315 50 Forbes, S.H. et al. (2004) Genomic context of paralogous recombination hotspots mediating recurrent NF1 region microdeletion. Genes Chromosomes Cancer 41, 12–25 51 Pavlicek, A. et al. (2005) Traffic of genetic information between segmental duplications flanking the typical 22q11.2 deletion in velo- cardio-facial syndrome/DiGeorge syndrome. Genome Res. 15, 1487–1495 52 Baye´s, M. et al. (2003) Mutational mechanisms of Williams-Beuren syndrome deletions. Am. J. Hum. Genet. 73, 131–151 53 Johnson, M.E. et al. (2006) Recurrent duplication-driven transposition of DNA during hominoid evolution. Proc. Natl. Acad. Sci. U.S.A. 103, 17626–17631 54 Jiang, Z. et al. (2007) Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nat. Genet. 39, 1361–1368 55 Iskow, R.C. et al. (2012) Exploring the role of copy number variants in human adaptation. Trends Genet. 28, 245–257 56 McGrath, C.L. et al. (2009) Minimal effect of ectopic gene conversion among recent duplicates in four mammalian genomes. Genetics 182, 615–622 57 Ezawa, K. et al. (2010) Evolutionary pattern of gene homogenization between primate-specific paralogs after human and macaque speciation using the 4-2-4 method. Mol. Biol. Evol. 27, 2152–2171 58 Jackson, M.S. et al. (2005) Evidence for widespread reticulate evolution within human duplicons. Am. J. Hum. Genet. 77, 824–840 59 Sudmant, P.H. et al. (2010) Diversity of human copy number variation and multicopy genes. Science 330, 641–646 60 Mansai, S.P. et al. (2011) The rate and tract length of gene conversion. Genes 2, 313–331 61 Teshima, K.M. and Innan, H. (2004) The effect of gene conversion on the divergence between duplicated genes. Genetics 166, 1553–1560 62 Scally, A. et al. (2012) Insights into hominid evolution from the gorilla genome sequence. Nature 483, 169–175 63 Visser, R. et al. (2005) Identification of a 3.0-kb major recombination hotspot in patients with Sotos syndrome who carry a common 1.9-Mb microdeletion. Am. J. Hum. Genet. 76, 52–67 64 Kurotaki, N. et al. (2005) Sotos syndrome common deletion is mediated by directly oriented subunits within inverted Sos-REP low-copy repeats. Hum. Mol. Genet. 14, 535–542 65 Sharp, A.J. et al. (2007) Characterization of a recurrent 15q24 microdeletion syndrome. Hum. Mol. Genet. 16, 567–572 66 Kumar, R.A. et al. (2008) Recurrent 16p11.2 microdeletions in autism. Hum. Mol. Genet. 17, 628–638 67 Weiss, L.A. et al. (2008) Association between microdeletion and microduplication at 16p11.2 and autism. N. Engl. J. Med. 358, 667–675 68 Kiyosawa, H. and Chance, P.F. (1996) Primate origin of the CMT1A- REP repeat and analysis of a putative transposon-associated recombinational hotspot. Hum. Mol. Genet. 5, 745–753 69 Hurles, M.E. (2001) Gene conversion homogenizes the CMT1A paralogous repeats. BMC Genomics 2, 11 70 Lindsay, S.J. et al. (2006) A chromosomal rearrangement hotspot can be identified from population genetic variation and is coincident with a hotspot for allelic recombination. Am. J. Hum. Genet. 79, 890–902 71 Shaikh, T.H. et al. (2007) Low copy repeats mediate distal chromosome 22q11.2 deletions: sequence analysis predicts breakpoint mechanisms. Genome Res. 17, 482–491 72 Saunier, S. et al. (2000) Characterization of the NPHP1 locus: mutational mechanism involved in deletions in familial juvenile nephronophthisis. Am. J. Hum. Genet. 66, 778–789 73 Balciuniene, J. et al. (2007) Recurrent 10q22-q23 deletions: a genomic disorder on 10q associated with cognitive and behavioral abnormalities. Am. J. Hum. Genet. 80, 938–947 74 Ballif, B.C. et al. (2010) Identification of a recurrent microdeletion at 17q23.1q23.2 flanked by segmental duplications associated with heart defects and limb abnormalities. Am. J. Hum. Genet. 86, 454–461 Opinion Trends in Genetics October 2013, Vol. 29, No. 10 568
  16. 16. Human housekeeping genes, revisited Eli Eisenberg1 and Erez Y. Levanon2 1 Raymond and Beverly Sackler School of Physics and Astronomy, Tel-Aviv University, Tel Aviv 69978, Israel 2 Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat Gan 52900, Israel Housekeeping genes are involved in basic cell mainte- nance and, therefore, are expected to maintain constant expression levels in all cells and conditions. Identification of these genes facilitates exposure of the underlying cellular infrastructure and increases understanding of various structural genomic features. In addition, house- keeping genes are instrumental for calibration in many biotechnological applications and genomic studies. Advances in our ability to measure RNA expression have resulted in a gradual increase in the number of identified housekeeping genes. Here, we describe housekeeping gene detection in the era of massive parallel sequencing and RNA-seq. We emphasize the importance of expres- sion at a constant level and provide a list of 3804 human genes that are expressed uniformly across a panel of tissues. Several exceptionally uniform genes are singled out for future experimental use, such as RT-PCR control genes. Finally, we discuss both ways in which current technology can meet some of past obstacles encoun- tered, and several as yet unmet challenges. The concept of housekeeping genes Housekeeping genes are genes that are required for the maintenance of basal cellular functions that are essential for the existence of a cell, regardless of its specific role in the tissue or organism. Thus, they are expected to be expressed in all cells of an organism under normal condi- tions, irrespective of tissue type, developmental stage, cell cycle state, or external signal. From a fundamental point of view, full characterization of the minimal set of genes required to sustain life is of special interest [1,2]. In addi- tion, housekeeping genes are widely used as internal con- trols for experimental as well as computational studies [3–7]. Furthermore, many studies have highlighted unique genomic and evolutionary features of this special group of genes. For example, housekeeping genes were shown to have shorter introns and exons [8–11], a different repeti- tive sequence environment [enriched in short interspersed elements (SINEs) and depleted in long interspersed ele- ments (LINEs)] [12,13], more simple sequence repeats in the 50 untranslated region (UTR) [14], lower conservation of the promoter sequence [15], and lower potential for nucleosome formation in the 50 region of these genes [16]. Protein products of housekeeping genes are enriched in some domain families [17]. These studies shed light on general aspects of gene structure and evolution. Early detection schemes for housekeeping genes The notion of housekeeping genes has been in use in the literature for nearly 40 years. In particular, several mam- malian genes have been used widely as internal controls in experimental expression studies, such as glyceraldehyde- 3-phosphate dehydrogenase (GAPDH), tubulins, cyclophi- lin, albumin, actins, 18S rRNA or 28S rRNA. Yet, only at the turn of the 21st century, with the advancement of transcriptome profiling technology, did it become possible to identify, systematically, a set of housekeeping genes. These first attempts used large-scale expression data [18–20] or, more often, microarray profiling to look at the expression levels of many genes across a panel of tissue samples. Typically, they resulted in lists of hundreds to thousands of genes [8,19–25], many more than the dozen or so commonly used control genes. Generally, the many lists produced show a considerable level of consistency. Typically, the intersection of any two of them yields approximately 50% coverage [8,24,26], sug- gesting that the sets are enriched in housekeeping genes but still lacking in specificity and selectivity. This could be partly attributed to the limited number of tissues exam- ined in each separate analysis and the differences between the tissues across analyses. However, it is likely that technological limitations affecting the underlying data have contributed much to the quality and reproducibility of the results. In particular, first-generation microarray technology is known to have had many problematic nonspecific probes [27]. Even the improved versions of microarrays are typi- cally assumed to achieve only an approximately twofold accuracy in expression level measurement, and they are limited in their dynamical range. These inaccuracies could have large effects on deciding whether a gene is expressed (regardless of the rather arbitrary expression cutoff used to determine which probe set is ‘expressed’). A second, more fundamental, issue relates to the very definition of housekeeping genes. Should one look for genes merely being expressed in all tissues, or should the gene also be expressed at a constant level across tissues? Early studies generally adopted the first definition and, in fact, GAPDH and other popular housekeeping genes for experi- mental controls have been found to vary considerably across tissues [3,28–30]. This choice was the pragmatic one to make, because it enabled the use of the binary present or absent calls of the microarray and rendered normalization issues unnecessary. However, this approach has two shortcomings. First, measurement errors and stochastic noise make it difficult to distinguish genes absent from the sample from those weakly expressed. Second, and more importantly, it was later appreciated Opinion 0168-9525/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. Corresponding author: Eisenberg, E. ( Keywords: housekeeping genes; RNA-seq; gene expression patterns; internal control; next generation sequencing. Trends in Genetics, October 2013, Vol. 29, No. 10 569
  17. 17. that a large part of the genome is expressed at a low basal level in all tissues [31]. Thus, most genes are expressed at some background level in all tissues. In light of this obser- vation, and to make the concept of housekeeping genes more useful, one should either modify the definition of housekeeping genes to ‘genes that are expressed above some cutoff level’, which necessarily introduces an arbi- trary parameter explicitly, or rather adopt the second option above and look for genes that are expressed at a constant level across all normal tissues. Introducing an expression cutoff requires a quantitative comparison of expression levels of different genes in the same sample. This is known to be a complex problem, due to questions of bias in PCR amplification, different probe affinities, and so on. Furthermore, normalizing the values obtained from different experiments is also a non- trivial challenge. Early microarrays studies generally used linear normalization, setting the mean expression level, or the trimmed mean, constant. Later, the more sophisticated quantile normalization was introduced [32]. These and other normalization procedures generally assume similar expression-value distributions for all samples studied. This could be justified for samples coming from identical or highly similar biological conditions, perhaps even for healthy and diseases samples of the same tissue. However, it is not yet clear how accurate this assumption is for cross- tissue comparisons, and how much it skews the results [33]. A third issue that was not fully addressed in previous studies of housekeeping genes is alternative splicing. It has been appreciated for more than a decade that most human genes have more than one isoform [34,35]. Thus, one could envision a situation in which one splice variant is consti- tutively expressed, making it a housekeeping transcript, whereas another transcript from the same gene exhibits a more complex expression profile (Figure 1A). Moreover, it is possible that a single gene expresses one transcript in one set of tissues and another transcript in other tissues, such that the gene, as such, is always expressed, but each transcript is specific to a subset of tissues. In principle, then, one would like to define the set of housekeeping transcripts. Early microarray technology did rather poorly in distinguishing between transcripts and, thus, some studies deliberately ‘zoomed out’ to the gene level. Housekeeping genes in the deep-sequencing era New horizons are opening as deep-sequencing technology takes over microarrays as the method of choice for tran- scriptome profiling [36]. RNA-seq was found to be prefera- ble to microarrays as a tool for expression measurement. Unlike microarrays, RNA-seq does not require pre-knowl- edge of the genomic sequence (although it is helpful for analysis), and requires smaller amounts of RNA. It pro- vides information at the single-base level, enabling better assessment of alternative splicing and even allelic varia- tion. Background levels in RNA-seq are lower, due to the better specificity and improved control of in silico sequence alignment compared with probe hybridization. Conse- quently, a wider dynamic range is accessible. Importantly, RNA-Seq is also more accurate in quantifying spike-in RNA controls of known concentration, and produces expression values that correlate better with quantitative PCR (qPCR) results [36] and protein levels [37]. This new and improved platform enables some of the challenges to be met that have been standing for many years, but it also opens up new questions. In terms of normalization, read coverage generally pro- vides a rather robust measure for comparing different genomic regions within the same sample. Exceptions to this are generally a result of alignment problems in repeti- tive or duplicative regions (Figure 1B). For the task of housekeeping gene identification, these can be partly avoided by limiting analysis to the nonrepetitive coding regions of the exons [33] and using long reads. Note, however, that highly expressed coding exons (e.g., GAPDH) are prone to having more duplications [38], resulting in alignment problems. Small-scale PCR biases are expected to be washed out when looking at the aver- aged expression level over whole exons. By contrast, the issue of cross-tissue normalization is still open. The popu- lar reads per kilobase per million mapped reads (RPKM) measure takes care of normalizing for the two most obvious factors affecting the raw number of reads per gene, tran- script, or exon: the total number of reads produced and their length [39]. The RPKM measure is simple and straightforward, but does not fully solve the between- sample normalization issue. More subtle biases, resulting from variations in transcript length distribution in the sample, coverage dependence on local sequence due to GC content, priming and other biases, and variability in mappability of different regions were detected [40–45]. A (A) (B) (C) ?? B B C A B C A A′ B′ TRENDS in Genetics Figure 1. Examples of challenges in housekeeping gene detection. (A) Genes having several splice variants could have different expression levels [indicated by the number of reads (black bars)] for different parts of the gene. (B) Duplicative regions, due to pseudogenes and other duplications, complicate unique read alignments, thus biasing expression-level measurement. (C) Expression measurement has several biases, including the lower expression (on average) of the upstream exons due to imperfect reverse transcription resulting in partial cDNA molecules. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 570
  18. 18. There is still no consensus as to the best way to account for all of these in a standard and consistent way. In terms of housekeeping gene identification, RNA-seq dataindeedshow explicitly thatbasal (leaky) lowexpression levels can be found throughout the genome. Therefore, any definition of housekeeping genes should refer to the quanti- tative expression level. This can be done using a cutoff, or by adding the requirement of low variability in expression across tissues. Here, we promote the latter course of action. Setting a cutoff value as the main criteria for defining the housekeeping genes is undesirable for three reasons. First, there seems to be no natural cutoff value, thus forcing one to make an arbitrary choice. Second, due to the lack of a proper intergene normalization scheme, the same RPKM values for different genes could indicate different expression levels [4,46]. Third, using the expression level as a measure of importance for cell function is also questionable: cells are likely to require different gene products at different concen- trations. There is no good reason to exclude genes that are constantly expressedata midratherthana highlevel.Thus, we feel that low variability should be used as the main criteria for selecting housekeeping genes. Another advantage of RNA-seq data is that they mea- sure the expression along the gene (similar to the older exon arrays) and can thereby provide expression at the exon level. Some software tools try to extract transcript expression levels from RNA-seq data (e.g., [47]). However, there is still much to be desired in terms of reliability within the limits of current technology [43]. This is expected to improve significantly, as read length increases. Note that recent findings [48] show significant variability in exon boundaries, making even the comparison of exon expression imperfect. An interim partial solution, which we adopt below, is to measure expression at the more basic exon level and aim to define a set of housekeeping exons. Extracting a set of housekeeping genes from Human BodyMap data Here, we demonstrate the power of the new technology for identifying housekeeping genes by analyzing expression data from the Human BodyMap (HBM) 2.0 Project. This includes publicly available RNA-Seq data (GEO accession number GSE30611, HBM), generated on HiSeq 2000 instruments, providing expression profiling in 16 normal human tissue types: adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells. Two different read lengths were used for each tissue (2 Â 50-bp paired- end and 1 Â 75-bp single-read data), each of which was sequenced in a separate HiSeq 2000 lane. We aligned the reads to the genome using the Bowtie2 aligner [49] and measured the read coverage of each of the coding exons of the (uniquely aligned) RefSeq sequences [50], in normalized RPKM units. For exons that were partly coding, only the coding part was considered. Short exons (<50 bp) are prone to alignment problems and were discarded. We compared the RPKM values obtained from the paired-end data and the single-read data to assess the technical reproducibility of the RPKM measure, and found that the typical fold-ratio between the two was 1.5 (Figure 2A). We observed a bias against the upstream exons of transcripts, which tended to have a lower expres- sion levels. This effect might result from imperfect reverse transcription resulting in cDNA missing the upstream part of the transcript (Figure 1C). -1.5 -1 -0.5 0 0.5 1 1.5 log2 (RPKM50_PE /RPKM75 ) 0 (A) (B) (C) 1 0 0.25 0.5 FracƟon of exons passing 0.01 1 100 Cutoffvalue(RPKM) Minimum expression over Ɵssues Key: Geometric mean expression 0 0.1 0.2 0.3 0.4 0.5 FracƟon of exons below cutoff 0 0.5 1 1.5 2 2.5 std[log2 (RPKM)]cutoff TRENDS in Genetics Figure 2. Characterization of the expression profile in Human BodyMap (HBM) data. (A) Reproducibility of the measured reads per kilobase per million mapped reads (RPKM) levels per exon, as assessed by comparing the 50-bp paired-end and the 75-bp single-read data. The continuous line is the best fit for a Gaussian distribution, added to accentuate the fat tails of the actual distribution. The width of the distribution is approximately 0.55 (log2 units), leading to a typical variability of 1.5-fold. (B) Fraction of exons expressed above a cutoff value in all 16 tissues, for different cutoff values. In total, 55% of all exons are expressed to a detectable level in the HBM data set. (C) Cumulative distribution of the exon expression variance. Most of the exons being expressed in all tissues have standard-deviation [log2(RPKM)] values between 0.7 and 1.5. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 571
  19. 19. Figure 2B presents the fraction of exons being expressed above a certain cutoff RPKM value in all tissues. Note that approximately 55% of all exons are expressed at a detect- able level in all HBM tissues, demonstrating why the old definition of housekeeping genes is not useful. In addition, it is hard to detect a natural expression cutoff value. The variation in expression level is estimated by the standard deviation of log2(RPKM) over samples. Figure 2C shows Table 1. Genes proposed for calibrationa Gene symbol RefSeq accession number Gene name Genomic coordinates (hg19) of exons passing the filters C1orf43 NM_015449 Chromosome 1 open reading frame 43 chr1 154192817 154192883 chr1 154186932 154187050 chr1 154186368 154186422 chr1 154184933 154185100 chr1 154184795 154184854 CHMP2A NM_014453 Charged multivesicular body protein 2A chr19 59065411 59065579 chr19 59063625 59063805 chr19 59063421 59063552 EMC7 NM_020154 ER membrane protein complex subunit 7 chr15 34382517 34382656 chr15 34380253 34380334 chr15 34376537 34376687 GPI NM_000175 Glucose-6-phosphate isomerase chr19 34857687 34857756 chr19 34859487 34859607 chr19 34868639 34868786 chr19 34869838 34869910 chr19 34872370 34872424 chr19 34884152 34884213 chr19 34884818 34884971 chr19 34887205 34887335 chr19 34887485 34887562 chr19 34890111 34890240 chr19 34890460 34890536 chr19 34890623 34890690 PSMB2 NM_002794 Proteasome subunit, beta type, 2 chr1 36101910 36102033 chr1 36096874 36096945 chr1 36070833 36070883 PSMB4 NM_002796 Proteasome subunit, beta type, 4 chr1 151372456 151372663 chr1 151372917 151373064 chr1 151373239 151373321 chr1 151373714 151373831 RAB7A NM_004637 Member RAS oncogene family chr3 128525214 128525433 chr3 128526385 128526514 chr3 128532169 128532262 REEP5 NM_005669 Receptor accessory protein 5 chr5 112256859 112256953 chr5 112238076 112238215 chr5 112222711 112222880 SNRPD3 NM_004175 Small nuclear ribonucleoprotein D3 chr22 24953642 24953768 chr22 24963951 24964144 VCP NM_007126 Valosin containing protein chr9 35067887 35068060 chr9 35066671 35066814 chr9 35064150 35064282 chr9 35062213 35062347 chr9 35061999 35062135 chr9 35061573 35061686 chr9 35061011 35061176 chr9 35060797 35060920 chr9 35060309 35060522 chr9 35059489 35059798 chr9 35059060 35059216 chr9 35057372 35057527 chr9 35057116 35057219 chr12 110930800 110931036 VPS29 NM_016226 Vacuolar protein sorting 29 homolog chr12 110929812 110929927 chr12 110929812 110929927 a Genes chosen have most of their exons showing geometrical mean expression exceeding RPKM = 50, standard deviation of log2(RPKM) <0.5, and no single tissue showing an expression level different from the geometrical mean by twofold or more. Genes with pseudogenes were excluded. Opinion Trends in Genetics October 2013, Vol. 29, No. 10 572
  20. 20. the cumulative distribution of these standard deviation values for the different exons. To define housekeeping exons, the exon must be expressed in all tissues at any nonzero level, and must exhibit a uniform expression level across tissues. Thus, we adopted the following criteria: (i) expression observed in all tissues; (ii) low variance over tissues: standard-deviation [log2(RPKM)]<1; and (iii) no exceptional expression in any single tissue; that is, no log- expression value differed from the averaged log2(RPKM) by two (fourfold) or more. These criteria resulted in a list of 37 363 unique exons (20% of studied exons), belonging to 11 648 RefSeq transcripts and 6289 genes. These included most of the stable housekeeping genes reported based on microarray data [30]. We define a housekeeping gene as a gene for which at least one RefSeq transcript has more than half of its exons meeting the previous criteria (thus being housekeeping exons). Altogether, we found 3804 such human housekeep- ing genes. The lists of housekeeping exons and housekeep- ing genes are available at$elieis/ HKG/. In addition, we propose a short list of highly uniform and strongly expressed genes that may be used for calibra- tion in future experimental settings (Table 1). As expected, the housekeeping genes are enriched in gene ontology (GO) categories associated with basic cellu- lar activity, such as gene expression and biogenesis of nucleotides and amino acids, catabolic processes, protein localization, and so on [51]. The overlap with previous lists is partial, due to the different definition of housekeeping genes used. In particular, GAPDH and actin beta (ACTB) do not appear in our new list, because these genes vary across tissues [3,28–30]. Nevertheless, some of the most pronounced features previously reported for housekeeping genes, such as the much shorter introns [8–11] and more duplications [52], also characterize the new set. Concluding remarks Current technology enables global measurement of expres- sion levels with unprecedented accuracy. This advance- ment has revealed that large parts of the genome are normally expressed at a low level. Accordingly, we found that most human exons are expressed at some level in all the human tissues studied. This new technological era calls the community to reevaluate the concept of a housekeeping gene. Here, we have presented our own perspective, sug- gesting the use of low expression variation as the main criteria for defining housekeeping genes. We also provide sets of exons and genes that are ubiquitously and uniform- ly expressed, as well as a short list of genes suitable for experimental calibration. More high-quality deep-sequencing transcriptome pro- filing data are expected to emerge in the near future, enabling improvements of the analysis described here using better statistics for the tissues studied and adding more tissue types. Furthermore, including extreme patho- logical conditions relevant for various tissues could further purify the housekeeping genes list [53]. A significant ad- vance should come from new experiments currently being done on single-cell transcriptome profiling [54]. This could improve the specificity in detecting housekeeping genes, narrowing the list to genes that are expressed in each and every single cell. In addition, accumulation of tissue-spe- cific epigenetic data, such as histone marks and nucleotide methylations, could be used in the future to better distin- guish regulated expression from low-level noise. As discussed above, normalization (within a sample and across samples) is still an unresolved issue. Advancement in this direction could greatly improve housekeeping gene detection. In addition, usage of longer reads is expected to decrease alignment errors and reduce bias. Longer reads (and improved analysis tools) are expected to raise consid- erably the sensitivity of expression level measurement at the transcript level, enabling direct evaluation of the housekeeping splice-variants list. In conclusion, the dramatic advancement of sequencing technologies calls for a reassessment of the notion of housekeeping genes, and allows for improving quantita- tively and qualitatively the resolution. We thus provide updated lists of housekeeping exons and genes for public use, available at$elieis/HKG/. It is expected that emerging technologies could very soon facili- tate meeting the yet open challenges, allowing for better and more accurate housekeeping gene profiling. Acknowledgments We thank Ami Haviv and Gilad Finkelstein for help with reads’ alignments, and Lily Bazak for help in gene lengths’ analysis. This work was supported by Israel Science Foundation 379/12 (EE), by the I- CORE Program of the Planning and Budgeting Committee and the Israel Science Foundation (grant No 41/11) and by the Marie Curie Integration Grant 256593(EYL). References 1 Fraser, C.M. et al. (1995) The minimal gene complement of Mycoplasma genitalium. Science 270, 397–403 2 Koonin, E.V. (2000) How many genes can make a cell: the minimal- gene-set concept. Annu. Rev. Genomics Hum. Genet. 1, 99–116 3 Thellin, O. et al. (1999) Housekeeping genes as internal standards: use and limits. J. Biotechnol. 75, 291–295 4 Robinson,M.D.andOshlack,A.(2010)Ascalingnormalizationmethodfor differential expression analysis of RNA-seq data. Genome Biol. 11, R25 5 Dheda, K. et al. (2004) Validation of housekeeping genes for normalizing RNA expression in real-time PCR. Biotechniques 37, 112–114, 116, 118–119 6 Rubie, C. et al. (2005) Housekeeping gene variability in normal and cancerous colorectal, pancreatic, esophageal, gastric and hepatic tissues. Mol. Cell. Probes 19, 101–109 7 Vandesompele, J. et al. (2002) Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol. 3, RESEARCH0034 8 Eisenberg, E. and Levanon, E.Y. (2003) Human housekeeping genes are compact. Trends Genet. 19, 362–365 9 Vinogradov, A.E. (2004) Compactness of human housekeeping genes: selection for economy or genomic design? Trends Genet. 20, 248–253 10 Carmel, L. and Koonin, E.V. (2009) A universal nonmonotonic relationship between gene compactness and expression levels in multicellular eukaryotes. Genome Biol. Evol. 1, 382–390 11 Castillo-Davis, C.I. et al. (2002) Selection for short introns in highly expressed genes. Nat. Genet. 31, 415–418 12 Eller, C.D. et al. (2007) Repetitive sequence environment distinguishes housekeeping genes. Gene 390, 153–165 13 Versteeg, R. et al. (2003) The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. Genome Res. 13, 1998–2004 14 Farre´, D. et al. (2007) Housekeeping genes tend to show reduced upstream sequence conservation. Genome Biol. 8, R140 15 Lawson, M.J. and Zhang, L. (2008) Housekeeping and tissue-specific genes differ in simple sequence repeats in the 50 -UTR region. Gene 407, 54–62 Opinion Trends in Genetics October 2013, Vol. 29, No. 10 573