Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Quan Nguyen at #ICG12: Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics and epigenetics data


Published on

Quan Nguyen in the ICG GigaScience Prize Track: Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics and epigenetics data. #ICG12 in Shenzhen, 26th October 2017

Published in: Science
  • Be the first to comment

Quan Nguyen at #ICG12: Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics and epigenetics data

  1. 1. Mammalian DNA regulatory regions predicted by utilizing human genomics, transcriptomics and epigenetics data Quan H. Nguyen, Ross L. Tellam, Marina Naval-Sanchez, Laercio R. Porto-Neto, William Barendse, Antonio Reverter, Benjamin Hayes, James Kijas, and Brian P. Dalrymple Commonwealth Scientific and Industrial Research Organisation (CSIRO), Livestock Genomics, Brisbane, Australia (Carlson et al., 2016, Nat Biotech)
  2. 2. Why are we searching for DNA regulatory regions? • There are genome assemblies for many species • We know little about which parts of the genome are functional: – We know mostly about protein-coding genes (~ 2% of the genome) – Coding genes are mostly similar (in sequence and numbers) between mammalian species – The control of gene expression distinguishes species, individuals, and tissues – Regulatory DNA sequences are binding sites of transcriptional regulation proteins (e.g. transcription factors) 1 | • To utilise the genome information, we need to explore beyond protein-coding sequences (i.e. regulatory sequences) Functional Annotation of Animal Genomes (FAANG) Proteins
  3. 3. 2 | How to identify regulatory regions experimentally? • Regulatory regions are identified by a combination of: 1. Epigenetics data: histone modifications (e.g. ChIP-Seq H3K26me3, H4K20me1…), DNA methylation (WGBS) 2. Genomics data: open-chromatin assays (DNAse), chromatin interactions (Hi-C) 3. Transcriptomics data: RNA-seq, CAGE *ROADMAP consortium, Nature, 2015 Promoter Inactive Enhancer
  4. 4. Human Atlases and Encyclopedia of Regulatory Databases • Lots of human data available for many: 1) cell types, 2) tissues, and 3) assay types • Much less data exist for other species • We developed a method, HPRS (Human Projection of Regulatory Sequences), to map data in humans to the genomes of other species 3 | *ROADMAP consortium, Nature, 2015
  5. 5. HPRS predicts three broad categories of regulatory regions • Promoters: more conserved, relatively easy to identify, potentially many novel promoters of non-coding genes and alternative transcription start sites • Enhancers: less conserved, more tissue/cell type specific • Other regulatory sequences: defined by transcription factor binding sites 4 | 1. Map 2. Filter 3. Use
  6. 6. 5 | Dataset Number regions Region types Tissues/cell lines Data Types ENCODE 108,000 TF binding 0/12 ChIPseq ROADMAP 5,917,129 Enhancers 48/40 ChIPseq, DNAse I FANTOM Enhancers 43,011 Enhancers 135/673 CAGE ENSEMBL 2,427,934 Enhancers 0/18 ChIPseq, DNAse I FANTOM Promoters 201,802 Promoters 152/823 CAGE Map: Selecting datasets from humans • Datasets for mapping promoters, enhancers and transcription factors are selected so that they represent: - Different tissues and cell lines - Different biochemical assay types
  7. 7. Map: Maximizing enhancer coverage prediction *Villar et al., 2015 Cell;160(3):554-66 • Mapping is based on inter-species conservation at two levels: 1) primary sequence and 2) genome organization (relative locations between regions) • We optimized mapping parameters and mapping strategies (reciprocal map & multimap) to recover most reference enhancers and promoters • For example: we found lower similarity threshold resulted in higher coverage of cattle liver reference enhancer dataset* but not specificity 6 |
  8. 8. 7 | • Each filter step recovers a dataset with more promoters/enhancers per Mb than the initial baseline (whole genome without HPRS):  Filter 1: H3K27Ac is the histone modification mark for enhancers  Filter 2: CAGE measures bidirectional promoters as signature for enhancers  Filter 3: Enhancer activity scored by SVM (support vector machine)*  Filter 4: RNAseq measures active transcription  Filter 5: Number of regulatory features mapped to the region  Filter 6: Sequence conservation (across 100 vertebrates)  Filter 7: Number of predicted transcription factor binding sites Filter: Seven steps for filtering mappable regulatory regions *Lee et al., Nat Gen, 2015
  9. 9. • Started with 729,246 non-overlapping regions in the cattle genome (the mapping created an Universal dataset) • Number of Villar et al* cattle liver reference promoters (P) and enhancers (E) per 1 Mb length was used to refine filtering parameters • Filtered dataset: ~7 fold enrichment in cattle liver reference enhancers and promoters* and ~4 fold reduction in regions • Filtered dataset contains 70% and 79% of enhancer and promoters in the cattle liver reference set* *Villar et al., 2015 Cell;160(3):554-66 8 | Filter: Seven steps for filtering mappable regulatory regions
  10. 10. 9 | HPRS prediction for 10 species • We mapped 42 ROADMAP tissues to 10 species • Data from more biologically related tissues produced higher coverage (highest in liver tissue) • Data from more evolutionarily related species produced higher coverage (highest in monkey – macaca and marmoset) • Combining multiple tissues increased the enhancer coverage markedly, to 65-87%
  11. 11. Enrichment of associated SNPs in regulatory regions 10 | *Bolormaa et al., 2014, PLOS Genetics; 10 (3) e1004198 • There are 10s of millions of SNPs (Single Nucleotide Polymorphisms (SNPs) ) in a genome: - Commercial SNP arrays are small (~ 50,000 SNPs, i.e. 0.5% of the total SNPs) - SNPs affecting protein functions: minority ~ 5% - SNPs affecting gene expression (regulatory SNPs): ~ 95% • We tested significant SNPs for 32 traits (feed intake, growth, body composition and reproduction), in 10,191 beef cattle* • Substantial fold enrichment of low p-value SNPs in regulatory set v. all other sets, including the set of SNPs 5kb upstream of protein coding genes
  12. 12. 11 | HPRS predicted results guide the selection of causative SNPs 13 GWAS SNP *Karim et al., 2011, Nat. Genetics • 13 SNPs at PLAG1 (pleomorphic adenoma gene-1) region are significantly associated to the cattle stature (lower height) phenotype* • We found 2 SNPs within promoters and 1 SNP within an intergenic enhancer • The 2 SNPs at the promoter region were validated by Karim et al.* to change promoter activity
  13. 13. Using the HPRS dataset to understand mechanism - polled 12 | • Two possible causal mutations of the polled phenotype*:  A 212 base insertion/10 base deletion mutation,  A ~80 kb duplication, at ~300 kb away • We found the deletion mutation is located within a predicted enhancer and a HAND1 transcription factor binding site • The HAND1 deletion may lead to downregulation of OLIG1, OLIG2 and lincRNA2 (via distal enhancer interaction) Hand1 Celtic deletion Enhancer Enhancer targets OLIG1 OLIG1 lincRNA2 *Allais-Bonnet et al., 2013, PLOS ONE 8 e63512
  14. 14. Summary 1. The data in humans are useful to predict regulatory sequences in other species (by HPRS mapping and filtering pipelines) 2. HPRS is a fast and economical approach, applicable when most data in a target species are not available 3. SNPs significantly associated with phenotypes are enriched in the predicted regulatory sequences (more enriched than traditional SNP selection based on known coding regions) 4. HPRS results can contribute to genomics technology development, for instance: to design a new generation causative SNP chip for large-scale genotyping, or to predict regulatory targets as candidates for genome editing 13 |
  15. 15. Acknowledgements • CSIRO: - Brian P. Dalrymple - Juca Porto-Neto - Ross L. Tellam - James Kijas - Bill Barendse - Marina Naval-Sanchez - Antonio Reverter • QAFFI: Ben Hayes • Funding: CSIRO OCE fellowship 14 | St Lucia Campus, Brisbane, Australia
  16. 16. A machine learning tool to predict SNP effects in an enhancer Red arrows show SNPs • gkmSVM (gapped k mers Support Vector Machine) scores regulatory activity by comparing enhancer with non-enhancer regions • deltaSVM scores were calculated for every base across cattle ALDOB enhancer (projected from human) • deltaSVM scores reduced markedly at locations overlapping transcription factor binding sites (indicating loss of binding if mutation occurs) 15 | *Lee et al., Nat Gen, 2015