Your SlideShare is downloading. ×
Jonathan Keith - Exploring the structure of whole-genome conservation profiles using bayesian segmentation
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Jonathan Keith - Exploring the structure of whole-genome conservation profiles using bayesian segmentation


Published on

Conservation is a key indicator of function in genomes, and can potentially be used to discover novel functional non‐protein‐coding RNAs and regulatory sequences. However, recent investigations have …

Conservation is a key indicator of function in genomes, and can potentially be used to discover novel functional non‐protein‐coding RNAs and regulatory sequences. However, recent investigations have demonstrated that a simple dichotomy between conserved and non‐conserved sequence is too naïve a distinction to reflect the full complexity of the numerous types of structural and functional constraints acting on genomes. This presentation will discuss recent investigations into the detailed structure of whole‐genome conservation profiles, using Bayesian segmentation techniques to identify multiple classes of conservation level. By integrating information about conservation with profiles of other properties indicative of function, including GC content and transition/
transversion ratios, a much finer level of structure can be detected. The method has been applied to a range of species including Drosophila, zebrafish, malaria and bacterial genomes, and results from each of these will be presented. One key implication of these results is that the proportion of functionally constrained sequence in
eukaryotic genomes may be very much larger than previously supposed. Another key implication is that genomic sequences may be subject to ephemeral functional constraints that act on too short a time scale to be detected in most comparative genomic studies. The functional content of various classes of conserved sequence will also be discussed.

First presented at the 2014 Winter School in Mathematical and Computational Biology

Published in: Science

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Exploring the Structure of Whole-Genome Conservation Profiles using Bayesian Segmentation Jonathan M Keith Mathematical & Computational Biology Winter School Brisbane July 8, 2014
  • 2. 2 Outline •  Introduction –  DNA/RNA/protein/ncRNAs –  Genome segmentation –  Incorporating multiple data types into segmentation –  Generalised Gibbs Sampler •  Applications –  The proportion of functional sequence in genomes –  Investigating alternative splicing –  Complexity of Drosophila 3’ utrs –  Non-coding RNAs in zebrafish –  Non-coding RNAs in Wolbachia –  Regions contributing to malaria pathogenicity and host specificity –  Transcription factor binding sites in zebrafish
  • 3. 3 DNA
  • 4. 4 The Central Dogma of Molecular Biology
  • 5. 5 Gene structure
  • 6. Non-coding RNA 6
  • 7. 7 Non-protein-coding RNA
  • 8. Conservation 8
  • 9. 9 Bayesian Genome Segmentation •  Input: sequence of characters from a finite alphabet •  Output: – Segmentations – Classifications
  • 10. 10 Segmenting what? •  GC content – binary 1=GC, 0=AT •  Pairwise alignment – binary 1=match, 0=mismatch •  Multiple sequence alignment –  Column-wise counts of most frequent character –  Column-wise Parsimony score given phylogeny •  Pairwise conservation + GC content •  Pairwise content + GC content + indel frequency
  • 11. Segmenting an alignment 11 •  Alignment encoded by a 32-character alphabet Human: GCCGA-- Mouse: GTC-A-- Zf : ATTAATG S : xZxIaJJ Species 1 A A A A A A A A A A A A A A A A C C C C C C C C C C C C C C C C Species 2 A A A A C C C C G G G G T T T T A A A A C C C C G G G G T T T T Species 3 A C G T A C G T A C G T A C G T A C G T A C G T A C G T A C G T Encoding a b c d e f g h i j k l m n o p q r s t u v w x y z U V W X Y Z •  Example
  • 12. 12 A Bayesian model Binary sequences encoding matches and mismatches are assumed to have been generated by a process involving the following parameters: • S – binary sequence • k – number of change-points • c = (c1,…,ck) – vector of change-points • θ = (θ0,…, θk) – vector of conservation levels for each segment • φ – probability that any given sequence position is a change-point • g = (g0,…, gk) – vector containing, for each segment, the number of the group to which it belongs • α – parameters of the beta distributions for each group • π – proportion of segments in each group The algorithm generates segmentations (k,c) and group parameters (π, α) and uses them to generate profiles for each group.
  • 13. 13 Model parameters •  Bayesian hierarchical model includes the parameters shown •  Integrate over φ and θ, sum over g, sample k, c, π and α
  • 14. 14
  • 15. 15
  • 16. Example: Subset-based sampling 16 € 1 3 € 1 3 € 1 3
  • 17. Generalized Gibbs sampler 17 € 1 3 € 2 3 € 2 3 € 1 3 € 1 3 € 1 3 € 1 3
  • 18. 18 Generalized Gibbs sampler •  X – target set •  I – index set (move types) •  U ⊆ I × X •  Θ(x) – elements of the form (i,x) •  Qx – transition matrix on Θ(x) •  qx – stationary distribution for Qx •  ρ(i,x) – elements that can be reached from x by move type i I X U (i,x) ρ(i,x) Θ(x)
  • 19. 19 Generalized Gibbs Sampler (discrete space) Starting with an arbitrary U0, perform the following steps iteratively: 1.  [Q-step] Given Un = (i,x), generate (j,x) ∈ Θ(x) using Qx((i,x), . ). 2.  [R-step] Given (j,x), generate W ∈ ρ(j,x) using R((j,x), . ). 3.  Let Un+1 = W. € R((i,x),( j,y)) = p(y)qy ( j) p(z)qz (k) (k,z)∈ρ(i,x) ∑
  • 20. 20 For further details…
  • 21. 21 Order of updates
  • 22. Model Selection 22
  • 23. 23 Model selection •  AIC approximation •  BIC approximation •  DICV Lk ln22 − Lngaskn ln2))1((ln −++
  • 24. Genome-wide conservation levels 24 Oldmeadow et al. (2010) Mol. Biol. Evol. 27(4): 942-953
  • 25. Investigating alternative splicing 25 Boyd et al. (2014) PLoS ONE 7(3):e33565
  • 26. Complexity of Drosophila 3’ UTRs 26 Algama et al. (2014) PLoS ONE 9(5):e97336
  • 27.           Objec've:       Method  development  in  finding  puta2ve  func2onal  regions  in  13   muscle  genes             Muscle Development Genes Project EYA1 SIX1 EYA4 SIX4 MYF5 WNT1 MYF6 WNT7a MYOD1 PAX3 MYOG PAX7 SHH
  • 28. Generating Input Sequence Human   A A A A C C C C G G G G T T T T - N Mouse   A C G T A C G T A C G T A C G T N -  code   a b c d e f g h i j k l m n o p skip I 1.  Pairwise  Alignment   2.  Multiple  Alignment   •  In 3-way alignment, columns with complementary bases are encoded using same letters •  Human indels are skipped •  ZF/Mouse indels are encoded with ‘I’
  • 29. Results: EYA4 EYA4:  311,523  bp,  and  20  exons   Model Selection Conservation levels propor2on  of     characters    a,  f,  k,  p     in    segments  
  • 30. Identification of conserved non coding sequences q  Identifying putative functional elements (PFEs) in 13 muscle genes 30 Application of changept on eya1, 3-way alignment (human, mouse, zebrafish) 50% 65% 45% Conservation
  • 31. Results – 3way 31 Gene # PFEs # PFEs matched with EvoFold # PFEs matched with DNAse-footprints # PFEs matched with fRNAdb nc transcripts EYA1 6 5 6 1 (Transcript : 3679 , PFE:169) EYA4 2 1 SHH 4 1 1 PAX3 9 6 4 2 (Transcript : 1521 , PFE:126, 127) PAX7 6 4 3 MYF5 1 1 SIX4 1 1 TOTAL 29 17 16 3
  • 32. PCR Results Expression was determined from pooled 24hpf zebrafish cDNA Muscle Genes Project: Lab Results
  • 33. Bacterial Genome Project: wMel & wPip Model  selec'on   Aligned  using  Mauve  alignment  program  which    takes  genomic  re-­‐arrangements  in  to  account  AIC DICV Conserva'on     Levels  
  • 34. wMel & wPip: Results Reference  species:  wMel:  1.27  mil  bp  &  1309  genes   11  tRNAs  and  19  non-­‐coding  regions  were  iden2fied  with  thresholds:      1.  Conserva2on  >  0.95    2.  Profile  value  >  0.75    3.  Segment  length  >  50bp     WIG  profile  of  the  most  conserved  group  
  • 35. 35 q  Identified 17 tRNAs, 2 rRNAs, 2 ncRNAs, 2 pseudogenes and 19 intergenic regions with no previous annotations q  ‘Discovery of small non-coding RNAs from the obligate intracellular bacterium Wolbachia pipientis’ , (target journal: Parasites and Vectors) q  Identifying putative ncRNAs in two bacteria genomes wMel and wPip tRNAncRNA Identification of conserved non coding sequences
  • 36. wMel & wPip: Results •  “changept”  mainly  used  to  search  for  6  intergenic  regions  in   wMel  genome,  iden2fied  as  being  transcribed  using  RACE  (Rapid   amplifica2on  of  cDNA  ends)  method     •  Currently  conduc2ng  lab  experiments  
  • 37. Genomic regions contributing to malaria pathogenicity and host specificity •  The shared evolutionary history that has honed the ability of malaria parasites to use haemoglobin as a fundamental resource, coupled with complex and divergent vertebrate immune systems that work to preclude access to the resource, suggests that malarial genomes are a mosaic of conserved and divergent regions that reflect this tug-of-war. •  Portions of the genome that are directly affected or recognized by the hosts’ immune systems or are involved in infecting a host cell are likely to be divergent across parasites that specialise on different host species. On the other hand, portions that are fundamental for the parasites ability to use haemoglobin as a primary resource are likely to be conserved. •  We apply a Bayesian segmentation model to a three-way whole-genome alignment of Plasmodium falciparum (human malaria), P. reichnowi (chimpanzee malaria), and P. gallinaciaum (chicken malaria). •  Seeking novel regions relevant to drug targets. 37
  • 38. Identifying TFBS •  Regions upstream of the zebrafish muscle development genes mentioned earlier identified as conserved contain regulatory elements •  Currently attempting to use the technique to identify TFBS genome-wide 38
  • 39. 39 Acknowledgments •  Queensland University of Technology –  Kerrie Mengersen –  Chris Oldmeadow •  University of Queensland –  Peter Adams –  Dirk Kroese –  Darryn Bryant –  Benjamin Goursaud –  Rachel Crehange •  Institute for Molecular Biosciences –  John Mattick –  Stuart Stephen •  Monash University –  Manjula Algama –  Edward Tasker –  Robert Bryson-Richardson –  Caitlin Johnston –  Adam Parslow –  Beth McGraw –  Jean Popovici –  Meg Woolfit –  Anders Goncalves da Silva •  Australian Research Council Research Grants DP0879308, DP1095849 •  Email: