Nadia Davidson - Introduction to rna-seq


Published on

The central dogma of genetics is that the genome, comprised of DNA, encodes many thousands of genes that can be transcribed into RNA. Following this, the RNA may be translated into amino acids giving a functional protein.
While the genome of an individual will be identical for each cell throughout their body, the number of transcribed copies of each gene, as RNA, will differ due to the different functional requirement of each tissue type. An important area of research within genetics is to study the genome in‐action, through RNA. For example, by comparing the quantities of each gene’s RNA between different tissue types, through development, in disease or in different environments – known as differential gene expression analysis.

RNA‐Seq, or high throughput RNA sequencing, has accelerated research in this area. The technology works by reverse transcribing the RNA back into DNA, sheering it into smaller fragments, then reading each fragments sequence in parallel to give millions of short “reads”, each between approximately 50‐200 bases in length. With this data comes a computational and statistical challenge because the biology must be inferred from millions of short sequences. Along with technical biases, there is true biological variability between samples of the same type, which must be accounted for.

In this talk I discuss the applications of RNA‐Seq, its challenges and some of the bioinformatics strategies being
employed to analyse this complex data. In particular, I will focus on the steps involved in differential gene
expression analysis, for both model organisms, like human, and more exotic organisms, without a sequenced

First presented at the 2014 Winter School in Mathematical and Computational Biology

Published in: Science
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Nadia Davidson - Introduction to rna-seq

  1. 1. Nadia Davidson Murdoch Childrens Research Institute Introduction to RNA-Seq Winter School in Mathematical and Computational Biology 2014
  2. 2. The  central  dogma  of  molecular  biology   Image  from  wikipedia  
  3. 3. Alterna9ve  splicing   DNA   RNA  
  4. 4. Transcrip9onal  abundance   DNA   RNA   2  copies   mul9ple  copies,  different  “splice”    variants  
  5. 5. Transcrip9onal  abundance   RNA  –  cell  type  A   RNA  –  cell  type  B   Different  quan99es,  different  “splice”  variants  
  6. 6. A   G   Which  copy  is  expressed  more?   DNA   G   Base  change  aKer  transcrip9on   DNA   RNA   Structural  rearrangement  in     the  genome  fuses  Gene  A  to  Gene  B   DNA   RNA   Gene  A                                                    Gene  B   Benefits  and  opportuni9es  of  RNA-­‐seq   •  Differen9al  expression   –  Comparing  the  expression   between  different  samples   •  Whole  transcriptome   sequencing     –  Annota9on  of  new  exons,   transcribed  regions,  genes  or   non-­‐coding  RNAs   –  The  ability  to  look  at  alterna9ve   splicing   –  Allele  specific  expression   –  RNA  edi9ng   –  Fusion  genes  in  cancer   –  Etc.  
  8. 8. RNA-­‐Seq  data  analysis   •  Whole  transcriptome   sequencing:   –  What  were  the  original   full  length  transcript   sequences?   –  This  Talk   •  Differen8al  expression:   –  Do  we  have  more  blue   transcripts  in  one  cell   type  than  another?   –  Next  Talk  
  9. 9. What  were  the  original  full  length   transcript  sequences…     if  we  have  a  reference  genome?  
  10. 10. The  reference  annota9on   •  Model  organisms  have  a  reference  annota9on       •  E.g.  ENSEMBL,  RefSeq,  UCSC,  GENCODE  all  provide  the  posi9on   of  known  genes  in  the  reference  genome   •  OKen,  we  assume  these  are  the  full  set  of  transcripts  of  a  gene   •  But  how  do  we  know  which  gene  a  read  came  from?     Scale chrX: 50 kb hg19 72,800,000 72,850,000 72,900,000 Ensembl Gene Predictions - Ensembl 75 ENST00000602584 ENST00000438453 ENST00000421245 ENST00000373504 ENST00000373502 ENST00000498407 ENST00000498318 chrX (q13.2) 22.2 12 q21.1 Xq23 24 Xq25 Xq28 UCSC  screen  shot  
  11. 11. Mapping  reads  to  the  genome   Cole  Trapnell  &  Steven  L  Salzberg,  Nature  Biotechnology  27,  455  -­‐  457  (2009)   •  Some  reads  can  be  mapped  wholly  to  the  genome  (grey)   •  Other  reads  need  to  be  ‘split’  across  splice  sites  (blue)   •  So#ware:  Tophat,  STAR,  Subread  
  12. 12. What  were  the  original  full  length   transcript  sequences…     if  we  have  a  reference  genome  but   want  to  find  something  novel?  
  13. 13. Map  reads   Graph  splicing  events   Traverse  the  graph   Genome  guided   assembly   Gene  func9on?   e.g.  BLAST  against  the   protein  database  or  a   related  species   (Blast2GO)   Jeffrey  A.  Mar9n  &  Zhong  Wang  Nature  Reviews  Gene9cs  12,  671-­‐682  (October  2011)   So#ware:  Cufflinks,  Scripture  
  14. 14. What  were  the  original  full  length   transcript  sequences…     if  we  don’t  have  a  reference   genome?  
  15. 15. De  novo  transcriptome  assembly   •  Like  genome  assembly   •  But  also  needs  to  deal  with:   –  Splicing   –  Non-­‐uniform  coverage     •  SoKware:  (Trinity,  Oases,   TransAbyss)   0 20 40 60 80 050000150000250000350000 Reads (Millions)Numberoftranscripts A Meantranscriptlength(bp) 04000050000 ci E eads Francis  et.  al.,  BMC  Genomics  2013   •  Challenges:   –  Accuracy     –  Computa9onal  requirements   –  Lots  of  transcripts.  Need  to  filter  and  cluster  transcripts  into   genes  (e.g.  with  Corset,  CD-­‐HIT-­‐EST,  assembler  informa9on   etc.)  
  16. 16. What  were  the  original  full  length   transcript  sequences…     if  we  have  a  reference  genome  but   it’s  not  very  good?  
  17. 17. More  common  than  you  may  think   – Non-­‐model  organisms:   •  A  badly  assembled  genome   •  No  reference  genome,  but  one  of  a  related  species   – Model  organisms:   •  Cancer   •  Poorly  assembled  regions  in  an  otherwise  good   reference  genome   – No  standard  approach  
  18. 18. Example  -­‐  Annota9ng  the  chicken  W  sex   chromosome   Chicken  is  a  model  organisms,  but  the  sequenced  reference  W   chromosome  is  poorly  assembled  with  missing  sequence.     Mo9va9on:  The  mechanism  for  sex  determina9on  in  birds  has  not   been  proven.  Are  there  any  novel  W  genes  which  could  be   involved?   Source:  hkp://­‐ed/mendel-­‐gifs/13-­‐sex-­‐chromosomes.JPG  
  19. 19. Experiment  and  analysis   Extracted  and  sequenced  mRNA  from  the  gonads  of     4  female  and  4  male  embryonic  chickens   1.4  billion  100bp  paired-­‐end  reads   Re-­‐assembled  the  reference  annota9on  sequences  (Ensembl),  with  a   genome  guided  assembly  (Cufflinks)  and  a  de  novo  assembly  (Abyss)   Iden9fied  W  genes  as  those  with  female  specific  expression   Discovered  2  novel  W  genes  and  for  1/3  of  known  W  gene  sequence  which   were  previously  incomplete,  we  found  the  full  length  sequences.     Some  W  candidates  were  followed  up  in  the  lab  for  sex  determina9on  studies  
  20. 20. An  example  of  one  W  gene   Ayers  et  al,  2013  Reference     Annota9on     Genome   Genome  guided   Coverage   0                                                500                                    1000                                1500                                    2000                                    2500   194                    0   Blastoderm   Gonads   De  novo  assembly   On  the  W  chromosome  in  the  reference  chicken  genome   On  “Unknown”  con9gs  in  the  reference  chicken  genome   On  an  autosome  in  the  reference  chicken  genome   base  posi9on  in  the  transcript   Take  home  message:  All  approaches  have  their  strengths  and  limita9ons  
  21. 21. Summary   •  RNA-­‐seq  is  very  powerful!     – It  allows  both  the  transcript  sequence  and  the   rela9ve  quan99es  to  be  measured.   – It  has  numerous  applica9ons:   •  It  compliments  DNA  sequencing  by  telling  us  how  the   genome  is  actually  used  is  a  par9cular  cell  type.   •  In  some  cases  (e.g.  non-­‐model  organisms)  it  can   circumvent  the  need  for  DNA  sequencing.   – There  are  standard  pipelines  for  some   applica9ons,  but  many  require  a  problem  specific   solu9on.  Challenging  but  fun!  
  22. 22. Acknowledgements   MCRI  Bioinforma8cs     The  (Alicia)  Oshlack  Lab         Chicken  W  genes:   MCRI  Compara8ve  Development   Craig  Smith   Ka9e  Ayers       Feel  free  to  email  me  with  ques8ons:  
  23. 23. More  informa9on   •  General:   –  Wang  et  al,  RNA-­‐Seq:  a  revolu9onary  tool  for  transcriptomics,  Nature  Reviews  Gene9cs   2009   •  Differen9al  Expression  Pipelines  and  Reviews:   –  Alicia  Oshlack  et  al.,  From  RNA-­‐seq  reads  to  differen9al  expression  results,  Genome   Biology  2010   –  Anders  et  al.,  Count-­‐based  differen9al  expression  analysis  of  RNA  sequencing  data  using   R  and  Bioconductor,  Nature  Protocols,  2013   –  hkp://   •  Assembly  Pipelines  and  Reviews:   –  Jeffrey  A.  Mar9n1  &  Zhong  Wang,  Next-­‐genera9on  transcriptome  assembly,  Nature   Reviews  Gene9cs  2011   –  hkps://­‐project/wiki/Example   –  Hass  et  al.,  De  novo  transcript  sequence  reconstruc9on  from  RNA-­‐seq  using  the  Trinity   plasorm  for  reference  genera9on  and  analysis,  Nature  Protocols,  2013   •  The  human  transcriptome  (ENCODE):   –  Sarah  Djebali  et  al,  Landscape  of  transcrip9on  in  human  cells,  Nature  2012