Exact duplicate reads in RNA-seq
data and the effect of their removal
on differential expression results
Katie Kyle
Women in Math, Science, and Engineering Research Symposium 2015
Big Data
RNA-seq data
 High-Throughput Next Generation
Sequencing
 New this century!
 1973 – 24 bp
 1986 = 12,000 bp/day
 ~Current = ~2 billion bp/day
These advancements are great! But . . .
 Sample preparation and amplification
processes introduce various forms of
error
 PCR Duplicates
 Artificial duplicates vs. natural
duplicates
 Why necessary?
Why does this matter?
 Differential gene expression
 Every cell contains the
complete genome in its
nucleus
 How can cells be different?
 RNA produced by a cell is
unique
 DE analysis
 Depends on ACCURATE
fragment counts
 Gene vs. transcript analysis
 Only ~ 1.5% of genome
codes for proteins
My Experiment
 Compare two sets of differential expression analysis results
 1) exact duplicates are removed
 2) exact duplicates are kept
Remove Keep
Data
 Two experimental groups: ”novel”, “familiar”
 Group 1: female guppies exposed to males with
novel color patterns
 Group 2: female guppies exposed to males with
familiar color patterns
 DE analysis of female brain proteins
comparing the two groups
Poecilia reticulata
(guppy)
Genes Sequenced
(Familiar and Novel Groups)
Data w/o Duplicates
(Duplicates Removed)
Data w/ Duplicates
(Raw Data)
Differential
ExpressionTests
Differential
ExpressionTests
Results:
Raw Data Duplicates Removed
Total fragment count 261,668,588 229,953,141
 # of exact duplicates: 31,715,447
 Percent of total: 12.12%
Sequence Composition
Duplicate
Unique
 Duplicates Removed
 Raw Data
Gene Pvalue
1 vitellogenin C 5.44E-07
2 vitellogenin A 1.46E-05
3 complement C4-like 4.81E-05
4 uridine phosphorylase 2-like 7.63E-05
5 Rhodopsin 1.83E-04
6 hypothetical protein LOC100537292 2.02e-04
7 endonuclease domain-containing 1 protein-like 2.80E-04
Gene Pvalue
1 vitellogenin C 9.47E-07
2 vitellogenin A 6.63E-05
3 endonuclease domain-containing 1 protein-like 6.88E-05
4 complement C4-like 7.87E-05
5 uridine phosphorylase 2-like 9.06E-05
Conclusions
 Quantitative but not qualitative change in results (at least for
this experiment)
 Results suggest that the exact duplicates ARE mainly PCR
artifacts because the p-values change proportionately for all
genes
 Future directions…
 Development of software platforms that eliminate the PCR step
from sequence preparation
 Single-molecule sequencing
Acknowledgements
 Dr. Kimberly Hughes
 Dr. Susan Blessing
 Ilana Janowitz
 Ashley Fryer
References
 Haddad, Fadia, Anqi X. Qin, Julie M. Giger, Hongyan Guo, and
Kenneth M. Baldwin. "Potential Pitfalls in the Accuracy of
Analysis of Natural Sense-antisense RNA Pairs by Reverse
Transcription-PCR." BMC Biotechnology 7.1 (2007): 21. Web
 Soon,Wendy W., Manoj Hariharan, and Michael Snyder.
"High-throughput Sequencing for Biology and Medicine."
Molecular Systems Biology 9.640 (2013): n. pag. Print.

kkyle_RevisedPres2015_FINAL

  • 1.
    Exact duplicate readsin RNA-seq data and the effect of their removal on differential expression results Katie Kyle Women in Math, Science, and Engineering Research Symposium 2015
  • 2.
  • 3.
    RNA-seq data  High-ThroughputNext Generation Sequencing  New this century!  1973 – 24 bp  1986 = 12,000 bp/day  ~Current = ~2 billion bp/day
  • 4.
    These advancements aregreat! But . . .  Sample preparation and amplification processes introduce various forms of error  PCR Duplicates  Artificial duplicates vs. natural duplicates  Why necessary?
  • 5.
    Why does thismatter?  Differential gene expression  Every cell contains the complete genome in its nucleus  How can cells be different?  RNA produced by a cell is unique  DE analysis  Depends on ACCURATE fragment counts  Gene vs. transcript analysis  Only ~ 1.5% of genome codes for proteins
  • 6.
    My Experiment  Comparetwo sets of differential expression analysis results  1) exact duplicates are removed  2) exact duplicates are kept Remove Keep
  • 7.
    Data  Two experimentalgroups: ”novel”, “familiar”  Group 1: female guppies exposed to males with novel color patterns  Group 2: female guppies exposed to males with familiar color patterns  DE analysis of female brain proteins comparing the two groups Poecilia reticulata (guppy)
  • 8.
    Genes Sequenced (Familiar andNovel Groups) Data w/o Duplicates (Duplicates Removed) Data w/ Duplicates (Raw Data) Differential ExpressionTests Differential ExpressionTests
  • 9.
    Results: Raw Data DuplicatesRemoved Total fragment count 261,668,588 229,953,141  # of exact duplicates: 31,715,447  Percent of total: 12.12% Sequence Composition Duplicate Unique
  • 10.
     Duplicates Removed Raw Data Gene Pvalue 1 vitellogenin C 5.44E-07 2 vitellogenin A 1.46E-05 3 complement C4-like 4.81E-05 4 uridine phosphorylase 2-like 7.63E-05 5 Rhodopsin 1.83E-04 6 hypothetical protein LOC100537292 2.02e-04 7 endonuclease domain-containing 1 protein-like 2.80E-04 Gene Pvalue 1 vitellogenin C 9.47E-07 2 vitellogenin A 6.63E-05 3 endonuclease domain-containing 1 protein-like 6.88E-05 4 complement C4-like 7.87E-05 5 uridine phosphorylase 2-like 9.06E-05
  • 11.
    Conclusions  Quantitative butnot qualitative change in results (at least for this experiment)  Results suggest that the exact duplicates ARE mainly PCR artifacts because the p-values change proportionately for all genes  Future directions…  Development of software platforms that eliminate the PCR step from sequence preparation  Single-molecule sequencing
  • 12.
    Acknowledgements  Dr. KimberlyHughes  Dr. Susan Blessing  Ilana Janowitz  Ashley Fryer
  • 13.
    References  Haddad, Fadia,Anqi X. Qin, Julie M. Giger, Hongyan Guo, and Kenneth M. Baldwin. "Potential Pitfalls in the Accuracy of Analysis of Natural Sense-antisense RNA Pairs by Reverse Transcription-PCR." BMC Biotechnology 7.1 (2007): 21. Web  Soon,Wendy W., Manoj Hariharan, and Michael Snyder. "High-throughput Sequencing for Biology and Medicine." Molecular Systems Biology 9.640 (2013): n. pag. Print.

Editor's Notes

  • #3 EXCITING! KNOW MORE THAN EVER BEFORE! BUTT… SPEED AT WHICH WE CAN GET DATA FASTER THAN SPEED AT WHICH WE LEARN TO PROCESS THE DATA Recently in biology we are getting into the realm of Big Data. This means that we starting to deal with massive amounts of data and having to learn how to process, interpret, store, and transfer these large amounts of data. While this new information and its abundance is exciting and important, the problem we’re having is the speed at which we are gathering this data compared to the speed at which we are learning how to handle this data. Before this era of big data we could keep up, we had a steady flow of data that we could process and learn to interpret wihtout getting behind.. Like drinking from a water fountain. But now, we have huge amounts of data coming at us at high speed… its like trying to drink from a fire hydrant https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAYQjB0&url=https%3A%2F%2Fwww.vichealth.vic.gov.au%2Four-work%2Fpromoting-healthy-eating&ei=PkcjVc7TF4OyggT6moOICA&bvm=bv.89947451,d.b2w&psig=AFQjCNHfuc4X8Hvo146w__pWfRhDymIL0A&ust=1428461598027704 https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAYQjB0&url=http%3A%2F%2Fanimaluntamed.yuku.com%2Ftopic%2F357%2FExcellent-Wild-Photography%3Fpage%3D6&ei=ekgjVdbWMMqXNtrvgKgJ&bvm=bv.89947451,d.b2w&psig=AFQjCNHniuO2F-CFX8wTCn065Ia05AJ35w&ust=1428461991885843
  • #4 We can go into a lab a sequence all of the DNA in your body in just a couple of days… When the human genome project first started, it took 13 years to first sequence the entire human genome. In just a couple decades we have increase our sequencing speed around 150,000x ! So here lies the source of our big data problem and our HUGE amounts of data. When I talk about big data in biology I am mainly referring to the large amounts of sequence data that we are collecting. One type of sequencing and one of the sources for all of this data is called RNA-seq. Some of you may be familary with the term DNA than you are with RNA, and ill get more into this later, but DNA is a type of nucleic acid, deoxyriboseNUCLEIC ACID, and RNA is another type of nucleic acid found in cells. It is just ribonucleic acid. DNA and RNA behave very similarly and so you can sequence both of them. So RNA-seq is sequencing the RNA or an organism, or tissue, or cell, whatever you’re interested in. And within RNA-seq and other sequencing technologies we now have what is call HTNGS, a very new technology. The new tech is now ‘massively parallel’ meaning you can sequence hundreds to throusands of DNA fragments at the same time, and then use computers to collect all of this data and turn it into meaningful sequences. And this type of technology is evolving very fast, which is the main cause of our big data problem. In the 70s, sequencing 24 bp of the lac operon was an entire research paper, and at best in the 70s we were sequencing a couple hundred of bases per day. Later on in the 80s with the devopment of the Sanger method for sequencing we were about to now sequencing about 12,000 bp per day, this is a 500x increase in speed in just a decade. And now, with our NGS we are able to sequence a couple BILLION bp in just one day. http://www.pnas.org/content/70/12/3581.full.pdf - Gilbert and Maxam (1973) = 24 bp as whole paper http://www.nature.com/scitable/topicpage/dna-sequencing-technologies-690 When studying an organism at the molecular level, you’re mainly going to look at the organisms GENOME or you’re going to look at its TRANSCRIPTOME. There is also the proteome but that’s less relevant to sequencing technology. So genome sequencing is sequencing an organisms DNA, transcroptome sequencing is seq an organisms RNA. The central dogma of molecular biology is DNA makes RNA makes protein. So in RNA-seq you are not sequenicng the genome directyly, you are sequencing all of the RNA in an organism or tissue or cell, whatever you’re looking at… ill go more inot this later.
  • #5 So while these advancements in sequencing are exciting and improving every day, as of yet they are not completely accurate. In exchange for speed we had to trade accuracy. With these new sequencing techonologies there are biases and artefacts that arise in the data that we have not compeltely figured out how to normalize yet. One of this biases arises in the PCR step of the preparting process. Polymerase chain reaction is a method for amplifying DNA samples. It does this in a process that copies the fragments of a given DNA sample over and over again to produce a sufficient amount of DNA. The problem with this process is that these duplicates can muddle certain analyses. But it is not so simple as just removing all duplicates, because all duplicates might not be from PCR. So it is not straighforward to determine what is a PCR duplicate and should be removed, and what is a duplicate that occurs naturally in the genome and should remain. However this PCR is necessary for the sequencing tech to work, many times when you gather a sample of DNA or RNA, especially when gathering your sample from a small source like a single cell or small amount of tissue, you get a very small amount of DNA or RNA maybe just a few nanograms which is not enough the sequencers to recognize. So in order to analyze you data at all you need to use PCR, even though it may produce artifacts in your data, to amplify what you have to be able to get it to sequence at all.
  • #6 Depends on accurate count number So why does this matter? What can these aritifical dupicates do to our experiments? There is one type of analysis that is highly dependant on sequence counts called differntial expression analysis. So first, what is differntial gene expression? Differntial gene expression is a cell’s ability to differntial express its genes in order to specialize. Every single cell in you body has the exact same DNA code in its nucleus. If that is true, which it is, how is it possible that we have a multitude of different types of cells that do very different things? Cells manage this through differntial gene expression, certain cells express some genes and not others, and which genes are expressed and not expressed are what determines cellular function specilzation. And this works because of the central dogma of mol bio: DNA makes RNA makes protein. The genes a cell expresses are transcribed into RNA, which is translated into proteins. So the genes a cell is expressing can be measuring, 1 by what proteins it is producing, and 2 by what RNA it is producing. This is where RNA-seq comes into play. RNA-seq is used in differntial expression analyses. By sequencing the RNA that a cell or organizm is producing gives us an indirect look at what genes are active/inactive in the cell/organism. http://www.ncbi.nlm.nih.gov/books/NBK10061/
  • #7 To test this question of how can these duplicates affect experimental analyses, I came up with a simple experiment to compare….
  • #8 So what data am I using for this experiment http://www.bio.fsu.edu/kahughes/research.html
  • #9 Analysis pipeline
  • #11 Vitellogenin C = eggyolk protein precursor “” “” A = egg yolk protein precursor Complement C proteins – role in both innate and adaptive immunity Uridine phosphorylase 2 – catalyzes the reversible phosphorlysitc cleavage of uridine and deoxyuridine to uracil and ribose/deoxyribose-1-phosphate Rhodopsin – allows vision in low light conditions – main element of rod cells Endonuclease domain-containing 1 – may act as DNAse and RNAse Hypothetical protein LOC… - ? But another one was RETRotransposon-like family member
  • #12 Include future prospects?