RNA Sequencing for Full Length Transcript Discovery

1,316 views

Published on

Use of second and third generation sequencing technology platforms to create a dataset for the discovery of full length transcripts

Published in: Technology, Health & Medicine
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,316
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
38
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Fragments are important and interestingSequencing is cheap and should reveal our fragments – as shown they express at high levels relative to actin – as shown for an annotation experiment – recommendations are paired end stranded sequencing –
  • Figure 5 – Random RNASeqvs Strand Specific Targeted RNA-SequencingPanel A shows the typical RNA seq experiment. It begins with cDNA library preparation constructed from the tissue of choice but randomly primed and includes second strand cDNA synthesis. PanelB shows the steps in a strand-specific targetd RNA-sequencing experiment. Primers are targetd and the second strand cDNA not synthesized.
  • Figure 5 – Random RNASeqvs Strand Specific Targeted RNA-SequencingPanel A shows the typical RNA seq experiment. It begins with cDNA library preparation constructed from the tissue of choice but randomly primed and includes second strand cDNA synthesis. PanelB shows the steps in a strand-specific targetd RNA-sequencing experiment. Primers are targetd and the second strand cDNA not synthesized.
  • Figure 2 -Step 1 – Assemble known information. Both the novel transcript fragments discovered through phage display experiments and additional transcript data gathered from a random RNASeq experiment were mapped to the genome. Step 2 – Create gene model. Step 3 – Primer Design. Primers were designed to be unique to the genome and specific and antisense to the gene. Step 4 – Perform Targeted RNASeq – this step involves fragmentation (see figure 5). Step 5 – Reassemble the fragmented Transcript data into full length transcripts. Step 6 – Confirm the full-length transcript.
  • Figure 3 Map phagecDNA fragment information together with Random RNAseq readsIn steps 1 and 2 of our workflow we want to map all known information to the genome, create a putative gene model. Mapping of short reads is a crucial and not always disambiguous step. Read mapping with blat versus read mapping with bowtie2 is not identical. The gene model in step 3 was created using blat reads. Using abundancy and known transcript information to select novel and specific transcript data to create our initial putative G12 gene model
  • Figure 4 – Primer design and custom cDNA library creationPrimers were designed specific to the gene model created. Panel A shows G12.1, G12.2, G12.3, G12.4, G12.6, G12.7, G12.transcript.1, G12.transcript.2, G12.transcript.3. These are the primers that were designed to the G12 gene model. 119 Primers were designed to 23 genes discovered in the initial surgical experiments.An average of 6 primers were designed to each of the genes including the 3’ most putative exon. To create the custom targeted cDNA library, a pooling strategy was employed separating chromosomes and primers to each of the genes in such a way that the reverse transcriptase reaction could occur as specifically as possible in 24 separate reactions (Panel B). The cDNA library was synthesized in a long reaction (> 12 hours) on sample freshly harvested from bone marrow with a RIN quality of greater than 9.
  • Figure 5 6 – Results and pre-sequencing fragmentationPanel A shows the results from our long reverse transcriptase reaction (12-16 hours) in our cDNA library creation. On aaverage, the transcripts are 3671 base pairs in length. Panel B shows the results from the pre-sequencing step. The purpose of this latter spte is to fragment the full length transcripts to an average length of 300 base pairs due to sequencing length limitations. The electropherogram reveals an average length of 333 base pairs for these fragments.
  • Total count: 46A : 46 (100%, 17+, 29- )C : 0G : 0T : 0N : 0---------------
  • Total count: 46A : 46 (100%, 17+, 29- )C : 0G : 0T : 0N : 0---------------
  • Total count: 46A : 46 (100%, 17+, 29- )C : 0G : 0T : 0N : 0---------------
  • Background – Discovery of the fragments
  • Background – Discovery of the fragments
  • RNA Sequencing for Full Length Transcript Discovery

    1. 1. RNA-Sequencing for Full-length Transcript Discovery Lab Meeting 2/10/14 Anne Deslattes Mays Mentor: Anton Wellstein, MD, PhD Special Recognition: Marcel Schmidt, PhD 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 1
    2. 2. 2 Discovery of homing gene fragments using bone marrow-derived monocytes Questions: 1. which proteins drive organ homing of hematopoietic cells ? 2. are there distinct homing proteins for diseased organs (cancer, wound healing, ischemia, infection) ? Approaches: 1. use human bone marrow (BM) cDNA library that displays large proteins from bone marrow & precursor cells on the phage surface 2. in vivo selection of homing proteins from target organs or vessels in animal models (normal or diseased) 3. this approach selects for gene fragments coding for homing proteins full length transcripts from source material
    3. 3. Experimental Objective We aim to identify the full-length transcripts using 2nd and 3rd generation sequencing methods for genes whose fragments were discovered through the phage display experiments nearly a decade ago. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 3
    4. 4. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 4 MedStar Georgetown University Hospital Cell Processing Unit Objective: Obtain healthy donor bone marrow bags
    5. 5. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 5 Objective: RNA Isolation from Total Bone Marrow Step 1: Total Bone Marrow Isolation
    6. 6. Four Sequencing Experiments Second Generation Sequencing 4/18/2014 Wellstein/Riegel Laboratory 7
    7. 7. 4/18/2014 Wellstein/Riegel Laboratory 8 2nd Generation Sequencing with Illumina HiSeq 2000
    8. 8. Four Sequencing Experiments Second Generation Sequencing 1. Total.bm.random – total bone marrow sequenced mate paired non-strand specific randomly primed ~ 180 million reads 4/18/2014 Wellstein/Riegel Laboratory 9
    9. 9. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 10
    10. 10. Experiment 1 Results Genome aligned (tophat (bowtie2)/cufflinks) and De novo assemblies (trinity (gsnap & blat)) using the read information Wellstein Genome – created a sub genome with excised regions around the phage with the hopes of discovering the underlying isoform and gene structure Blat/Blasted the short reads against this region and still • Results were ambiguous information regarding isoforms and gene structure hits which included phage • Structure of transcript was not clear • Strand information regarding reads aligned not clear Next Steps • Design another experiment, same cell population, this time targeted (including original phage primers used often in experiments in both lineage negative and total bone marrow experiments) and strand specific • Create a custom long transcript library primed to include full length phage transcripts 4/18/2014 Wellstein/Riegel Laboratory 11
    11. 11. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 12
    12. 12. Random RNA-Sequencing vs Strand-specific Targeted RNA- sequencing 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 13
    13. 13. Targeted RNA-Sequencing Workflow 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 14 5
    14. 14. Initial G12 Gene Model from the Total Bone Marrow 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 15
    15. 15. Design targeted primers and create custom long reaction cDNA library 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 16
    16. 16. Results and pre-sequencing fragmentation 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 17
    17. 17. Experiment 2 Results Genome aligned (tophat (bowtie2)/cufflinks) and De novo assemblies (trinity (gsnap & blat)) using the read information Wellstein Genome – created a sub genome with excised regions around the phage with the hopes of discovering the underlying isoform and gene structure Blat/Blasted the short reads against this region and still • Results were ambiguous information regarding isoforms and gene structure hits which included phage • Strand information known but yet • Structure of transcript was not clear • Was it the depth? Was it the cell population? Was it mistargeted regions? Next Steps • Design another experiment, now looking at only the lineage negative cell population where it is known the phage are enriched • Return to randomly primed reads • Sequence at a depth similar to the original total bone marrow experiment (100 million reads) 4/18/2014 Wellstein/Riegel Laboratory 18
    18. 18. Four Sequencing Experiments Second Generation Sequencing 1. Total.bm.random – total bone marrow non-strand specific randomly primed ~ 180 million reads 2. Total.bm.ss.targeted – total bone marrow strand specific targeted primed to a depth ~ 20 million reads 3. Lin.neg.ss.random – lineage-negative strand specific randomly primed ~ 111 million reads 4/18/2014 Wellstein/Riegel Laboratory 19
    19. 19. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 20
    20. 20. Negative Selection: Human Progenitor Cell Enrichment Kit with Platelet Depletion to Isolate the Lineage Negative sub population from total bone marrow
    21. 21. Loading and Negative Controls class gene total.bm.ss lin.neg.ss loading ACTB 2933 12,643 loading B2M 1500 8473 loading GAPDH 622 44,413 negative CD11B 231 1193 negative CD11C 132 689 negative CD14 21 49 negative CD16a 418 1312 negative CD19 8 36 negative CD2 7 16 negative CD24 142 177 negative CD3EAP 28 243 negative CD56 197 2039 negative CD61 24 480 negative CD66B 207 208 negative glycophorin.A 49 80 negative mir155 2 20 Phage and Positive Controls class gene total.bm.ss lin.neg.ss phage _b9 203 2298 phage a1 0 0 phage A12 0 0 phage A5 186 553 phage a8 76 789 phage b3 439 4731 phage b6 68 331 phage B9 171 2354 phage C1 9 139 phage C12 42 10,657 phage C2 147 1757 phage c3 163 453 phage C7 170 1419 phage d5 236 744 phage E12.1 34 459 phage E7 106 300 phage E9 236 2723 phage F6 120 2556 phage G12 292 925 phage H3 64 1060 phage h4 179 658 phage h6 0 0 phage h7 126 1302 positive BST1 32 1616 positive CD133 0 0 positive CD34 9 398 positive THY1 2 4 3 loading controls 13 negative controls 27 Positive controls and phage
    22. 22. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 23 Peak read count: 45,701 Peak read count: 52,626 Peak read count: 12,570 Peak read count: 200 ACTB
    23. 23. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 24 Negative Control: CD14 (should be highest in Total Bone Marrow) Peak read count: 109 Peak read count: 6318 Peak read count: 48 Peak read count: 21
    24. 24. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 25 Negative Control: CD34 (should be highest in Lineage Negative) Peak read count: 169 Peak read count: 43 Peak read count: 386 Peak read count: 10
    25. 25. What’s Wrong With Illumina Reads Uniformity of Read Coverage* • An aligned read can be represented as an integer point in R2 as follows: The ‘t-coordinate’ corresponding to the read is its left-end point while the ‘l-coordinate’ is the length of the fragment. In Evans et al. (2010), it is shown that for any choice of fragment length distribution, the col- lection of points f(t, l)} from a sequencing experiment forms a two-dimensional Poisson process. This principle guides our further analysis of these points f(t, l)}, as we test for uniformity in both the t and l coordinates. The output of ReadSpy is a list of test statistics and P-values for each transcript. A statistically significant (low) P-value means we reject the fact that the dataset is uniform on that transcript. Thus, a higher P-value corresponds to a set of reads sampled uniformly, which is desired. In the next two sections, we describe the statistical test applied a each transcript. The test is formulated in terms of the genomic segment [a, b]. *Hower, Valerie, Richard Starfield, Adam Roberts, and Lior Pachter. "Quantifying uniformity of mapped reads." Bioinformatics 28, no. 20 (2012): 2680-2682. 4/18/2014 Wellstein/Riegel Laboratory 26
    26. 26. Lior Pachter’s ReadSpy Results Total BM Targeted Strand Specific (20 million reads) target_id length df pair_counts _0 test_stat_0 p_value_0 chr19 49129131 19 226 3948.34 0.00E+00 chr4 191038775 19 227 1760.40 0.00E+00 chr11 135006716 19 304 2811.79 0.00E+00 chr2 243199471 19 361 6859.00 0.00E+00 chr16 90354953 38 402 7638.00 0.00E+00 chr9 141354337 38 436 2754.92 0.00E+00 chr12 133851995 57 797 15143.00 0.00E+00 chr15 102531492 76 841 15979.00 0.00E+00 chr1 249250866 247 2739 20184.43 0.00E+00 chr7 159138908 285 3325 54980.68 0.00E+00 Lineage Negative Strand Specific Random (110 million reads) target_id length df pair_counts _0 test_stat_0 p_value_0 chrY 59373664 19 224 4256.00 0.00E+00 chr21 48130091 19 284 2951.63 0.00E+00 chr19 49129131 57 663 10583.74 0.00E+00 chr8 146364218 57 751 5478.61 0.00E+00 chr10 135534897 76 902 8655.73 0.00E+00 chr3 198022577 76 957 12936.24 0.00E+00 chr16 90354953 133 1439 27341.00 0.00E+00 chr11 135006716 190 2067 23431.41 0.00E+00 chr2 243199471 190 2260 42940.00 0.00E+00 chr4 191038775 285 3236 40639.91 0.00E+00 chr9 141354337 304 3423 23574.66 0.00E+00 chr15 102531492 380 5735 108965.00 0.00E+00 chr1 249250866 912 10322 97596.23 0.00E+00 chr7 159138908 2394 29726 504209.24 0.00E+00 chr12 133851995 5605 84272 1601168.00 0.00E+00 Our reads all have low p-values indicating the non-uniform nature of their read coverage
    27. 27. Experiment 3 Results Genome aligned (tophat (bowtie2)/cufflinks) and De novo assemblies (trinity (gsnap & blat)) using the read information Wellstein Genome – created a sub genome with excised regions around the phage with the hopes of discovering the underlying isoform and gene structure Blat/Blasted the short reads against this region and still • Results were ambiguous information regarding isoforms and gene structure hits which included phage • Strand information known but yet • Enrichment in population is evident • Unambiguous Structure of phage transcripts still not clear • Finding known genes can be done, even de novo assembly of novel transcripts is done on a regular basis • But with these phage, a fragment is known -- how do we find the full length structure of this phage? • What if we had the phage transcripts in the targeted full length library, but it was lost in the fragmentation? Is there a way to do sequencing without fragmentation? Next Steps • Use new 3rd generation technology to do full length transcript sequencing without fragmentation 4/18/2014 Wellstein/Riegel Laboratory 28
    28. 28. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 29 Source: Iso-seq webinar by Liz Tseng, Pacific Biosystems https://github.com/PacificBiosciences/cDNA_primer/wiki/Understanding-PacBio- transcriptome-data
    29. 29. Four Sequencing Experiments Second Generation Sequencing 1. Total.bm.random – total bone marrow sequenced non-strand specific randomly primed ~ 180 million reads 2. Total.bm.ss.targeted – total bone marrow sequenced strand specific targeted primed to a depth ~ 20 million reads 3. Lin.neg.ss.random – lin- sequenced strand specific randomly primed ~ 111 million reads Third Generation Sequencing 4. Lin.neg Pac Bio Long reads – 6 million CCS Filtered SubReads ~ 277,000 readsOfInserts 4/18/2014 Wellstein/Riegel Laboratory 30
    30. 30. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 31Source: http://www.pacificbiosciences.com/products/smrt-technology/
    31. 31. 4/18/2014 Wellstein/Riegel Laboratory 32 Source: https://github.com/PacificBiosciences/cDNA_primer/wiki/Understanding- PacBio-transcriptome-data#wiki-roiexplained
    32. 32. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 33 Source: https://github.com/PacificBiosciences/cDNA_primer/wiki/Understanding- PacBio-transcriptome-data#wiki-roiexplained
    33. 33. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 34 Source: Bobby Sebra – smrt portal analysis results
    34. 34. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 35 Peak read count: 45,701 Peak read count: 52,626 Peak read count: 12,570 Peak read count: 10 ACTB
    35. 35. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 36 Negative Control: CD14 (should be highest in Total Bone Marrow) Peak read count: 109 Peak read count: 6318 Peak read count: 48 Peak read count: 21
    36. 36. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 37 Negative Control: CD34 (should be highest in Lineage Negative) Peak read count: 169 Peak read count: 43 Peak read count: 386 Peak read count: 10
    37. 37. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 38 Phage: B9 – only the phage (953 bp) Peak read count: 10 Peak read count: 10 Peak read count: 10 Peak read count: 10
    38. 38. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 39 Peak read count: 10 Peak read count: 16 Peak read count: 10 Peak read count: 10 Phage: B9 10x larger region (~9kb) centered on phage evidence
    39. 39. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 40 2/6/2014 Reports for Job readsofinsert http://ec2-54-197-149-12.compute-1.amazonaws.com:8080/smrtportal/View-Data/Report/16437?name=readsofinsert&media=all&reportKey=Reads-Of-Insert-R… 1/1 Read  Length  Of  Insert Read  Quality  Of  Insert Number Of  Passes Reports for Job  readsofinsert Reads Of Insert Movie Reads  Of Insert Read Bases Of  Insert Mean Read Length Of Insert Read Accuracy Of Insert Mean Number  Of Passes m131214_160008_42177R_c100597152550000001823102305221422_s1_p0 47,762 61,257,390 1,282 97.96% 11.01 m131212_234151_42177R_c100597412550000001823102305221473_s1_p0 23,360 33,092,110 1,416 98.39% 11.65 m131214_092100_42177R_c100597152550000001823102305221420_s1_p0 36,623 59,671,472 1,629 98.41% 10.78 m131214_124034_42177R_c100597152550000001823102305221421_s1_p0 49,710 63,809,739 1,283 98.04% 11.26 m131213_232025_42177R_c100597412550000001823102305221475_s1_p0 30,720 37,357,905 1,216 97.49% 10.75 m131213_030106_42177R_c100597412550000001823102305221474_s1_p0 24,284 34,943,462 1,438 98.49% 11.85 m131214_060132_42177R_c100597412550000001823102305221477_s1_p0 32,492 39,813,943 1,225 97.49% 10.54 m131214_023937_42177R_c100597412550000001823102305221476_s1_p0 32,210 39,536,384 1,227 97.57% 10.74 Generated  by SMRT®  Portal. Thu  Feb  06  13:30:44  UTC  2014   For Research  Use  Only. Not for use  in  diagnostic procedures. Source: self-install smrt portal – reads of insert
    40. 40. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 41 87% 11% 2% Transcript Size Distribution 1 to 2k 2 to 3k over 3k
    41. 41. Summary of reads. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 42 ------ 5' primer seen summary ---- Per subread: 258835/277161 (93.4%) Per ZMW: 258835/277161 (93.4%) Per ZMW first-pass: 258835/277161 (93.4%) ------ 3' primer seen summary ---- Per subread: 1361/277161 (0.5%) Per ZMW: 1361/277161 (0.5%) Per ZMW first-pass: 1361/277161 (0.5%) ------ 5'&3' primer seen summary ---- Per subread: 1341/277161 (0.5%) Per ZMW: 1341/277161 (0.5%) Per ZMW first-pass: 1341/277161 (0.5%) ------ 5'&3'&polyA primer seen summary ---- Per subread: 18/277161 (0.0%) Per ZMW: 18/277161 (0.0%) Per ZMW first-pass: 18/277161 (0.0%) ------ Primer Match breakdown ---- F0/R0: 258855 (100.0%) Source: output of summarize_results.py (Liz Tseng)
    42. 42. But this is not good – it turns out that the primers were incorrectly chosen and the best way to find the primers used is to do as follows: 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 43 >cat reads_of_insert.fasta | grep -A1 "AAAAAAAAAAAAAAAAA" | more GGCTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT -- AACATTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTAACTCTGCGTTGATACCACTGCTT -- TGTTTTATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT -- TTACAATTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT -- GAGCCCTTACCGAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT -- GTGGTGATTGTTTACTAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT -- GACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT -- TTTCCCGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT -- CTTACTTACGTAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT -- GCCCCATCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT >cat reads_of_insert.fasta | grep -A1 "TTTTTTTTTTTT" | more AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTATTTGGCTTGAT -- AAGCAGTTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTGATTTCCAT -- AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTACTTGGGATCTTT -- AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTATTTTTTTTTTTTTT -- AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTACCCATCAGCG -- AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTGGTATTTGTTTGTTTCTG -- AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTATTTTTTTTTTTTTTTTT -- AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGACATAAACAC -- AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTACTAAGCATATT T Now my primers are: >F0 AAGCAGTGGTATCAACGCAGAGTAC >R0 GTAACTCTGCGTTGATACCACTGCTT
    43. 43. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 44 ------ 5' primer seen summary ---- Per subread: 256672/277161 (92.6%) Per ZMW: 256672/277161 (92.6%) Per ZMW first-pass: 256672/277161 (92.6%) ------ 3' primer seen summary ---- Per subread: 208877/277161 (75.4%) Per ZMW: 208877/277161 (75.4%) Per ZMW first-pass: 208877/277161 (75.4%) ------ 5'&3' primer seen summary ---- Per subread: 207111/277161 (74.7%) Per ZMW: 207111/277161 (74.7%) Per ZMW first-pass: 207111/277161 (74.7%) ------ 5'&3'&polyA primer seen summary ---- Per subread: 100863/277161 (36.4%) Per ZMW: 100863/277161 (36.4%) Per ZMW first-pass: 100863/277161 (36.4%) ------ Primer Match breakdown ---- F0/R0: 258438 (100.0%) Source: output of summarize_results.py (Liz Tseng)
    44. 44. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 45 Negative Control: CD14 (should be highest in Total Bone Marrow) Peak read count: 109 Peak read count: 6318 Peak read count: 48 Peak read count: 21
    45. 45. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 46 Negative Control: CD34 (should be highest in Lineage Negative) Peak read count: 169 Peak read count: 43 Peak read count: 386 Peak read count: 10
    46. 46. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 47 Phage: B9 – only the phage (953 bp) Peak read count: 10 Peak read count: 10 Peak read count: 10 Peak read count: 10
    47. 47. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 48 Peak read count: 10 Peak read count: 16 Peak read count: 10 Peak read count: 10 Phage: B9 10x larger region (~9kb) centered on phage evidence
    48. 48. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 49 Scale chr11: MOB2 CTSD Indiv. Seq. Matches Sequences SNPs Genes Human mRNAs Spliced ESTs DNase Clusters Txn Factor ChIP Rhesus Mouse Dog Elephant Chicken X_tropicalis Zebrafish Lamprey Common SNPs(138) RepeatMasker 200 bases hg19 1,774,050 1,774,100 1,774,150 1,774,200 1,774,250 1,774,300 1,774,350 1,774,400 1,774,450 Your Sequence from Blat Search UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) RefSeq Genes Retroposed Genes V5, Including Pseudogenes Publications: Sequences in scientific articles Human mRNAs from GenBank Human ESTs That Have Been Spliced H3K27Ac Mark (Often Found Near Active Regulatory Elements) on 7 cell lines from ENCODE Digital DNaseI Hypersensitivity Clusters in 125 cell types from ENCODE Transcription Factor ChIP-seq from ENCODE 100 vertebrates Basewise Conservation by PhyloP Multiz Alignments of 100 Vertebrates Simple Nucleotide Polymorphisms (dbSNP 138) Found in >= 1% of Samples Repeating Elements by RepeatMasker 01823102305221476_s1_p0/142269/25_1056_CCS 001823102305221475_s1_p0/23219/25_2124_CCS 14-10 01823102305221420_s1_p0/101093/25_2057_CCS 001823102305221420_s1_p0/43933/25_2151_CCS 01823102305221474_s1_p0/126784/25_2052_CCS 001823102305221474_s1_p0/38774/25_2111_CCS 001823102305221473_s1_p0/61096/26_2148_CCS 001823102305221420_s1_p0/90213/25_2018_CCS 001823102305221420_s1_p0/70860/25_1785_CCS 001823102305221420_s1_p0/46857/25_2050_CCS 01823102305221474_s1_p0/129700/25_2069_CCS 001823102305221473_s1_p0/56996/25_2088_CCS 01823102305221421_s1_p0/102623/25_2092_CCS 0001823102305221477_s1_p0/3072/2126_65_CCS 001823102305221476_s1_p0/26060/25_2036_CCS 0001823102305221476_s1_p0/1057/25_2034_CCS 0001823102305221474_s1_p0/5669/25_2058_CCS 01823102305221476_s1_p0/118762/25_1890_CCS 001823102305221422_s1_p0/82049/25_2039_CCS MOB2 CTSD Layered H3K27Ac 100 _ 0 _ 100 Vert. Cons 4.88 _ -4.5 _ 0 - Phage 14-10: 100% identity and alignment to 19 full length read of inserts
    49. 49. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 50 Scale chr11: MOB2 IFITM10 CTSD Indiv. Seq. Matches Sequences SNPs Genes Human mRNAs Spliced ESTs DNase Clusters Txn Factor ChIP Rhesus Mouse Dog Elephant Chicken X_tropicalis Zebrafish Lamprey Common SNPs(138) RepeatMasker 5 kb hg19 1,775,000 1,780,000 1,785,000 Your Sequence from Blat Search UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) RefSeq Genes Retroposed Genes V5, Including Pseudogenes Publications: Sequences in scientific articles Human mRNAs from GenBank Human ESTs That Have Been Spliced H3K27Ac Mark (Often Found Near Active Regulatory Elements) on 7 cell lines from ENCODE Digital DNaseI Hypersensitivity Clusters in 125 cell types from ENCODE Transcription Factor ChIP-seq from ENCODE 100 vertebrates Basewise Conservation by PhyloP Multiz Alignments of 100 Vertebrates Simple Nucleotide Polymorphisms (dbSNP 138) Found in >= 1% of Samples Repeating Elements by RepeatMasker 01823102305221476_s1_p0/142269/25_1056_CCS 001823102305221475_s1_p0/23219/25_2124_CCS 14-10 01823102305221420_s1_p0/101093/25_2057_CCS 001823102305221420_s1_p0/43933/25_2151_CCS 01823102305221474_s1_p0/126784/25_2052_CCS 001823102305221474_s1_p0/38774/25_2111_CCS 001823102305221473_s1_p0/61096/26_2148_CCS 001823102305221420_s1_p0/90213/25_2018_CCS 001823102305221420_s1_p0/70860/25_1785_CCS 001823102305221420_s1_p0/46857/25_2050_CCS 01823102305221474_s1_p0/129700/25_2069_CCS 001823102305221473_s1_p0/56996/25_2088_CCS 01823102305221421_s1_p0/102623/25_2092_CCS 0001823102305221477_s1_p0/3072/2126_65_CCS 001823102305221476_s1_p0/26060/25_2036_CCS 0001823102305221476_s1_p0/1057/25_2034_CCS 0001823102305221474_s1_p0/5669/25_2058_CCS 01823102305221476_s1_p0/118762/25_1890_CCS 001823102305221422_s1_p0/82049/25_2039_CCS MOB2 IFITM10 CTSD Layered H3K27Ac 100 _ 0 _ 100 Vert. Cons 4.88 _ -4.5 _ 0 - Phage 14-10: 100% aligned to CTSD, 2 possibly 3 splice variants in lineage negative cell population – structure fully resolved
    50. 50. Conclusions: • Full Length Transcript discovery is achieved with Pacific Biosystems RS sequencer, using size selection in library preparation prior to sequencing and Reads Of Insert algorithm • Even before the release of the ReadsOfInsert approach, the subreads that are available as a result of the sequencing still had the ability to tell you the structure of the complete transcript. • With an error rate of 15%, seemingly daunting, the random nature of the error and the length of the read provided the complete structure in a way that no short read second generation sequence could. • When one is searching for the complete structure, perfection in the parts is of no consequence • NO ASSEMBLY is REQUIRED 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 51
    51. 51. Next Steps: 1. Compete the reads of insert approach with 75% accuracy and minimum 1 pass 2. Identify additional full length structure (if possible with the sample reads) 3. Write up the results 4. (next paper) If no additional phage found, sequence an enriched population with confirmed phage evidence at full length with more another pacific bio sequencing 5. Use illumina reads to correct for errors and recover more reads 6. Use greater pac bio sequencing depth 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 52
    52. 52. 4/18/2014 Wellstein/Riegel Laboratory 53 Acknowledgements Dr. Anton Wellstein Dr. Anna Riegel Dr. Elena Tassi Dr. Marcel Schmidt The entire lab: Elena, Virginie, Ghada, Ivana, Eveline, Khalid, Khaled, Eric, Nitya, the entire Wellstein/Riegel laboratory My Committee Dr. Yuri Gusev Dr. Anatoly Dritschilo Dr. Michael Johnson Dr. Christopher Loffredo Dr. Habtom Ressom Dr. Terry Ryan (external committee member) Robert Sebra, Mt. Sinai PacBio Sequencing Liz, Tseng, Pacific Biosystems Eric Schadt, Mt. Sinai PacBio Sequencing Brian Haas, Author Trinity Suite `
    53. 53. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 54 CD11: New Evidence of an Exon From all Samples, confirmed by PacBio Peak read count: 16 Peak read count: 1925 Peak read count: 639 Peak read count: 121
    54. 54. 4/18/2014 Wellstein/Riegel Laboratory, Lombardi Cancer Center, Washington DC 20007 55 PASA assembly (Trinity Pipeline) Denovo + Genome Guided Evidence of a new exon – not found in annotation for CD11

    ×