Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

226 views

Published on

30th Annual Convention of Indian Association for Cancer Research & International Symposium on
"Signalling Network and Cancer"
Indian Institute of Chemical Biology (IICB), Kolkata,
6-9 February, 2011

By,
Asoke K Talukder, Ph.D
Indian Institute of Information Technology & Management, Gwalior, India

Published in: Health & Medicine
  • Be the first to comment

CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

  1. 1. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 1/3118th June 20010 CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion 30th Annual Convention of Indian Association for Cancer Research & International Symposium on "Signalling Network and Cancer" Indian Institute of Chemical Biology (IICB), Kolkata, 6-9 February, 2011 17th December 2009 Asoke K Talukder, Ph.D Indian Institute of Information Technology & Management, Gwalior, India
  2. 2. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 2/31 Acknowledgement • Indian Association for Cancer Research • Dr Susanta Roy Choudhury & Dr Chitra Mandal • Indian Institute of Chemical Biology • Prof Dr Nitaipada Bhattacharyya • Open Source Software/Foundation • Authors of Open Source & Open Domain software • Authors & Publishers making various articles available free in the Web
  3. 3. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 3/31 Hope & Opportunities • For Cancer Therapeutics, Time is the essence – Speed is the Mantra • We need in-Silico Algorithms to – Make Speedy Diagnosis – Make it Reliable – Make it Repeatable – Make it Scalable – Make it Economic
  4. 4. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 4/31 Challenges in Computing • Biology needs similarity & not identity • Computers are efficient in discovering identity but not similarity • All Biology problems are different & unique • Huge data generated by Next Generation Sequencers with many errors • Eliminate Noise from Information • Minimize False Positive and False Negative
  5. 5. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 5/31 Most Biology Solutions are NP-Hard • If the data volume increases by x, complexity of solution is much higher than x (non deterministic polynomial time) • Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time • You may not know when you have an optimal solution, if you use a heuristic • Almost impossible to arrive at exact solution; however, if the solution is obtained, it can be proved it is the right solution • Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation does the solution need?
  6. 6. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 6/31 Biology + Computing + Mathematics •Better Predictability •Higher Accuracy •Less time to market
  7. 7. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 7/31 CNVMiner: Pipeline to Mine CNV & Structural Variation • Functions – Uses Library in Hierarchical Order – Uses Mate-Pair/Paired-end data – Determines Links & Structural Variations – Calculates Digital Gene Expression
  8. 8. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 8/31 Structural Variation with NGS (Nature Methods, November 2009)
  9. 9. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 9/31 Paired End Mapping (PEM) Paul Medvedev, Monica Stanciu & Michael Brudno, Computational methods for discovering structural variation with next-generation sequencing, Nature Methods Supplement| Vol.6 No.11s | November 2009
  10. 10. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 10/31 NGS Data Types • Fixed Length short reads (NGS) – All sequence reads are short – MAQ supports 63 bases (made 100 by us) – Bowtie supports 1024 bases • Variable Length long reads (NGS & Classic) – All sequence reads are of variable size – Goes even > 1024 bases
  11. 11. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 11/31 NGS Data Formats • Single-end • Paired-end • Mate-pair Insert Size Library Size Sequence Sequence Sequence • FASTA • FASTQ • … NO ORDER OR ORIENTATION
  12. 12. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 12/31 Method-1 (Simple Variations) • Align the Paired-end/Mate-pair reads (donor) as Single-end Sequence-reads to the Reference – BLAST for long & variable length sequences – BOWTIE for short fixed length sequences • Establish the Link by locating the mates • Measure the distance between mates • Establish agreement with Library Inserts
  13. 13. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 13/31 Method-2 (Complex Variations) • Take Unmatched Sequence-reads • Split them using Sliding-window and do alignment as Single-end read • Identify the Cluster • Measure the distance between Mates Clusters
  14. 14. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 14/31 Method-3 (DGE) • Identify the Cluster • Calculate the number of reads • Calculate the Breadth of the Aligned set • Calculate the Depth of the Aligned set • Calculate the DGE (Digital Gene Expression) – FPKM (Fragments Per Kilobase of exon Per Million mapped reads) – Coverage (depth) – Aligned Reference (breadth) – Reads Aligned (total number in a cluster)
  15. 15. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 15/31 Hierarchical Libraries priority 1 libraryLarge 25000 35000 pair _1FW _1RV priority 2 libraryModerate 15000 25000 pair _3FW _3RV priority 4 librarySmall 8000 15000 pair _4FW _4RV priority 5 libraryTiny 3000 6000 pair _5FW _5RV
  16. 16. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 16/31 Alignment – Blast (for Variable Length Data) # BLASTN 2.2.23+ # Query: FMC01.F_A01_length_948 # Database: mciceri_29cont_454_illumina # Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score # 1 hits found FMC01.F_A01_length_948 chr_14_length_1427689 98.18 933 8 9 17 948 1131593 1130669 0.0 1635 # BLASTN 2.2.23+ # Query: FMC01.F_A02_length_992 # Database: mciceri_29cont_454_illumina # Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score # 1 hits found FMC01.F_A02_length_992 chr_19_length_679487 97.42 968 10 15 19 986 381039 381991 0.0 1650 # BLASTN 2.2.23+ # Query: FMC01.F_A03_length_1164 # Database: mciceri_29cont_454_illumina # Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score # 1 hits found FMC01.F_A03_length_1164 chr_19_length_679487 97.86 1076 4 19 16 1090 508930 507873 0.0 1847 # BLASTN 2.2.23+ # Query: FMC01.R_A04_length_1192 # Database: mciceri_29cont_454_illumina # Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score # 2 hits found FMC01.R_A04_length_1192 chr_22_length_757631 97.98 1141 8 15 7 1142 705343 704213 0.0 1975 FMC01.R_A04_length_1192 chr_12_length_43706 97.98 1141 8 15 7 1142 29181 28051 0.0 1975
  17. 17. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 17/31 Alignment – Bowtie (for Fixed Length Data) HWUSI-EAS705_9146:3:24:828:1109/10 chr1_length_4160774 1374500 255 100M * 0 0 TCTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGA CCTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCA %%%%%%%%%%%%%% %41213:/=;555440323113=44;;>1=;?=1>>=53>;?A/>8=?;===A;? A5AA9A4?B?AAAB@BA>AAA<ABAB@@A@< XA:i:0 MD:Z:100 NM:i:0 HWUSI-EAS705_9146:3:98:1103:366/10 chr1_length_4160774 1374501 255 100M * 0 0 CTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGAC CTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCAT 444454313355544455544433244445661493/3;;565=;491=;5;54==3= ;;>;5;;;95>><:==53=?2>??=>A;A=A?A>?AB>AA>A XA:i:0 MD:Z:100 NM:i:0 HWUSI-EAS705_9146:3:20:433:1834/10 chr1_length_4160774 1374502 255 100M * 0 0 TTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACC TTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCATC BAA<AB=?A30@A? AAA>?9=B<=>5;8;=>?4:=919;3555/554533;35;55555;5554%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%% XA:i:0 MD:Z:100 NM:i:0
  18. 18. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 18/31 Grouping (chr/contig wise) (.contig – TIGR/AMOS Format) ##gi_317165637_gb_CP002447.1_length_6353983 100 6353983 bases #FMC01_A01_xFW(3763405) [] 948 bases, {392 948} <3763405 3763971> #FMC01_A02_xFW(4429399) [] 992 bases, {20 899} <4429399 4430278> #FMC01_B08_xFW(2257526) [RC] 1140 bases, {1112 302} <2257526 2256713> #FMC01_D03_xFW(3775037) [] 1130 bases, {12 1119} <3775037 3776153> #FMC01_F01_xFW(2444650) [RC] 1017 bases, {444 270} <2444650 2444473> #FMC01_F03_xFW(438175) [] 1061 bases, {15 990} <438175 439151> #FMC01_F12_xFW(196934) [RC] 680 bases, {371 8} <196934 196568> #FMC01_G05_xFW(3663438) [] 1159 bases, {13 308} <3663438 3663734> #FMC01_H08_xFW(4782980) [] 935 bases, {21 935} <4782980 4783894> #FMC01_A08_xRV(3555569) [] 1174 bases, {655 1164} <3555569 3556080> #FMC01_A08_xRV(3555463) [] 1174 bases, {89 165} <3555463 3555539> #FMC01_C04_xRV(1933307) [] 1134 bases, {5 1083} <1933307 1934392> #FMC01_D02_xRV(1039634) [RC] 1163 bases, {1112 8} <1039634 1038528> #FMC01_D10_xRV(927106) [] 1203 bases, {84 447} <927106 927469> #FMC01_E03_xRV(5326284) [] 1150 bases, {5 1059} <5326284 5327343> #FMC01_E10_xRV(622907) [] 1175 bases, {67 1073} <622907 623932> #FMC01_E11_xRV(1634970) [] 1176 bases, {520 1121} <1634970 1635571> #FMC01_F08_xRV(3554606) [RC] 1168 bases, {756 552} <3554606 3554382> #FMC01_F10_xRV(3812335) [] 1207 bases, {9 978} <3812335 3813307> #FMC01_F11_xRV(6125371) [RC] 1180 bases, {118 28} <6125371 6125281> #FMC01_F12_xRV(152024) [] 1146 bases, {12 1034} <152024 153047> #FMC03_C06_xFW(5850746) [] 1154 bases, {12 311} <5850746 5851051>
  19. 19. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 19/31 Intermediate XML File <?xml version="1.0" ?> <EVIDENCE ID="project_1" DATE="Mon Feb 7 15:25:13 IST 2011" PROJECT="MyProject" PARAMETERS="" > <LIBRARY ID="lib_large" NAME="large" MIN="2000" MAX="60000"> <INSERT ID="ins_200" NAME="PL9_E9"> <SEQUENCE ID="seq_124" NAME="PL9_E9_FWR"/> <SEQUENCE ID="seq_125" NAME="PL9_E9_RVR"/> </INSERT> <INSERT ID="ins_201" NAME="PL9_H3"> <SEQUENCE ID="seq_156" NAME="PL9_H3_FWR"/> <SEQUENCE ID="seq_157" NAME="PL9_H3_RVR"/> . . . <SEQUENCE ID="seq_124" ORI="EB" ASM_LEND="853973“ ASM_REND="853707"/> <DIFF_R_TO_L ID="seq_124" ORI="EB" DIFF = 266/> <SEQUENCE ID="seq_125" ORI="BE" ASM_LEND="853707" ASM_REND="853973"/> <DIFF_L_TO_R ID="seq_125" ORI="BE" DIFF = 266/>
  20. 20. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 20/31 Unmatched Data (Inversion & Fuse) CGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGAC CGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGAC GCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACC GCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCT GCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTT CCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTT CCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTC CTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTC TTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCA CGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATC GGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATCGAC GGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATCGAC GCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATCGACT CCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATCGACTA TCTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATCGACAGGC
  21. 21. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 21/31 Alignment in Genome Viewer
  22. 22. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 22/31 Mining the Variation (PEM) Sequence #FMC01_D10_1FW (2617431 <--> 2616304) <====> #FMC01_D10_1RV (3555790 <--> 3555424) in Library "Large" Priority 1 Invalid Link: INSERTION : Effective Gap 937993 bases Sequence #FMC01_E03_1FW (6000798 <--> 6001079) <====> #FMC01_E03_1RV (6025479 <--> 6024437) in Library "Large" Priority 1 Invalid Link: DELETION : Effective Gap 23358 bases Sequence #FMC03_B08_1FW (3405991 <--> 3406934) <====> #FMC03_B08_1RV (3449204 <--> 3448283) in Library "Large" Priority 1 Invalid Link: INSERTION : Effective Gap 41349 bases Sequence #FMC03_B10_1FW (4009642 <--> 4009447) <====> #FMC03_B10_1RV (4043713 <--> 4042741) in Library "Large" Priority 1 Valid Link: Insert Size 33099 bases Sequence #FMC03_D10_1FW (4002883 <--> 4002973) <====> #FMC03_D10_1RV (4049290 <--> 4048293) in Library "Large" Priority 1 Invalid Link: INSERTION : Effective Gap 45320 bases
  23. 23. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 23/31 CNVMine (Sequence Data) donEnd donStart donDiff refEnd refStart refDiff Chr/Contig SV_Info ------------------------------------------------------------------------------------------- 1484649 1520201 35552 121547 160092 38545 chr_15 Delete(2993) 1760942 1763068 2126 485407 486677 1270 chr_15 Insert(856) 1834755 1946223 111468 556660 574404 17744 chr_15 Insert(93724) 2296143 2304884 8741 1029365 1037561 8196 chr_15 Insert(545) 2467331 2494711 27380 1182894 1212071 29177 chr_15 Delete(1797) 2497348 2505581 8233 1215390 1222895 7505 chr_15 Insert(728) 2669853 4111343 1441490 1409178 1416675 7497 chr_15 Insert(1433993) 2898970 2912959 13989 1653470 1670528 17058 chr_15 Delete(3069) 2918147 2928707 10560 1675614 1686668 11054 chr_15 Delete(494)
  24. 24. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 24/31 Digital Gene Expression refStart refEnd donStart donEnd refBsize donBSize ---------------------------------------------------------------- 9448 9457 4643011 4643020 109 109 12635 12649 4645735 4645749 114 114 38249 38263 4654405 4654419 114 114 342068 73135 4687372 4700079 269033 12707 87020 87029 4700707 4700716 109 109 91302 91303 4702965 4702966 101 101 1608380 1608379 4728865 4728866 101 101 1607063 1607021 4730161 4730203 142 142 377588 377581 4760359 4760366 107 107 1578176 377406 4760494 4907831 1200870 147337 377302 376991 4760645 4760956 411 411 376767 375860 4761180 4762087 1007 1007
  25. 25. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 25/31 FPKM refBStart refEnd donStart donEnd Alignments FPKM ------------------------------------------------------------------ 1374500 1374654 14041 14095 41 8982.2458243511 1374752 1374931 14293 14372 29 5465.9640075694 1375022 1375167 14563 14608 62 14425.9853878729 1376391 1376524 15932 15965 29 7356.4477996611 1377079 1377344 16491 16656 138 17569.3224352609 1381298 1381405 20747 20754 12 3783.7224261228 1384517 1384622 23975 23980 6 1927.8966647388 1384875 1385026 24333 24384 25 5585.7933167100 1417360 1417469 56951 56960 7 2227.9937870802 1423415 1423524 63402 63411 21 6500.0185714816 1427353 1427472 65583 65602 15 4252.7132310414 1462473 1462598 99400 99425 4 1079.6221322537
  26. 26. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 26/31 Next Version • Define these Structural Variation Loci as Biomarkers • Use Structural Variations Loci along-with SNP in GWAS • Make this Cloud Computing Enabled
  27. 27. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 27/31 Age Growth 5 10 15 20 25 30 35 40 45 . . . . Human Architecture! Growth Performance Source: Rajkumar Buyya
  28. 28. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 28/31 Number of Processors 1 2 . . . . Computational Power Improvement Multiprocessor (fat, fatter) Uniprocessor Supercomputers (tall, taller) Source: Rajkumar Buyya
  29. 29. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 29/31 Biology Solutions – Hype or Hope • As Computers are becoming fatter (multiple cores) and clusters are becoming cheaper, it is slowly becoming possible to attempt solving NP-hard problems in Biology • All computing algorithms to solve biology problems must be parallel & distributed • HPC (High Performance Computing) and Parallel Programming will play a significant role in this attempt
  30. 30. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 30/31 Cloud Computing • If you need milk, you need not buy a Cow • Cloud computing is an emerging computing paradigm where data and applications reside in the cyberspace – scientist/clinician will access their data and information through any web-connected device be it fixed or mobile. • A biologist need not be constrained by the capability of the tool or the computing resources?
  31. 31. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 31/31 Conclusion & Way Forward • Combine and leverage the advancements of Computing Technologies like Cheaper Hardware & the Cloud • Efficient and Optimized Algorithms • Interdisciplinary team of Computer Scientists, Biologists, Mathematicians, Statisticians and HPC experts • Pave the way for Affordable & Personalized Medicine
  32. 32. IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 32/31 Thank You Toolsmith: Asoke K Talukder, Ph.D Email: “asoke” dot “talukder” (at) “geschcikten” dot “com” Workshop:

×