I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

Pattern Recognition In Clinical Data
S...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

I NTRODUCTION
I NTRODUCTION
Objective
...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPS...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPS...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPS...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPS...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPS...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPS...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPS...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPS...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPS...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPS...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPS...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPS...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPS...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

N EXT G ENERATION S EQUENCING

C G C G...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

N EXT G ENERATION S EQUENCING

C G C G...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

N EXT G ENERATION S EQUENCING

C G C G...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

N EXT G ENERATION S EQUENCING
C G C G ...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

NGS: W HY ?

Molecular Approach: Study...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

NGS: W HY ?

Molecular Approach: Study...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

NGS: W HY ?

Molecular Approach: Study...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

NGS: W HERE ?

Study variations, genot...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

NGS: W HERE ?

Study variations, genot...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

NGS: W HERE ?

Study variations, genot...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

NGS: M UTATIONS

3x109 b...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

D RIVERS AND PASSENGERS I
Cancer is kn...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

D RIVERS AND PASSENGERS
...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

D RIVERS AND PASSENGERS
...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

D RIVERS AND PASSENGERS
...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

D RIVER M UTATIONS : W H...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

D RIVER M UTATIONS : W H...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

D RIVER M UTATIONS : W H...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

D RIVER M UTATIONS : W H...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

M ACHINE L EARNING I
Two...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

M ACHINE L EARNING II

Table: Test Dat...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

M ACHINE L EARNING : F E...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

M ACHINE L EARNING : F E...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

F UNCTIONAL I MPACT I

I...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

F UNCTIONAL I MPACT II
Figure: SIFT [?] algorithm

R EPR...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

Some of the common tools/algorithms us...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

F RAMEWORK FOR COMPARING...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

F RAMEWORK FOR COMPARING...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

D RIVER M UTATIONS : T OOLS DON ’ T AG...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

Solution?:
Galaxy[?], an...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

Run all tools in one go:
Figure: Run all tools

R EPRODU...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

Compare all tools:
Figure: Compare all tools

R EPRODUCI...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

V IRAL G ENOME D ETECION...
I NTRODUCTION

S IGNIFICANT M UTATIONS

RawData

V IRAL G ENOME D ETECION

Align data
to human
genome
reference

R EPRODUC...
I NTRODUCTION

S IGNIFICANT M UTATIONS

RawData

V IRAL G ENOME D ETECION

Align data
to human
genome
reference

R EPRODUC...
I NTRODUCTION

S IGNIFICANT M UTATIONS

RawData

V IRAL G ENOME D ETECION

Align data
to human
genome
reference

R EPRODUC...
I NTRODUCTION

S IGNIFICANT M UTATIONS

RawData

V IRAL G ENOME D ETECION

Align data
to human
genome
reference

R EPRODUC...
I NTRODUCTION

S IGNIFICANT M UTATIONS

RawData

V IRAL G ENOME D ETECION

Align data
to human
genome
reference

R EPRODUC...
I NTRODUCTION

S IGNIFICANT M UTATIONS

RawData

V IRAL G ENOME D ETECION

Align data
to human
genome
reference

R EPRODUC...
I NTRODUCTION

S IGNIFICANT M UTATIONS

RawData

V IRAL G ENOME D ETECION

Align data
to human
genome
reference

R EPRODUC...
I NTRODUCTION

S IGNIFICANT M UTATIONS

G ALAXY W ORKFLOW

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

Figure: Aligned HPV genomes

C ONCLUSI...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

R EPRODUCIBILITY

In pur...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

C ONCLUSIONS
With the Ga...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

F UTURE W ORK

Develop a...
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

R EFERENCES I

Hannah Ca...
Upcoming SlideShare
Loading in …5
×

Pattern Recognition in Clinical Data

369
-1

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
369
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Pattern Recognition in Clinical Data

  1. 1. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY Pattern Recognition In Clinical Data Saket Choudhary Dual Degree Project Guide: Prof. Santosh Noronha C G C A T C G A G C T C G C G T C G A G C T October 30, 2013 C ONCLUSIONS
  2. 2. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY I NTRODUCTION I NTRODUCTION Objective S IGNIFICANT M UTATIONS Next Generation Sequencing Motivation Computational Methods for Driver Detection V IRAL G ENOME D ETECION Workflow R EPRODUCIBILITY Reproducibility C ONCLUSIONS Wrapping up C ONCLUSIONS
  3. 3. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  4. 4. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  5. 5. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  6. 6. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  7. 7. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  8. 8. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  9. 9. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  10. 10. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  11. 11. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  12. 12. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  13. 13. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  14. 14. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  15. 15. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  16. 16. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY N EXT G ENERATION S EQUENCING C G C G T C G A G C T A G C A C ONCLUSIONS
  17. 17. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY N EXT G ENERATION S EQUENCING C G C G T C G A G C T A G C A C ONCLUSIONS
  18. 18. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY N EXT G ENERATION S EQUENCING C G C G T C G A G C T A G C A G C G T∗ C G A G∗ C T A G∗ C ONCLUSIONS
  19. 19. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY N EXT G ENERATION S EQUENCING C G C G T C G A G C T A G C A G C G T∗ C G A G∗ C T A G∗ A G C G C G T C G A G C T A G C A C A C ONCLUSIONS
  20. 20. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY NGS: W HY ? Molecular Approach: Study of variations at the ’base’ level Low Cost: 1000$ genome Faster: Quicker than traditional sequencing techniques C ONCLUSIONS
  21. 21. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY NGS: W HY ? Molecular Approach: Study of variations at the ’base’ level Low Cost: 1000$ genome Faster: Quicker than traditional sequencing techniques C ONCLUSIONS
  22. 22. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY NGS: W HY ? Molecular Approach: Study of variations at the ’base’ level Low Cost: 1000$ genome Faster: Quicker than traditional sequencing techniques C ONCLUSIONS
  23. 23. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY NGS: W HERE ? Study variations, genotype-phenotype association Look for ’markers of diseases’ Prognosis C ONCLUSIONS
  24. 24. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY NGS: W HERE ? Study variations, genotype-phenotype association Look for ’markers of diseases’ Prognosis C ONCLUSIONS
  25. 25. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY NGS: W HERE ? Study variations, genotype-phenotype association Look for ’markers of diseases’ Prognosis C ONCLUSIONS
  26. 26. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS NGS: M UTATIONS 3x109 base pairs We are all 99.9% similar, at DNA level More than 2 million SNPs No particular pattern of SNPs If a certain mutation causes a change in an amino acid, it is referred to as non synonymous(nsSNV)
  27. 27. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY D RIVERS AND PASSENGERS I Cancer is known to arise due to mutations Not all mutations are equally important! Somatic Mutations Set of mutations acquired after zygote formation, over and above the germline mutations Driver Mutations Mutations that confer growth advantages to the cell, being selected positively in the tumor tissue C ONCLUSIONS
  28. 28. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS D RIVERS AND PASSENGERS Drivers are NOT simply loss of function mutations, but more than that: Loss of function: Inactivate tumor suppressor proteins Gain of function: Activates normal genes transforming them to oncogenes Drug Resistance Mutations: Mutations that have evolved to overcome the inhibitory effect of drugs
  29. 29. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS D RIVERS AND PASSENGERS Drivers are NOT simply loss of function mutations, but more than that: Loss of function: Inactivate tumor suppressor proteins Gain of function: Activates normal genes transforming them to oncogenes Drug Resistance Mutations: Mutations that have evolved to overcome the inhibitory effect of drugs
  30. 30. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS D RIVERS AND PASSENGERS Drivers are NOT simply loss of function mutations, but more than that: Loss of function: Inactivate tumor suppressor proteins Gain of function: Activates normal genes transforming them to oncogenes Drug Resistance Mutations: Mutations that have evolved to overcome the inhibitory effect of drugs
  31. 31. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS D RIVER M UTATIONS : W HY ? Identify driver mutations −→ better therapeutic targets But how does one zero down upon the exact set? −→ experiments are too costly, probably infeasible for 2 million+ SNPs −→ Leverage computational analysis Low cost of NGS comes with a heavier roadblock of data analysis Searching among 2 million+ SNPs is a non-trivial, and a computationally intensive problem Softwares have a low consensus ratio amongst them selves ←→ Defining a driver, computationally is non-trivial However there is no tool that allows one to visualise the results on an input across the cohort of tools
  32. 32. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS D RIVER M UTATIONS : W HY ? Identify driver mutations −→ better therapeutic targets But how does one zero down upon the exact set? −→ experiments are too costly, probably infeasible for 2 million+ SNPs −→ Leverage computational analysis Low cost of NGS comes with a heavier roadblock of data analysis Searching among 2 million+ SNPs is a non-trivial, and a computationally intensive problem Softwares have a low consensus ratio amongst them selves ←→ Defining a driver, computationally is non-trivial However there is no tool that allows one to visualise the results on an input across the cohort of tools
  33. 33. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS D RIVER M UTATIONS : W HY ? Identify driver mutations −→ better therapeutic targets But how does one zero down upon the exact set? −→ experiments are too costly, probably infeasible for 2 million+ SNPs −→ Leverage computational analysis Low cost of NGS comes with a heavier roadblock of data analysis Searching among 2 million+ SNPs is a non-trivial, and a computationally intensive problem Softwares have a low consensus ratio amongst them selves ←→ Defining a driver, computationally is non-trivial However there is no tool that allows one to visualise the results on an input across the cohort of tools
  34. 34. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS D RIVER M UTATIONS : W HY ? Identify driver mutations −→ better therapeutic targets But how does one zero down upon the exact set? −→ experiments are too costly, probably infeasible for 2 million+ SNPs −→ Leverage computational analysis Low cost of NGS comes with a heavier roadblock of data analysis Searching among 2 million+ SNPs is a non-trivial, and a computationally intensive problem Softwares have a low consensus ratio amongst them selves ←→ Defining a driver, computationally is non-trivial However there is no tool that allows one to visualise the results on an input across the cohort of tools
  35. 35. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS M ACHINE L EARNING I Two datasets: Training: Labeled dataset, containing a table of features with mutations labelled as ”drivers/passengers” Test: ’Learning’ from training dataset, test the prediction model Table: Training Dataset Chromosome 1 1 2 . . . Position 27822 27832 47842 . . . Ref A T G . . . Alt G G C . . . Type Driver Driver Passenger . . .
  36. 36. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY M ACHINE L EARNING II Table: Test Dataset Chromosome 1 1 Position 27824 47832 Ref A T Alt G G Type ? ? C ONCLUSIONS
  37. 37. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS M ACHINE L EARNING : F EATURE S ELECTION I Machine Learning relies on a set of features for training Redundant features should be avoided CHASM [1] makes use of p(Xi ) represents the probability of occurrence of an event Xi Considering a series of events X1 , X2 , X3 ...,Xn analogous ’series of packets’ in communication theory , the information received at each step can be quantified on a log scale by: 1 = −log2 (p(Xi )) log2 (Xi ) (1) The expected value of information from a series of events is called shannon entropy: H(X): H(X) = − p(Xi ) log2 p(Xi ) i (2)
  38. 38. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS M ACHINE L EARNING : F EATURE S ELECTION II Mutual Information between two random variables X, Y is defined as the amount of information gained about random variable X due to additional information gained from the second, Y: I(X, Y) = H(X) − H(X|Y) (3) Here: X: Class Label[Driver/Passenger] Y: Predictive Feature and hence I(X, Y) represents how much information was gained about the class label Y from knowledge of a feature X Simplifying : I(X, Y) = p(x, y)log2 p(x, y) p(x)p(y) (4)
  39. 39. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS F UNCTIONAL I MPACT I If a certain mutation confers an advantage to the cell in terms of replication rate, it is probably going to be selected while all those mutations that reduce its fitness have a higher chance of being eliminated from the population. Certain residues in a MSA of homologous sequences are more conserved than others. A highly conserved if mutated is possibly going to cost a lot since what had ’evolved’ is disturbed! Scores can be assigned based on this ”conservation” parameter.
  40. 40. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION F UNCTIONAL I MPACT II Figure: SIFT [?] algorithm R EPRODUCIBILITY C ONCLUSIONS
  41. 41. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY Some of the common tools/algorithms used for driver mutation prediction: SIFT Polyphen Mutation Assesor TransFIC Condel C ONCLUSIONS
  42. 42. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS F RAMEWORK FOR COMPARING VARIOUS TOOLS I Different tools use different formats, give different outputs for similar input Running analysis on multiple tools −→ keep shifting data formats Concordance? Polyphen2 Input chr1:888659 T/C chr1:1120431 G/A chr1:1387764 G/A chr1:1421991 G/A chr1:1599812 C/T chr1:1888193 C/A chr1:1900186 T/C
  43. 43. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS F RAMEWORK FOR COMPARING VARIOUS TOOLS II SIFT Input 1,888659,T,C 1,1120431,G,A 1,1387764,G,A 1,1421991,G,A 1,1599812,C,T 1,1888193,C,A 1,1900186,T,C
  44. 44. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY D RIVER M UTATIONS : T OOLS DON ’ T AGREE X Axis: Condel Score Y Axis: MA Score C ONCLUSIONS
  45. 45. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS Solution?: Galaxy[?], an open source web-based platform for bioinformatics, makes it possible to represent the entire data analysis pipeline in an intuitive graphical interface Figure: Galaxy Workflow polyphen2 algorithm
  46. 46. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION Run all tools in one go: Figure: Run all tools R EPRODUCIBILITY C ONCLUSIONS
  47. 47. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION Compare all tools: Figure: Compare all tools R EPRODUCIBILITY C ONCLUSIONS
  48. 48. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS V IRAL G ENOME D ETECION Cervical cancers have been proven to be associated with Human Papillomavirus(HPV) Cervical cancer datasets from Indian women was put through an analysis to detect : 1. Any possible HPV integration 2. Sites of HPV integration Who Cares? Replacing whole genome sequencing, by targeted sequencing at the sites where these virus have been detected in a cohort of samples, thus speeding up the whole process.
  49. 49. I NTRODUCTION S IGNIFICANT M UTATIONS RawData V IRAL G ENOME D ETECION Align data to human genome reference R EPRODUCIBILITY Extract unmapped regions Align unmapped regions to Virus genome Extract mapped BLAST regions C ONCLUSIONS
  50. 50. I NTRODUCTION S IGNIFICANT M UTATIONS RawData V IRAL G ENOME D ETECION Align data to human genome reference R EPRODUCIBILITY Extract unmapped regions Align unmapped regions to Virus genome Extract mapped BLAST regions C ONCLUSIONS
  51. 51. I NTRODUCTION S IGNIFICANT M UTATIONS RawData V IRAL G ENOME D ETECION Align data to human genome reference R EPRODUCIBILITY Extract unmapped regions Align unmapped regions to Virus genome Extract mapped BLAST regions C ONCLUSIONS
  52. 52. I NTRODUCTION S IGNIFICANT M UTATIONS RawData V IRAL G ENOME D ETECION Align data to human genome reference R EPRODUCIBILITY Extract unmapped regions Align unmapped regions to Virus genome Extract mapped BLAST regions C ONCLUSIONS
  53. 53. I NTRODUCTION S IGNIFICANT M UTATIONS RawData V IRAL G ENOME D ETECION Align data to human genome reference R EPRODUCIBILITY Extract unmapped regions Align unmapped regions to Virus genome Extract mapped BLAST regions C ONCLUSIONS
  54. 54. I NTRODUCTION S IGNIFICANT M UTATIONS RawData V IRAL G ENOME D ETECION Align data to human genome reference R EPRODUCIBILITY Extract unmapped regions Align unmapped regions to Virus genome Extract mapped BLAST regions C ONCLUSIONS
  55. 55. I NTRODUCTION S IGNIFICANT M UTATIONS RawData V IRAL G ENOME D ETECION Align data to human genome reference R EPRODUCIBILITY Extract unmapped regions Align unmapped regions to Virus genome Extract mapped BLAST regions C ONCLUSIONS
  56. 56. I NTRODUCTION S IGNIFICANT M UTATIONS G ALAXY W ORKFLOW V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS
  57. 57. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY Figure: Aligned HPV genomes C ONCLUSIONS
  58. 58. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS R EPRODUCIBILITY In pursuit of novel ’discovery’, standardizing the data analysis pipeline is often ignored, leading to dubious conclusions Analysis should be reproducible and above all, correct Parameter’s values can change the results by a big factor, they need to be documented/logged Garbage in, Garbage out
  59. 59. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS C ONCLUSIONS With the Galaxy tool box for identification of significant mutations and the study of the science behind the methods, the next steps would be to: Open source the toolbox to the community: A tool makes little sense if it is not in a usable form, community feedback will be used to add more tools and improve the existing ones A new method for driver mutation prediction: all the methods have low level of concordance. A new method that takes into account the available data at all levels : mutations, transcriptome and micro array data is possible. With the Galaxy toolbox in place, it would be possible to integrate information at various levels
  60. 60. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS F UTURE W ORK Develop an algorithm that integrates machine learning approach with functional approach by zeroing down upon only those attributes that are known to have an impact The algorithm would also account for information at other levels: RNA expressions, Clinical data. Integrating information at all levels would provide a deeper insight The developed Galaxy toolbox will be used a the basic framework for integrating information
  61. 61. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS R EFERENCES I Hannah Carter, Sining Chen, Leyla Isik, Svitlana Tyekucheva, Victor E Velculescu, Kenneth W Kinzler, Bert Vogelstein, and Rachel Karchin. Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer research, 69(16):6660–6667, 2009.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×