SlideShare a Scribd company logo
Gene Duplication and
Read Mapping
Lecture – 5
Nafis Neehal, Lecturer, Department of CSE, DIU
CONTENTS
1. Mutation
2. Gene Duplication
3. Read Mapping
- Keyword Tree
- Suffix Tree
- Suffix Array
- Burrows Wheeler Transform
4. Maximum Likelihood Estimation
1. DNA Mutation
What and how mutation occurs, common forms
Mutation
DNA Mutation refers to sudden, random changes
in DNA sequences which leads to different
phenotypic expressions.
A T C C G A
A T G C C G A
Insertion
Common Mutation Types
Substitution
AATTCGCA
AATGCGCA
Deletion
AATTCGCA
AATCGCA
Insertion
AATCGCA
AATTCGCA
Duplication
AATCGCA
AATCATCGCAInversion
AATCGCA AACGGCA
AGCATCG ACTATCG
2. Gene Duplication
Duplication of Genes, Homolog, Ortholog, Paralogs
Gene Duplication
Gene duplication (or chromosomal duplication or
gene amplification) is a major mechanism
through which new genetic material is generated
during molecular evolution. It can be defined as
any duplication of a region of DNA that contains a
gene.
Homolog, Ortholog, Paralog and Speciation
• Homolog - A gene related to a second
gene by descent from a common
ancestral DNA sequence
• Ortholog - Orthologs are genes in
different species that evolved from a
common ancestral gene by speciation*
• Paralog - Paralogs are genes related by
duplication within a genome
• Speciation* - Speciation is the origin of a
new species capable of making a living in
a new way from the species from which
it arose
3. Read Mapping
Short Read Mapping, Genome Indexing
Read Mapping
Mapping refers to the process of
aligning short reads to and finding
the starting position in a reference
sequence (typically Genome).
Short read generally are reads with a
length of 30-350 base pairs.
Genome Indexing (Keyword Tree)
▹ Stores a set of keywords in a rooted labeled
tree.
▹ Each edge is labeled with a letter from an
alphabet.
▹ Any two edges coming out of the same
vertex have distinct labels.
▹ Every keyword stored can be spelled on a
path from root to some leaf.
▹ Furthermore, every path from root to leaf
gives a keyword.
Keywords
▹ Apple
▹ Apropos
▹ Banana
▹ Bandana
▹ Orange
Genome Indexing (Suffix Tree)
▹ Similar to Keyword Tree
▹ Suffixes of the text are keywords
▹ Edges that form paths are collapsed
▹ Each edge is labeled with a substring of the
text
▹ All internal edges have at least two outgoing
edges.
▹ Leaves are labeled by the index of the
pattern.
Suffix tree of ATCATG
Genome Indexing (Suffix Array)
▹ More space efficient than suffix tree
▹ Suffix tree index for human genome is
about 47 GB
▹ Lexicographically sort all the suffixes
▹ Store the starting indices of the suffixes
along with the original string
Generate Suffix Array of
ATCATG
1 ATCATG$
2 TCATG$
3 CATG$
4 ATG$
5 TG$
6 G$
7 $
Sort the suffixes
lexicographically
7 $
1 ATCATG$
4 ATG$
3 CATG$
6 G$
2 TCATG$
5 TG$
Genome Indexing (Burrows Wheeler Transform)
▹ Given Sequence – abaaba
▹ Add $ as ending notation –
abaaba$
▹ By Shifting each alphabet to the
right once, generate all the
rotations
▹ Lexicographically Sort all the
rotations
▹ The very last column will be
denoted as BWT (T)
Genome Indexing (Burrows Wheeler Transform)
▹ Given Sequence – abaaba
▹ Add $ as ending notation –
abaaba$
▹ Lexicographically sorted all
rotations will generate BWT
Matrix which will be denoted as
BWM (T)
▹ Suffix Array generated from all
the rotations will be called SA (T)
▹ BWM can be derived from any
given BWT (T)
Genome Indexing (Burrows Wheeler Transform)
LF (Last to First) Mapping
▹ Generate Burrows Wheeler
Matrix for a given sequence
▹ Assign numbers to distinguish
same characters
▹ Assign the numbers in a
ascending manner for each
character
Genome Indexing (Burrows Wheeler Transform)
Find out the row starting with b1 using LF Mapping
1. Start from the row containing $ in the First Column
2. Find out what’s in Last Column of that row (here its a0)
3. Compare it with query (b1)
4. If MATCH, then
- Find b1 in First Column
- Print row number
- Terminate
5. If No MATCH, then
- Find the row with that element in the First column
- Go to Step 2 and Repeat
Start
Genome Indexing (Burrows Wheeler Transform)
Find Original Gene using LF Mapping if
BWT (T) is Given
1. Original Gene = abaaba (Not Given)
2. Given BWT (T) = abba$aa
3. Store it as Last Column
4. Draw the First Column by sorting
the elements of Last Column
Lexicographically
5. Assign numbers to distinguish
characters in an ascending manner
6. Start LF Mapping from Starting
Element ($)
7. For each element found in the
LAST column, write it from right to
left
$ a0
a0 b0
a1 b1
a2 a1
a3 $
b0 a2
b1 a3
F L
$abaaba
FINISH
Start
Whales and DolphinsTheir ancestors had back legs once, they could walk
Humans have tails
While they are inside the womb! It dissolves
eventually
Birds came from Dinosaurs
And they both descended from Reptiles
Bacterium
All livings beings can be traced back to a
bacterium
Mid-Done

More Related Content

Similar to Lecture 5

Cg7 trees
Cg7 treesCg7 trees
Cg7 trees
Anasua Sarkar
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_PresentationToyin23
 
phylogenetics.pdf
phylogenetics.pdfphylogenetics.pdf
phylogenetics.pdf
SrimathideviJ
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
Prof. Wim Van Criekinge
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
Nawfal Aldujaily
 
PHYLOGENETIC ANALYSIS_CSS2.pptx
PHYLOGENETIC ANALYSIS_CSS2.pptxPHYLOGENETIC ANALYSIS_CSS2.pptx
PHYLOGENETIC ANALYSIS_CSS2.pptx
Silpa87
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Torsten Seemann
 
Marker devt. workshop 27022012
Marker devt. workshop 27022012Marker devt. workshop 27022012
Marker devt. workshop 27022012
Koppolu Ravi
 
Practicum Pressentation PDF
Practicum Pressentation PDFPracticum Pressentation PDF
Practicum Pressentation PDFGui Chen
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
Karan Veer Singh
 
Exploring the Solution Space of Sorting by Reversals: A New Approach
Exploring the Solution Space of Sorting by Reversals: A New ApproachExploring the Solution Space of Sorting by Reversals: A New Approach
Exploring the Solution Space of Sorting by Reversals: A New Approach
IDES Editor
 
Molecular phylogenetics
Molecular phylogeneticsMolecular phylogenetics
Molecular phylogenetics
Ajay Kumar Chandra
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
Yaoyu Wang
 
Maximum parsimony
Maximum parsimonyMaximum parsimony
Maximum parsimony
Shruthi Krishnaswamy
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentation
austinps
 
Dna data compression algorithms based on redundancy
Dna data compression algorithms based on redundancyDna data compression algorithms based on redundancy
Dna data compression algorithms based on redundancy
ijfcstjournal
 
Apolo Taller en BIOS
Apolo Taller en BIOS Apolo Taller en BIOS
Apolo Taller en BIOS
Monica Munoz-Torres
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
Monica Munoz-Torres
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research community
Monica Munoz-Torres
 

Similar to Lecture 5 (20)

Cg7 trees
Cg7 treesCg7 trees
Cg7 trees
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_Presentation
 
phylogenetics.pdf
phylogenetics.pdfphylogenetics.pdf
phylogenetics.pdf
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
PHYLOGENETIC ANALYSIS_CSS2.pptx
PHYLOGENETIC ANALYSIS_CSS2.pptxPHYLOGENETIC ANALYSIS_CSS2.pptx
PHYLOGENETIC ANALYSIS_CSS2.pptx
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Marker devt. workshop 27022012
Marker devt. workshop 27022012Marker devt. workshop 27022012
Marker devt. workshop 27022012
 
Practicum Pressentation PDF
Practicum Pressentation PDFPracticum Pressentation PDF
Practicum Pressentation PDF
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Exploring the Solution Space of Sorting by Reversals: A New Approach
Exploring the Solution Space of Sorting by Reversals: A New ApproachExploring the Solution Space of Sorting by Reversals: A New Approach
Exploring the Solution Space of Sorting by Reversals: A New Approach
 
Molecular phylogenetics
Molecular phylogeneticsMolecular phylogenetics
Molecular phylogenetics
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
Maximum parsimony
Maximum parsimonyMaximum parsimony
Maximum parsimony
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentation
 
Est database
Est databaseEst database
Est database
 
Dna data compression algorithms based on redundancy
Dna data compression algorithms based on redundancyDna data compression algorithms based on redundancy
Dna data compression algorithms based on redundancy
 
Apolo Taller en BIOS
Apolo Taller en BIOS Apolo Taller en BIOS
Apolo Taller en BIOS
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research community
 

More from Owali Shawon

Lecture 7
Lecture 7Lecture 7
Lecture 7
Owali Shawon
 
Lecture 8
Lecture 8Lecture 8
Lecture 8
Owali Shawon
 
Lecture 6
Lecture 6Lecture 6
Lecture 6
Owali Shawon
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
Owali Shawon
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
Owali Shawon
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
Owali Shawon
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
Owali Shawon
 
ABOUT ME!
ABOUT ME! ABOUT ME!
ABOUT ME!
Owali Shawon
 
Electrical Circuit
Electrical CircuitElectrical Circuit
Electrical Circuit
Owali Shawon
 
Short Review of Stack
Short Review of StackShort Review of Stack
Short Review of Stack
Owali Shawon
 

More from Owali Shawon (10)

Lecture 7
Lecture 7Lecture 7
Lecture 7
 
Lecture 8
Lecture 8Lecture 8
Lecture 8
 
Lecture 6
Lecture 6Lecture 6
Lecture 6
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
ABOUT ME!
ABOUT ME! ABOUT ME!
ABOUT ME!
 
Electrical Circuit
Electrical CircuitElectrical Circuit
Electrical Circuit
 
Short Review of Stack
Short Review of StackShort Review of Stack
Short Review of Stack
 

Recently uploaded

Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 

Recently uploaded (20)

Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 

Lecture 5

  • 1. Gene Duplication and Read Mapping Lecture – 5 Nafis Neehal, Lecturer, Department of CSE, DIU
  • 2. CONTENTS 1. Mutation 2. Gene Duplication 3. Read Mapping - Keyword Tree - Suffix Tree - Suffix Array - Burrows Wheeler Transform 4. Maximum Likelihood Estimation
  • 3. 1. DNA Mutation What and how mutation occurs, common forms
  • 4. Mutation DNA Mutation refers to sudden, random changes in DNA sequences which leads to different phenotypic expressions. A T C C G A A T G C C G A Insertion
  • 6. 2. Gene Duplication Duplication of Genes, Homolog, Ortholog, Paralogs
  • 7. Gene Duplication Gene duplication (or chromosomal duplication or gene amplification) is a major mechanism through which new genetic material is generated during molecular evolution. It can be defined as any duplication of a region of DNA that contains a gene.
  • 8. Homolog, Ortholog, Paralog and Speciation • Homolog - A gene related to a second gene by descent from a common ancestral DNA sequence • Ortholog - Orthologs are genes in different species that evolved from a common ancestral gene by speciation* • Paralog - Paralogs are genes related by duplication within a genome • Speciation* - Speciation is the origin of a new species capable of making a living in a new way from the species from which it arose
  • 9. 3. Read Mapping Short Read Mapping, Genome Indexing
  • 10. Read Mapping Mapping refers to the process of aligning short reads to and finding the starting position in a reference sequence (typically Genome). Short read generally are reads with a length of 30-350 base pairs.
  • 11. Genome Indexing (Keyword Tree) ▹ Stores a set of keywords in a rooted labeled tree. ▹ Each edge is labeled with a letter from an alphabet. ▹ Any two edges coming out of the same vertex have distinct labels. ▹ Every keyword stored can be spelled on a path from root to some leaf. ▹ Furthermore, every path from root to leaf gives a keyword. Keywords ▹ Apple ▹ Apropos ▹ Banana ▹ Bandana ▹ Orange
  • 12. Genome Indexing (Suffix Tree) ▹ Similar to Keyword Tree ▹ Suffixes of the text are keywords ▹ Edges that form paths are collapsed ▹ Each edge is labeled with a substring of the text ▹ All internal edges have at least two outgoing edges. ▹ Leaves are labeled by the index of the pattern. Suffix tree of ATCATG
  • 13. Genome Indexing (Suffix Array) ▹ More space efficient than suffix tree ▹ Suffix tree index for human genome is about 47 GB ▹ Lexicographically sort all the suffixes ▹ Store the starting indices of the suffixes along with the original string Generate Suffix Array of ATCATG 1 ATCATG$ 2 TCATG$ 3 CATG$ 4 ATG$ 5 TG$ 6 G$ 7 $ Sort the suffixes lexicographically 7 $ 1 ATCATG$ 4 ATG$ 3 CATG$ 6 G$ 2 TCATG$ 5 TG$
  • 14. Genome Indexing (Burrows Wheeler Transform) ▹ Given Sequence – abaaba ▹ Add $ as ending notation – abaaba$ ▹ By Shifting each alphabet to the right once, generate all the rotations ▹ Lexicographically Sort all the rotations ▹ The very last column will be denoted as BWT (T)
  • 15. Genome Indexing (Burrows Wheeler Transform) ▹ Given Sequence – abaaba ▹ Add $ as ending notation – abaaba$ ▹ Lexicographically sorted all rotations will generate BWT Matrix which will be denoted as BWM (T) ▹ Suffix Array generated from all the rotations will be called SA (T) ▹ BWM can be derived from any given BWT (T)
  • 16. Genome Indexing (Burrows Wheeler Transform) LF (Last to First) Mapping ▹ Generate Burrows Wheeler Matrix for a given sequence ▹ Assign numbers to distinguish same characters ▹ Assign the numbers in a ascending manner for each character
  • 17. Genome Indexing (Burrows Wheeler Transform) Find out the row starting with b1 using LF Mapping 1. Start from the row containing $ in the First Column 2. Find out what’s in Last Column of that row (here its a0) 3. Compare it with query (b1) 4. If MATCH, then - Find b1 in First Column - Print row number - Terminate 5. If No MATCH, then - Find the row with that element in the First column - Go to Step 2 and Repeat Start
  • 18. Genome Indexing (Burrows Wheeler Transform) Find Original Gene using LF Mapping if BWT (T) is Given 1. Original Gene = abaaba (Not Given) 2. Given BWT (T) = abba$aa 3. Store it as Last Column 4. Draw the First Column by sorting the elements of Last Column Lexicographically 5. Assign numbers to distinguish characters in an ascending manner 6. Start LF Mapping from Starting Element ($) 7. For each element found in the LAST column, write it from right to left $ a0 a0 b0 a1 b1 a2 a1 a3 $ b0 a2 b1 a3 F L $abaaba FINISH Start
  • 19. Whales and DolphinsTheir ancestors had back legs once, they could walk Humans have tails While they are inside the womb! It dissolves eventually Birds came from Dinosaurs And they both descended from Reptiles Bacterium All livings beings can be traced back to a bacterium