Module 2 Sequence similarity.
Part of bioinformatics training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
Scoring system is a set of values for qualifying the set of one residue being substituted by another in an alignment.
It is also known as substitution matrix.
Scoring matrix of nucleotide is relatively simple.
A positive value or a high score is given for a match & negative value or a low score is given for a mismatch.
Scoring matrices for amino acids are more complicated because scoring has to reflect the physicochemical properties of amino acid residues.
Sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcript (ESTs).
The problem of sequence assembly can be compared to taking many copies of a book, passing each of them through a shredder with a different cutter, and piecing the text of the book back together just by looking at the shredded pieces. Besides the obvious difficulty of this task, there are some extra practical issues: the original may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable.
Module 2 Sequence similarity.
Part of bioinformatics training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
Scoring system is a set of values for qualifying the set of one residue being substituted by another in an alignment.
It is also known as substitution matrix.
Scoring matrix of nucleotide is relatively simple.
A positive value or a high score is given for a match & negative value or a low score is given for a mismatch.
Scoring matrices for amino acids are more complicated because scoring has to reflect the physicochemical properties of amino acid residues.
Sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcript (ESTs).
The problem of sequence assembly can be compared to taking many copies of a book, passing each of them through a shredder with a different cutter, and piecing the text of the book back together just by looking at the shredded pieces. Besides the obvious difficulty of this task, there are some extra practical issues: the original may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable.
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
Sequence Alignment Pairwise alignment:- Global Alignment and Local AlignmentTwo types of alignment Progressive Programs for multiple sequence alignment BLOSUM Point accepted mutation (PAM)PAM VS BLOSUM
INTRODUCTION OF BIOINFORMATICS
HISTORY
WHAT IS DATABASE
NEED FOR DATABASE
TYPES OF DATABASE
PRIMARY DATABASE
NUCLEIC ACID SEQUENCE DATABASE
GENE BANK
INTRODUCTION
GENE BANK SUBMISSION TOOL
GENE BANK SUBMISSION TYPE
HOW TO RETRIEVE DATA FROM GENEBANK
APPLICATION
CONCLUSION
REFERENCE
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
This presentation gives you a detailed information about the swiss prot database that comes under UniProtKB. It also covers TrEMBL: a computer annotated supplement to Swiss-Prot.
Global and local alignment (bioinformatics)Pritom Chaki
A general global alignment technique is the Needleman–Wunsch algorithm, which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context.
INTRODUCTION.
NCBI.
EMBL.
DDBJ.
CONCLUSION.
REFERENSE.
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health.
The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.
The NCBI houses a series of databases relevant to biotechnology and biomedicine. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature.
All these databases are available online through the Entrez search engine.
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
Sequence Alignment Pairwise alignment:- Global Alignment and Local AlignmentTwo types of alignment Progressive Programs for multiple sequence alignment BLOSUM Point accepted mutation (PAM)PAM VS BLOSUM
INTRODUCTION OF BIOINFORMATICS
HISTORY
WHAT IS DATABASE
NEED FOR DATABASE
TYPES OF DATABASE
PRIMARY DATABASE
NUCLEIC ACID SEQUENCE DATABASE
GENE BANK
INTRODUCTION
GENE BANK SUBMISSION TOOL
GENE BANK SUBMISSION TYPE
HOW TO RETRIEVE DATA FROM GENEBANK
APPLICATION
CONCLUSION
REFERENCE
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
This presentation gives you a detailed information about the swiss prot database that comes under UniProtKB. It also covers TrEMBL: a computer annotated supplement to Swiss-Prot.
Global and local alignment (bioinformatics)Pritom Chaki
A general global alignment technique is the Needleman–Wunsch algorithm, which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context.
INTRODUCTION.
NCBI.
EMBL.
DDBJ.
CONCLUSION.
REFERENSE.
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health.
The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.
The NCBI houses a series of databases relevant to biotechnology and biomedicine. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature.
All these databases are available online through the Entrez search engine.
Proteins : is made of chain of amino acids ( amino acid= monomers) therefor the protein is polymers .
The proteins are made up of carbon, hydrogen, oxygen, and nitrogen.
Amino acid :
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
What is greenhouse gasses and how many gasses are there to affect the Earth.
222397 lecture 16 17
1. HBC1011 Biochemistry I
Lecture 16 and 17 – Exploring
Evolution and Bioinformatics
Ng Chong Han, PhD
ITAR1010, 06-2523751
chng@mmu.edu.my
2.
3. Overview
• Homology, paralogs, orthologs, convergent
& divergent evolution
• Statistical analysis of sequence alignments
• Evolutionary relationships: protein
sequences & tertiary structures
• Evolutionary tree
3
4. Evolutionary relationships are present in protein sequences.
The human myoglobin sequence (red) differs from the chimpanzee sequence
(blue) in only one amino acid in a protein chain of 153 residues
5. Homologs are molecules derived from a
common ancestor
• Exploration of biochemical evolution attempt to determine
how proteins, other molecules, & biochemical pathways
have been transformed through time.
• Most fundamental relationship between entities =
homology
• 2 molecules are said to be homologous if they have been
derived from a common ancestor.
• Search sequence database for sequence-comparison
analysis
• Gene duplication: any duplication of a region of DNA that
contains a gene, which is generated during molecular
evolution, can arise as products from DNA replication and
repair machinery.
5
6. Homologous molecules = Homologs
Paralogs Orthologs
Homologs present
within one species
Homologs present in
different species
(Differ in their detail
biochemical functions,
some exception)
(very similar or identical
functions, some
exception)
7. 2 classes of homologs
Homologs that perform identical or
very similar functions in different
organisms are called orthologs,
whereas homologs that perform
different functions within one
organism are called paralogs.
Human
8. Orthology
8
• Homologous sequences are orthologous if they are inferred
to be descended from the same ancestral sequence
separated by a speciation event: when a species diverges
into two separate species.
• For instance, the plant Flu regulatory protein is present both
in Arabidopsis (multicellular higher plant) and
Chlamydomonas (single cell green algae). The complex
Chlamydomonas version can fully substitute the much
simpler Arabidopsis protein, if transferred from algae to plant
genome by means of molecular cloning.
• Orthologs often, but not always, have the same function.
9. Orthology
9
• Orthologous sequences provide useful information in taxonomic
classification and phylogenetic studies of organisms.
• Two organisms that are very closely related are likely to display
very similar DNA sequences between two orthologs.
Conversely, an organism
that is further removed
evolutionarily from another
organism is likely to display
a greater divergence in the
sequence of the orthologs
being studied.
10. Paralogy
10
• Homologous sequences are paralogous if they were
created by a duplication event within the genome.
• For gene duplication events, if a gene in an organism is
duplicated to occupy two different positions in the same
genome, then the two copies are paralogous.
• Paralogous genes often belong to the same species, but
this is not necessary: eg, the hemoglobin gene of humans
and the myoglobin gene of chimpanzees are paralogs.
11. Paralogy
11
• Paralogous sequences provide useful and dramatic
insight into some of the way genomes evolve.
• Function is not always conserved, however.
• Human angiogenin diverged from ribonuclease, for
example, and while the two paralogs remain similar in
tertiary structure, their functions within the cell are now
quite different.
Human
12. Paralogy regions
12
• Sometimes, large chromosomal regions share gene content similar
to other chromosomal regions within the same genome.
• Examples of paralogy regions include regions of human
chromosome 2, 7, and 12 containing Hox gene clusters, collagen
genes and keratin genes.
13. (common ancestor)
Two segments of DNA can have shared ancestry because of either
a speciation event (orthologs) or a duplication event (paralogs).
14. The importance of the study of the
homology
14
• Reveal the evolutionary
history of molecules
• Information about their
function
• i.e.: if a newly
sequenced protein is
homologous to an
already characterized
protein strong
indication of the new
protein’s biochemical
function.
15. Statistical analysis of sequence alignments
can detect homology
• How can we know whether 2 human protein are paralogs
or whether a yeast protein is the ortholog of a human
protein?
• Significant sequence similarity between 2 molecules =
likely to have the same evolutionary origin & therefore,
same 3-D structure, function & mechanism.
• Since protein sequences are better conserved
evolutionarily than nucleotide sequences, protein
sequence comparison produces more reliable and
accurate results when dealing with coding DNA.
15
16. Sequence comparison methods
• The sequences of two proteins that have an ancestor in common
will have diverged in a variety of ways.
• Insertions and deletions may have occurred at the ends of the
proteins or within the functional domains themselves.
• Individual amino acids may have been mutated to other residues
of varying degrees of similarity.
16
Human
hemoglobin (α
chain) 141 a.a. &
Human
myoglobin (α
chain) 153 a.a.
17. Sequence comparison methods
• Globins
– Myoglobin: binds oxygen in muscle
– Hemoglobin: oxygen-carrying protein in blood,
composed of 2 identical α chains & 2 identical β chains
• Both cradle a heme group: an iron containing organic
molecule that binds the oxygen.
17
To detect sequence
similarity, we perform
sequence alignment.
18. How can we tell where to align the 2
sequences?
• Approach:
– Compare all possible juxtaposition of one protein
sequence with another, in each case recording
the number of identical residues that are aligned
with one another.
– Comparison can be accomplished by simply
sliding one sequence past the other, one a.a at a
time & counting the number of matched residues.
18
19. (A) A comparison is made
by sliding the sequences of
the 2 proteins past each
other, 1 amino acid at a
time, and counting the
number of amino acid
identities between the
proteins
(B) The 2 alignments with
the largest number of
matches are shown above
the graph, which plots the
matches as a function of
alignment.
Largest
no. of
matches
20. Alignment with gap insertion
• The sequences can be aligned to capture most of the
identities by introducing a gap into one of the sequence.
• Gap insert to compensate for the insertion/deletions of
nucleotides that may have taken place in the gene.
• Gap increases the complexity of sequence alignment: gap
of arbitrary size
• Method: use scoring system to compare different
alignments & include penalties (to prevent unreasonable
number of insertion)
20
Gap
21. Alignment with gap insertion:
Scoring system
21
• The alignment of α hemoglobin & myoglobin after a gap has
been inserted into the hemoglobin α sequence
Identity between aligned sequence = +10 points;
gap (regardless size) = -25 points.
38 identities & 1 gap; score = ((38x10) + (1x-25)) = 355)
38 matched amino acid in average 147 residues ((153+141)/2)
, so the sequences are 25.9% (38/147x100) identical.
22. The statistical significance of alignments can
be estimated by shuffling
22
• Because proteins are composed of the same set of 20 amino
acids, the alignment of any two unrelated proteins will yield
some identities, especially if gaps are allowed.
• Even if two proteins have identical amino acid composition,
they may not be linked by evolution. It is the order of the
residues that implies a relationship.
How can we
estimate the
probability that a
specific series of
identities is a
chance occurrence?
23. The statistical significance of alignments can
be estimated by shuffling
23
• The process of the sequences shuffling is repeated many
times to yield a histogram – the score from the original
alignment should be higher than the scores from random
shuffling.
The high
alignment
score does
not occur
by chance.
Original
alignment
score
Random
alignment
score
24. Distant evolutionary relationships can be
detected through the use of substitution matrices
• Scoring scheme discussed previously assigned
points only to positions occupied by identical a.a
• No credit for non-identical a.a
• How about substitution?
• A scoring system based solely on amino acid
identity cannot account for these changes.
24
25. Types of substitution
25
Substitution
nonconservativeconservative
Replacing one a.a with
another that is similar in size
and chemical properties.
May have minor effects on
protein structure and can
thus be tolerated without
compromising function.
An amino acid
replaces one that
is dissimilar
Conservative and single-nucleotide
substitutions are likely to be more
common than are substitutions with
more radical effects.
26. Substitution matrix
• Substitution matrix – a scoring system for the replacement of
any amino acid with each of the other 19 amino acids.
• Large positive score corresponds to substitution that occurs
relatively frequently
• Large negative score corresponds to substitution that occurs
only rarely
• When 2 seq are compared, each substitution is assigned a
score based on matrix.
26
Blosum-62 : Blocks
of amino acid
substitution matrix
27. Blosum-62 substitution matrix.
Arginine Lysine,
conservative
Valine Lysine,
nonconservative
D E H K R N Q S T A C G P F I L M V W Y
red: charged, green: polar, blue:
large and hydrophobic, black: other
28. Blosum-62 score
• A single-residue gap: -12 points
• Additional single gap: -2 points per residue
28
identities
Conservative
substitution
gap
29. Blosum-62 score
• The alignment of hemoglobin & myoglobin with conservative
substitutions indicated by yellow shading and identities by
orange. Score = 115
29
identities
Conservative
substitution
gap
30. Blosum-62
• Blosum-62: Detects homology between less obviously
related sequences (not only detect identity)
• Alignment of human myoglobin & lupine (plant)
leghemoglobin. Identities: orange boxes; conservative
substitution: . These sequences are 23% identical.
30
31. Alignment of identities versus Blosum-62
• Alignment of identities: the probability of the alignment occurs
by chance alone is high (1:20).
• Blosum-62: the probability of the alignment occurs by chance
alone is very low (1:300), better, firmer conclusion.
31
32. Sequence analysis – rule of thumb
• For sequences longer than 100 amino acids, sequence
identities > 25% = statistical significant similarity =
sequences are probably homologous.
• If 2 sequences are less than 15% identical = pairwise
comparison alone is unlikely to indicate statistically
significant similarity
• If between 15% to 25% further analysis
The lack of a statistically significant degree of sequence
similarity does not rule out homology
Why??
32
33. Homology VS Similarity
33
• Similarity refers to the
likeness or % identity
between 2 sequences
• Similarity means sharing a
statistically significant
number of amino acids
• Similarity does not imply
homology
• Homology refers to shared
ancestry
• Two sequences are
homologous is they are
derived from a common
ancestral sequence
• Homology usually implies
similarity
Homology among proteins is often incorrectly concluded on the basis of
sequence similarity. High sequence similarity might occur because
of convergent evolution, or, as with shorter sequences, because of chance.
Such sequences are similar but not homologous.
34. Databases can be searched to identify
homologous sequences
• Database search for homologous seq: using online
resources on NCBI (National Center for Biotechnology
Information)
• Procedure: BLAST (Basic Local Alignment Search Tool)
search.
• Result: a list of sequence alignments.
• Open reading frame (ORF): protein-coding region
• Hypothetical protein: ORF with no assigned function
34
35. E value (highlighted in red): the number of sequences with this
level of similarity expected to be in the DB by chance is 2x10-25
36. Examination of 3-D structure enhances our
understanding of evolutionary relationship
• To gain a deeper understanding of evolutionary
relationships between proteins, we must examine
3-D structures because
– The sequences of many proteins that have been
descended from a common ancestor have diverged to
such an extent that the relationship between the proteins
can no longer be detected from their sequences alone.
– Biomolecules generally function as intricate 3-D structures
rather than as linear polymers.
– Sequence mutation affected function & function directly
related to tertiary structure
36
37. Tertiary structure is more conserved than
primary structure
• Because 3-D structure is much more closely
associated with function than its sequence, tertiary
structure is more evolutionarily conserved than its
primary structure.
• i.e.: tertiary structures of globin, extremely similar
even though the similarity between human
myoglobin & lupine leghemoglobin is just barely
detectable at seq level & that between human
hemoglobin and lupine leghemoglobin is not
statistical significant.
37
38. Conservation of 3-D structure. The tertiary structures of human hemoglobin,
human myoglobin, & lupine leghemoglobin are conserved. This structural
similarity firmly establishes that the framework that binds the heme group &
facilitates the reversible binding of oxygen has been conserved over a long
evolutionary period.
39. Tertiary structure is more conserved than
primary structure
• Comparison of 3-D structures has revealed striking
similarities between proteins that were not expected
to be related.
• i.e.: protein actin (major component of the
cytoskeleton) & heat shock protein 70 (assists
protein folding inside cell)
– Similar in structure, only 15.6% sequence identity
– Paralogs
– Different biological roles, descended from a
common ancestor
39
40. Structures of Actin & Hsp-70. A comparison of the identically colored
elements of secondary structure reveals the overall similarity in structure
despite the difference in biochemical activities.
41. Conserved function sequence
41
• Regions & residues critical for protein function are more
strongly conserved than are other residues.
• i.e.: each type of globin contains a bound heme group with
an iron atom at its center. A histidine residue that interacts
directly with this iron is conserved in all globins.
Identified key residues/highly
conserved sequences within a family
of proteins identify other family
members even when the overall level
of sequence similarity is below
statistical significance.
42. Divergent and Convergent evolution
• Divergent evolution: process by which 2 or more biological
characteristics have a common origin, but have diverged
over evolutionary time.
How might two unrelated proteins come to resemble each
other structurally? Two proteins evolving independently may
have converged on a similar structure in order to perform a
similar biochemical activity.
• Convergent evolution: process by which very different
evolutionary pathways lead to the same solution (different
origin points).
42
43. One example of convergent evolution is the serine
protease family, which cleaves peptide bonds by
hydrolysis. The structure of the active sites at which the
hydrolysis reaction takes place are remarkably similar.
44. The similarity might suggest that these proteins are homologous.
However, striking differences in the overall structures of these
proteins make an evolutionary relationship extremely unlikely.
45. Evolutionary tree can be constructed on the
basis of sequence information
• Aligned sequences can be used to construct an
evolutionary tree in which the length of the branch
connecting each pair of proteins is proportional to the
number of amino acid differences between the
sequences. Branch lengths indicate genetic change i.e.
the longer the branch, the more genetic change has
occurred.
• To estimate the approximates dates of gene duplications
& other evolutionary events, evolutionary tree can be
calibrated comparing the deduced branch points with
divergence times determined from the fossil record.
45
46. An evolutionary tree for globins. The branching structure was deduced by
sequence comparison, whereas the results of fossil studies provided the
overall time scale showing when divergence occurred.
47. Evolutionary tree can be constructed on the
basis of sequence information
How can we estimate the approximate dates of gene
duplications and other evolutionary events?
• Duplication leading to the 2 chains of hemoglobin appears to
have occurred 350 million years ago.
– This estimation is supported by the observation that
jawless fish such as the lamprey, which diverged from bony
fish ~400 million years ago, contain hemoglobin built from a
single type of polypeptide
chain.
47
The lamprey
48. Modern techniques make the experimental
exploration of evolution possible
• Ancient DNA can sometimes be amplified and sequenced using
polymerase chain reaction (PCR) and DNA sequencing.
• This approach has been applied to mitochondrial DNA from a
Neanderthal fossil estimated at between 30,000 and 100,000 years
of age found near Düsseldorf, Germany, in 1856. Comparison with
the sequences from Homo sapiens revealed between 22 and 36
substitutions, considerably fewer than the average of 55 differences
between human beings and chimpanzees over the common bases in
this region.
48
49. Modern techniques make the experimental
exploration of evolution possible
• Further analysis suggested that the common ancestor of modern
human beings and Neanderthals lived approximately 600 million
years ago.
• An evolutionary tree constructed by using these and other data
revealed that the Neanderthal was not an intermediate between
chimpanzees and human beings but, instead, was an evolutionary
"dead end" that became extinct
49
Successful sequencing of
ancient DNA requires
sufficient DNA for reliable
amplification and the
rigorous exclusion of all
sources of contamination.
50. Archeological sites in Indonesia
• Homo floresiensis ("Flores Man"; nicknamed "hobbit") is an
extinct species thought to be in the genus Homo. The remains of
an individual (1.1 m in height) were discovered in 2003 at Liang
Bua on the island of Flores in Indonesia.
• This hominin had originally been considered to be remarkable
for its survival until only 12,000 years ago. However, by 2016,
more work has pushed their existence back to 50,000 years ago.
50
51. Glossary
• BLOSUM
– Blocks Substitution Matrix. A substitution matrix in which scores for
each position are derived from observations of the frequencies of
substitutions in blocks of local alignments in related proteins. Each
matrix is tailored to a particular evolutionary distance. In the
BLOSUM62 matrix, for example, the alignment from which scores
were derived was created using sequences sharing no more than
62% identity.
• Alignment
– The process of lining up two or more sequences to achieve
maximal levels of identity (and conservation, in the case of amino
acid sequences) for the purpose of assessing the degree of
similarity and the possibility of homology.
52. • Juxtaposition
– the act of placing two or more things side by side or the state of
being so placed.
• E value
– Expectation value. The number of different alignments with
scores equivalent to or better than raw score that are expected to
occur in a database search by chance. The lower the E value, the
more significant the score.
• Substitution
– The presence of a non-identical amino acid at a given position in
an alignment. If the aligned residues have similar physico-
chemical properties the substitution is said to be "conservative".
• Conservation
– Changes at a specific position of an amino acid or (less
commonly, DNA) sequence that preserve the physico-chemical
properties of the original residue.
53. • Identity
– The extent to which two (nucleotide or amino acid) sequences
are invariant.
• gap
– A space introduced into an alignment or position at which a letter
is paired with a null.
• Similarity
– The extent to which nucleotide or protein sequences are related.
The extent of similarity between two sequences can be based on
percent sequence identity and/or conservation. In BLAST
similarity refers to a positive matrix score.
• Query
– The input sequence (or other type of search term) with which all
of the entries in a database are to be compared.
54. Summary
1. Homologs are descended from a common ancestor.
2. Statistical analysis of sequence alignments can detect
homology.
3. Examination of three-dimensional structure enhances our
understanding of evolutionary relationships.
4. Evolutionary trees can be constructed on the basis of
sequence information.
54
55. Study questions
1. What are the differences between paralog and ortholog?
2. How can we study the function of a novel gene using
sequence alignment?
3. Why is it possible two similar sequences not homologous?
4. Why is protein sequence comparison produce more
accurate result than nucleotide sequence comparison?
5. Why is tertiary structure more evolutionarily conserved than
its primary structure?
6. What is a conservative substitution?
7. What is a sequence alignment?
8. What online tool can be used to search for homologous
sequences?
55
56. How confident can we be that orthologs are
similar, but paralogs differ?
56
• The idea that orthologs share similar functions, whereas
paralogs have different functions, has thus become accepted
by many and is the standard textbook model, as exemplified
by the ‘Phylogenetics Factsheet’ of the National Centre for
Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.-
gov/About/primer/phylo.html).
• However, more new evidences show that orthologs and
paralogs are not so different in either their evolutionary rates
or their mechanisms of divergence.
• Thus, functional change between orthologs might be as
common as between paralogs, and future studies should be
designed to test the impact of duplication against this
alternative model.
Studer and Robinson-Rechavi (2009)