This document provides an introduction and overview of the field of bioinformatics. It discusses how bioinformatics combines computer science and biology to analyze large amounts of biological data. Specifically, it mentions that bioinformatics uses algorithms and techniques from computer science to solve complex biological problems related to areas like molecular biology, genomics, drug discovery, and more. It also outlines some of the key applications of bioinformatics like sequence analysis, protein structure prediction, genome annotation, and comparative genomics. Finally, it provides brief descriptions of important biological databases and resources that bioinformaticians use to store and analyze genomic and protein sequence data.
As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data.
INTRODUCTION
WHAT IS DATA AND DATABASE?
WHAT IS BIOLOGICAL DATABASE?
TYPES OF BIOLOGICAL DATABASE
PRIMARY DATABASE
Nucleic acid sequence database
Protein sequence database
SECONDARY DATABASE
COMPOSITE DATABASE
TERTIARY DATABASE
WHY NEED?
CONCLUSION
REFRENCES
An integrated publicly accessible bioinformatics resource to support genomic/proteomic research and scientific discovery.
Established in 1984, by the National Biomedical Research Foundation (NBRF) Georgetown University Medial Center, Washington D.C., USA.
It is the source of annotated protein databases and analysis tools for the researchers.
Serve as primary resource for the exploration of protein information.
Accessible by text search for entry and list retrieval, and also BLAST search and peptide match.
INTRODUCTION.
NCBI.
EMBL.
DDBJ.
CONCLUSION.
REFERENSE.
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health.
The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.
The NCBI houses a series of databases relevant to biotechnology and biomedicine. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature.
All these databases are available online through the Entrez search engine.
SWISS-PROT- Protein Database- The Universal Protein Resource Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins.
As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data.
INTRODUCTION
WHAT IS DATA AND DATABASE?
WHAT IS BIOLOGICAL DATABASE?
TYPES OF BIOLOGICAL DATABASE
PRIMARY DATABASE
Nucleic acid sequence database
Protein sequence database
SECONDARY DATABASE
COMPOSITE DATABASE
TERTIARY DATABASE
WHY NEED?
CONCLUSION
REFRENCES
An integrated publicly accessible bioinformatics resource to support genomic/proteomic research and scientific discovery.
Established in 1984, by the National Biomedical Research Foundation (NBRF) Georgetown University Medial Center, Washington D.C., USA.
It is the source of annotated protein databases and analysis tools for the researchers.
Serve as primary resource for the exploration of protein information.
Accessible by text search for entry and list retrieval, and also BLAST search and peptide match.
INTRODUCTION.
NCBI.
EMBL.
DDBJ.
CONCLUSION.
REFERENSE.
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health.
The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.
The NCBI houses a series of databases relevant to biotechnology and biomedicine. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature.
All these databases are available online through the Entrez search engine.
SWISS-PROT- Protein Database- The Universal Protein Resource Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins.
An Introduction to Bioinformatics
Drexel University INFO648-900-200915
A Presentation of Health Informatics Group 5
Cecilia Vernes
Joel Abueg
Kadodjomon Yeo
Sharon McDowell Hall
Terrence Hughes
The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. This presentation deals with what, why, how, where and who of PDB. In this presentation we have also included briefing about various file formats available in PDB with emphasis on PDB file format
this presentation is about bioinformatics. the contents of bioinformatics are as under:
1.Introduction to bioinformatics.
2.Why bioinformatics is necessary?
3.Goals of bioinformatics
4.Field of bioinformatics
5.Where bioinformatics help?
6.Applications of bioinformatics
7.Software and tools of bioinformatics
8.References
An Introduction to Bioinformatics
Drexel University INFO648-900-200915
A Presentation of Health Informatics Group 5
Cecilia Vernes
Joel Abueg
Kadodjomon Yeo
Sharon McDowell Hall
Terrence Hughes
The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. This presentation deals with what, why, how, where and who of PDB. In this presentation we have also included briefing about various file formats available in PDB with emphasis on PDB file format
this presentation is about bioinformatics. the contents of bioinformatics are as under:
1.Introduction to bioinformatics.
2.Why bioinformatics is necessary?
3.Goals of bioinformatics
4.Field of bioinformatics
5.Where bioinformatics help?
6.Applications of bioinformatics
7.Software and tools of bioinformatics
8.References
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...journal ijrtem
process in which instead comparing whole query sequence with database sequence it breaks
query sequence into small words and these words are used to align patterns. it uses heuristic method which
make it faster than earlier smith-waterman algorithm. But due small query sequence used for align in case of
very large database with complex queries it may perform poor. To remove this draw back we suggest by using
MSA tools which can filter database in by removing unnecessary sequences from data. This sorted data set then
applies to BLAST which can then indentify relationship among them i.e. HOMOLOGS, ORTHOLOGS,
PARALOGS. The proposed system can be further use to find relation among two persons or used to create
family tree. Ortholog is interesting for a wide range of bioinformatics analyses, including functional annotation,
phylogenetic inference, or genome evolution. This system describes and motivates the algorithm for predicting
orthologous relationships among complete genomes. The algorithm takes a pairwise approach, thus neither
requiring tree reconstruction nor reconciliation
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...IJRTEMJOURNAL
BLAST is most popular sequence alignment tool used to align bioinformatics patterns. It uses
local alignment process in which instead comparing whole query sequence with database sequence it breaks
query sequence into small words and these words are used to align patterns. it uses heuristic method which
make it faster than earlier smith-waterman algorithm. But due small query sequence used for align in case of
very large database with complex queries it may perform poor. To remove this draw back we suggest by using
MSA tools which can filter database in by removing unnecessary sequences from data. This sorted data set then
applies to BLAST which can then indentify relationship among them i.e. HOMOLOGS, ORTHOLOGS,
PARALOGS. The proposed system can be further use to find relation among two persons or used to create
family tree. Ortholog is interesting for a wide range of bioinformatics analyses, including functional annotation,
phylogenetic inference, or genome evolution. This system describes and motivates the algorithm for predicting
orthologous relationships among complete genomes. The algorithm takes a pairwise approach, thus neither
requiring tree reconstruction nor reconciliation
INTRODUCTION
DEFINITION OF BIOINFORMATICS
HISTORY
OBJECTIVE OF BIOINFORMATIC
TOOLS OF BIOINFORMATICS
PROCEDURE AND TOOLS OF BIOINFORMATIC
BIOLOGICAL DATABASES
HOMOLOGY AND SIMILARITY TOOLS (SEQUENCE ALIGNMENT)
PROTEIN FUNCTION ANALYSIS TOOLS
STRUCTURAL ANALYSIS TOOLS
SEQUENCE MANIPULATION TOOLS
SEQUENCE ANALYSIS TOOLS
APPLICATION
CONCLUSION
REFERENCES
Bioinformatics, application by kk sahu sirKAUSHAL SAHU
INTRODUCTION
HISTORY
WHAT IS BIOINFORMATICS
APPLICATIONS
DNA AND RNA LEVELS
CONCLUSION
REFRENCES
"Bioinformatics" to refer to the study of information processes in biotic systems. This definition placed bioinformatics as a field parallel to biophysics or biochemistry (biochemistry is the study of chemical processes in biological systems).
the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data. This includes nucleotide and amino acid sequences, protein domains, and protein structures.
Knowing Your NGS Downstream: Functional PredictionsGolden Helix Inc
Next-Generation Sequencing analysis workflows typically lead to a list of candidate variants that may or may not be associated with the phenotype of interest. Any given analysis may result in tens, hundreds, or even thousands of genetic variants which must be screened and prioritized for experimental validation before a causal variant may be identified. To assist with this screening process, the field of bioinformatics has developed numerous algorithms to predict the functional consequences of genetic variants. Algorithms like SIFT and PolyPhen-2 are firmly established in the field and are cited frequently. Other tools, like MutationAssessor and FATHMM are newer and perhaps not known as well.
This presentation will review several of the functional prediction tools that are currently available to help researchers determine the functional consequences of genetic alterations. The biological principals underlying functional predictions will be discussed together with an overview of the methodology used by each of the predictive algorithms. Finally, we will discuss how these predictions can be accessed and used within the Golden Helix SNP & Variation Suite (SVS) software.
Bioinformatics Introduction and Use of BLAST ToolJesminBinti
Hi, I am Jesmin, studying MCSE. I think this file will help you if you want to know the basic information about Bioinformatics and the use of BLAST tool. The BLAST tool is the tool that matches the sequences of DNA,RNA and proteins.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
2. Introduction to Bioinformatics
Combined
to solve
complex
Biological problems
Algorithms and techniques of computer science being used to solve the problems
faced by molecular biologists
‘Information technology applied to the management and analysis of biological data’
Storage and Analysis are two important functions – bioinformaticians build tools for
each
Bio IT market has observed significant growth in genomic era
Biology
Chemistry
Statistics
Computer
science
Bioinformatics
3. Fields of Bioinformatics
The need for bioinformatics has arisen from the recent explosion of publicly
available genomic information, such as resulting from the Human Genome
Project.
Gain a better understanding of gene analysis, taxonomy, & evolution.
To work efficiently on the rational drug designs and reduce the time taken for
development of drug manually.
Unravel the wealth of Biological information hidden in mass of sequence,
structure, literature and biological data
Has environmental-clean up benefits
In agriculture, it can be used to produce high productivity crops
Gene Therapy
Forensic Analysis
Understanding biological pathways and networks in System Biology
5. Applications of Bioinformatics
Provides central, globally accessible databases that enable scientists to submit, search and analyze
information and offers software for data studies, modelling and interpretation.
Sequence Analysis:-
The application of sequence analysis determines those genes which encode regulatory
sequences or peptides by using the information of sequencing. These computational tools
also detect the DNA mutations in an organism and identify those sequences which are related.
Special software is used to see the overlapping of fragments and their assembly.
Prediction of Protein Structure:-
It is easy to determine the primary structure of proteins in the form of amino acids which
are present on the DNA molecule but it is difficult to determine the secondary, tertiary or
quaternary structures of proteins. Tools of bioinformatics can be used to determine the
complex protein structures.
Genome Annotation:-
In genome annotation, genomes are marked to know the regulatory sequences and protein
coding. It is a very important part of the human genome project as it determines the
regulatory sequences
6. Comparative Genomics:-
Comparative genomics is the branch of
bioinformatics which determines the
genomic structure and function relation
between different biological species
which enable the scientists to trace the
processes of evolution that occur in
genomes of different species.
Pharmaceutical Research:-
Tools of bioinformatics are also helpful in
drug discovery, diagnosis and disease
management. Complete sequencing of
human genes has enabled the scientists
to make medicines and drugs which can
target more than 500 genes. Accurate
prediction in screening.
7. S.
No
Unix Windows Linux
1. Open source Close source Open source
2. Very high security system Low security system High security system
3. Command-line GUI Hybrid
4. File system is arranged in
hierarchical manner
File system is arranged in parallel
manner
File system is arranged in
hierarchical manner
5. Not user friendly User friendly User friendly
6. Single tasking Multi tasking Multi tasking
8.
9. Biological databanks and databases
Very fast growth of biological
data
Diversity of biological data:
o Primary sequences
o 3D structures
o Functional data
Database entry usually
required for publication
o Sequences
o Structures
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information
Resource
GenBank (USA) MIPS
DDBJ (Japan) SWISS-PROT
University of Geneva,
now with EBI
TrEMBL
A supplement to SWISS- PROT
NRL-3D
Major primary databases
10. Sequence Databases
Three databanks exchange data on a daily basis
Data can be submitted and accessed at either location
Nucleotides db:
GenBank - https://www.ncbi.nlm.nih.gov/
EMBL - https://www.ebi.ac.uk/
DDBJ - https://www.ddbj.nig.ac.jp/index-e.html
Bibliographic db:
PubMed , Medline
Specialized db:
RDP, IMGT, TRANSFAC, MitBase
Genetic db:
SGD – https://www.yeastgenome.org/
ACeDB, OMIM
11. Composite Databases Secondary Databases
Swiss Prot
PIR
GenBank
NRL-3D
Store structure info or results
of searches of the primary
databases
Composite Databases Primary Source
PROSITE
https://prosite.expasy.org/
SWISS-PROT
PRINTS
http://130.88.97.239/PRINTS/index.p
hp
OWL
13. SCOP
Structural Classification of Proteins
http://scop.mrc-lmb.cam.ac.uk/
SCOP database aims to provide a detailed and comprehensive description of
structural and evolutionary relationship between all proteins
Levels of hierarchy
Family : Pairwise residue identities of aa 30% or greater
Superfamily : Eventhough low seq identities, should have common
evolutionary origin
Eg: ATPase domain of HSP and HK
Fold : Major structural similarity
Class : all α , all β, α or β, α and β, Multidomain
14. CATH
https://www.cathdb.info/
Class : 2º structure
Architecture : Gross orientation of 2º structure, independent of connectivities
Topology (fold family) : topological connection of super families
S level : Sequence and structural identities
15. Basis of Sequence Alignment
1. Aligning sequences
2. To find the relatedness of the proteins or gene, if they have a
common ancestor or not.
3. Mutation in the sequences, brings the changes or divergence in the
sequences.
4. Can also reveal the part of the sequence which is crucial for the
functioning of gene or protein.
Similarity indicates conserved function
Human and mouse genes are more than 80% similar
Comparing sequences helps us understand function
16. Sequence Alignment
After obtaining nuc/aa sequences, first thing is to compare with the known sequences.
Comparison is done at the level of constituents. Then finding of conserved residues to predict
the nature and function of the protein. This process of mapping is called
Sequence Alignment
1. Local alignment – Smith & Waterman Algorithm
2. Global alignment – Needleman & Wunch Algorithm
Gapped Alignment
Ungapped Alignment
Terms to Know - Homolog, Ortholog, Paralog, Xenolog, Similar and Identical
Alignment scoring and substitution matrices
Dot plots
Dynamic programming algorithm
Heuristic methods (In order to reduce time)
FASTA
BLAST
Pairwise sequence alignment
Multiple sequence alignment
17. Scoring a sequence alignment
Match score:
Mismatch score:
Gap penalty:
+ 1
+ 0
–1
Matches: 18 × (+1)
Mismatches: 2 × 0
Gaps: 7 × (– 1)
Score = +11
ACGTCTGAT-------ATAGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT
AC-T-TGA--CG-CGT-TA-TCTATCT
We can achieve this by penalizing more for a new
gap, than for extending an existing gap
Maximum no of matches gives high similarity – Optimum Alignment
ACGTCTGATACGCCGTATAGTCTATCT
||||| ||| || ||||||||
----CTGATTCGC---ATCGTCTATCT
18. Scores:
positive for identical or similar
negative for different
negative for insertion in one of the two sequences
Substitution matrices – weights replacement of one residue by another
assumption of evolution by point mutations
amino acid replacement (by base replacement)
amino acid insertion
amino acid deletion
Significance of alignment
Depends critically on gap penalty
Need to adjust to given sequence
19. Derivation of substitution matrices
PAM matrices
First substitution matrix; Developed by Dayhoff (1978) based on Point
Accepted Mutation (PAM) model of evolution
1PAM (without sub) is a unit of evolutionary divergence in which 1% of the aa
have been changed
Derived from alignment of very similar sequences
PAM1 = mutation events that change 1%of AA
PAM2, PAM3, ... extrapolated by matrix multiplication e.g.: PAM2 = PAM1*PAM1; PAM3 =
PAM2 * PAM1 etc
Lower distance PAM matrix for closely related proteins eg., PAM30
Higher distance PAM matrix for highly diverged sequences eg., PAM250
Problems with PAM matrices:
Incorrect modelling of long time substitutions, since conservative mutations dominated by
single nucleotide change
e.g.: L <–> I, L <–> V, Y <–> F
long time: any Amino Acid change
20. positive and negative values identity score depends on residue
positive and negative
values identity score
depends on residue
21. BLOSUM matrices
BLOCKSAmino acid Substitution Matrices
Similar as PAM; however the data were derived from local alignments for distantly related proteins
deposited in BLOCKS db
Unlike PAM there is no evolutionary basis
BLOSUM series (BLOSUM50, BLOSUM62, ...)
BLOCKS database:
ungapped multiple alignments of protein families at a given identity
E.g.,
BLOSUM 30 better for gapped alignments – for comparing highly diverged seq
BLOSUM 90 better for ungapped alignments – for very close seq
BLOSUM 62 was derived from a set of sequences which are 62% or less similar
22. DOT Plot
Simple comparison without alignment
2D graphical representation method primarily used for finding regions of
local matches between two sequences
DOTTER, PALIGN, DOTLET (https://dotlet.vital-it.ch/)
Distinguish by alignment score
Similarities increase score (positive)
Mismatches decrease score (Negative)
Gaps decrease score
Number of possible dots = (probability of pair) x (length of seq A) x (length of seq B)
Disadv – No direct seq homology & Statistically weak
23. Dynamic programming algorithm
To build up optimal alignment which maximizes the similarity we need some scoring
methods
The dynamic programming relies on a principle of optimality.
PROCEDURE
Construct a two-dimensional matrix whose axes are the two sequences to be compared.
The scores are calculated one row at a time. This starts with the first row of one
sequence, which is used to scan through the entire length of the other sequence,
followed by scanning of the second row.
The scanning of the second row takes into account the scores already obtained in the
first round. The best score is put into the bottom right corner of an intermediate
matrix.
This process is iterated until values for all the cells are filled.
24. Depicting the results:
Back tracing
The best matching path is the one that has the maximum total score.
If two or more paths reach the same highest score, one is chosen
arbitrarily to represent the best alignment.
The path can also move horizontally or vertically at a certain
point, which corresponds to introduction of a gap or an insertion
or deletion for one of the two sequences.
25. BLAST
Basic Local Alignment search tool
https://blast.ncbi.nlm.nih.gov/Blast.cgi
Multi-step approach to find high-scoring local alignments between
two sequences
List words of fixed length (3AA) (11nuc) expected to give score larger
than threshold (seed alignment)
For every word, search database and extend ungapped alignment in
both directions upto a certain length to get HSPs
New versions of BLAST allow gaps
Blastn:
Blastp:
tBlastn:
Blastx:
tBlastx:
nucleotide sequences
protein sequences
protein query - translated database
nucleotide query - protein database
nucleotide query - translated database
26.
27. Interpretation
Rapid and easier to find homolog by scanning huge db
Search against specialized db
Blast program employ SEG program to filter low complexity regions before
executing db search
Quality of the alignment is represented by score (to identify hits)
Significance of the alignment is represented as e-value (Expected value)
E-value decreases exponentially as the score increases
The E-value provides information about the likelihood that a given sequence
match is purely by chance. The lower the E- value, the less likely the
database and therefore more significant the match is.
If E is between 0.01 and 10, the match is considered not significant.
28. FASTA
More sensitive than BLAST
Table to locate all identically matching words of
length Ktup between two sequences
Blast – Hit extension step
Fasta – Exact word match
As the high value of Ktup increases the search
becomes slow
FASTA also uses E-values and bit scores. The FASTA
output provides one more statistical parameter,
the Z-score.
If Z is in the range of 5 to 15, the sequence pair
can be described as highly probable homologs. If
Z < 5, their relationships is described as less
certain
29. Phylogenetics
Phylogenetics is the study of evolutionary relatedness among various groups of
organisms (e.g., species, populations).
Methods of Phylogenetic Analysis:
Monophyletic group – all taxa share by one common ancestor
Paraphyletic groping – share common ancestor but not all
Errors in alignment mislead tree
Phenetic
NJ,
UPGMA
Cladistic
MP
ML
30. A phylogenetic tree is a tree showing the
evolutionary interrelationships among various
species or other entities that are believed to
have a common ancestor. A phylogenetic tree
is a form of a cladogram. In a phylogenetic
tree, each node with descendants represents
the most recent common ancestor of the
descendants, and edge lengths correspond to
time estimates.
Each node in a phylogenetic tree is called a
taxonomic unit. Internal nodes are generally
referred to as Hypothetical Taxonomic Units
(HTUs) as they cannot be directly observed
Distances – no of changes
Parts of a phylogenetic tree
Node
Root
Outgroup
Ingroup
Branch
31. Phenetic Method of analysis:
Also known as numerical taxonomy
Involves various measures of overall similarity for ranking species
All the data are first converted to a numerical value without any character
(weighing). Then no of similarities / differences is calculated.
Then clustering or grouping close together
Lack of evolutionary significance in phenetics
Cladistic method of analysis:
Alternative approach
Diagramming relationship between taxa
Basic assumption – members of the group share a common evolutionary
history
Typically based on morphological data
32. Distance and Character
A tree can be based on
1. quantitative measures like the distance or similarity between species, or
2. based on qualitative aspects like common characters.
Molecular clock assumption – substitution in nu / aa are being compared at constant rate
33. Maximum Parsimony:
Finds the optimum tree by minimizing the number of evolutionary changes
No assumptions on the evolutionary pattern
MSA then scoring
Rather time consuming works well if seq have strong similarity
May oversimplify evolution
May produce several equally good trees
PAUP, MacClade
Maximum Likelihood:
The best tree is found based on assumptions on evolution model
Nucleotide models more advanced at the moment than aminoacid models
Programs require lot of capacity from the system
34. Neighbour Joining:
The sequences that should be joined are chosen to give the best least-squares estimates of the
branch length that most closely reflect the actual distances between the sequences
NJ method begins by creating a star topology in which no neighbours are connected
Then tree is modified by joining pair of sequences. Pair to be joined is chosen by calculating
the sum of branch length
Distance table
No molecular clock assumed
UPGMA
Unweighted Pair Group method with Arithmetic Mean
Works by clustering, starting with more similar towards distant
Dot representation
Molecular clock assumed
35. PHYLIP (Phylogeny Inference Package)
Available free in Windows/MacOS/Linux systems
Parsimony, distance matrix and likelihood methods (bootstrapping and
consensus trees)
Data can be molecular sequences, gene frequencies, restriction sites and
fragments, distance matrices and discrete characters