Bioinformatics is an interdisciplinary field involving biology, computer science, mathematics and statistics. It addresses large-scale biological problems from a computational perspective. Common problems include modeling biological processes at the molecular level and making inferences from collected data. A bioinformatics solution typically involves collecting statistics from biological data, building a computational model, solving a computational problem, and testing the algorithm. Bioinformatics plays a role in areas like structural genomics, functional genomics and nutritional genomics. It is used for applications such as transcriptome analysis, drug discovery, cheminformatics analysis, and more. It is an important tool in fields like molecular medicine, gene therapy, microbial genome applications, antibiotic resistance, and evolutionary studies. Biological databases are important for organizing
2. Introduction
Bioinformatics is an interdisciplinary field mainly involving molecular
biology and genetics, computer science, mathematics, and statistics.
Data intensive, large-scale biological problems are addressed from a
computational point of view.
The most common problems are modeling biological processes at the
molecular level and making inferences from collected data.
3. A bioinformatics solution usually involves the following steps:
● Collect statistics from biological data.
● Build a computational model.
● Solve a computational modeling problem.
● Test and evaluate a computational algorithm.
4.
5.
6.
7. Applications of bioinformatics
Bioinformatics plays a vital role in the areas of structural genomics, functional
genomics, and nutritional genomics.
It covers emerging scientific research and the exploration of proteomes from the
overall level of intracellular protein composition (protein profiles), protein structure,
protein-protein interaction, and unique activity patterns (e.g. post-translational
modifications).
8. Applications of Bioinformatics
Bioinformatics is used for transcriptome analysis where mRNA expression levels
can be determined.
Bioinformatics is used to identify and structurally modify a natural product, to
design a compound with the desired properties and to assess its therapeutic
effects, theoretically.
Cheminformatics analysis includes analyses such as similarity searching,
clustering, QSAR modeling, virtual screening, etc.
9. Bioinformatics is playing an increasingly important role in almost all aspects of
drug discovery and drug development.
Bioinformatics tools are very effective in prediction, analysis and interpretation of
clinical and preclinical findings.
10. Molecular Medicine
The human genome will have profound effects on the fields of biomedical
research and clinical medicine.
The completion of the human genome and the use of bioinformatic tools means
that we can search for the genes directly associated with different diseases and
begin to understand the molecular basis of these diseases more clearly.
This new knowledge of the molecular mechanisms of disease will enable better
treatments, cures and even preventative tests to be developed
11. Gene therapy
In the not too distant future with the use of bioinformatics tool, the
potential for using genes themselves to treat disease may become a
reality.
Gene therapy is the approach used to treat, cure or even prevent disease
by changing the expression of a person’s genes.
12. Homology modelling and protein drug discovery
At present all drugs on the market target only about 500 proteins.
With an improved understanding of disease mechanisms and using
computational tools to identify and validate new drug targets, more
specific medicines that act on the cause, not merely the symptoms, of the
disease can be developed.
These highly specific drugs promise to have fewer side effects than many
of today’s medicines.
13. Microbial genome applications
The arrival of the complete genome sequences and their potential to
provide a greater insight into the microbial world and its capacities could
have broad and far reaching implications for environment, health, energy
and industrial applications.
By studying the genetic material of these organisms, scientists can begin
to understand these microbes at a very fundamental level and isolate the
genes that give them their unique abilities to survive under extreme
conditions.
14. Antibiotic resistance
Scientists have been examining the genome of Enterococcus faecalis-a leading
cause of bacterial infection among hospital patients.
They have discovered a virulence region made up of a number of antibiotic-
resistant genes that may contribute to the bacterium’s transformation from a
harmless gut bacteria to a menacing invader.
The discovery of the region, known as a pathogenicity island, could provide useful
markers for detecting pathogenic strains and help to establish controls to prevent
the spread of infection in wards.
15. Evolutionary studies
The sequencing of genomes from all three domains of life, eukaryota, bacteria and
archaea means that evolutionary studies can be performed in a quest to
determine the tree of life and the last universal common ancestor.
16. Crop improvement
Comparative genetics of the plant genomes has shown that the organisation of
their genes has remained more conserved over evolutionary time than was
previously believed.
These findings suggest that information obtained from the model crop systems
can be used to suggest improvements to other food crops.
At present the complete genomes of Arabidopsis thaliana (water cress) and Oryza
sativa (rice) are available
17.
18. Biological Databases- Types and Importance
As the volume of genomic data grows, sophisticated computational methodologies are
required to manage the data deluge.
A biological database is a large, organized body of persistent data, usually associated with
computerized software designed to update, query, and retrieve components of the data stored
within the system.
A simple database might be a single file containing many records, each of which includes the
same set of information.
The chief objective of the development of a database is to organize data in a set of structured records
to enable easy retrieval of information.
19. Types of Biological Databases
Based on their contents, biological
databases can be roughly divided into
two categories:
Primary
Databases
Secondary
Databases
20. 1. Primary databases
Primary databases are also called as archieval databases.
They are populated with experimentally derived data such as
nucleotide sequence, protein sequence or macromolecular structure.
Experimental results are submitted directly into the database by
researchers, and the data are essentially archival in nature.
Once given a database accession number, the data in primary
databases are never changed: they form part of the scientific record.
21. Examples for primary databases
GenBank from NCBI (National Center for Biotechnology Information)
ENA from EMBL
DDBJ
Protein Data Bank (PDB; coordinates of three-dimensional
macromolecular structures)
22. 2. Secondary databases
Secondary databases comprise data derived from the results of
analysing primary data.
Secondary databases often draw upon information from numerous
sources, including other databases (primary and secondary),
controlled vocabularies and the scientific literature.
They are highly curated, often using a complex combination of
computational algorithms and manual analysis and interpretation to
derive new knowledge from the public record of science.
23. Examples for secondary databases
RefSeq from NCBI
Ensembl (variation,
function, regulation and
more layered onto whole
genome sequences)
TrEMBL and Swiss Prot
from UniProt
24. Specialized Databases
There are also specialized databases are those that cater to a particular research
interest.
It includes organisms, diseases, so on.
Flybase
HIV sequence database
Ribosomal Database Project
25. Importance of Databases
It allows knowledge discovery, which refers to the identification of connections between pieces of
information that were not known when the information was first entered.
This facilitates the discovery of new biological insights from raw data.
Secondary databases have become the molecular biologist’s reference library over the past decade.
It provides a wealth of information on just about any gene or gene product that has been investigated
by the research community.
It helps to solve cases where many users want to access the same entries of data.
Allows the indexing of data.
It helps to remove redundancy of data.
26. GenBank
The GenBank sequence database is an open access, annotated collection of all
publicly available nucleotide sequences and their protein translations.
It is produced and maintained by the National Center for
Biotechnology Information (NCBI; a part of the National Institutes of
Health in the United States) as part of the International Nucleotide
Sequence Database Collaboration (INSDC).
https://www.ncbi.nlm.nih.gov/genbank/
27. GenBank introduction
GenBank and its collaborators receive sequences produced in laboratories throughout
the world from more than 100,000 distinct organisms. The database started in 1982 by
Walter Goad and Los Alamos National Laboratory.
GenBank has become an important database for research in biological fields
and has grown in recent years at an exponential rate by doubling roughly
every 18 months
As of 15 June 2019, GenBank release 232.0 has 213,383,758 loci,
329,835,282,370 bases, from 213,383,758 reported sequences
28. GenBank introduction
In recent years, divisions have been added to support specific
sequencing strategies.
These include divisions for expressed sequence tag (EST), genome
survey (GSS), high throughput genomic (HTG), high throughput cDNA
(HTC), and environmental sample (ENV) sequences, making a total of
18 divisions.
29.
30. Submissions overview
Only original sequences can be submitted to GenBank.
Direct submissions are made to GenBank using BankIt, which is a
Web-based form, or the stand-alone submission program, Sequin.
Upon receipt of a sequence submission, the GenBank staff
examines the originality of the data and assigns an accession
number to the sequence and performs quality assurance checks.
The submissions are then released to the public database, where
the entries are retrievable by Entrez or downloadable by FTP.
31. Submission using BankIt
About one-third of author submissions are received through NCBI's web-
based data submission tool, BankIt (Author Webpage). Using BankIt,
authors enter sequence information directly into a form, and add biological
annotations such as coding regions, or mRNA features.
BankIt validates submissions, flagging many common errors, and checks
for vector contamination using a variant of BLAST called Vecscreen,
before creating a draft record in GenBank flat file format for the submitter
to review. BankIt is the tool of choice for simple submissions, especially
when only one or a small number of records is to be submitted.
BankIt can also be used by submitters to update their existing GenBank
records.
32. Submission using Sequin
NCBI also offers a standalone multi-platform submission program called
Sequin (Author Webpage) that can be used interactively with other NCBI
sequence retrieval and analysis tools.
Sequin handles simple sequences such as a cDNA, as well as segmented
entries, phylogenetic studies, population studies, mutation studies,
environmental samples, and alignments for which BankIt and other web-
based submission tools are not well suited.
Sequin has convenient editing and complex annotation capabilities and
contains a number of built-in validation functions for quality assurance.
33. Submission via tbl2asn
Submitters of large, heavily annotated genomes may find it convenient
to use ‘tbl2asn’.
Convert a table of annotations generated via an annotation pipeline
into an ASN.1 record suitable for submission to GenBank
34.
35.
36.
37. Sequence identifiers and accession numbers
Each GenBank record, consisting of both a sequence and its
annotations, is assigned a unique identifier, the accession number,
that is shared across the three collaborating databases (GenBank,
DDBJ, EMBL) and remains constant over the lifetime of the record
Each version of the DNA sequence within a GenBank record is also
assigned a unique NCBI identifier, called a ‘gi’, that appears on the
VERSION line of GenBank flatfile records following the accession
number
ACCESSION AF000001
VERSION AF000001.1 GI: 987654321
38. RETRIEVING GenBank DATA
The Entrez system
The sequence records in GenBank are accessible via Entrez (Author
Webpage), a flexible database retrieval system that covers over 30
biological databases.
These include DNA and protein sequences derived from GenBank and
other sources, genome maps, population, phylogenetic and environmental
sequence sets, gene expression data, the NCBI taxonomy, protein
domain information, protein structures from the Molecular Modeling
Database, MMDB; each database linked to the scientific literature via
PubMed and PubMed Central
39. Obtaining GenBank by FTP
NCBI distributes GenBank releases in the traditional flat-file format as
well as in the Abstract Syntax Notation (ASN.1) format used for
internal maintenance.
The complete bimonthly GenBank release and the daily updates,
which also incorporate sequence data from EMBL and DDBJ, are
available by anonymous FTP from NCBI at (Author Webpage) as well
as from a mirror site at the University of Indiana (Author Webpage)
40. European Nucleotide Archive
The European Nucleotide Archive (ENA) is a repository providing free and
unrestricted access to annotated DNA and RNA sequences. It also stores
complementary information such as experimental procedures, details of
sequence assembly and other metadata related to sequencing projects
http://www.ebi.ac.uk/ena/
41. Database Structure
The archive is composed of three main databases:
● The Sequence Read Archive,
● The Trace Archive and
● EMBL Nucleotide Sequence Database (also known as EMBL-bank).
The ENA is produced and maintained by the European Bioinformatics
Institute and is a member of the International Nucleotide Sequence
Database Collaboration (INSDC) along with the DNA Data Bank of Japan
and GenBank.
42. Data access and management
The data contained in the ENA can be accessed manually or
programmatically via REST URL through the ENA browser.
Initially limited to the Sequence Read Archive, the ENA browser now also
provides access to the Trace Archive and EMBL-Bank, allowing file
retrieval in a range of formats including XML, HTML, FASTA and FASTQ
Individual records can be accessed using their accession numbers and
other text queries are enabled through the EB-eye search engine
43. SRA
The ENA operates an instance of the Sequence Read Archive (SRA), an
archival repository of sequence reads and analyses which are intended
for public release.Originally called the Short Read Archive, the name was
changed in anticipation of future sequencing technologies being able to
produce longer sequence reads.
The preferred data format for files submitted to the SRA is the BAM
format, which is capable of storing both aligned and unaligned reads.
44. Storage
As of 2012, the ENA's storage requirements continue to grow
exponentially, with a doubling time of approximately 10 months.
To manage this increase, the ENA selectively discards less-valuable
sequencing platform data and implements advanced compression
strategies.
The CRAM reference-based compression toolkit was developed to help
reduce ENA storage requirements.
45. DNA Data Bank of Japan
http://www.ddbj.nig.ac.jp/
DDBJ began data bank activities in 1986 at NIG and remains the only
nucleotide sequence data bank in Asia
Currently, DDBJ Center is in operation at Research Organization of
Information and System National Institute of Genetics(NIG) in
Mishima, Japan with endorsement of MEXT; Japanese Ministry of
Education, Culture, Sports, Science and Technology.
46. DDBJ, expanding its DNA databank activities, was restructured as one of the
Intellectual Infrastructure Project Centers of NIG, being separated from CIB.
Collaborating with NBDC; National Bioscience Database Center, DDBJ Center
started to operate the archive for all types of individual-level genetic and de-
identified phenotypic data from human subjects, JGA; Japanese Genotype-
phenotype Archive.
ARSA iis high-speed retrieval system of sequence and annotation data maintained
by DNA Data Bank of Japan ( DDBJ)
50. RefSeq
Reference Sequence (RefSeq) collection provides a comprehensive,
integrated, non-redundant, well-annotated set of sequences, including
genomic DNA, transcripts, and proteins. RefSeq sequences form a
foundation for medical, functional, and diversity studies.
They provide a stable reference for genome annotation, gene
identification and characterization, mutation and polymorphism analysis
(especially RefSeqGene records), expression studies, and comparative
analyses.
RefSeq genomes are copies of selected assembled genomes available in
GenBank.
51. Main features of the RefSeq collection include:
● non-redundancy
● explicitly linked nucleotide and protein sequences
● updates to reflect current knowledge of sequence data and biology
● data validation and format consistency
● distinct accession series (all accessions include an underscore '_'
character)
● ongoing curation by NCBI staff and collaborators, with reviewed
records indicated
52. RefSeq transcript and protein records are generated by
several processes including:
● Computation
Eukaryotic Genome Annotation Pipeline
Prokaryotic Genome Annotation Pipeline
● Manual curation
● Propagation from annotated genomes that are submitted to members
of the International Nucleotide Sequence Database Collaboration
(INSDC)
53.
54. Scope
NCBI provides RefSeqs for taxonomically diverse organisms including
archaea, bacteria, eukaryotes, and viruses.
References sequences are provided for genomes, transcripts, and
proteins. Some targeted loci projects are included in RefSeq including:
RefSeqGene , fungal ITS , and rRNA loci. New or updated records are
added to the collection as data become publicly available
55. Ensembl www.ensembl.org
Ensembl is a joint project between EMBL-EBI and the Sanger Centre
to develop a software system which produces and maintains automatic
annotation of eukaryotic genomes.
56.
57. Ensembl
In the Ensembl project, sequence data are fed into the gene
annotation system (a collection of software "pipelines" written
in Perl) which creates a set of predicted gene locations and
saves them in a MySQL database for subsequent analysis
and display.
Ensembl makes these data freely accessible to the
world research community.
60. Protein Information Resource (PIR) – Protein Sequence
Database (PIR-PSD):
PIR was established in 1984 by the National Biomedical Research
Foundation (NBRF) as a resource to assist researchers in the identification
and interpretation of protein sequence information.
Prior to that, the NBRF compiled the first comprehensive collection of
macromolecular sequences in the Atlas of Protein Sequence and Structure,
published from 1965-1978 under the editorship of Margaret O. Dayhoff
61. PIR
For over four decades, beginning with the Atlas of Protein
Sequence and Structure, PIR has provided protein databases and
analysis tools freely accessible to the scientific community
including the Protein Sequence Database (PSD).
In 2002 PIR, along with its international partners, EBI (European
Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics),
were awarded a grant from NIH to create UniProt, a single
worldwide database of protein sequence and function, by unifying
the PIR-PSD, Swiss-Prot, and TrEMBL databases.
62. UniProt is produced by the UniProt Consortium, a collaboration
between the European Bioinformatics Institute (EBI), the Swiss
Institute of Bioinformatics (SIB) and the Protein Information
Resource (PIR)
64. UniProt Knowledgebase (UniProtKB)
The UniProt Knowledgebase, the centrepiece of the UniProt
Consortium’s activities, is an expertly and richly curated protein
database, consisting of two sections called UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL.
65. Swiss-Prot
UniProtKB/Swiss-Prot contains high-quality manually annotated and
non-redundant protein sequence records.
Manual annotation consists of analysis, comparison and merging of all
available sequences for a given protein, as well as a critical review of
associated experimental and predicted data.
UniProt curators extract biological information from the literature and
perform numerous computational analyses.
66. Swiss-Prot
UniProtKB/Swiss-Prot aims to provide all known relevant information
about a particular protein. It describes, in a single record, the different
protein products derived from a certain gene from a given species,
including each protein derived by alternative splicing, polymorphisms
and/or post-translational modifications.
Protein families and groups are regularly reviewed to keep up with
current scientific findings.
67. UniProtKB/Swiss-Prot entry name
Entry name symbolized as X_Y, where:
X is protein name, Y is species name
see for example INS_HUMAN, INS1_MOUSE and INS2_MOUSE
INS = INSULIN
HUMAN =SPECIES
68.
69. UniProtKB/TrEMBL
UniProtKB/TrEMBL contains high-quality computationally analysed
records enriched with automatic annotation and classification.
Records are selected for full manual annotation and integration into
UniProtKB/Swiss-Prot according to defined annotation priorities.
The default raw sequence data for UniProtKB are:
DDBJ/ENA/GenBank coding sequence (CDS) translations, the
sequences of PDB structures, sequences from Ensembl and
RefSeq, data derived from amino acid sequences that are directly
submitted to UniProtKB or scanned from the literature.
70. UniProt Reference Clusters (UniRef)
Three UniRef databases – UniRef100, UniRef90 and UniRef50 –
merge sequences automatically across species. UniRef100 is based
on all UniProtKB records.
UniRef100 is produced by clustering all these records by sequence
identity. Identical sequences and sub-fragments are presented as a
single UniRef100 entry with accession numbers of all the merged
entries, the protein sequence, links to the corresponding UniProtKB
and archive records. UniRef90 and UniRef50 are built from UniRef100
to provide records with mutual sequence identity of 90% or more, or
50% or more, respectively
71. UniProt Archive (UniParc)
UniParc is designed to capture all publicly available protein sequence
data and contains all the protein sequences from the main publicly
available protein sequence databases.
UniParc handles all sequences simply as text strings – sequences that
are 100% identical over their entire length are merged regardless of
whether they are from the same or different species.
UniParc also provides sequence versions, which are incremented
every time the underlying sequence changes.
72.
73. UniProt Metagenomic and Environmental
Sequences (UniMES)
The availability of metagenomic data has necessitated the creation of
a separate database, UniMES, to store sequences which are
recovered directly from environmental samples.
The predicted proteins from this dataset are combined with automatic
classification by InterPro.
74. Since 1971, the Protein Data Bank archive (PDB) has served as the single
repository of information about the 3D structures of proteins, nucleic acids, and
complex assemblies.
75. Protein Data
Bank
RCSB PDB (Research
Collaboratory for
Structural
Bioinformatics PDB)
operates the US data
center for the global
PDB archive
76.
77. Protein Data Bank Japan
Supports browsing in multiple languages such as Japanese, Chinese,
and Korean; SeSAW identifies functionally or evolutionarily conserved
motifs by locating and annotating sequence and structural similarities,
tools for bioinformaticians, and more.
78. Research Collaboratory for
Structural Bioinformatics Protein
Data Bank
Simple and advanced searching for macromolecules and ligands,
tabular reports, specialized visualization tools, sequence-structure
comparisons, RCSB PDB Mobile, Molecule of the Month and other
educational resources at PDB-101, and more.
79. Biological Magnetic Resonance
Data Bank
Collects NMR data from any experiment and captures assigned chemical
shifts, coupling constants, and peak lists for a variety of macromolecules;
contains derived annotations such as hydrogen exchange rates, pKa
values, and relaxation parameters.
80. Protein Data Bank in Europe
Rich information about all PDB entries, multiple search and browse
facilities, advanced services including PDBePISA, PDBeFold and
PDBeMotif, advanced visualisation and validation of NMR and EM
structures, tools for bioinformaticians.
86. PDB file format
HEADER, TITLE and AUTHOR records
provide information about the researchers who defined the structure;
numerous other types of records are available to provide other types of
information
REMARK records
can contain free-form annotation, but they also accommodate standardized
information
SEQRES records
give the sequences of the three peptide chains (named A, B and C), which
87. ATOM records
describe the coordinates of the atoms that are part of the protein. For
example, the first ATOM line above describes the alpha-N atom of the first
residue of peptide chain A, which is a proline residue; the first three
floating point numbers are its x, y and z coordinates and are in units of
Ångströms. The next three columns are the occupancy, temperature
factor, and the element name, respectively.
HETATM records
describe coordinates of hetero-atoms, that is those atoms which are not
part of the protein molecule.
90. Human genome project
The human genome project, a large, federally funded collaborative
project, completed the sequencing of entire human genome in 2003
Initially project funded by DOE and NIH.
The Human Genome Project originally aimed to map the nucleotides
contained in a human haploid reference genome (more than three billion).
The "genome" of any given individual is unique; mapping the "human
genome" involved sequencing a small number of individuals and then
assembling these together to get a complete sequence for each
chromosome. Therefore, the finished human genome is a mosaic, not
representing any one individual.
91. Project goals were to
● identify all the approximately 20,500 genes in human DNA,
● determine the sequences of the 3 billion chemical base pairs that
make up human DNA,
● store this information in databases,
● improve tools for data analysis,
● transfer related technologies to the private sector, and
● address the ethical, legal, and social issues (ELSI) that may arise
from the project.
93. BAC-end sequencing
The widely agreed-upon strategy for sequencing the human genome is
based on the use of BACs that carry fragments of human DNA from
known locations in the genome
GRAIL
GRAIL (Gene Recognition and Assembly Internet Link) is one of the most
widely used computer programs for identifying potential genes in DNA
sequence and for general DNA sequence analysis.
94. Race b/w HGP and Celera
The entry of Celera Genomics into the human genome sequencing arena
in 1998 galvanised the public effort, leading to a race to sequence the
human genome.
Celera utilized the skills of computer scientist W. Meyers to perform whole
genome short cloning approach and intensive computer processing of
data to complete the Drosophila sequence and then the human genome
sequence
Craig Venter aimed to sequence and assemble the entire human genome
by 2001, and only make the information available to paying customers
95.
96. Impacts Of The HGP
Molecular medicine.
Energy sources and environmental applications.
Risk assessment.
Bioarchaeology, anthropology, evolution, and human migration.
DNA forensics (identification)
Agriculture, livestock breeding, and bioprocessing.
98. Molecular Modeling Database (MMDB)
The Molecular Modeling DataBase (MMDB) is a database of
experimentally determined three-dimensional biomolecular structures,
and is also referred to as the Entrez Structure database.
It is a subset of three-dimensional structures obtained from the RCSB
Protein Data Bank (PDB), excluding theoretical models.
99. Functional insights
Experimentally resolved structures of proteins, RNA, and DNA, derived
from the Protein Data Bank (PDB), with value-added features such as
explicit chemical graphs, computationally identified 3D domains (compact
substructures) that are used to identify similar 3D structures, as well as
links to literature, similar sequences, information about chemicals bound
to the structures, and more.
These connections make it possible, for example, to find 3D structures for
homologs of a protein sequence of interest, then interactively view the
sequence-structure relationships, active sites, bound chemicals, journal
articles, and more.
100. CBLAST
A tool that compares a query protein sequence against all protein
sequences from experimentally resolved 3D structures, by using
protein BLAST against the PDB data set
101. IBIS (Inferred Biomolecular Interaction Server)
For a given protein sequence or structure query, IBIS reports protein-
protein, protein-small molecule, protein nucleic acids and protein-ion
interactions observed in experimentally-determined structural
biological assemblies. IBIS also infers/predicts interacting partners and
binding sites by homology, by inspecting the protein complexes
formed by close homologs of a given query.
102. Vast (Vector Alignment Search Tool)
The original VAST finds structures that are 3D similar to individual protein
molecules, or individual 3D domains
104. Database description
The SCOP database aims to provide a detailed and comprehensive
description of the structural and evolutionary relationships between
proteins whose three-dimensional structure is known and deposited in
the Protein Data Bank. The main levels of the classification are:
● Family
● Super Family
● Fold
● Classes
105. Family
Family groups closely related proteins with a clear evidence for their
evolutionary origin. In most cases, their relationship is detectable with
current sequence comparison methods, e.g. BLAST, PSI- BLAST,
HMMER.
106. Superfamily
Superfamily brings together more distantly related protein domains.
Their similarity is frequently limited to common structural features that
along with a conserved architecture of active or binding sites, or
similar modes of oligomerization suggest a probable evolutionary
ancestry.
107. Fold
Fold groups superfamilies on the basis of the global structural
features shared by the majority of their members. These features
are the composition of the secondary structures in the domain
core, their architecture and topology. Fold is an attribute of a
superfamily but the constituent families of some superfamilies that
have evolved distinct structural features can belong to a different
fold.
108. Class
Classes bring together folds and IUPRs with different secondary
structural content.
These include all-alpha and all-beta proteins, containing
predominantly alpha-helices and beta-strands, respectively, and
‘mixed’ alpha and beta classes (a/b) and (a+b) with respectively
112. CATH database
The CATH Protein Structure Classification database is a free,
publicly available online resource that provides information on the
evolutionary relationships of protein domains. It was created in the
mid-1990s
CATH shares many broad features with the SCOP resource,
however there are also many areas in which the detailed
classification differs greatly
113. Classification
The domains are classified within the CATH structural hierarchy: at the Class (C)
level, domains are assigned according to their secondary structure content, i.e. all
alpha, all beta, a mixture of alpha and beta, or little secondary structure;
at the Architecture (A) level, information on the secondary structure arrangement in
three-dimensional space is used for assignment;
at the Topology/fold (T) level, information on how the secondary structure elements
are connected and arranged is used; assignments are made to the Homologous
superfamily
(H) level if there is good evidence that the domains are related by evolution i.e. they
are homologous.
116. OMIM
OMIM is a comprehensive, authoritative compendium of human genes
and genetic phenotypes that is freely available and updated daily. The
full-text, referenced overviews in OMIM contain information on all known
mendelian disorders and over 15,000 genes. OMIM focuses on the
relationship between phenotype and genotype.
It is updated daily, and the entries contain copious links to other
genetics resources.
117. OMIM
Each OMIM entry has a full-text summary of a genetically determined
phenotype and/or gene and has numerous links to other genetic
databases such as DNA and protein sequence, PubMed references,
general and locus-specific mutation databases, HUGO nomenclature,
MapViewer, GeneTests, patient support groups and many others.
OMIM is an easy and straightforward portal to the burgeoning
information in human genetics.
118. OMIM
OMIM has been available online since 1987, first from Johns Hopkins
University and since 1995 from the NCBI
OMIM can be searched from its homepage or from any page in the NCBI
Entrez suite of databases. Information in OMIM can be retrieved by
queries on MIM number, disorder, gene name and/or symbol, or plain
English
Each OMIM entry is assigned a unique six-digit number whose first digit
indicates whether its inheritance is autosomal, X-linked, Y-linked or
mitochondrial
119.
120.
121. MalaCards: The human disease database
MalaCards is an integrated database of human maladies and their
annotations, modeled on the architecture and richness of the popular
GeneCards database of human genes.
The MalaCards disease and disorders database is organized into "disease
cards", each integrating prioritized information, and listing numerous known
aliases for each disease, along with a variety of annotations, as well as
inter-disease connections, empowered by the GeneCards relational
database, searches, and GeneAnalytics set-analyses
122. Cont…
Annotations include: symptoms, drugs, articles, genes, clinical trials,
related diseases/disorders and more.
An automatic computational information retrieval engine populates the
disease cards, using remote data, as well as information gleaned using
the GeneCards platform to compile the disease database.
The MalaCards disease database integrates both specialized and general
disease lists, including rare diseases, genetic diseases, complex
disorders and more.
124. TIGR
Provides a collection of curated databases containing DNA and protein
sequence, gene expression, cellular role, protein family, and
taxonomic data for microbes, plants and humans.
The CMR (Comprehensive Microbial Resource) contains analysis on
completed microbial genome sequencing.
127. KEGG
This concept is realized in the following databases of KEGG, which
are categorized into systems, genomic, chemical, and health
information
Systems information
● PATHWAY — pathway maps for cellular and organismal functions
● MODULE — modules or functional units of genes
● BRITE — hierarchical classifications of biological entities
128. Genomic information
● GENOME — complete genomes
● GENES — genes and proteins in the complete genomes
● ORTHOLOGY — ortholog groups of genes in the complete
genomes
129. Chemical information
● COMPOUND, GLYCAN — chemical compounds and glycans
● REACTION, RPAIR, RCLASS — chemical reactions
● ENZYME — enzyme nomenclature
130. Health information
● DISEASE — human diseases
● DRUG — approved drugs
● ENVIRON — crude drugs and health-related substances
131. Microbial Genome Database
● MBGD is a workbench system for comparative analysis of completely
sequenced microbial genomes.
● he central function of MBGD is to create an orthologous gene classification
table using precomputed all-against-all similarity relationships among genes
in multiple genomes.
● The growth of the number of completed microbial genome sequences is
accelerated recently and nearly a hundred of genomes in various levels of
relatedness have already been available today.
● Especially interesting are the recently available multiple genomes of some
particular taxonomic groups such as proteobacteria gamma subdivision and
132. ● Especially interesting are the recently available multiple genomes of some
particular taxonomic groups such as proteobacteria gamma subdivision and
Bacillus/Clostridium group in gram-positive bacteria.
● Since the first release in 1997, MBGD has been developed under a different
concept: it provides a classification system rather than a classification result
itself.
● The key components of MBGD include –
● (i) an algorithm that can classify genes into orthologous groups using
precomputed all-against-all homology search results
● (ii) a user interface that is designed for users to explore the resulting
classification in detail, and
133. ● MBGD uses MySQL database management system to store most of the data
including similarity relationships as well as cluster tables created on demand.
Figure 1
Tree splitting procedure for ortholog grouping
in MBGD. In this figure, nine genes (A1, B1 etc.)
in five organisms (A–E) are classified into two
clusters. In this example, the root node is split
because three out of four organisms are
duplicated in both of the subtrees. The cutoff
ratio of duplicated organisms in each root node
is a parameter of our algorithm.
135. FASTA and BLAST
The number of DNA and protein sequences in public databases is very
large.Searching a database involves aligning the query sequence to each
sequence in the database, to find significant local alignment.
BLAST and FASTA are two similarity searching programs that identify
homologous DNA sequences and proteins based on the excess sequence
similarity.
They provide facilities for comparing DNA and proteins sequences with the
existing DNA and protein databases.
They are two major heuristic algorithms for performing database searches.
136. FASTA
FASTA stands for fast-all” or “Fast Alignment”
FASTA is a DNA and protein sequence alignment software package
first described by David J. Lipman and William R. Pearson in 1985
It was the first database similarity search tool developed, preceding
the development of BLAST.
FASTA is another sequence alignment tool which is used to search
similarities between sequences of DNA and proteins.
137. Variants of FASTA
fasta - scan a protein or DNA sequence library for similar sequences.
fastx - compare a DNA sequence to a protein sequence database, comparing
the translated DNA sequence in forward and reverse frames.
tfastx - compare a protein sequence to a DNA sequence database, calculating
similarities with frameshifts to the forward and reverse orientations.
fasty - compare a DNA sequence to a protein sequence database, comparing
the translated DNA sequence in forward and reverse frames.
tfasty - compare a protein sequence to a DNA sequence database, calculating
similarities with frameshifts to the forward and reverse orientations.
138. Variants of FASTA
fasts - compare unordered peptides to a protein sequence database
tfasts - compare unordered peptides to a translated DNA sequence
database
fastm - compare ordered peptides (or short DNA sequences) to a protein
(DNA) sequence database
fastm - compare ordered peptides (or short DNA sequences) to a
translated DNA sequence database
fastf - compare mixed peptides to a protein sequence database
139.
140.
141. FASTA sequence format
Fasta file description starts with ‘>’ symbol and followed by the gi and
accession number and then the description, all in a single line. Next line starts
with the sequence and in each row there would be 60 nucleotides/amino
acids only. For DNA and proteins it is represented in one letter IUPAC
nucleotide codes and amino acid codes
Example
>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN(OVALBUMINRELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
142. BLAST (Basic Local Alignment Search
Tool)
The BLAST program was developed by Stephen Altschul of NCBI in 1990
and has since become one of the most popular programs for sequence
analysis.
BLAST uses heuristics to align a query sequence with all sequences in a
database.
143. BLAST
BLAST is more time-efficient than FASTA by searching only for the more
significant patterns in the sequences, yet with comparative sensitivity.
BLAST is also often used as part of other algorithms that require
approximate sequence matching
144.
145. Variants of BLAST
BLAST-N: compares nucleotide sequence with nucleotide sequences
BLAST-P: compares protein sequences with protein sequences
BLAST-X: Compares nucleotide sequences against the protein sequences
tBLAST-N: compares the protein sequences against the six frame translations of
nucleotide sequences
tBLAST-X: Compares the six frame translations of nucleotide sequence against
the six frame translations of protein sequences.
146.
147. Megablast
Large numbers of query sequences (megablast)
When comparing large numbers of input sequences via the command-line
BLAST, "megablast" is much faster than running BLAST multiple times.
148. Uses of BLAST
Identifying species: With the use of BLAST, you can possibly correctly
identify a species or find homologous species. This can be useful, for example,
when you are working with a DNA sequence from an unknown species.
Locating domains: When working with a protein sequence you can input it
into BLAST, to locate known domains within the sequence of interest.
Establishing phylogeny: Using the results received through BLAST you can
create a phylogenetic tree using the BLAST web-page.
DNA mapping: When working with a known species, and looking to sequence
a gene at an unknown location, BLAST can compare the chromosomal position
of the sequence of interest, to relevant sequences in the database(s).
149.
150. E value
The BLAST E-value is the number of expected hits of similar quality
(score) that could be found just by chance.
E-value of 10 means that up to 10 hits can be expected to be found just by
chance, given the same size of a random database.
E-value can be used as a first quality filter for the BLAST search result, to
obtain only results equal to or better than the number given by the -evalue
option. Blast results are sorted by E-value by default (best hit in first line).
151. -evalue 1e-50 :small E-value: low number of hits, but of high quality
Blast hits with an E-value smaller than 1e-50 includes database matches of very high quality.
-evalue 0.01
Blast hits with E-value smaller than 0.01 can still be considered as good hit for homology
matches.
-evalue 10 (default)
large E-value: many hits, partly of low quality
E-value smaller than 10 will include hits that cannot be considered as significant, but may give
an idea of potential relations.