SlideShare a Scribd company logo
1 of 152
Bioinformatics
Introduction
Bioinformatics is an interdisciplinary field mainly involving molecular
biology and genetics, computer science, mathematics, and statistics.
Data intensive, large-scale biological problems are addressed from a
computational point of view.
The most common problems are modeling biological processes at the
molecular level and making inferences from collected data.
A bioinformatics solution usually involves the following steps:
● Collect statistics from biological data.
● Build a computational model.
● Solve a computational modeling problem.
● Test and evaluate a computational algorithm.
Applications of bioinformatics
Bioinformatics plays a vital role in the areas of structural genomics, functional
genomics, and nutritional genomics.
It covers emerging scientific research and the exploration of proteomes from the
overall level of intracellular protein composition (protein profiles), protein structure,
protein-protein interaction, and unique activity patterns (e.g. post-translational
modifications).
Applications of Bioinformatics
Bioinformatics is used for transcriptome analysis where mRNA expression levels
can be determined.
Bioinformatics is used to identify and structurally modify a natural product, to
design a compound with the desired properties and to assess its therapeutic
effects, theoretically.
Cheminformatics analysis includes analyses such as similarity searching,
clustering, QSAR modeling, virtual screening, etc.
Bioinformatics is playing an increasingly important role in almost all aspects of
drug discovery and drug development.
Bioinformatics tools are very effective in prediction, analysis and interpretation of
clinical and preclinical findings.
Molecular Medicine
The human genome will have profound effects on the fields of biomedical
research and clinical medicine.
The completion of the human genome and the use of bioinformatic tools means
that we can search for the genes directly associated with different diseases and
begin to understand the molecular basis of these diseases more clearly.
This new knowledge of the molecular mechanisms of disease will enable better
treatments, cures and even preventative tests to be developed
Gene therapy
In the not too distant future with the use of bioinformatics tool, the
potential for using genes themselves to treat disease may become a
reality.
Gene therapy is the approach used to treat, cure or even prevent disease
by changing the expression of a person’s genes.
Homology modelling and protein drug discovery
At present all drugs on the market target only about 500 proteins.
With an improved understanding of disease mechanisms and using
computational tools to identify and validate new drug targets, more
specific medicines that act on the cause, not merely the symptoms, of the
disease can be developed.
These highly specific drugs promise to have fewer side effects than many
of today’s medicines.
Microbial genome applications
The arrival of the complete genome sequences and their potential to
provide a greater insight into the microbial world and its capacities could
have broad and far reaching implications for environment, health, energy
and industrial applications.
By studying the genetic material of these organisms, scientists can begin
to understand these microbes at a very fundamental level and isolate the
genes that give them their unique abilities to survive under extreme
conditions.
Antibiotic resistance
Scientists have been examining the genome of Enterococcus faecalis-a leading
cause of bacterial infection among hospital patients.
They have discovered a virulence region made up of a number of antibiotic-
resistant genes that may contribute to the bacterium’s transformation from a
harmless gut bacteria to a menacing invader.
The discovery of the region, known as a pathogenicity island, could provide useful
markers for detecting pathogenic strains and help to establish controls to prevent
the spread of infection in wards.
Evolutionary studies
The sequencing of genomes from all three domains of life, eukaryota, bacteria and
archaea means that evolutionary studies can be performed in a quest to
determine the tree of life and the last universal common ancestor.
Crop improvement
Comparative genetics of the plant genomes has shown that the organisation of
their genes has remained more conserved over evolutionary time than was
previously believed.
These findings suggest that information obtained from the model crop systems
can be used to suggest improvements to other food crops.
At present the complete genomes of Arabidopsis thaliana (water cress) and Oryza
sativa (rice) are available
Biological Databases- Types and Importance
As the volume of genomic data grows, sophisticated computational methodologies are
required to manage the data deluge.
A biological database is a large, organized body of persistent data, usually associated with
computerized software designed to update, query, and retrieve components of the data stored
within the system.
A simple database might be a single file containing many records, each of which includes the
same set of information.
The chief objective of the development of a database is to organize data in a set of structured records
to enable easy retrieval of information.
Types of Biological Databases
Based on their contents, biological
databases can be roughly divided into
two categories:
Primary
Databases
Secondary
Databases
1. Primary databases
Primary databases are also called as archieval databases.
They are populated with experimentally derived data such as
nucleotide sequence, protein sequence or macromolecular structure.
Experimental results are submitted directly into the database by
researchers, and the data are essentially archival in nature.
Once given a database accession number, the data in primary
databases are never changed: they form part of the scientific record.
Examples for primary databases
GenBank from NCBI (National Center for Biotechnology Information)
ENA from EMBL
DDBJ
Protein Data Bank (PDB; coordinates of three-dimensional
macromolecular structures)
2. Secondary databases
Secondary databases comprise data derived from the results of
analysing primary data.
Secondary databases often draw upon information from numerous
sources, including other databases (primary and secondary),
controlled vocabularies and the scientific literature.
They are highly curated, often using a complex combination of
computational algorithms and manual analysis and interpretation to
derive new knowledge from the public record of science.
Examples for secondary databases
RefSeq from NCBI
Ensembl (variation,
function, regulation and
more layered onto whole
genome sequences)
TrEMBL and Swiss Prot
from UniProt
Specialized Databases
There are also specialized databases are those that cater to a particular research
interest.
It includes organisms, diseases, so on.
Flybase
HIV sequence database
Ribosomal Database Project
Importance of Databases
It allows knowledge discovery, which refers to the identification of connections between pieces of
information that were not known when the information was first entered.
This facilitates the discovery of new biological insights from raw data.
Secondary databases have become the molecular biologist’s reference library over the past decade.
It provides a wealth of information on just about any gene or gene product that has been investigated
by the research community.
It helps to solve cases where many users want to access the same entries of data.
Allows the indexing of data.
It helps to remove redundancy of data.
GenBank
The GenBank sequence database is an open access, annotated collection of all
publicly available nucleotide sequences and their protein translations.
It is produced and maintained by the National Center for
Biotechnology Information (NCBI; a part of the National Institutes of
Health in the United States) as part of the International Nucleotide
Sequence Database Collaboration (INSDC).
https://www.ncbi.nlm.nih.gov/genbank/
GenBank introduction
GenBank and its collaborators receive sequences produced in laboratories throughout
the world from more than 100,000 distinct organisms. The database started in 1982 by
Walter Goad and Los Alamos National Laboratory.
GenBank has become an important database for research in biological fields
and has grown in recent years at an exponential rate by doubling roughly
every 18 months
As of 15 June 2019, GenBank release 232.0 has 213,383,758 loci,
329,835,282,370 bases, from 213,383,758 reported sequences
GenBank introduction
In recent years, divisions have been added to support specific
sequencing strategies.
These include divisions for expressed sequence tag (EST), genome
survey (GSS), high throughput genomic (HTG), high throughput cDNA
(HTC), and environmental sample (ENV) sequences, making a total of
18 divisions.
Submissions overview
Only original sequences can be submitted to GenBank.
Direct submissions are made to GenBank using BankIt, which is a
Web-based form, or the stand-alone submission program, Sequin.
Upon receipt of a sequence submission, the GenBank staff
examines the originality of the data and assigns an accession
number to the sequence and performs quality assurance checks.
The submissions are then released to the public database, where
the entries are retrievable by Entrez or downloadable by FTP.
Submission using BankIt
About one-third of author submissions are received through NCBI's web-
based data submission tool, BankIt (Author Webpage). Using BankIt,
authors enter sequence information directly into a form, and add biological
annotations such as coding regions, or mRNA features.
BankIt validates submissions, flagging many common errors, and checks
for vector contamination using a variant of BLAST called Vecscreen,
before creating a draft record in GenBank flat file format for the submitter
to review. BankIt is the tool of choice for simple submissions, especially
when only one or a small number of records is to be submitted.
BankIt can also be used by submitters to update their existing GenBank
records.
Submission using Sequin
NCBI also offers a standalone multi-platform submission program called
Sequin (Author Webpage) that can be used interactively with other NCBI
sequence retrieval and analysis tools.
Sequin handles simple sequences such as a cDNA, as well as segmented
entries, phylogenetic studies, population studies, mutation studies,
environmental samples, and alignments for which BankIt and other web-
based submission tools are not well suited.
Sequin has convenient editing and complex annotation capabilities and
contains a number of built-in validation functions for quality assurance.
Submission via tbl2asn
Submitters of large, heavily annotated genomes may find it convenient
to use ‘tbl2asn’.
Convert a table of annotations generated via an annotation pipeline
into an ASN.1 record suitable for submission to GenBank
Sequence identifiers and accession numbers
Each GenBank record, consisting of both a sequence and its
annotations, is assigned a unique identifier, the accession number,
that is shared across the three collaborating databases (GenBank,
DDBJ, EMBL) and remains constant over the lifetime of the record
Each version of the DNA sequence within a GenBank record is also
assigned a unique NCBI identifier, called a ‘gi’, that appears on the
VERSION line of GenBank flatfile records following the accession
number
ACCESSION AF000001
VERSION AF000001.1 GI: 987654321
RETRIEVING GenBank DATA
The Entrez system
The sequence records in GenBank are accessible via Entrez (Author
Webpage), a flexible database retrieval system that covers over 30
biological databases.
These include DNA and protein sequences derived from GenBank and
other sources, genome maps, population, phylogenetic and environmental
sequence sets, gene expression data, the NCBI taxonomy, protein
domain information, protein structures from the Molecular Modeling
Database, MMDB; each database linked to the scientific literature via
PubMed and PubMed Central
Obtaining GenBank by FTP
NCBI distributes GenBank releases in the traditional flat-file format as
well as in the Abstract Syntax Notation (ASN.1) format used for
internal maintenance.
The complete bimonthly GenBank release and the daily updates,
which also incorporate sequence data from EMBL and DDBJ, are
available by anonymous FTP from NCBI at (Author Webpage) as well
as from a mirror site at the University of Indiana (Author Webpage)
European Nucleotide Archive
The European Nucleotide Archive (ENA) is a repository providing free and
unrestricted access to annotated DNA and RNA sequences. It also stores
complementary information such as experimental procedures, details of
sequence assembly and other metadata related to sequencing projects
http://www.ebi.ac.uk/ena/
Database Structure
The archive is composed of three main databases:
● The Sequence Read Archive,
● The Trace Archive and
● EMBL Nucleotide Sequence Database (also known as EMBL-bank).
The ENA is produced and maintained by the European Bioinformatics
Institute and is a member of the International Nucleotide Sequence
Database Collaboration (INSDC) along with the DNA Data Bank of Japan
and GenBank.
Data access and management
The data contained in the ENA can be accessed manually or
programmatically via REST URL through the ENA browser.
Initially limited to the Sequence Read Archive, the ENA browser now also
provides access to the Trace Archive and EMBL-Bank, allowing file
retrieval in a range of formats including XML, HTML, FASTA and FASTQ
Individual records can be accessed using their accession numbers and
other text queries are enabled through the EB-eye search engine
SRA
The ENA operates an instance of the Sequence Read Archive (SRA), an
archival repository of sequence reads and analyses which are intended
for public release.Originally called the Short Read Archive, the name was
changed in anticipation of future sequencing technologies being able to
produce longer sequence reads.
The preferred data format for files submitted to the SRA is the BAM
format, which is capable of storing both aligned and unaligned reads.
Storage
As of 2012, the ENA's storage requirements continue to grow
exponentially, with a doubling time of approximately 10 months.
To manage this increase, the ENA selectively discards less-valuable
sequencing platform data and implements advanced compression
strategies.
The CRAM reference-based compression toolkit was developed to help
reduce ENA storage requirements.
DNA Data Bank of Japan
http://www.ddbj.nig.ac.jp/
DDBJ began data bank activities in 1986 at NIG and remains the only
nucleotide sequence data bank in Asia
Currently, DDBJ Center is in operation at Research Organization of
Information and System National Institute of Genetics(NIG) in
Mishima, Japan with endorsement of MEXT; Japanese Ministry of
Education, Culture, Sports, Science and Technology.
DDBJ, expanding its DNA databank activities, was restructured as one of the
Intellectual Infrastructure Project Centers of NIG, being separated from CIB.
Collaborating with NBDC; National Bioscience Database Center, DDBJ Center
started to operate the archive for all types of individual-level genetic and de-
identified phenotypic data from human subjects, JGA; Japanese Genotype-
phenotype Archive.
ARSA iis high-speed retrieval system of sequence and annotation data maintained
by DNA Data Bank of Japan ( DDBJ)
Sequence Data
Transition
Tools
RefSeq
Reference Sequence (RefSeq) collection provides a comprehensive,
integrated, non-redundant, well-annotated set of sequences, including
genomic DNA, transcripts, and proteins. RefSeq sequences form a
foundation for medical, functional, and diversity studies.
They provide a stable reference for genome annotation, gene
identification and characterization, mutation and polymorphism analysis
(especially RefSeqGene records), expression studies, and comparative
analyses.
RefSeq genomes are copies of selected assembled genomes available in
GenBank.
Main features of the RefSeq collection include:
● non-redundancy
● explicitly linked nucleotide and protein sequences
● updates to reflect current knowledge of sequence data and biology
● data validation and format consistency
● distinct accession series (all accessions include an underscore '_'
character)
● ongoing curation by NCBI staff and collaborators, with reviewed
records indicated
RefSeq transcript and protein records are generated by
several processes including:
● Computation
Eukaryotic Genome Annotation Pipeline
Prokaryotic Genome Annotation Pipeline
● Manual curation
● Propagation from annotated genomes that are submitted to members
of the International Nucleotide Sequence Database Collaboration
(INSDC)
Scope
NCBI provides RefSeqs for taxonomically diverse organisms including
archaea, bacteria, eukaryotes, and viruses.
References sequences are provided for genomes, transcripts, and
proteins. Some targeted loci projects are included in RefSeq including:
RefSeqGene , fungal ITS , and rRNA loci. New or updated records are
added to the collection as data become publicly available
Ensembl www.ensembl.org
Ensembl is a joint project between EMBL-EBI and the Sanger Centre
to develop a software system which produces and maintains automatic
annotation of eukaryotic genomes.
Ensembl
In the Ensembl project, sequence data are fed into the gene
annotation system (a collection of software "pipelines" written
in Perl) which creates a set of predicted gene locations and
saves them in a MySQL database for subsequent analysis
and display.
Ensembl makes these data freely accessible to the
world research community.
Protein Databases
Protein Databases
Protein Information Resource (PIR) – Protein Sequence Database (PIR-PSD):
UniProt
Protein Data Bank
PROSITE
PRINTS
Pfam
Protein Information Resource (PIR) – Protein Sequence
Database (PIR-PSD):
PIR was established in 1984 by the National Biomedical Research
Foundation (NBRF) as a resource to assist researchers in the identification
and interpretation of protein sequence information.
Prior to that, the NBRF compiled the first comprehensive collection of
macromolecular sequences in the Atlas of Protein Sequence and Structure,
published from 1965-1978 under the editorship of Margaret O. Dayhoff
PIR
For over four decades, beginning with the Atlas of Protein
Sequence and Structure, PIR has provided protein databases and
analysis tools freely accessible to the scientific community
including the Protein Sequence Database (PSD).
In 2002 PIR, along with its international partners, EBI (European
Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics),
were awarded a grant from NIH to create UniProt, a single
worldwide database of protein sequence and function, by unifying
the PIR-PSD, Swiss-Prot, and TrEMBL databases.
UniProt is produced by the UniProt Consortium, a collaboration
between the European Bioinformatics Institute (EBI), the Swiss
Institute of Bioinformatics (SIB) and the Protein Information
Resource (PIR)
UniProt comprises four components
UniProt Knowledgebase (UniProtKB)
The UniProt Knowledgebase, the centrepiece of the UniProt
Consortium’s activities, is an expertly and richly curated protein
database, consisting of two sections called UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL.
Swiss-Prot
UniProtKB/Swiss-Prot contains high-quality manually annotated and
non-redundant protein sequence records.
Manual annotation consists of analysis, comparison and merging of all
available sequences for a given protein, as well as a critical review of
associated experimental and predicted data.
UniProt curators extract biological information from the literature and
perform numerous computational analyses.
Swiss-Prot
UniProtKB/Swiss-Prot aims to provide all known relevant information
about a particular protein. It describes, in a single record, the different
protein products derived from a certain gene from a given species,
including each protein derived by alternative splicing, polymorphisms
and/or post-translational modifications.
Protein families and groups are regularly reviewed to keep up with
current scientific findings.
UniProtKB/Swiss-Prot entry name
Entry name symbolized as X_Y, where:
X is protein name, Y is species name
see for example INS_HUMAN, INS1_MOUSE and INS2_MOUSE
INS = INSULIN
HUMAN =SPECIES
UniProtKB/TrEMBL
UniProtKB/TrEMBL contains high-quality computationally analysed
records enriched with automatic annotation and classification.
Records are selected for full manual annotation and integration into
UniProtKB/Swiss-Prot according to defined annotation priorities.
The default raw sequence data for UniProtKB are:
DDBJ/ENA/GenBank coding sequence (CDS) translations, the
sequences of PDB structures, sequences from Ensembl and
RefSeq, data derived from amino acid sequences that are directly
submitted to UniProtKB or scanned from the literature.
UniProt Reference Clusters (UniRef)
Three UniRef databases – UniRef100, UniRef90 and UniRef50 –
merge sequences automatically across species. UniRef100 is based
on all UniProtKB records.
UniRef100 is produced by clustering all these records by sequence
identity. Identical sequences and sub-fragments are presented as a
single UniRef100 entry with accession numbers of all the merged
entries, the protein sequence, links to the corresponding UniProtKB
and archive records. UniRef90 and UniRef50 are built from UniRef100
to provide records with mutual sequence identity of 90% or more, or
50% or more, respectively
UniProt Archive (UniParc)
UniParc is designed to capture all publicly available protein sequence
data and contains all the protein sequences from the main publicly
available protein sequence databases.
UniParc handles all sequences simply as text strings – sequences that
are 100% identical over their entire length are merged regardless of
whether they are from the same or different species.
UniParc also provides sequence versions, which are incremented
every time the underlying sequence changes.
UniProt Metagenomic and Environmental
Sequences (UniMES)
The availability of metagenomic data has necessitated the creation of
a separate database, UniMES, to store sequences which are
recovered directly from environmental samples.
The predicted proteins from this dataset are combined with automatic
classification by InterPro.
Since 1971, the Protein Data Bank archive (PDB) has served as the single
repository of information about the 3D structures of proteins, nucleic acids, and
complex assemblies.
Protein Data
Bank
RCSB PDB (Research
Collaboratory for
Structural
Bioinformatics PDB)
operates the US data
center for the global
PDB archive
Protein Data Bank Japan
Supports browsing in multiple languages such as Japanese, Chinese,
and Korean; SeSAW identifies functionally or evolutionarily conserved
motifs by locating and annotating sequence and structural similarities,
tools for bioinformaticians, and more.
Research Collaboratory for
Structural Bioinformatics Protein
Data Bank
Simple and advanced searching for macromolecules and ligands,
tabular reports, specialized visualization tools, sequence-structure
comparisons, RCSB PDB Mobile, Molecule of the Month and other
educational resources at PDB-101, and more.
Biological Magnetic Resonance
Data Bank
Collects NMR data from any experiment and captures assigned chemical
shifts, coupling constants, and peak lists for a variety of macromolecules;
contains derived annotations such as hydrogen exchange rates, pKa
values, and relaxation parameters.
Protein Data Bank in Europe
Rich information about all PDB entries, multiple search and browse
facilities, advanced services including PDBePISA, PDBeFold and
PDBeMotif, advanced visualisation and validation of NMR and EM
structures, tools for bioinformaticians.
PDB file formats
mmCIF
PDB file
PDBML
PDB file format
HEADER, TITLE and AUTHOR records
provide information about the researchers who defined the structure;
numerous other types of records are available to provide other types of
information
REMARK records
can contain free-form annotation, but they also accommodate standardized
information
SEQRES records
give the sequences of the three peptide chains (named A, B and C), which
ATOM records
describe the coordinates of the atoms that are part of the protein. For
example, the first ATOM line above describes the alpha-N atom of the first
residue of peptide chain A, which is a proline residue; the first three
floating point numbers are its x, y and z coordinates and are in units of
Ångströms. The next three columns are the occupancy, temperature
factor, and the element name, respectively.
HETATM records
describe coordinates of hetero-atoms, that is those atoms which are not
part of the protein molecule.
Molecular visualization softwares
Cn3D
PyMOL
RasMol
Human Genome Project
Human genome project
The human genome project, a large, federally funded collaborative
project, completed the sequencing of entire human genome in 2003
Initially project funded by DOE and NIH.
The Human Genome Project originally aimed to map the nucleotides
contained in a human haploid reference genome (more than three billion).
The "genome" of any given individual is unique; mapping the "human
genome" involved sequencing a small number of individuals and then
assembling these together to get a complete sequence for each
chromosome. Therefore, the finished human genome is a mosaic, not
representing any one individual.
Project goals were to
● identify all the approximately 20,500 genes in human DNA,
● determine the sequences of the 3 billion chemical base pairs that
make up human DNA,
● store this information in databases,
● improve tools for data analysis,
● transfer related technologies to the private sector, and
● address the ethical, legal, and social issues (ELSI) that may arise
from the project.
Approach
BAC-end sequencing
The widely agreed-upon strategy for sequencing the human genome is
based on the use of BACs that carry fragments of human DNA from
known locations in the genome
GRAIL
GRAIL (Gene Recognition and Assembly Internet Link) is one of the most
widely used computer programs for identifying potential genes in DNA
sequence and for general DNA sequence analysis.
Race b/w HGP and Celera
The entry of Celera Genomics into the human genome sequencing arena
in 1998 galvanised the public effort, leading to a race to sequence the
human genome.
Celera utilized the skills of computer scientist W. Meyers to perform whole
genome short cloning approach and intensive computer processing of
data to complete the Drosophila sequence and then the human genome
sequence
Craig Venter aimed to sequence and assemble the entire human genome
by 2001, and only make the information available to paying customers
Impacts Of The HGP
Molecular medicine.
Energy sources and environmental applications.
Risk assessment.
Bioarchaeology, anthropology, evolution, and human migration.
DNA forensics (identification)
Agriculture, livestock breeding, and bioprocessing.
Molecular Modeling Database
Molecular Modeling Database (MMDB)
The Molecular Modeling DataBase (MMDB) is a database of
experimentally determined three-dimensional biomolecular structures,
and is also referred to as the Entrez Structure database.
It is a subset of three-dimensional structures obtained from the RCSB
Protein Data Bank (PDB), excluding theoretical models.
Functional insights
Experimentally resolved structures of proteins, RNA, and DNA, derived
from the Protein Data Bank (PDB), with value-added features such as
explicit chemical graphs, computationally identified 3D domains (compact
substructures) that are used to identify similar 3D structures, as well as
links to literature, similar sequences, information about chemicals bound
to the structures, and more.
These connections make it possible, for example, to find 3D structures for
homologs of a protein sequence of interest, then interactively view the
sequence-structure relationships, active sites, bound chemicals, journal
articles, and more.
CBLAST
A tool that compares a query protein sequence against all protein
sequences from experimentally resolved 3D structures, by using
protein BLAST against the PDB data set
IBIS (Inferred Biomolecular Interaction Server)
For a given protein sequence or structure query, IBIS reports protein-
protein, protein-small molecule, protein nucleic acids and protein-ion
interactions observed in experimentally-determined structural
biological assemblies. IBIS also infers/predicts interacting partners and
binding sites by homology, by inspecting the protein complexes
formed by close homologs of a given query.
Vast (Vector Alignment Search Tool)
The original VAST finds structures that are 3D similar to individual protein
molecules, or individual 3D domains
SCOP( Structural Classification of
Proteins)
Database description
The SCOP database aims to provide a detailed and comprehensive
description of the structural and evolutionary relationships between
proteins whose three-dimensional structure is known and deposited in
the Protein Data Bank. The main levels of the classification are:
● Family
● Super Family
● Fold
● Classes
Family
Family groups closely related proteins with a clear evidence for their
evolutionary origin. In most cases, their relationship is detectable with
current sequence comparison methods, e.g. BLAST, PSI- BLAST,
HMMER.
Superfamily
Superfamily brings together more distantly related protein domains.
Their similarity is frequently limited to common structural features that
along with a conserved architecture of active or binding sites, or
similar modes of oligomerization suggest a probable evolutionary
ancestry.
Fold
Fold groups superfamilies on the basis of the global structural
features shared by the majority of their members. These features
are the composition of the secondary structures in the domain
core, their architecture and topology. Fold is an attribute of a
superfamily but the constituent families of some superfamilies that
have evolved distinct structural features can belong to a different
fold.
Class
Classes bring together folds and IUPRs with different secondary
structural content.
These include all-alpha and all-beta proteins, containing
predominantly alpha-helices and beta-strands, respectively, and
‘mixed’ alpha and beta classes (a/b) and (a+b) with respectively
Example
● Human trypsinogen lineage
● Root: scop
● Class: All beta proteins [48724]
● Fold: Trypsin-like serine proteases [50493]
● Superfamily: Trypsin-like serine proteases [50494]
● Family: Eukaryotic proteases [50514]
● Protein: Trypsin(ogen) [50515]
● Species: Human (Homo sapiens) [TaxId: 9606] [50519]
CATH database
The CATH Protein Structure Classification database is a free,
publicly available online resource that provides information on the
evolutionary relationships of protein domains. It was created in the
mid-1990s
CATH shares many broad features with the SCOP resource,
however there are also many areas in which the detailed
classification differs greatly
Classification
The domains are classified within the CATH structural hierarchy: at the Class (C)
level, domains are assigned according to their secondary structure content, i.e. all
alpha, all beta, a mixture of alpha and beta, or little secondary structure;
at the Architecture (A) level, information on the secondary structure arrangement in
three-dimensional space is used for assignment;
at the Topology/fold (T) level, information on how the secondary structure elements
are connected and arranged is used; assignments are made to the Homologous
superfamily
(H) level if there is good evidence that the domains are related by evolution i.e. they
are homologous.
OMIM - Online Mendelian
Inheritance in Man
OMIM
OMIM is a comprehensive, authoritative compendium of human genes
and genetic phenotypes that is freely available and updated daily. The
full-text, referenced overviews in OMIM contain information on all known
mendelian disorders and over 15,000 genes. OMIM focuses on the
relationship between phenotype and genotype.
It is updated daily, and the entries contain copious links to other
genetics resources.
OMIM
Each OMIM entry has a full-text summary of a genetically determined
phenotype and/or gene and has numerous links to other genetic
databases such as DNA and protein sequence, PubMed references,
general and locus-specific mutation databases, HUGO nomenclature,
MapViewer, GeneTests, patient support groups and many others.
OMIM is an easy and straightforward portal to the burgeoning
information in human genetics.
OMIM
OMIM has been available online since 1987, first from Johns Hopkins
University and since 1995 from the NCBI
OMIM can be searched from its homepage or from any page in the NCBI
Entrez suite of databases. Information in OMIM can be retrieved by
queries on MIM number, disorder, gene name and/or symbol, or plain
English
Each OMIM entry is assigned a unique six-digit number whose first digit
indicates whether its inheritance is autosomal, X-linked, Y-linked or
mitochondrial
MalaCards: The human disease database
MalaCards is an integrated database of human maladies and their
annotations, modeled on the architecture and richness of the popular
GeneCards database of human genes.
The MalaCards disease and disorders database is organized into "disease
cards", each integrating prioritized information, and listing numerous known
aliases for each disease, along with a variety of annotations, as well as
inter-disease connections, empowered by the GeneCards relational
database, searches, and GeneAnalytics set-analyses
Cont…
Annotations include: symptoms, drugs, articles, genes, clinical trials,
related diseases/disorders and more.
An automatic computational information retrieval engine populates the
disease cards, using remote data, as well as information gleaned using
the GeneCards platform to compile the disease database.
The MalaCards disease database integrates both specialized and general
disease lists, including rare diseases, genetic diseases, complex
disorders and more.
TIGR Database
The institute of genomic research
TIGR
Provides a collection of curated databases containing DNA and protein
sequence, gene expression, cellular role, protein family, and
taxonomic data for microbes, plants and humans.
The CMR (Comprehensive Microbial Resource) contains analysis on
completed microbial genome sequencing.
Metabolic pathway Database - KEGG
KEGG
This concept is realized in the following databases of KEGG, which
are categorized into systems, genomic, chemical, and health
information
Systems information
● PATHWAY — pathway maps for cellular and organismal functions
● MODULE — modules or functional units of genes
● BRITE — hierarchical classifications of biological entities
Genomic information
● GENOME — complete genomes
● GENES — genes and proteins in the complete genomes
● ORTHOLOGY — ortholog groups of genes in the complete
genomes
Chemical information
● COMPOUND, GLYCAN — chemical compounds and glycans
● REACTION, RPAIR, RCLASS — chemical reactions
● ENZYME — enzyme nomenclature
Health information
● DISEASE — human diseases
● DRUG — approved drugs
● ENVIRON — crude drugs and health-related substances
Microbial Genome Database
● MBGD is a workbench system for comparative analysis of completely
sequenced microbial genomes.
● he central function of MBGD is to create an orthologous gene classification
table using precomputed all-against-all similarity relationships among genes
in multiple genomes.
● The growth of the number of completed microbial genome sequences is
accelerated recently and nearly a hundred of genomes in various levels of
relatedness have already been available today.
● Especially interesting are the recently available multiple genomes of some
particular taxonomic groups such as proteobacteria gamma subdivision and
● Especially interesting are the recently available multiple genomes of some
particular taxonomic groups such as proteobacteria gamma subdivision and
Bacillus/Clostridium group in gram-positive bacteria.
● Since the first release in 1997, MBGD has been developed under a different
concept: it provides a classification system rather than a classification result
itself.
● The key components of MBGD include –
● (i) an algorithm that can classify genes into orthologous groups using
precomputed all-against-all homology search results
● (ii) a user interface that is designed for users to explore the resulting
classification in detail, and
● MBGD uses MySQL database management system to store most of the data
including similarity relationships as well as cluster tables created on demand.
Figure 1
Tree splitting procedure for ortholog grouping
in MBGD. In this figure, nine genes (A1, B1 etc.)
in five organisms (A–E) are classified into two
clusters. In this example, the root node is split
because three out of four organisms are
duplicated in both of the subtrees. The cutoff
ratio of duplicated organisms in each root node
is a parameter of our algorithm.
FASTA and BLAST
FASTA and BLAST
The number of DNA and protein sequences in public databases is very
large.Searching a database involves aligning the query sequence to each
sequence in the database, to find significant local alignment.
BLAST and FASTA are two similarity searching programs that identify
homologous DNA sequences and proteins based on the excess sequence
similarity.
They provide facilities for comparing DNA and proteins sequences with the
existing DNA and protein databases.
They are two major heuristic algorithms for performing database searches.
FASTA
FASTA stands for fast-all” or “Fast Alignment”
FASTA is a DNA and protein sequence alignment software package
first described by David J. Lipman and William R. Pearson in 1985
It was the first database similarity search tool developed, preceding
the development of BLAST.
FASTA is another sequence alignment tool which is used to search
similarities between sequences of DNA and proteins.
Variants of FASTA
fasta - scan a protein or DNA sequence library for similar sequences.
fastx - compare a DNA sequence to a protein sequence database, comparing
the translated DNA sequence in forward and reverse frames.
tfastx - compare a protein sequence to a DNA sequence database, calculating
similarities with frameshifts to the forward and reverse orientations.
fasty - compare a DNA sequence to a protein sequence database, comparing
the translated DNA sequence in forward and reverse frames.
tfasty - compare a protein sequence to a DNA sequence database, calculating
similarities with frameshifts to the forward and reverse orientations.
Variants of FASTA
fasts - compare unordered peptides to a protein sequence database
tfasts - compare unordered peptides to a translated DNA sequence
database
fastm - compare ordered peptides (or short DNA sequences) to a protein
(DNA) sequence database
fastm - compare ordered peptides (or short DNA sequences) to a
translated DNA sequence database
fastf - compare mixed peptides to a protein sequence database
FASTA sequence format
Fasta file description starts with ‘>’ symbol and followed by the gi and
accession number and then the description, all in a single line. Next line starts
with the sequence and in each row there would be 60 nucleotides/amino
acids only. For DNA and proteins it is represented in one letter IUPAC
nucleotide codes and amino acid codes
Example
>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN(OVALBUMINRELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
BLAST (Basic Local Alignment Search
Tool)
The BLAST program was developed by Stephen Altschul of NCBI in 1990
and has since become one of the most popular programs for sequence
analysis.
BLAST uses heuristics to align a query sequence with all sequences in a
database.
BLAST
BLAST is more time-efficient than FASTA by searching only for the more
significant patterns in the sequences, yet with comparative sensitivity.
BLAST is also often used as part of other algorithms that require
approximate sequence matching
Variants of BLAST
BLAST-N: compares nucleotide sequence with nucleotide sequences
BLAST-P: compares protein sequences with protein sequences
BLAST-X: Compares nucleotide sequences against the protein sequences
tBLAST-N: compares the protein sequences against the six frame translations of
nucleotide sequences
tBLAST-X: Compares the six frame translations of nucleotide sequence against
the six frame translations of protein sequences.
Megablast
Large numbers of query sequences (megablast)
When comparing large numbers of input sequences via the command-line
BLAST, "megablast" is much faster than running BLAST multiple times.
Uses of BLAST
Identifying species: With the use of BLAST, you can possibly correctly
identify a species or find homologous species. This can be useful, for example,
when you are working with a DNA sequence from an unknown species.
Locating domains: When working with a protein sequence you can input it
into BLAST, to locate known domains within the sequence of interest.
Establishing phylogeny: Using the results received through BLAST you can
create a phylogenetic tree using the BLAST web-page.
DNA mapping: When working with a known species, and looking to sequence
a gene at an unknown location, BLAST can compare the chromosomal position
of the sequence of interest, to relevant sequences in the database(s).
E value
The BLAST E-value is the number of expected hits of similar quality
(score) that could be found just by chance.
E-value of 10 means that up to 10 hits can be expected to be found just by
chance, given the same size of a random database.
E-value can be used as a first quality filter for the BLAST search result, to
obtain only results equal to or better than the number given by the -evalue
option. Blast results are sorted by E-value by default (best hit in first line).
-evalue 1e-50 :small E-value: low number of hits, but of high quality
Blast hits with an E-value smaller than 1e-50 includes database matches of very high quality.
-evalue 0.01
Blast hits with E-value smaller than 0.01 can still be considered as good hit for homology
matches.
-evalue 10 (default)
large E-value: many hits, partly of low quality
E-value smaller than 10 will include hits that cannot be considered as significant, but may give
an idea of potential relations.
Bioinformatics Introduction and Databases

More Related Content

What's hot

Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsNikesh Narayanan
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)AnkitTiwari354
 
4.1 introduction to bioinformatics
4.1 introduction to bioinformatics4.1 introduction to bioinformatics
4.1 introduction to bioinformaticsPrabhakar Pawar
 
BITS: Basics of Sequence similarity
BITS: Basics of Sequence similarityBITS: Basics of Sequence similarity
BITS: Basics of Sequence similarityBITS
 
Orthologs,Paralogs & Xenologs
 Orthologs,Paralogs & Xenologs  Orthologs,Paralogs & Xenologs
Orthologs,Paralogs & Xenologs OsamaZafar16
 
Genomics and bioinformatics
Genomics and bioinformatics Genomics and bioinformatics
Genomics and bioinformatics Senthil Natesan
 
The Galaxy bioinformatics workflow environment
The Galaxy bioinformatics workflow environmentThe Galaxy bioinformatics workflow environment
The Galaxy bioinformatics workflow environmentRutger Vos
 
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...VHIR Vall d’Hebron Institut de Recerca
 
Databases in Bioinformatics
Databases in BioinformaticsDatabases in Bioinformatics
Databases in BioinformaticsMeghaj Mallick
 

What's hot (20)

Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 
protein data bank
protein data bankprotein data bank
protein data bank
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)
 
4.1 introduction to bioinformatics
4.1 introduction to bioinformatics4.1 introduction to bioinformatics
4.1 introduction to bioinformatics
 
Proteome databases
Proteome databasesProteome databases
Proteome databases
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Structure alignment methods
Structure alignment methodsStructure alignment methods
Structure alignment methods
 
Ncbi
NcbiNcbi
Ncbi
 
Protein database
Protein  databaseProtein  database
Protein database
 
BITS: Basics of Sequence similarity
BITS: Basics of Sequence similarityBITS: Basics of Sequence similarity
BITS: Basics of Sequence similarity
 
Orthologs,Paralogs & Xenologs
 Orthologs,Paralogs & Xenologs  Orthologs,Paralogs & Xenologs
Orthologs,Paralogs & Xenologs
 
Genomics and bioinformatics
Genomics and bioinformatics Genomics and bioinformatics
Genomics and bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
The Galaxy bioinformatics workflow environment
The Galaxy bioinformatics workflow environmentThe Galaxy bioinformatics workflow environment
The Galaxy bioinformatics workflow environment
 
PIR & MINT
PIR & MINTPIR & MINT
PIR & MINT
 
Protein databases
Protein databasesProtein databases
Protein databases
 
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
 
Databases in Bioinformatics
Databases in BioinformaticsDatabases in Bioinformatics
Databases in Bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
 

Similar to Bioinformatics Introduction and Databases

Bioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolBioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolJesminBinti
 
Bioinformatics
BioinformaticsBioinformatics
BioinformaticsAmna Jalil
 
GENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICSGENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICSsandeshGM
 
Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformaticaMartín Arrieta
 
LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSMSCW Mysore
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptxAshuAsh15
 
Applications of bioinformatics, main by kk sahu
Applications of bioinformatics, main by kk sahuApplications of bioinformatics, main by kk sahu
Applications of bioinformatics, main by kk sahuKAUSHAL SAHU
 
BIOINFORMATICS Applications And Challenges
BIOINFORMATICS Applications And ChallengesBIOINFORMATICS Applications And Challenges
BIOINFORMATICS Applications And ChallengesAmos Watentena
 
Role of Bioinformatics in Plant Pathology.pptx
Role of Bioinformatics in Plant Pathology.pptxRole of Bioinformatics in Plant Pathology.pptx
Role of Bioinformatics in Plant Pathology.pptxHasanRiaz18
 
Health Informatics- Module 5-Chapter 3.pptx
Health Informatics- Module 5-Chapter 3.pptxHealth Informatics- Module 5-Chapter 3.pptx
Health Informatics- Module 5-Chapter 3.pptxArti Parab Academics
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsAsad Afridi
 

Similar to Bioinformatics Introduction and Databases (20)

Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolBioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST Tool
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
GENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICSGENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICS
 
Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformatica
 
Basic of bioinformatics
Basic of bioinformaticsBasic of bioinformatics
Basic of bioinformatics
 
LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICS
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
Bio informatics
Bio informaticsBio informatics
Bio informatics
 
Bio informatics
Bio informaticsBio informatics
Bio informatics
 
origin, history.pptx
origin, history.pptxorigin, history.pptx
origin, history.pptx
 
Applications of bioinformatics, main by kk sahu
Applications of bioinformatics, main by kk sahuApplications of bioinformatics, main by kk sahu
Applications of bioinformatics, main by kk sahu
 
BIOINFORMATICS Applications And Challenges
BIOINFORMATICS Applications And ChallengesBIOINFORMATICS Applications And Challenges
BIOINFORMATICS Applications And Challenges
 
Role of Bioinformatics in Plant Pathology.pptx
Role of Bioinformatics in Plant Pathology.pptxRole of Bioinformatics in Plant Pathology.pptx
Role of Bioinformatics in Plant Pathology.pptx
 
Health Informatics- Module 5-Chapter 3.pptx
Health Informatics- Module 5-Chapter 3.pptxHealth Informatics- Module 5-Chapter 3.pptx
Health Informatics- Module 5-Chapter 3.pptx
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics.pptx
Bioinformatics.pptxBioinformatics.pptx
Bioinformatics.pptx
 
rheumatoid arthritis
rheumatoid arthritisrheumatoid arthritis
rheumatoid arthritis
 

Recently uploaded

Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 

Recently uploaded (20)

Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 

Bioinformatics Introduction and Databases

  • 2. Introduction Bioinformatics is an interdisciplinary field mainly involving molecular biology and genetics, computer science, mathematics, and statistics. Data intensive, large-scale biological problems are addressed from a computational point of view. The most common problems are modeling biological processes at the molecular level and making inferences from collected data.
  • 3. A bioinformatics solution usually involves the following steps: ● Collect statistics from biological data. ● Build a computational model. ● Solve a computational modeling problem. ● Test and evaluate a computational algorithm.
  • 4.
  • 5.
  • 6.
  • 7. Applications of bioinformatics Bioinformatics plays a vital role in the areas of structural genomics, functional genomics, and nutritional genomics. It covers emerging scientific research and the exploration of proteomes from the overall level of intracellular protein composition (protein profiles), protein structure, protein-protein interaction, and unique activity patterns (e.g. post-translational modifications).
  • 8. Applications of Bioinformatics Bioinformatics is used for transcriptome analysis where mRNA expression levels can be determined. Bioinformatics is used to identify and structurally modify a natural product, to design a compound with the desired properties and to assess its therapeutic effects, theoretically. Cheminformatics analysis includes analyses such as similarity searching, clustering, QSAR modeling, virtual screening, etc.
  • 9. Bioinformatics is playing an increasingly important role in almost all aspects of drug discovery and drug development. Bioinformatics tools are very effective in prediction, analysis and interpretation of clinical and preclinical findings.
  • 10. Molecular Medicine The human genome will have profound effects on the fields of biomedical research and clinical medicine. The completion of the human genome and the use of bioinformatic tools means that we can search for the genes directly associated with different diseases and begin to understand the molecular basis of these diseases more clearly. This new knowledge of the molecular mechanisms of disease will enable better treatments, cures and even preventative tests to be developed
  • 11. Gene therapy In the not too distant future with the use of bioinformatics tool, the potential for using genes themselves to treat disease may become a reality. Gene therapy is the approach used to treat, cure or even prevent disease by changing the expression of a person’s genes.
  • 12. Homology modelling and protein drug discovery At present all drugs on the market target only about 500 proteins. With an improved understanding of disease mechanisms and using computational tools to identify and validate new drug targets, more specific medicines that act on the cause, not merely the symptoms, of the disease can be developed. These highly specific drugs promise to have fewer side effects than many of today’s medicines.
  • 13. Microbial genome applications The arrival of the complete genome sequences and their potential to provide a greater insight into the microbial world and its capacities could have broad and far reaching implications for environment, health, energy and industrial applications. By studying the genetic material of these organisms, scientists can begin to understand these microbes at a very fundamental level and isolate the genes that give them their unique abilities to survive under extreme conditions.
  • 14. Antibiotic resistance Scientists have been examining the genome of Enterococcus faecalis-a leading cause of bacterial infection among hospital patients. They have discovered a virulence region made up of a number of antibiotic- resistant genes that may contribute to the bacterium’s transformation from a harmless gut bacteria to a menacing invader. The discovery of the region, known as a pathogenicity island, could provide useful markers for detecting pathogenic strains and help to establish controls to prevent the spread of infection in wards.
  • 15. Evolutionary studies The sequencing of genomes from all three domains of life, eukaryota, bacteria and archaea means that evolutionary studies can be performed in a quest to determine the tree of life and the last universal common ancestor.
  • 16. Crop improvement Comparative genetics of the plant genomes has shown that the organisation of their genes has remained more conserved over evolutionary time than was previously believed. These findings suggest that information obtained from the model crop systems can be used to suggest improvements to other food crops. At present the complete genomes of Arabidopsis thaliana (water cress) and Oryza sativa (rice) are available
  • 17.
  • 18. Biological Databases- Types and Importance As the volume of genomic data grows, sophisticated computational methodologies are required to manage the data deluge. A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information.
  • 19. Types of Biological Databases Based on their contents, biological databases can be roughly divided into two categories: Primary Databases Secondary Databases
  • 20. 1. Primary databases Primary databases are also called as archieval databases. They are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure. Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature. Once given a database accession number, the data in primary databases are never changed: they form part of the scientific record.
  • 21. Examples for primary databases GenBank from NCBI (National Center for Biotechnology Information) ENA from EMBL DDBJ Protein Data Bank (PDB; coordinates of three-dimensional macromolecular structures)
  • 22. 2. Secondary databases Secondary databases comprise data derived from the results of analysing primary data. Secondary databases often draw upon information from numerous sources, including other databases (primary and secondary), controlled vocabularies and the scientific literature. They are highly curated, often using a complex combination of computational algorithms and manual analysis and interpretation to derive new knowledge from the public record of science.
  • 23. Examples for secondary databases RefSeq from NCBI Ensembl (variation, function, regulation and more layered onto whole genome sequences) TrEMBL and Swiss Prot from UniProt
  • 24. Specialized Databases There are also specialized databases are those that cater to a particular research interest. It includes organisms, diseases, so on. Flybase HIV sequence database Ribosomal Database Project
  • 25. Importance of Databases It allows knowledge discovery, which refers to the identification of connections between pieces of information that were not known when the information was first entered. This facilitates the discovery of new biological insights from raw data. Secondary databases have become the molecular biologist’s reference library over the past decade. It provides a wealth of information on just about any gene or gene product that has been investigated by the research community. It helps to solve cases where many users want to access the same entries of data. Allows the indexing of data. It helps to remove redundancy of data.
  • 26. GenBank The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part of the National Institutes of Health in the United States) as part of the International Nucleotide Sequence Database Collaboration (INSDC). https://www.ncbi.nlm.nih.gov/genbank/
  • 27. GenBank introduction GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. The database started in 1982 by Walter Goad and Los Alamos National Laboratory. GenBank has become an important database for research in biological fields and has grown in recent years at an exponential rate by doubling roughly every 18 months As of 15 June 2019, GenBank release 232.0 has 213,383,758 loci, 329,835,282,370 bases, from 213,383,758 reported sequences
  • 28. GenBank introduction In recent years, divisions have been added to support specific sequencing strategies. These include divisions for expressed sequence tag (EST), genome survey (GSS), high throughput genomic (HTG), high throughput cDNA (HTC), and environmental sample (ENV) sequences, making a total of 18 divisions.
  • 29.
  • 30. Submissions overview Only original sequences can be submitted to GenBank. Direct submissions are made to GenBank using BankIt, which is a Web-based form, or the stand-alone submission program, Sequin. Upon receipt of a sequence submission, the GenBank staff examines the originality of the data and assigns an accession number to the sequence and performs quality assurance checks. The submissions are then released to the public database, where the entries are retrievable by Entrez or downloadable by FTP.
  • 31. Submission using BankIt About one-third of author submissions are received through NCBI's web- based data submission tool, BankIt (Author Webpage). Using BankIt, authors enter sequence information directly into a form, and add biological annotations such as coding regions, or mRNA features. BankIt validates submissions, flagging many common errors, and checks for vector contamination using a variant of BLAST called Vecscreen, before creating a draft record in GenBank flat file format for the submitter to review. BankIt is the tool of choice for simple submissions, especially when only one or a small number of records is to be submitted. BankIt can also be used by submitters to update their existing GenBank records.
  • 32. Submission using Sequin NCBI also offers a standalone multi-platform submission program called Sequin (Author Webpage) that can be used interactively with other NCBI sequence retrieval and analysis tools. Sequin handles simple sequences such as a cDNA, as well as segmented entries, phylogenetic studies, population studies, mutation studies, environmental samples, and alignments for which BankIt and other web- based submission tools are not well suited. Sequin has convenient editing and complex annotation capabilities and contains a number of built-in validation functions for quality assurance.
  • 33. Submission via tbl2asn Submitters of large, heavily annotated genomes may find it convenient to use ‘tbl2asn’. Convert a table of annotations generated via an annotation pipeline into an ASN.1 record suitable for submission to GenBank
  • 34.
  • 35.
  • 36.
  • 37. Sequence identifiers and accession numbers Each GenBank record, consisting of both a sequence and its annotations, is assigned a unique identifier, the accession number, that is shared across the three collaborating databases (GenBank, DDBJ, EMBL) and remains constant over the lifetime of the record Each version of the DNA sequence within a GenBank record is also assigned a unique NCBI identifier, called a ‘gi’, that appears on the VERSION line of GenBank flatfile records following the accession number ACCESSION AF000001 VERSION AF000001.1 GI: 987654321
  • 38. RETRIEVING GenBank DATA The Entrez system The sequence records in GenBank are accessible via Entrez (Author Webpage), a flexible database retrieval system that covers over 30 biological databases. These include DNA and protein sequences derived from GenBank and other sources, genome maps, population, phylogenetic and environmental sequence sets, gene expression data, the NCBI taxonomy, protein domain information, protein structures from the Molecular Modeling Database, MMDB; each database linked to the scientific literature via PubMed and PubMed Central
  • 39. Obtaining GenBank by FTP NCBI distributes GenBank releases in the traditional flat-file format as well as in the Abstract Syntax Notation (ASN.1) format used for internal maintenance. The complete bimonthly GenBank release and the daily updates, which also incorporate sequence data from EMBL and DDBJ, are available by anonymous FTP from NCBI at (Author Webpage) as well as from a mirror site at the University of Indiana (Author Webpage)
  • 40. European Nucleotide Archive The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects http://www.ebi.ac.uk/ena/
  • 41. Database Structure The archive is composed of three main databases: ● The Sequence Read Archive, ● The Trace Archive and ● EMBL Nucleotide Sequence Database (also known as EMBL-bank). The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.
  • 42. Data access and management The data contained in the ENA can be accessed manually or programmatically via REST URL through the ENA browser. Initially limited to the Sequence Read Archive, the ENA browser now also provides access to the Trace Archive and EMBL-Bank, allowing file retrieval in a range of formats including XML, HTML, FASTA and FASTQ Individual records can be accessed using their accession numbers and other text queries are enabled through the EB-eye search engine
  • 43. SRA The ENA operates an instance of the Sequence Read Archive (SRA), an archival repository of sequence reads and analyses which are intended for public release.Originally called the Short Read Archive, the name was changed in anticipation of future sequencing technologies being able to produce longer sequence reads. The preferred data format for files submitted to the SRA is the BAM format, which is capable of storing both aligned and unaligned reads.
  • 44. Storage As of 2012, the ENA's storage requirements continue to grow exponentially, with a doubling time of approximately 10 months. To manage this increase, the ENA selectively discards less-valuable sequencing platform data and implements advanced compression strategies. The CRAM reference-based compression toolkit was developed to help reduce ENA storage requirements.
  • 45. DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/ DDBJ began data bank activities in 1986 at NIG and remains the only nucleotide sequence data bank in Asia Currently, DDBJ Center is in operation at Research Organization of Information and System National Institute of Genetics(NIG) in Mishima, Japan with endorsement of MEXT; Japanese Ministry of Education, Culture, Sports, Science and Technology.
  • 46. DDBJ, expanding its DNA databank activities, was restructured as one of the Intellectual Infrastructure Project Centers of NIG, being separated from CIB. Collaborating with NBDC; National Bioscience Database Center, DDBJ Center started to operate the archive for all types of individual-level genetic and de- identified phenotypic data from human subjects, JGA; Japanese Genotype- phenotype Archive. ARSA iis high-speed retrieval system of sequence and annotation data maintained by DNA Data Bank of Japan ( DDBJ)
  • 47.
  • 49. Tools
  • 50. RefSeq Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq sequences form a foundation for medical, functional, and diversity studies. They provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses. RefSeq genomes are copies of selected assembled genomes available in GenBank.
  • 51. Main features of the RefSeq collection include: ● non-redundancy ● explicitly linked nucleotide and protein sequences ● updates to reflect current knowledge of sequence data and biology ● data validation and format consistency ● distinct accession series (all accessions include an underscore '_' character) ● ongoing curation by NCBI staff and collaborators, with reviewed records indicated
  • 52. RefSeq transcript and protein records are generated by several processes including: ● Computation Eukaryotic Genome Annotation Pipeline Prokaryotic Genome Annotation Pipeline ● Manual curation ● Propagation from annotated genomes that are submitted to members of the International Nucleotide Sequence Database Collaboration (INSDC)
  • 53.
  • 54. Scope NCBI provides RefSeqs for taxonomically diverse organisms including archaea, bacteria, eukaryotes, and viruses. References sequences are provided for genomes, transcripts, and proteins. Some targeted loci projects are included in RefSeq including: RefSeqGene , fungal ITS , and rRNA loci. New or updated records are added to the collection as data become publicly available
  • 55. Ensembl www.ensembl.org Ensembl is a joint project between EMBL-EBI and the Sanger Centre to develop a software system which produces and maintains automatic annotation of eukaryotic genomes.
  • 56.
  • 57. Ensembl In the Ensembl project, sequence data are fed into the gene annotation system (a collection of software "pipelines" written in Perl) which creates a set of predicted gene locations and saves them in a MySQL database for subsequent analysis and display. Ensembl makes these data freely accessible to the world research community.
  • 59. Protein Databases Protein Information Resource (PIR) – Protein Sequence Database (PIR-PSD): UniProt Protein Data Bank PROSITE PRINTS Pfam
  • 60. Protein Information Resource (PIR) – Protein Sequence Database (PIR-PSD): PIR was established in 1984 by the National Biomedical Research Foundation (NBRF) as a resource to assist researchers in the identification and interpretation of protein sequence information. Prior to that, the NBRF compiled the first comprehensive collection of macromolecular sequences in the Atlas of Protein Sequence and Structure, published from 1965-1978 under the editorship of Margaret O. Dayhoff
  • 61. PIR For over four decades, beginning with the Atlas of Protein Sequence and Structure, PIR has provided protein databases and analysis tools freely accessible to the scientific community including the Protein Sequence Database (PSD). In 2002 PIR, along with its international partners, EBI (European Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics), were awarded a grant from NIH to create UniProt, a single worldwide database of protein sequence and function, by unifying the PIR-PSD, Swiss-Prot, and TrEMBL databases.
  • 62. UniProt is produced by the UniProt Consortium, a collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR)
  • 64. UniProt Knowledgebase (UniProtKB) The UniProt Knowledgebase, the centrepiece of the UniProt Consortium’s activities, is an expertly and richly curated protein database, consisting of two sections called UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.
  • 65. Swiss-Prot UniProtKB/Swiss-Prot contains high-quality manually annotated and non-redundant protein sequence records. Manual annotation consists of analysis, comparison and merging of all available sequences for a given protein, as well as a critical review of associated experimental and predicted data. UniProt curators extract biological information from the literature and perform numerous computational analyses.
  • 66. Swiss-Prot UniProtKB/Swiss-Prot aims to provide all known relevant information about a particular protein. It describes, in a single record, the different protein products derived from a certain gene from a given species, including each protein derived by alternative splicing, polymorphisms and/or post-translational modifications. Protein families and groups are regularly reviewed to keep up with current scientific findings.
  • 67. UniProtKB/Swiss-Prot entry name Entry name symbolized as X_Y, where: X is protein name, Y is species name see for example INS_HUMAN, INS1_MOUSE and INS2_MOUSE INS = INSULIN HUMAN =SPECIES
  • 68.
  • 69. UniProtKB/TrEMBL UniProtKB/TrEMBL contains high-quality computationally analysed records enriched with automatic annotation and classification. Records are selected for full manual annotation and integration into UniProtKB/Swiss-Prot according to defined annotation priorities. The default raw sequence data for UniProtKB are: DDBJ/ENA/GenBank coding sequence (CDS) translations, the sequences of PDB structures, sequences from Ensembl and RefSeq, data derived from amino acid sequences that are directly submitted to UniProtKB or scanned from the literature.
  • 70. UniProt Reference Clusters (UniRef) Three UniRef databases – UniRef100, UniRef90 and UniRef50 – merge sequences automatically across species. UniRef100 is based on all UniProtKB records. UniRef100 is produced by clustering all these records by sequence identity. Identical sequences and sub-fragments are presented as a single UniRef100 entry with accession numbers of all the merged entries, the protein sequence, links to the corresponding UniProtKB and archive records. UniRef90 and UniRef50 are built from UniRef100 to provide records with mutual sequence identity of 90% or more, or 50% or more, respectively
  • 71. UniProt Archive (UniParc) UniParc is designed to capture all publicly available protein sequence data and contains all the protein sequences from the main publicly available protein sequence databases. UniParc handles all sequences simply as text strings – sequences that are 100% identical over their entire length are merged regardless of whether they are from the same or different species. UniParc also provides sequence versions, which are incremented every time the underlying sequence changes.
  • 72.
  • 73. UniProt Metagenomic and Environmental Sequences (UniMES) The availability of metagenomic data has necessitated the creation of a separate database, UniMES, to store sequences which are recovered directly from environmental samples. The predicted proteins from this dataset are combined with automatic classification by InterPro.
  • 74. Since 1971, the Protein Data Bank archive (PDB) has served as the single repository of information about the 3D structures of proteins, nucleic acids, and complex assemblies.
  • 75. Protein Data Bank RCSB PDB (Research Collaboratory for Structural Bioinformatics PDB) operates the US data center for the global PDB archive
  • 76.
  • 77. Protein Data Bank Japan Supports browsing in multiple languages such as Japanese, Chinese, and Korean; SeSAW identifies functionally or evolutionarily conserved motifs by locating and annotating sequence and structural similarities, tools for bioinformaticians, and more.
  • 78. Research Collaboratory for Structural Bioinformatics Protein Data Bank Simple and advanced searching for macromolecules and ligands, tabular reports, specialized visualization tools, sequence-structure comparisons, RCSB PDB Mobile, Molecule of the Month and other educational resources at PDB-101, and more.
  • 79. Biological Magnetic Resonance Data Bank Collects NMR data from any experiment and captures assigned chemical shifts, coupling constants, and peak lists for a variety of macromolecules; contains derived annotations such as hydrogen exchange rates, pKa values, and relaxation parameters.
  • 80. Protein Data Bank in Europe Rich information about all PDB entries, multiple search and browse facilities, advanced services including PDBePISA, PDBeFold and PDBeMotif, advanced visualisation and validation of NMR and EM structures, tools for bioinformaticians.
  • 81.
  • 83.
  • 84.
  • 85.
  • 86. PDB file format HEADER, TITLE and AUTHOR records provide information about the researchers who defined the structure; numerous other types of records are available to provide other types of information REMARK records can contain free-form annotation, but they also accommodate standardized information SEQRES records give the sequences of the three peptide chains (named A, B and C), which
  • 87. ATOM records describe the coordinates of the atoms that are part of the protein. For example, the first ATOM line above describes the alpha-N atom of the first residue of peptide chain A, which is a proline residue; the first three floating point numbers are its x, y and z coordinates and are in units of Ångströms. The next three columns are the occupancy, temperature factor, and the element name, respectively. HETATM records describe coordinates of hetero-atoms, that is those atoms which are not part of the protein molecule.
  • 90. Human genome project The human genome project, a large, federally funded collaborative project, completed the sequencing of entire human genome in 2003 Initially project funded by DOE and NIH. The Human Genome Project originally aimed to map the nucleotides contained in a human haploid reference genome (more than three billion). The "genome" of any given individual is unique; mapping the "human genome" involved sequencing a small number of individuals and then assembling these together to get a complete sequence for each chromosome. Therefore, the finished human genome is a mosaic, not representing any one individual.
  • 91. Project goals were to ● identify all the approximately 20,500 genes in human DNA, ● determine the sequences of the 3 billion chemical base pairs that make up human DNA, ● store this information in databases, ● improve tools for data analysis, ● transfer related technologies to the private sector, and ● address the ethical, legal, and social issues (ELSI) that may arise from the project.
  • 93. BAC-end sequencing The widely agreed-upon strategy for sequencing the human genome is based on the use of BACs that carry fragments of human DNA from known locations in the genome GRAIL GRAIL (Gene Recognition and Assembly Internet Link) is one of the most widely used computer programs for identifying potential genes in DNA sequence and for general DNA sequence analysis.
  • 94. Race b/w HGP and Celera The entry of Celera Genomics into the human genome sequencing arena in 1998 galvanised the public effort, leading to a race to sequence the human genome. Celera utilized the skills of computer scientist W. Meyers to perform whole genome short cloning approach and intensive computer processing of data to complete the Drosophila sequence and then the human genome sequence Craig Venter aimed to sequence and assemble the entire human genome by 2001, and only make the information available to paying customers
  • 95.
  • 96. Impacts Of The HGP Molecular medicine. Energy sources and environmental applications. Risk assessment. Bioarchaeology, anthropology, evolution, and human migration. DNA forensics (identification) Agriculture, livestock breeding, and bioprocessing.
  • 98. Molecular Modeling Database (MMDB) The Molecular Modeling DataBase (MMDB) is a database of experimentally determined three-dimensional biomolecular structures, and is also referred to as the Entrez Structure database. It is a subset of three-dimensional structures obtained from the RCSB Protein Data Bank (PDB), excluding theoretical models.
  • 99. Functional insights Experimentally resolved structures of proteins, RNA, and DNA, derived from the Protein Data Bank (PDB), with value-added features such as explicit chemical graphs, computationally identified 3D domains (compact substructures) that are used to identify similar 3D structures, as well as links to literature, similar sequences, information about chemicals bound to the structures, and more. These connections make it possible, for example, to find 3D structures for homologs of a protein sequence of interest, then interactively view the sequence-structure relationships, active sites, bound chemicals, journal articles, and more.
  • 100. CBLAST A tool that compares a query protein sequence against all protein sequences from experimentally resolved 3D structures, by using protein BLAST against the PDB data set
  • 101. IBIS (Inferred Biomolecular Interaction Server) For a given protein sequence or structure query, IBIS reports protein- protein, protein-small molecule, protein nucleic acids and protein-ion interactions observed in experimentally-determined structural biological assemblies. IBIS also infers/predicts interacting partners and binding sites by homology, by inspecting the protein complexes formed by close homologs of a given query.
  • 102. Vast (Vector Alignment Search Tool) The original VAST finds structures that are 3D similar to individual protein molecules, or individual 3D domains
  • 104. Database description The SCOP database aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between proteins whose three-dimensional structure is known and deposited in the Protein Data Bank. The main levels of the classification are: ● Family ● Super Family ● Fold ● Classes
  • 105. Family Family groups closely related proteins with a clear evidence for their evolutionary origin. In most cases, their relationship is detectable with current sequence comparison methods, e.g. BLAST, PSI- BLAST, HMMER.
  • 106. Superfamily Superfamily brings together more distantly related protein domains. Their similarity is frequently limited to common structural features that along with a conserved architecture of active or binding sites, or similar modes of oligomerization suggest a probable evolutionary ancestry.
  • 107. Fold Fold groups superfamilies on the basis of the global structural features shared by the majority of their members. These features are the composition of the secondary structures in the domain core, their architecture and topology. Fold is an attribute of a superfamily but the constituent families of some superfamilies that have evolved distinct structural features can belong to a different fold.
  • 108. Class Classes bring together folds and IUPRs with different secondary structural content. These include all-alpha and all-beta proteins, containing predominantly alpha-helices and beta-strands, respectively, and ‘mixed’ alpha and beta classes (a/b) and (a+b) with respectively
  • 109.
  • 110. Example ● Human trypsinogen lineage ● Root: scop ● Class: All beta proteins [48724] ● Fold: Trypsin-like serine proteases [50493] ● Superfamily: Trypsin-like serine proteases [50494] ● Family: Eukaryotic proteases [50514] ● Protein: Trypsin(ogen) [50515] ● Species: Human (Homo sapiens) [TaxId: 9606] [50519]
  • 111.
  • 112. CATH database The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s CATH shares many broad features with the SCOP resource, however there are also many areas in which the detailed classification differs greatly
  • 113. Classification The domains are classified within the CATH structural hierarchy: at the Class (C) level, domains are assigned according to their secondary structure content, i.e. all alpha, all beta, a mixture of alpha and beta, or little secondary structure; at the Architecture (A) level, information on the secondary structure arrangement in three-dimensional space is used for assignment; at the Topology/fold (T) level, information on how the secondary structure elements are connected and arranged is used; assignments are made to the Homologous superfamily (H) level if there is good evidence that the domains are related by evolution i.e. they are homologous.
  • 114.
  • 115. OMIM - Online Mendelian Inheritance in Man
  • 116. OMIM OMIM is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 15,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain copious links to other genetics resources.
  • 117. OMIM Each OMIM entry has a full-text summary of a genetically determined phenotype and/or gene and has numerous links to other genetic databases such as DNA and protein sequence, PubMed references, general and locus-specific mutation databases, HUGO nomenclature, MapViewer, GeneTests, patient support groups and many others. OMIM is an easy and straightforward portal to the burgeoning information in human genetics.
  • 118. OMIM OMIM has been available online since 1987, first from Johns Hopkins University and since 1995 from the NCBI OMIM can be searched from its homepage or from any page in the NCBI Entrez suite of databases. Information in OMIM can be retrieved by queries on MIM number, disorder, gene name and/or symbol, or plain English Each OMIM entry is assigned a unique six-digit number whose first digit indicates whether its inheritance is autosomal, X-linked, Y-linked or mitochondrial
  • 119.
  • 120.
  • 121. MalaCards: The human disease database MalaCards is an integrated database of human maladies and their annotations, modeled on the architecture and richness of the popular GeneCards database of human genes. The MalaCards disease and disorders database is organized into "disease cards", each integrating prioritized information, and listing numerous known aliases for each disease, along with a variety of annotations, as well as inter-disease connections, empowered by the GeneCards relational database, searches, and GeneAnalytics set-analyses
  • 122. Cont… Annotations include: symptoms, drugs, articles, genes, clinical trials, related diseases/disorders and more. An automatic computational information retrieval engine populates the disease cards, using remote data, as well as information gleaned using the GeneCards platform to compile the disease database. The MalaCards disease database integrates both specialized and general disease lists, including rare diseases, genetic diseases, complex disorders and more.
  • 123. TIGR Database The institute of genomic research
  • 124. TIGR Provides a collection of curated databases containing DNA and protein sequence, gene expression, cellular role, protein family, and taxonomic data for microbes, plants and humans. The CMR (Comprehensive Microbial Resource) contains analysis on completed microbial genome sequencing.
  • 125.
  • 127. KEGG This concept is realized in the following databases of KEGG, which are categorized into systems, genomic, chemical, and health information Systems information ● PATHWAY — pathway maps for cellular and organismal functions ● MODULE — modules or functional units of genes ● BRITE — hierarchical classifications of biological entities
  • 128. Genomic information ● GENOME — complete genomes ● GENES — genes and proteins in the complete genomes ● ORTHOLOGY — ortholog groups of genes in the complete genomes
  • 129. Chemical information ● COMPOUND, GLYCAN — chemical compounds and glycans ● REACTION, RPAIR, RCLASS — chemical reactions ● ENZYME — enzyme nomenclature
  • 130. Health information ● DISEASE — human diseases ● DRUG — approved drugs ● ENVIRON — crude drugs and health-related substances
  • 131. Microbial Genome Database ● MBGD is a workbench system for comparative analysis of completely sequenced microbial genomes. ● he central function of MBGD is to create an orthologous gene classification table using precomputed all-against-all similarity relationships among genes in multiple genomes. ● The growth of the number of completed microbial genome sequences is accelerated recently and nearly a hundred of genomes in various levels of relatedness have already been available today. ● Especially interesting are the recently available multiple genomes of some particular taxonomic groups such as proteobacteria gamma subdivision and
  • 132. ● Especially interesting are the recently available multiple genomes of some particular taxonomic groups such as proteobacteria gamma subdivision and Bacillus/Clostridium group in gram-positive bacteria. ● Since the first release in 1997, MBGD has been developed under a different concept: it provides a classification system rather than a classification result itself. ● The key components of MBGD include – ● (i) an algorithm that can classify genes into orthologous groups using precomputed all-against-all homology search results ● (ii) a user interface that is designed for users to explore the resulting classification in detail, and
  • 133. ● MBGD uses MySQL database management system to store most of the data including similarity relationships as well as cluster tables created on demand. Figure 1 Tree splitting procedure for ortholog grouping in MBGD. In this figure, nine genes (A1, B1 etc.) in five organisms (A–E) are classified into two clusters. In this example, the root node is split because three out of four organisms are duplicated in both of the subtrees. The cutoff ratio of duplicated organisms in each root node is a parameter of our algorithm.
  • 135. FASTA and BLAST The number of DNA and protein sequences in public databases is very large.Searching a database involves aligning the query sequence to each sequence in the database, to find significant local alignment. BLAST and FASTA are two similarity searching programs that identify homologous DNA sequences and proteins based on the excess sequence similarity. They provide facilities for comparing DNA and proteins sequences with the existing DNA and protein databases. They are two major heuristic algorithms for performing database searches.
  • 136. FASTA FASTA stands for fast-all” or “Fast Alignment” FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985 It was the first database similarity search tool developed, preceding the development of BLAST. FASTA is another sequence alignment tool which is used to search similarities between sequences of DNA and proteins.
  • 137. Variants of FASTA fasta - scan a protein or DNA sequence library for similar sequences. fastx - compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward and reverse frames. tfastx - compare a protein sequence to a DNA sequence database, calculating similarities with frameshifts to the forward and reverse orientations. fasty - compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward and reverse frames. tfasty - compare a protein sequence to a DNA sequence database, calculating similarities with frameshifts to the forward and reverse orientations.
  • 138. Variants of FASTA fasts - compare unordered peptides to a protein sequence database tfasts - compare unordered peptides to a translated DNA sequence database fastm - compare ordered peptides (or short DNA sequences) to a protein (DNA) sequence database fastm - compare ordered peptides (or short DNA sequences) to a translated DNA sequence database fastf - compare mixed peptides to a protein sequence database
  • 139.
  • 140.
  • 141. FASTA sequence format Fasta file description starts with ‘>’ symbol and followed by the gi and accession number and then the description, all in a single line. Next line starts with the sequence and in each row there would be 60 nucleotides/amino acids only. For DNA and proteins it is represented in one letter IUPAC nucleotide codes and amino acid codes Example >gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN(OVALBUMINRELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
  • 142. BLAST (Basic Local Alignment Search Tool) The BLAST program was developed by Stephen Altschul of NCBI in 1990 and has since become one of the most popular programs for sequence analysis. BLAST uses heuristics to align a query sequence with all sequences in a database.
  • 143. BLAST BLAST is more time-efficient than FASTA by searching only for the more significant patterns in the sequences, yet with comparative sensitivity. BLAST is also often used as part of other algorithms that require approximate sequence matching
  • 144.
  • 145. Variants of BLAST BLAST-N: compares nucleotide sequence with nucleotide sequences BLAST-P: compares protein sequences with protein sequences BLAST-X: Compares nucleotide sequences against the protein sequences tBLAST-N: compares the protein sequences against the six frame translations of nucleotide sequences tBLAST-X: Compares the six frame translations of nucleotide sequence against the six frame translations of protein sequences.
  • 146.
  • 147. Megablast Large numbers of query sequences (megablast) When comparing large numbers of input sequences via the command-line BLAST, "megablast" is much faster than running BLAST multiple times.
  • 148. Uses of BLAST Identifying species: With the use of BLAST, you can possibly correctly identify a species or find homologous species. This can be useful, for example, when you are working with a DNA sequence from an unknown species. Locating domains: When working with a protein sequence you can input it into BLAST, to locate known domains within the sequence of interest. Establishing phylogeny: Using the results received through BLAST you can create a phylogenetic tree using the BLAST web-page. DNA mapping: When working with a known species, and looking to sequence a gene at an unknown location, BLAST can compare the chromosomal position of the sequence of interest, to relevant sequences in the database(s).
  • 149.
  • 150. E value The BLAST E-value is the number of expected hits of similar quality (score) that could be found just by chance. E-value of 10 means that up to 10 hits can be expected to be found just by chance, given the same size of a random database. E-value can be used as a first quality filter for the BLAST search result, to obtain only results equal to or better than the number given by the -evalue option. Blast results are sorted by E-value by default (best hit in first line).
  • 151. -evalue 1e-50 :small E-value: low number of hits, but of high quality Blast hits with an E-value smaller than 1e-50 includes database matches of very high quality. -evalue 0.01 Blast hits with E-value smaller than 0.01 can still be considered as good hit for homology matches. -evalue 10 (default) large E-value: many hits, partly of low quality E-value smaller than 10 will include hits that cannot be considered as significant, but may give an idea of potential relations.