Introductiontobioinformaticsand
BiologicalDatabase
By – Dr. Rajesh Kumar
Department of Botany
MGGAC, Mahe
Whatis bioinformatics?
• In biology, bioinformatics is defined as, “
t
h
e use of
computer to store, retrieve, analyse or predict the
composition or structure of bio-molecules” . Bioinformatics
is the application of computational techniques and
informationtechnology totheorganisationandmanagement
of biological data. Classical bioinformatics deals primarily
withsequenceanalysis.
Aimsofbioinformatics
Developmentofdatabasecontainingall biological
information.
Developmentofbettertoolsfordatadesigning,annotation
andmining.
Designanddevelopmentofdrugsbyusingsimulation
software.
Designanddevelopmentof softwaretools for protein
structurepredictionfunction,annotationanddockinganalysis.
Creationanddevelopmentofsoftwaretoimprovetoolsfor
analysingsequencesfortheirfunctionandsimilaritywith
othersequences
Applications of bioinformatics
Applications of
bioinformatis
Gene
therapy
Drug
designing
Antibiotic
resistance
Crop
development
Medicine
biotechnology
Drought
resistance
Evolutionary
studies
Forensic
analysis
Veterinary
science
Wheather
analysisy
Waste
ceanup
Biological databases
• Biological data are complex, exception-ridden, vast and
incomplete. Therefore several databases has been created
and interpreted to ensure unambiguous results. A
collection of biological data arranged in computer
readable form that enhances the speed of search and
retrieval and convenient to use is called biological
database. A good database must have updated
information.
Importanceofbiological database
• A range of information like biological sequences,
structures, binding sites, metabolic interactions, molecular
action, functional relationships, protein families, motifs
and homologous can be retrieved by using biological
databases.The mainpurposeof abiological databaseis to
store and manage biological data and information in
computerreadableforms.
Typesofbiologicaldatabase
Nucleotidesequence
database
Primarydatabase
Deriveddatabase
secondarydatabase
Specialised
database
Structure
database
Protein structure
database
Genebank
DDBJ
EMBL
Proteinsequence
database
Swis-port
PIR
GenePept
PDB
EBI-MSD
MMDB
Domainand
motifdatabase
Prosite
Blocks
COG
Gene
expression
database
Metabolic
pathway
database
SCOP
CATH
GEO
GXD
MGED
KEGG
EMP
PathDB
TGI
GSOB
GPCRD
Primary databasevs.secondarydatabase
• A primarydatabasecontainsonlysequenceorstructural
information.
• Thedatabasederivedfromtheanalysisortreatmentof
primarydataaresecondarydatabase.Itisveryimportant
forinterferingproteinfunction.
Examplesofsome primary
biological database
GeneBank
 Oneof thefastest growing repositories of knownnucleotide sequences,GeneBank
(Genetic Sequence Databank), has a flat file structure. It is an ASCII text file,
readable by both humans and computers. Besides sequence data, GeneBank files
contain information such as accession numbers and gene names, phylogenetic
classificationandreferencestopublishedliterature.
 This database has been developed and maintained at the NCBI, Bethesda, MD,
USA, asapartofInternationalSequenceDatabaseCollaboration(INSDC).
 Itis anopenaccesssequencedatabase.
 It coordinateswithindividual laboratories
EMBL and DDBJ.
and other sequence databases like
Continue………………..
It is anannotatedcollection of all nucleotide sequencesthatareavailable to
thepublic.
The nucleotide database was divided into three databases at NCBI:
CoreNucleotide database, Expressed Sequence Tag (EST) and Genome
SurveySequence(GSS).
CoreNucleotide databasehasmostof thenucleotide sequencesused. It also
enclosesall nucleotiderecordsthatarenotintheEST andGSS databases.
Submission of sequences to GeneBank can be done using BankIt, Sequin
andtbl2asntools.
EMBL(EuropeanMolecular
Biology Laboratory)
• A comprehensive database of DNA and RNA sequences, EMBL
nucleotide sequencedatabaseis collected from scientific literature,
patientoffices andis directly submittedby researchers. EMBL has
been prepared in collaboration with GeneBank (USA) and the
DNA DatabaseofJapan(DDBJ).
• Itis establishedin1980.
• Itis maintainedbyEBI (EuropeanBioinformatics Institute)
Swiss-Port
 This is a curatedprotein sequencedatabasethatoffers ahigh level
of integrationwithotherdatabasesandalso hasavery lowlevel of
redundancy. Swiss-Port strives toprovide protein sequenceswith a
high level of annotation (for instance, the description of protein
function, domain structure and post translational modifications,
etc.).
 It is established in 1986 and maintained collaboratively , since
1987, by the department of Medical Biochemistry of the
UniversityofGenevaandtheEMBL dataLibrary.
 TrEMBL is a computer–annotated supplement of Swiss-Port that
contains all translations of EMBL nucleotide sequence entries,
whichis notyetintegratedinSwiss-Port.
 Currently Swiss-Port have 0.5 and TrEMBL have 7.6 milliom
sequences.
ProteinInformationResource(PIR)
• PIR is an integrated public bioinformatics resource to
support genomic and proteomic research and scientific
studies. Nowadays, PIR offers a wide variety of resources
mainly oriented to assisting the propagation and
consistency of protein annotations like PIRSF, ProClass
andProLINK.
ExamplesofSome Secondary
Biological Database
MotifDatabases
• Protein sequence motif is a set of conserved amino acid
residuesthatareimportantfor proteinfunctionandarelocated
within a certain distance from one another
. These motifs
usually provide clues to the functions of otherwise
uncharacterisedproteins.
• The PROSITE database consists of documentation entries
describing protein domains, families and functional sites as
wellasassociatedpatternsandprofilestoidentifythem.
• PRINT is adatabasefor proteinfingerprints. A fingerprint is a
group of conserved motifs used to characterise a protein
family.
DomainDatabase
• A protein domain is an independently folded, structurally
compact unit that forms a steady three- dimensional structure
and shows a certain level of evolutionary conservation.
Typically ,aconserveddomaincontainsoneormoremotifs.
• ProDomisaproteindomaindatabaseautomaticallygenerated
fromtheSwiss-PortandTrEMBL sequencedatabase.
• SMART is a highly reliable and sensitive tool for
domain identification.
• COG is adatabaseandaconvenienttoolformotifanddomain
identification.
3DStructuredatabases
• PDB (Protein Data bank) is themainprimary databasefor 3D
structures of biological macromolecules determined by X-ray,
crystallography and NMR. It also accepts experimental data
usedtodeterminethestructuresandhomologymodels.
• SCOP (Structural Classification of Protein database) classifies
protein 3D structures in a hierarchical scheme of structure
classes. All the protein structures in PDB are classified her,
andtheupdatednewstructuresaredepositedinPDB.
• The CATH database (Class, Architecture, Topology,
Homologous) contains a hierarchical classification of protein
domainstructure.
Protein data bank
• PDB (Protein data bank) is a repository for 3D structural
data obtained by x-ray crystallography or NMR
spectroscopyofproteinsandnucleic acids.
• Research Collaboratory for Structural Bioinformatics
(RCSB) PDB provides avariety of tools andresourcesfor
studying the structures of biological macromolecules and
their relationship with other sequences, its function and
diseasescausedif any.
Geneexpressiondatabases
• GEO or Gene Expression Omnibus is a curated online
resource and a gene expression molecular abundance
repository for gene expression data browsing, query and
retrieval.
• GXD or GeneExpressionDatabase is acommunity resource
forgeneexpressioninformation.
• MGED or Microarray Gene Expression data contains
microarray data, generated by functional genomics and
proteomicsexperiments.
• ArrayExpress from European Bioinformatics Institute is a
repositoryfortranscriptomicsdata.
Metabolicpathwaydatabases
• KEGG PATHWAY Databasecontains graphical pathwaymapsfor all
knownmetabolicpathwaysfromvariousorganisms.
• EcoCyc is an E. coli database , stores information regarding the
genomeandbiochemical machineryofE.coli.
• LIGAND is achemical databasefor enzymereactions attheInstitute
for Chemical Research, Kyoto. It is composite database currently
consisting of the COMPOUND, DRUG, GLYCAN, REACTION,
RPAIR andENZYMEdatabases.
• MetaCyc is a non-redundant, experimentally elucidated metabolic
pathwaydatabase.
• BRENDA is an enzyme database tat contains information on all
aspectsofenzymesandenzymaticreactions.
Genomedatabases
• Genome databases give absolute information on the heritable
properties of an organism. These databases help to identify
genes and predict their functions. A few genome databases
havelinkswithspecificorganismdatabases.
• GOLD (Genomes Online Database at the University of
Illinois, USA) contains a list of all thecomplete and ongoing
genomeprojectsworldwide.
• Genomes at NCBI (National Centre for Biotechnology
Information,USA).
• TIGR database (TDB), at the institute for Genomic Research
atRockeville MD,USA.
Virological databases
• A virological database contains all the sequences and
related information of viruses of animals, plants, bacteria,
fungi and archea; for example, the HIV protease
database. A committee called The International
Committee on Taxonomy of Viruses(ICTV) authorises
and organises the taxonomic classification of viruses.
ICTVdB contains taxonomic information for over
thousandsofvirusspecies.
Worldbiodiversitydatabases
• Taxonomic databases are built to document all known
species and them available and
worldwide.
accessibl
e databases contain taxonomic
hierarchies,
make
These
species names, synonyms, descriptions,
illustrations and references. For example: CCINFO,
STRAIN andALGAE.
Databaseforvariousmodelorganisms
• Escherichia coli- E. coli Genome Centre(Wisconsin university,
USA), TheE.coli index(UniversityofBirmingham, UK)
• Arabidopsisthaliana-TAIR (TheArabidopsisInformationResource)
• Homosapiens-HumanGenomeResourcesatNCBI, USA
• Oryza sativa (rice) -RGP (Rice Genome Research
Programme,Japan)
• Drosaphilamelanogaster-FlyBase (DrosophilaGenomeDatabase)
• Musmusculus(mouce)-MouceGenomeInformatics
• Daniorerio(zebrafish)- ZFIN (Zebrafish InformationNetwork atthe
UniversityofOregon,USA)
• S. cerevisiae (Bakers yeast)- SGD ()Yeast Genome Database
at Stanford,USA
AnnotationofGene?????
In molecular biology, genomes make the basic genetic material and
typically consistofDNA. Whereby,genomeincludethegenes(coding)
andnoncoding regions, of interest tous, arethecoding regions asthey
actively influence basic life processes. The genes contain useful
biological information that is required in building up and maintaining
an organism. Gene annotation can be defined merely as the process of
makingnucleotidesequencemeaningful.
Gene annotation involves the process of taking the raw DNA sequence
produced by genomesequencing projects andadding layers of analysis
and interpretation necessary to extracting biologically significant
information andplacing such derived details into context. Annotation is
the process by which pertinent information about these raw DNA
sequencesisaddedtothedatabases.
Accessionnumber
Accession numbers are unique identifiers which
permanently identify sequences in the database.
Accession numbers are assigned and
communicated to authors within twoworking days
ofthereceiptofsubmission.
Introduction to Biological database ppt(1).pptx

Introduction to Biological database ppt(1).pptx

  • 1.
    Introductiontobioinformaticsand BiologicalDatabase By – Dr.Rajesh Kumar Department of Botany MGGAC, Mahe
  • 2.
    Whatis bioinformatics? • Inbiology, bioinformatics is defined as, “ t h e use of computer to store, retrieve, analyse or predict the composition or structure of bio-molecules” . Bioinformatics is the application of computational techniques and informationtechnology totheorganisationandmanagement of biological data. Classical bioinformatics deals primarily withsequenceanalysis.
  • 3.
    Aimsofbioinformatics Developmentofdatabasecontainingall biological information. Developmentofbettertoolsfordatadesigning,annotation andmining. Designanddevelopmentofdrugsbyusingsimulation software. Designanddevelopmentof softwaretoolsfor protein structurepredictionfunction,annotationanddockinganalysis. Creationanddevelopmentofsoftwaretoimprovetoolsfor analysingsequencesfortheirfunctionandsimilaritywith othersequences
  • 4.
    Applications of bioinformatics Applicationsof bioinformatis Gene therapy Drug designing Antibiotic resistance Crop development Medicine biotechnology Drought resistance Evolutionary studies Forensic analysis Veterinary science Wheather analysisy Waste ceanup
  • 5.
    Biological databases • Biologicaldata are complex, exception-ridden, vast and incomplete. Therefore several databases has been created and interpreted to ensure unambiguous results. A collection of biological data arranged in computer readable form that enhances the speed of search and retrieval and convenient to use is called biological database. A good database must have updated information.
  • 6.
    Importanceofbiological database • Arange of information like biological sequences, structures, binding sites, metabolic interactions, molecular action, functional relationships, protein families, motifs and homologous can be retrieved by using biological databases.The mainpurposeof abiological databaseis to store and manage biological data and information in computerreadableforms.
  • 7.
  • 8.
    Primary databasevs.secondarydatabase • Aprimarydatabasecontainsonlysequenceorstructural information. • Thedatabasederivedfromtheanalysisortreatmentof primarydataaresecondarydatabase.Itisveryimportant forinterferingproteinfunction.
  • 9.
  • 10.
    GeneBank  Oneof thefastestgrowing repositories of knownnucleotide sequences,GeneBank (Genetic Sequence Databank), has a flat file structure. It is an ASCII text file, readable by both humans and computers. Besides sequence data, GeneBank files contain information such as accession numbers and gene names, phylogenetic classificationandreferencestopublishedliterature.  This database has been developed and maintained at the NCBI, Bethesda, MD, USA, asapartofInternationalSequenceDatabaseCollaboration(INSDC).  Itis anopenaccesssequencedatabase.  It coordinateswithindividual laboratories EMBL and DDBJ. and other sequence databases like Continue………………..
  • 11.
    It is anannotatedcollectionof all nucleotide sequencesthatareavailable to thepublic. The nucleotide database was divided into three databases at NCBI: CoreNucleotide database, Expressed Sequence Tag (EST) and Genome SurveySequence(GSS). CoreNucleotide databasehasmostof thenucleotide sequencesused. It also enclosesall nucleotiderecordsthatarenotintheEST andGSS databases. Submission of sequences to GeneBank can be done using BankIt, Sequin andtbl2asntools.
  • 12.
    EMBL(EuropeanMolecular Biology Laboratory) • Acomprehensive database of DNA and RNA sequences, EMBL nucleotide sequencedatabaseis collected from scientific literature, patientoffices andis directly submittedby researchers. EMBL has been prepared in collaboration with GeneBank (USA) and the DNA DatabaseofJapan(DDBJ). • Itis establishedin1980. • Itis maintainedbyEBI (EuropeanBioinformatics Institute)
  • 13.
    Swiss-Port  This isa curatedprotein sequencedatabasethatoffers ahigh level of integrationwithotherdatabasesandalso hasavery lowlevel of redundancy. Swiss-Port strives toprovide protein sequenceswith a high level of annotation (for instance, the description of protein function, domain structure and post translational modifications, etc.).  It is established in 1986 and maintained collaboratively , since 1987, by the department of Medical Biochemistry of the UniversityofGenevaandtheEMBL dataLibrary.  TrEMBL is a computer–annotated supplement of Swiss-Port that contains all translations of EMBL nucleotide sequence entries, whichis notyetintegratedinSwiss-Port.  Currently Swiss-Port have 0.5 and TrEMBL have 7.6 milliom sequences.
  • 14.
    ProteinInformationResource(PIR) • PIR isan integrated public bioinformatics resource to support genomic and proteomic research and scientific studies. Nowadays, PIR offers a wide variety of resources mainly oriented to assisting the propagation and consistency of protein annotations like PIRSF, ProClass andProLINK.
  • 15.
  • 16.
    MotifDatabases • Protein sequencemotif is a set of conserved amino acid residuesthatareimportantfor proteinfunctionandarelocated within a certain distance from one another . These motifs usually provide clues to the functions of otherwise uncharacterisedproteins. • The PROSITE database consists of documentation entries describing protein domains, families and functional sites as wellasassociatedpatternsandprofilestoidentifythem. • PRINT is adatabasefor proteinfingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family.
  • 17.
    DomainDatabase • A proteindomain is an independently folded, structurally compact unit that forms a steady three- dimensional structure and shows a certain level of evolutionary conservation. Typically ,aconserveddomaincontainsoneormoremotifs. • ProDomisaproteindomaindatabaseautomaticallygenerated fromtheSwiss-PortandTrEMBL sequencedatabase. • SMART is a highly reliable and sensitive tool for domain identification. • COG is adatabaseandaconvenienttoolformotifanddomain identification.
  • 18.
    3DStructuredatabases • PDB (ProteinData bank) is themainprimary databasefor 3D structures of biological macromolecules determined by X-ray, crystallography and NMR. It also accepts experimental data usedtodeterminethestructuresandhomologymodels. • SCOP (Structural Classification of Protein database) classifies protein 3D structures in a hierarchical scheme of structure classes. All the protein structures in PDB are classified her, andtheupdatednewstructuresaredepositedinPDB. • The CATH database (Class, Architecture, Topology, Homologous) contains a hierarchical classification of protein domainstructure.
  • 19.
    Protein data bank •PDB (Protein data bank) is a repository for 3D structural data obtained by x-ray crystallography or NMR spectroscopyofproteinsandnucleic acids. • Research Collaboratory for Structural Bioinformatics (RCSB) PDB provides avariety of tools andresourcesfor studying the structures of biological macromolecules and their relationship with other sequences, its function and diseasescausedif any.
  • 20.
    Geneexpressiondatabases • GEO orGene Expression Omnibus is a curated online resource and a gene expression molecular abundance repository for gene expression data browsing, query and retrieval. • GXD or GeneExpressionDatabase is acommunity resource forgeneexpressioninformation. • MGED or Microarray Gene Expression data contains microarray data, generated by functional genomics and proteomicsexperiments. • ArrayExpress from European Bioinformatics Institute is a repositoryfortranscriptomicsdata.
  • 21.
    Metabolicpathwaydatabases • KEGG PATHWAYDatabasecontains graphical pathwaymapsfor all knownmetabolicpathwaysfromvariousorganisms. • EcoCyc is an E. coli database , stores information regarding the genomeandbiochemical machineryofE.coli. • LIGAND is achemical databasefor enzymereactions attheInstitute for Chemical Research, Kyoto. It is composite database currently consisting of the COMPOUND, DRUG, GLYCAN, REACTION, RPAIR andENZYMEdatabases. • MetaCyc is a non-redundant, experimentally elucidated metabolic pathwaydatabase. • BRENDA is an enzyme database tat contains information on all aspectsofenzymesandenzymaticreactions.
  • 22.
    Genomedatabases • Genome databasesgive absolute information on the heritable properties of an organism. These databases help to identify genes and predict their functions. A few genome databases havelinkswithspecificorganismdatabases. • GOLD (Genomes Online Database at the University of Illinois, USA) contains a list of all thecomplete and ongoing genomeprojectsworldwide. • Genomes at NCBI (National Centre for Biotechnology Information,USA). • TIGR database (TDB), at the institute for Genomic Research atRockeville MD,USA.
  • 23.
    Virological databases • Avirological database contains all the sequences and related information of viruses of animals, plants, bacteria, fungi and archea; for example, the HIV protease database. A committee called The International Committee on Taxonomy of Viruses(ICTV) authorises and organises the taxonomic classification of viruses. ICTVdB contains taxonomic information for over thousandsofvirusspecies.
  • 24.
    Worldbiodiversitydatabases • Taxonomic databasesare built to document all known species and them available and worldwide. accessibl e databases contain taxonomic hierarchies, make These species names, synonyms, descriptions, illustrations and references. For example: CCINFO, STRAIN andALGAE.
  • 25.
    Databaseforvariousmodelorganisms • Escherichia coli-E. coli Genome Centre(Wisconsin university, USA), TheE.coli index(UniversityofBirmingham, UK) • Arabidopsisthaliana-TAIR (TheArabidopsisInformationResource) • Homosapiens-HumanGenomeResourcesatNCBI, USA • Oryza sativa (rice) -RGP (Rice Genome Research Programme,Japan) • Drosaphilamelanogaster-FlyBase (DrosophilaGenomeDatabase) • Musmusculus(mouce)-MouceGenomeInformatics • Daniorerio(zebrafish)- ZFIN (Zebrafish InformationNetwork atthe UniversityofOregon,USA) • S. cerevisiae (Bakers yeast)- SGD ()Yeast Genome Database at Stanford,USA
  • 26.
    AnnotationofGene????? In molecular biology,genomes make the basic genetic material and typically consistofDNA. Whereby,genomeincludethegenes(coding) andnoncoding regions, of interest tous, arethecoding regions asthey actively influence basic life processes. The genes contain useful biological information that is required in building up and maintaining an organism. Gene annotation can be defined merely as the process of makingnucleotidesequencemeaningful. Gene annotation involves the process of taking the raw DNA sequence produced by genomesequencing projects andadding layers of analysis and interpretation necessary to extracting biologically significant information andplacing such derived details into context. Annotation is the process by which pertinent information about these raw DNA sequencesisaddedtothedatabases.
  • 27.
    Accessionnumber Accession numbers areunique identifiers which permanently identify sequences in the database. Accession numbers are assigned and communicated to authors within twoworking days ofthereceiptofsubmission.