DATABASES
DATA &
information
Data is raw, unorganized facts that need to be processed. Eg: Each
student’s exam score is one piece of data.
 When data is processed, organized, structured or presented in
a given context so as to make it useful, it is called
information. Eg:- performance of a class or of the average
entire school is information that can be derived from the given
data.
 Data >> Information >> Knowledge
Database
 Databases are composed of computer hardware and software for
data management.
 The chief objective of the development of a database is to organize
data in a set of structured records to enable easy retrieval of
information.
 Each record, also called an entry, should contain a number of fields
that hold the actual data items, for example, fields for names, phone
numbers, addresses, dates.
Database
 Biological databases serve a critical purpose in the collection and
organization of data related to biological systems.
 They provide computational support and a user-friendly interface
to a researcher for a meaningful analysis of biological data.
 Computerized archive used to store and organize data in such a
way that information can be retrieved easily via a variety of search
criteria.
NEED FOR
Biological
databases
 Need for storing and communicating large datasets has grown
 Make biological data available to scientists.
 To make biological data available in computer-readable form.
Features of
biological
databases
 Heterogeneity: presence of diverse data types, structures, formats,
or sources within a database system
 High volume data: datasets that are extremely large and may
contain millions or even billions of records
 Data curation: process aimed at managing, organizing, and
enhancing the quality of biological data to ensure its accuracy,
usability, and long-term value. Biological databases contain vast
amounts of information related to genomics, proteomics,
metabolomics, taxonomy, and other aspects of biology
Features of
biological
databases
 Data integration: process of combining data from different sources and
formats into a unified and cohesive view within a single database or data
repository. Cross-reference and link related data to enhance its utility for
researchers.
 Data sharing: critical for advancing scientific research, promoting
collaboration, and facilitating discoveries in the field of biology.
 Dynamics: Dynamics in biological databases refer to the continuous
changes and updates that occur in the content and structure of these
databases over time.
Different
classifications of
databases
 Types of Data
 Nucleotide sequences
 Protein sequences
 Proteins sequence patterns or motifs
 Macromolecular 3D structure
 Gene expression data
 Metabolic pathways
Different
classifications of
databases  Availability
 Publicly available, no restrictions
 Available, but with copyright
 Accessible, but not downloadable
 Academic, but not freely available
 Proprietary, commercial; possibly free for academics
Different
classifications of
databases
 Primary databases: experimental results directly
into database
 Secondary databases: results of analysis of
primary databases
 Composite databases: Collection of various
primary database sequences
Primary Databases
 Contains bio-molecular data in its original form.
 Experimental results are submitted directly into the database
by researchers, and the data are essentially archival in nature.
 Eg: GenBank, EMBL and DDBJ for DNA/RNA sequences,
SWISS-PROT and PIR for protein sequences and PDB for
molecular structures.
Nucleotide
sequence
databases
 EMBL,GenBank, and DDBJ are the three primary nucleotide
sequence databases which are part of INSDC (International
Nucleotide Sequence DatabaseCollaboration).
 EMBL (European Molecular Biology Laboratory)
www.ebi.ac.uk/embl/ EMBL-EBI (European Bioinformatics
Institute))
 GenBank www.ncbi.nlm.nih.gov/Genbank/
 DDBJ (DNA DataBank of Japan) www.ddbj.nig.ac.jp (DNA
DataBank of Japan)
Genbank
Genbank
 An annotated collection of all publicly available nucleotide and
proteins
 Maintained since 1992 by NCBI.
 http://www.ncbi.nlm.nih.gov
Browsing genbank
Genpept
GENBANK
SUBMISSION
 Researchers and institutions can submit their DNA and RNA sequences to GenBank.
 Preparation of data for submission
 Organize metadata, including information about the source organism,
experimental methods, and relevant publications.
 Format your data according to GenBank's submission guidelines. GenBank accepts
submissions in various formats, including FASTA, GenBank flat file format, and
Sequin format.
 table2asn is a command-line program that creates sequence records for submission
to GenBank. It is used primarily for submission of annotated genomes and large
batches of sequences.
 GenomeWorkbench is a set of integrated tools for studying and analyzing genetic
data. Its SubmissionWizard option allows you to prepare submissions of single
eukaryotic and prokaryotic genomes.You can also use Genome Workbench to edit
and visualize file created by table2asn.
GENBANK
SUBMISSION
 Select a submission tool
 BankIt: An online submission tool on the NCBI website.
 Submission Portal: a unified system for multiple submission types.
 Sequin:A standalone software for sequence submission.
 After submission, these sequences are curated, reviewed, and
integrated into the database.
 Once submission is approved, GenBank will assign a unique accession
number to your sequence.
 Submitted data may be held confidentially until the associated
research paper is published, or may be release immediately based on
the choice provided during submission.
GENBANK-
BankIt
Genbank-
Submission
Portal
EMBL
(European
Molecular
Biology
Laboratory)
 Nucleic acid database from EBI (European Bioinformatics
Institute)
 Produced in collaboration with DDBJ and GenBank
 Search engine – SRS(Sequence Retrieval System)
 https://www.ebi.ac.uk/
EMBL
EMBL -data
resources
EMBL-tools
 1) Data Retrieval
 2) Protein functional analysis
 3) Pairwise sequence alignment
 4) Phylogeny
 5) RNA analysis
 6) SequenceTranslation
 7) Literature ontologies
EMBL
Browsing
DDBJ (DNA Databank of Japan)
 Started in 1986 in collaboration with GenBank
 Produced and maintained at NIG (National Institute of Genetics)
 http://www.ddbj.nig.ac.jp/
DDBJ features
Ddbj browsing
The NIG
Supercomputer
 NIG provides state-of-the-art supercomputer system services
equipped with large-scale clustered computers, large-scale
memory-sharing computers, and large-capacity high-speed disk
drives as a computational infrastructure for life and medical
research.
Primary protein databases
UniProtKB
 The UniProt Knowledgebase (UniProtKB) is the central hub for the
collection of functional information on proteins, with accurate,
consistent and rich annotation.
 Produced by the Uniprot consortium (EMBL-EBI, SIB (Swiss Institute of
Bioinformatics) and PIR(Protein Information Resource ) )
 The UniProt Knowledgebase consists of two sections: a section
containing manually-annotated records with information extracted from
literature and curator-evaluated computational analysis
(UniProtKB/Swiss-Prot), and a section with computationally analyzed
records that await full manual annotation (UniProtKB/TrEMBL).
UniProt
browsing
PDB (Protein
DataBank)
 Protein Databank
 PDB Established in 1972 at Brookhaven National
Laboratory (BNL)
 Sole international repository of macromolecular
structure data
 Moved to Research Collaboratory for Structural
Bioinformatics
PDB browsing
SECONDARY
DATABASES
 Contains data derived from the results of analysing primary data
 Manually created or automatically generated
 Contains more relevant and useful information structured to
specific requirements
 Eg: PROSITE, PRINTS, PDBbind
PROSITE
 Families of proteins
 Can search using regular expressions
 Families exhibit these patterns so we can efficiently search over
families
C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C
prosite
PRIMARYVS
SECONDARY
DATABASES
 Primary Database
 Contains original database from researches
 Public or mostly open access
 NCBI, GENBANK, EMBL, SWISS-PROT
 Secondary Database
 Results from entries of primary database
 Manually created or automatically generated
 PROSITE, Pfam, PRINTS
COMPOSITE
DATABASES
 Collection of various primary database sequences
 Renders sequence searching highly efficient as it searches
multiple resources
 Eg:- NRDB (Non Redundant Database),OWL, SWISS PROT +
TrEMBL
NRDB
 Non Redundant Database built by NCBI
 Composite of GenBank (GenBank CDS translation), PDB sequences, SWISS-
PROT, PIR.
 Its non-identical and non-redundant.
 Default database of NCBI BLAST service.
 Regularly updated.
OWL
Database
 Non redundant protein database derived from SWISS-PROT, PIR,
GenBank (protein) and NRL_3D.
 279,796 entries-small due to strict redundancy.
 All identical and SNPs containing entries removed.
Organism-
specific
databases
 Organism-specific databases are databases dedicated to collecting and organizing
information about a particular species or group of closely related species.
 Genomic Data: Organism-specific databases typically contain detailed genomic
information for the target species, including DNA sequences, gene annotations, and
regulatory elements.
 Phenotypic Data:They may include data related to the phenotype of the organism,
which could encompass physical characteristics, behavior, and developmental
processes.
 Metabolomic and Proteomic Data: Some databases provide information about the
metabolic pathways and proteomes of the organism, helping researchers understand
its biochemical processes.
 Taxonomy and Evolutionary Information:These databases often include data
about the species' taxonomic classification and evolutionary relationships with other
organisms.
 Environmental and Ecological Data: For ecologists and environmental scientists,
some databases contain information about the species' habitat, distribution, and
interactions with other organisms.
TheArabidopsis Information Resource
DATABASES (1).pdf bio informatics  for colleges.

DATABASES (1).pdf bio informatics for colleges.

  • 1.
  • 2.
    DATA & information Data israw, unorganized facts that need to be processed. Eg: Each student’s exam score is one piece of data.  When data is processed, organized, structured or presented in a given context so as to make it useful, it is called information. Eg:- performance of a class or of the average entire school is information that can be derived from the given data.  Data >> Information >> Knowledge
  • 3.
    Database  Databases arecomposed of computer hardware and software for data management.  The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information.  Each record, also called an entry, should contain a number of fields that hold the actual data items, for example, fields for names, phone numbers, addresses, dates.
  • 4.
    Database  Biological databasesserve a critical purpose in the collection and organization of data related to biological systems.  They provide computational support and a user-friendly interface to a researcher for a meaningful analysis of biological data.  Computerized archive used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria.
  • 5.
    NEED FOR Biological databases  Needfor storing and communicating large datasets has grown  Make biological data available to scientists.  To make biological data available in computer-readable form.
  • 6.
    Features of biological databases  Heterogeneity:presence of diverse data types, structures, formats, or sources within a database system  High volume data: datasets that are extremely large and may contain millions or even billions of records  Data curation: process aimed at managing, organizing, and enhancing the quality of biological data to ensure its accuracy, usability, and long-term value. Biological databases contain vast amounts of information related to genomics, proteomics, metabolomics, taxonomy, and other aspects of biology
  • 7.
    Features of biological databases  Dataintegration: process of combining data from different sources and formats into a unified and cohesive view within a single database or data repository. Cross-reference and link related data to enhance its utility for researchers.  Data sharing: critical for advancing scientific research, promoting collaboration, and facilitating discoveries in the field of biology.  Dynamics: Dynamics in biological databases refer to the continuous changes and updates that occur in the content and structure of these databases over time.
  • 8.
    Different classifications of databases  Typesof Data  Nucleotide sequences  Protein sequences  Proteins sequence patterns or motifs  Macromolecular 3D structure  Gene expression data  Metabolic pathways
  • 9.
    Different classifications of databases Availability  Publicly available, no restrictions  Available, but with copyright  Accessible, but not downloadable  Academic, but not freely available  Proprietary, commercial; possibly free for academics
  • 10.
    Different classifications of databases  Primarydatabases: experimental results directly into database  Secondary databases: results of analysis of primary databases  Composite databases: Collection of various primary database sequences
  • 11.
    Primary Databases  Containsbio-molecular data in its original form.  Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature.  Eg: GenBank, EMBL and DDBJ for DNA/RNA sequences, SWISS-PROT and PIR for protein sequences and PDB for molecular structures.
  • 12.
    Nucleotide sequence databases  EMBL,GenBank, andDDBJ are the three primary nucleotide sequence databases which are part of INSDC (International Nucleotide Sequence DatabaseCollaboration).  EMBL (European Molecular Biology Laboratory) www.ebi.ac.uk/embl/ EMBL-EBI (European Bioinformatics Institute))  GenBank www.ncbi.nlm.nih.gov/Genbank/  DDBJ (DNA DataBank of Japan) www.ddbj.nig.ac.jp (DNA DataBank of Japan)
  • 13.
  • 14.
    Genbank  An annotatedcollection of all publicly available nucleotide and proteins  Maintained since 1992 by NCBI.  http://www.ncbi.nlm.nih.gov
  • 15.
  • 21.
  • 22.
    GENBANK SUBMISSION  Researchers andinstitutions can submit their DNA and RNA sequences to GenBank.  Preparation of data for submission  Organize metadata, including information about the source organism, experimental methods, and relevant publications.  Format your data according to GenBank's submission guidelines. GenBank accepts submissions in various formats, including FASTA, GenBank flat file format, and Sequin format.  table2asn is a command-line program that creates sequence records for submission to GenBank. It is used primarily for submission of annotated genomes and large batches of sequences.  GenomeWorkbench is a set of integrated tools for studying and analyzing genetic data. Its SubmissionWizard option allows you to prepare submissions of single eukaryotic and prokaryotic genomes.You can also use Genome Workbench to edit and visualize file created by table2asn.
  • 23.
    GENBANK SUBMISSION  Select asubmission tool  BankIt: An online submission tool on the NCBI website.  Submission Portal: a unified system for multiple submission types.  Sequin:A standalone software for sequence submission.  After submission, these sequences are curated, reviewed, and integrated into the database.  Once submission is approved, GenBank will assign a unique accession number to your sequence.  Submitted data may be held confidentially until the associated research paper is published, or may be release immediately based on the choice provided during submission.
  • 24.
  • 25.
  • 27.
    EMBL (European Molecular Biology Laboratory)  Nucleic aciddatabase from EBI (European Bioinformatics Institute)  Produced in collaboration with DDBJ and GenBank  Search engine – SRS(Sequence Retrieval System)  https://www.ebi.ac.uk/
  • 28.
  • 29.
  • 30.
    EMBL-tools  1) DataRetrieval  2) Protein functional analysis  3) Pairwise sequence alignment  4) Phylogeny  5) RNA analysis  6) SequenceTranslation  7) Literature ontologies
  • 31.
  • 35.
    DDBJ (DNA Databankof Japan)  Started in 1986 in collaboration with GenBank  Produced and maintained at NIG (National Institute of Genetics)  http://www.ddbj.nig.ac.jp/
  • 36.
  • 37.
  • 38.
    The NIG Supercomputer  NIGprovides state-of-the-art supercomputer system services equipped with large-scale clustered computers, large-scale memory-sharing computers, and large-capacity high-speed disk drives as a computational infrastructure for life and medical research.
  • 40.
  • 41.
    UniProtKB  The UniProtKnowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation.  Produced by the Uniprot consortium (EMBL-EBI, SIB (Swiss Institute of Bioinformatics) and PIR(Protein Information Resource ) )  The UniProt Knowledgebase consists of two sections: a section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis (UniProtKB/Swiss-Prot), and a section with computationally analyzed records that await full manual annotation (UniProtKB/TrEMBL).
  • 42.
  • 45.
    PDB (Protein DataBank)  ProteinDatabank  PDB Established in 1972 at Brookhaven National Laboratory (BNL)  Sole international repository of macromolecular structure data  Moved to Research Collaboratory for Structural Bioinformatics
  • 46.
  • 50.
    SECONDARY DATABASES  Contains dataderived from the results of analysing primary data  Manually created or automatically generated  Contains more relevant and useful information structured to specific requirements  Eg: PROSITE, PRINTS, PDBbind
  • 51.
    PROSITE  Families ofproteins  Can search using regular expressions  Families exhibit these patterns so we can efficiently search over families C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C
  • 52.
  • 56.
    PRIMARYVS SECONDARY DATABASES  Primary Database Contains original database from researches  Public or mostly open access  NCBI, GENBANK, EMBL, SWISS-PROT  Secondary Database  Results from entries of primary database  Manually created or automatically generated  PROSITE, Pfam, PRINTS
  • 57.
    COMPOSITE DATABASES  Collection ofvarious primary database sequences  Renders sequence searching highly efficient as it searches multiple resources  Eg:- NRDB (Non Redundant Database),OWL, SWISS PROT + TrEMBL
  • 58.
    NRDB  Non RedundantDatabase built by NCBI  Composite of GenBank (GenBank CDS translation), PDB sequences, SWISS- PROT, PIR.  Its non-identical and non-redundant.  Default database of NCBI BLAST service.  Regularly updated.
  • 59.
    OWL Database  Non redundantprotein database derived from SWISS-PROT, PIR, GenBank (protein) and NRL_3D.  279,796 entries-small due to strict redundancy.  All identical and SNPs containing entries removed.
  • 60.
    Organism- specific databases  Organism-specific databasesare databases dedicated to collecting and organizing information about a particular species or group of closely related species.  Genomic Data: Organism-specific databases typically contain detailed genomic information for the target species, including DNA sequences, gene annotations, and regulatory elements.  Phenotypic Data:They may include data related to the phenotype of the organism, which could encompass physical characteristics, behavior, and developmental processes.  Metabolomic and Proteomic Data: Some databases provide information about the metabolic pathways and proteomes of the organism, helping researchers understand its biochemical processes.  Taxonomy and Evolutionary Information:These databases often include data about the species' taxonomic classification and evolutionary relationships with other organisms.  Environmental and Ecological Data: For ecologists and environmental scientists, some databases contain information about the species' habitat, distribution, and interactions with other organisms.
  • 65.