NCBI
National Centre For Biotechnology
Information
Site: www.ncbi.nlm.nih.gov
By Richa Sharma
M.Sc. Biomedical Sciences
Dr. BR Ambedkar Center for Biomedical
aresearch (ACBR)
INTRODUCTION
NCBI was established in the year 1988, as a part of the
National Library of Medicine at the National Institutes of
Health, Maryland, USA
NCBI HOME PAGE
DIFFERENCES BETWEEN
DATABASE AND TOOL
DATABASE
 It is a collection of data
that is structured,
searchable, updated
periodically and cross-
referenced.
 Different databases are:
 Genome Database
 Sequence Database
 Protein Database
 Literature Database
 Disease Database
TOOL
 A program that is used to
extract or retrieve the
desired information from
the database.
 Different types of tools are:
 Database Retrieval Tool i.e.
Entrez
 BLAST
 ORF Finder
 ePCR
 Spidey
DATABASES AND TOOLS OF NCBI
TOOLS OF NCBI
DATABASE RETRIEVAL TOOL-
ENTREZ
Entrez is an integrated database search and retrieval
system that extracts information from DNA and protein
sequence data, population sets, whole genome,
macromolecular structures, and the biomedical literature
via PubMed.
Entrez provides extensive links within and between
database records.
http://www.ncbi.nlm.nih.gov/gquery/
ARCHITECTURE OF THE ENTREZ SYSTEM
BLAST-BASIC LOCAL ALIGNMENT
SEARCH TOOL
The BLAST programs perform sequence-similarity searches
against a variety of sequence databases, returning a set of
gapped alignments with links to full database records, to
UniGene, Gene, the MMDB, or GEO.
The BLAST tools available at NCBI are classified into
different categories.
Two important ones are:
 Standard BLAST
 MegaBLAST
STANDARD BLAST
Standard BLAST includes:
 blastn : Comparing the nucleotide sequence query
against a nucleotide sequence database.
 blastp : Comparing the amino acid query against a
protein sequence database.
 blastx : Comparing the nucleotide query sequence
translated in all reading frames against a protein
database.
• tblastn : Comparing the protein query
sequence against a nucleotide database
translated in all reading frames.
tblastx : Comparing the six –reading
frame translations of the nucleotide
query against six frame translations of
the nucleotide sequence database.
MegaBLAST
MegaBLAST is a program optimized for aligning long
sequences.
It can only work with DNA sequences, hence the only
program it supports is “blastn”.
It is faster than blastn but less sensitive,
SEQUENCE SUBMISSION TO NCBI
The databases are constantly updated through newer
submissions of sequences, and this is done using the
following sequence submission tools :
1. BankIt
2. Sequin
BankIt
BankIT is a web based GenBank sequence submission tool.
It is a tool of choice for simple submissions, especially
when only one or small number of records are to be
submitted. It can also be used by submitters to update
their existing GenBank records. Sequence analysis tools are
not required for submission through this process.
SEQUIN
Sequin is a stand-alone software tool developed by NCBI
which aids in submission and updating entries to the
sequence databases. It helps in handling multiple
sequence submissions, provides increased capacity for
complex submissions containing long sequences, multiple
annotations, segmented sets of DNA or phylogenetic and
population studies.
It also provides graphical viewing and editing options.
NCBI HOME PAGE
SPECIALISED TOOLS
Some of the specialized tools for the sequence analysis are
:
1. ORF Finder
2. e-PCR
3. Spidey
Open Reading Frame (ORF)
Finder
ORF Finder is an essential graphical analysis tool, which
finds all open reading frames of a selectable minimum size
in a user’s sequence or in a sequence already in the
database.
It uses the standard or alternative genetic codes to identify
all open reading frames.
This is helpful in preparing complete and accurate
sequence submissions. It is also packaged with the Sequin
sequence submission software.
e-PCR (Electronic Polymerase
Chain Reaction)
e-PCR is a computational procedure that is used to identify
sequence-tagged sites (STSs) within DNA sequeces. While
looking for potential STSs in DNA sequences e-PCR searches
for sub-sequences that closely match the PCR primers and
have the correct order, orientation, and spacing that could
represent the PCR primers used to generate known
STSs.The new version of e-PCr provides a search mode
using a query sequence against a sequence database.
SPIDEY
This is an m-RNA to genomic alignment program ,which
uses the local alignment tools like BLAST to find its
alignment. Spidey takes as an input a single genomic
sequence and a set of mRNA-FASTA sequences. At first,
Spidey defines windows on the genomic sequence and then
perform the mRNA-to-genomic alignment separately within
each window to avoid including exons from paralogs and
pseudogenes. It has no maximum intron size and does not
favour shorter or longer introns.
Databases
 Structured collection of information.
 Consists of basic units called record or enteries.
 The prefect database-
 Comprehensive but easy to search
 Cross referenced
 Minimum redundancy
NCBI Databases
 Nucleotide database
 Literature database
 Protein database
 Gene expression database
 Structural database
 Chemical database
 Other databases
Kinds of databases
Primary database
 Original submissions by
experimentalists.
 Database staff organise
but don’t add additional
information.
 Example - Genbank
Derivative databases
 Derived from primary
data
 Content controlled by
third party.
 Examples – Refseq,
SWISS-PROT, unigene
Nucleotide database
 GENBANK
 NCBI’s primary sequence data
 It is a comprehensive public database of nucleotide
sequences.
 Genbank along with EMBL and DDBJ comprises the INSD.
 It is a collaborative approach for exchanging data daily
to ensure a uniform and comprehensive collection of
sequence information.
Accession numbers are labels for
sequences
 DNA sequences and other molecular data are tagged with
accession numbers that are used to identify a sequence or
other record relevant to molecular data.
 It is string of letters and/or numbers that corresponds to a
molecular sequence.
 It is shared among the 3 collaborating databases and
remains constant over the lifetime of record.
 The DNA sequence within a Genbank record is also assigned
a unique NCBI identifier called a ‘gi’ that apperas on the
version line of flat file records following the accession
number.
Retrieval of nucleotide sequence of
beta-globin gene from Xenopus laevis
NCBI’s Derivative Sequence
Database
 RefSeq
 It is a collection of non redundant set of nucleotide and
protein sequences.
 It is derived from the primary submissions available in the
GenBank.
 RefSeq records can be distinguished from GenBank records
by the format of the accession series
 RefSeq accession numbers are formatted as two alphabetic
characters followed by an underscore ‘-’
 The GenBank accession never include an underscore.
Literature database
 PMC – PubMed Central
 It is a digital archive of peer-reviewed journals in the
life sciences providing access to full-text articles.
 All PMC free articles are identified in PubMed search
results and PMC itself can be searched using Entrez.
Retrieval of complete entry of role of
remorin protein in the pubmed
database
Protein database
 Entrez protein is the protein sequence database of NCBI.
 The protein sequences in this database come from several
different sources such as Swiss-Prot,PDB.
 There are GenPept translations for each of the coding
sequences within the GenBank nucleotide database.
 The Entrez protein database is cross linked to the Entrez
taxonomy database.
 It is also linled to CDD.
 After clicking on the individual search results of Entrez
protein,the protein sequence is displayed in a particular
format which is known as GenPept.
Expression database
 GEO-Gene Expression Omnibus
 Distribution and regulation of the transcriptional
products of normal and abnormal cell types.
 SAGE map- serial analysis of gene expression map.
Structural database
 MMDB-Molecular modelling database.
 3D macromolecular structures.
 XRD and NMR are being used for the experimental structure
determination.
 These provide a wealth of information regarding the biological
function,mechanism linked to the function,the evolutionary history of the
function and relationship between the macromolecules.
Chemical database
 PubChem is a database of chemical molecules
maintained by NCBI.
 It focuses on the chemical,structural and biological
properties of small molecules
 Molecular mass below 2000u.
Other databases
 OMIM-Online Mendelian Inheritance in Man.
 It is a comprehensive,authoritative and timely
knowledge base of human genes and genetic disorders.
 OMIA-Online Mendelian Inheritance in Animals.
 It is a database of genes,inhertited disorders and traits
in animal species other than human and mouse.
THANK
YOU… !!! 

Ncbi

  • 1.
    NCBI National Centre ForBiotechnology Information Site: www.ncbi.nlm.nih.gov By Richa Sharma M.Sc. Biomedical Sciences Dr. BR Ambedkar Center for Biomedical aresearch (ACBR)
  • 2.
    INTRODUCTION NCBI was establishedin the year 1988, as a part of the National Library of Medicine at the National Institutes of Health, Maryland, USA
  • 3.
  • 4.
    DIFFERENCES BETWEEN DATABASE ANDTOOL DATABASE  It is a collection of data that is structured, searchable, updated periodically and cross- referenced.  Different databases are:  Genome Database  Sequence Database  Protein Database  Literature Database  Disease Database TOOL  A program that is used to extract or retrieve the desired information from the database.  Different types of tools are:  Database Retrieval Tool i.e. Entrez  BLAST  ORF Finder  ePCR  Spidey
  • 5.
  • 6.
  • 7.
    DATABASE RETRIEVAL TOOL- ENTREZ Entrezis an integrated database search and retrieval system that extracts information from DNA and protein sequence data, population sets, whole genome, macromolecular structures, and the biomedical literature via PubMed. Entrez provides extensive links within and between database records. http://www.ncbi.nlm.nih.gov/gquery/
  • 13.
    ARCHITECTURE OF THEENTREZ SYSTEM
  • 14.
    BLAST-BASIC LOCAL ALIGNMENT SEARCHTOOL The BLAST programs perform sequence-similarity searches against a variety of sequence databases, returning a set of gapped alignments with links to full database records, to UniGene, Gene, the MMDB, or GEO. The BLAST tools available at NCBI are classified into different categories. Two important ones are:  Standard BLAST  MegaBLAST
  • 15.
    STANDARD BLAST Standard BLASTincludes:  blastn : Comparing the nucleotide sequence query against a nucleotide sequence database.  blastp : Comparing the amino acid query against a protein sequence database.  blastx : Comparing the nucleotide query sequence translated in all reading frames against a protein database.
  • 16.
    • tblastn :Comparing the protein query sequence against a nucleotide database translated in all reading frames. tblastx : Comparing the six –reading frame translations of the nucleotide query against six frame translations of the nucleotide sequence database.
  • 17.
    MegaBLAST MegaBLAST is aprogram optimized for aligning long sequences. It can only work with DNA sequences, hence the only program it supports is “blastn”. It is faster than blastn but less sensitive,
  • 18.
    SEQUENCE SUBMISSION TONCBI The databases are constantly updated through newer submissions of sequences, and this is done using the following sequence submission tools : 1. BankIt 2. Sequin
  • 19.
    BankIt BankIT is aweb based GenBank sequence submission tool. It is a tool of choice for simple submissions, especially when only one or small number of records are to be submitted. It can also be used by submitters to update their existing GenBank records. Sequence analysis tools are not required for submission through this process.
  • 20.
    SEQUIN Sequin is astand-alone software tool developed by NCBI which aids in submission and updating entries to the sequence databases. It helps in handling multiple sequence submissions, provides increased capacity for complex submissions containing long sequences, multiple annotations, segmented sets of DNA or phylogenetic and population studies. It also provides graphical viewing and editing options.
  • 21.
  • 26.
    SPECIALISED TOOLS Some ofthe specialized tools for the sequence analysis are : 1. ORF Finder 2. e-PCR 3. Spidey
  • 27.
    Open Reading Frame(ORF) Finder ORF Finder is an essential graphical analysis tool, which finds all open reading frames of a selectable minimum size in a user’s sequence or in a sequence already in the database. It uses the standard or alternative genetic codes to identify all open reading frames. This is helpful in preparing complete and accurate sequence submissions. It is also packaged with the Sequin sequence submission software.
  • 28.
    e-PCR (Electronic Polymerase ChainReaction) e-PCR is a computational procedure that is used to identify sequence-tagged sites (STSs) within DNA sequeces. While looking for potential STSs in DNA sequences e-PCR searches for sub-sequences that closely match the PCR primers and have the correct order, orientation, and spacing that could represent the PCR primers used to generate known STSs.The new version of e-PCr provides a search mode using a query sequence against a sequence database.
  • 29.
    SPIDEY This is anm-RNA to genomic alignment program ,which uses the local alignment tools like BLAST to find its alignment. Spidey takes as an input a single genomic sequence and a set of mRNA-FASTA sequences. At first, Spidey defines windows on the genomic sequence and then perform the mRNA-to-genomic alignment separately within each window to avoid including exons from paralogs and pseudogenes. It has no maximum intron size and does not favour shorter or longer introns.
  • 30.
    Databases  Structured collectionof information.  Consists of basic units called record or enteries.  The prefect database-  Comprehensive but easy to search  Cross referenced  Minimum redundancy
  • 31.
    NCBI Databases  Nucleotidedatabase  Literature database  Protein database  Gene expression database  Structural database  Chemical database  Other databases
  • 33.
    Kinds of databases Primarydatabase  Original submissions by experimentalists.  Database staff organise but don’t add additional information.  Example - Genbank Derivative databases  Derived from primary data  Content controlled by third party.  Examples – Refseq, SWISS-PROT, unigene
  • 34.
    Nucleotide database  GENBANK NCBI’s primary sequence data  It is a comprehensive public database of nucleotide sequences.  Genbank along with EMBL and DDBJ comprises the INSD.  It is a collaborative approach for exchanging data daily to ensure a uniform and comprehensive collection of sequence information.
  • 36.
    Accession numbers arelabels for sequences  DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data.  It is string of letters and/or numbers that corresponds to a molecular sequence.  It is shared among the 3 collaborating databases and remains constant over the lifetime of record.  The DNA sequence within a Genbank record is also assigned a unique NCBI identifier called a ‘gi’ that apperas on the version line of flat file records following the accession number.
  • 37.
    Retrieval of nucleotidesequence of beta-globin gene from Xenopus laevis
  • 43.
    NCBI’s Derivative Sequence Database RefSeq  It is a collection of non redundant set of nucleotide and protein sequences.  It is derived from the primary submissions available in the GenBank.  RefSeq records can be distinguished from GenBank records by the format of the accession series  RefSeq accession numbers are formatted as two alphabetic characters followed by an underscore ‘-’  The GenBank accession never include an underscore.
  • 44.
    Literature database  PMC– PubMed Central  It is a digital archive of peer-reviewed journals in the life sciences providing access to full-text articles.  All PMC free articles are identified in PubMed search results and PMC itself can be searched using Entrez.
  • 45.
    Retrieval of completeentry of role of remorin protein in the pubmed database
  • 49.
    Protein database  Entrezprotein is the protein sequence database of NCBI.  The protein sequences in this database come from several different sources such as Swiss-Prot,PDB.  There are GenPept translations for each of the coding sequences within the GenBank nucleotide database.  The Entrez protein database is cross linked to the Entrez taxonomy database.  It is also linled to CDD.  After clicking on the individual search results of Entrez protein,the protein sequence is displayed in a particular format which is known as GenPept.
  • 50.
    Expression database  GEO-GeneExpression Omnibus  Distribution and regulation of the transcriptional products of normal and abnormal cell types.  SAGE map- serial analysis of gene expression map.
  • 51.
    Structural database  MMDB-Molecularmodelling database.  3D macromolecular structures.  XRD and NMR are being used for the experimental structure determination.  These provide a wealth of information regarding the biological function,mechanism linked to the function,the evolutionary history of the function and relationship between the macromolecules.
  • 52.
    Chemical database  PubChemis a database of chemical molecules maintained by NCBI.  It focuses on the chemical,structural and biological properties of small molecules  Molecular mass below 2000u.
  • 53.
    Other databases  OMIM-OnlineMendelian Inheritance in Man.  It is a comprehensive,authoritative and timely knowledge base of human genes and genetic disorders.  OMIA-Online Mendelian Inheritance in Animals.  It is a database of genes,inhertited disorders and traits in animal species other than human and mouse.
  • 54.