National Center for Biotechnology Information By, Kavisa Ghosh, V M.Sc.Biotechnology(Int.)
The National Center for  Biotechnology Information Created in 1988 as a part of the National Library of Medicine at NIH Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Bethesda,MD
NCBI HOME PAGE
 
 
Entrez: An Integrated Database Search and Retrieval System
 
Entrez: Database Integration Genomes Taxonomy PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure 3 -D Structure Word weight VAST BLAST BLAST Phylogeny
The   (ever expanding)   Entrez System Entrez PopSet Structure PubMed Books 3D Domains Taxonomy GEO/GDS UniGene Nucleotide Protein Genome OMIM CDD/CDART Journals SNP UniSTS PubMed Central
Entrez Databases All Molecular Database entries are organized by organism  (Taxonomy Database) . Each record is assigned a UID. A “unique integer identifier” for internal tracking Each record is indexed by data fields. [author], [title], [organism], and many others Each record is given a Document Summary. a summary of the record’s content (DocSum) Each record is manually or computationally assigned  links  to biologically related UIDs in and across databases.
Literature Databases PubMed  Books PubMed Central Journals On-Line Mendelian Inheritance in Man (OMIM)
Molecular Sequence Databases Sequence Databases Nucleotide (GenBank) Taxonomy PopSet Protein Marker Databases Single Nucleotide Polymorphisms (SNP’s, dbSNP) Sequence Tagged Sites (STS’s, dbSTS) Expressed Sequence Tags (EST’s, dbEST) UniGene
Molecular Databases Primary Databases Original submissions by experimentalists Database staff organize but don’t add additional information Example:   GenBank Derivative Databases Human curated compilation and correction of data Example:   SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Example:   UniGene Combinations Example:   NCBI Genome Assembly
ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC ACGTGC ACGTGC TTGACA TTGACA TTGACA CGTGA CGTGA CGTGA ATTGACTA ATTGACTA ATTGACTA ATTGACTA TATAGCCG TATAGCCG TATAGCCG TATAGCCG GenBank TATAGCCG TATAGCCG TATAGCCG TATAGCCG AT GA C ATT GA GA ATT ATT C C GA GA ATT C C GA GA ATT C GA GA ATT C GA GA ATT C C GA GA ATT C C UniGene RefSeq Genome Assembly Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA
Derivative  Databases GenBank Sequencing Centers UniGene RefSeq: Entrez Gene and annotation pipelines Labs Updated ONLY  by submitters EST UniSTS STS HTG GSS PRI ROD PLN MAM BCT INV VRT PHG VRL Updated  by NCBI RefSeq ATT GA ATT C GA C GA C C C ATT TA ACT
What is GenBank ?   Nucleotide only  sequence database  Archival  in nature Historical Reflective of submitter point of view (subjective) Redundant GenBank Data Direct submissions (traditional records) Batch submissions (EST, GSS, STS) ftp accounts (genome data) Three collaborating databases GenBank DNA Database of Japan (DDBJ)  European Molecular Biology Laboratory (EMBL) Database
The Old Way From Fran Lewitter, Whitehead Institute
GenBank:  NCBI’s Primary Sequence Database full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub  ftp://bio-mirror.net/biomirror/genbank/  121 Gigabytes of data Release 136 June 2003 25,592,865 Records 18,197,119(June 2002) 32,528,249,295   Nucleotides 22,616,937,182(June 2002) 110,000 + Species
106,533,156,756 bases in 108,431,692 sequence records in the traditional GenBank divisions 148,165,117,763 bases in 48,443,067 sequence records in the WGS division as of August 2009. GenBank Continued…
GenBank Divisions “ Organismal” (Traditional) PRI (28)  Primate   ROD (15)  Rodent   PLN (20)  Plant and Fungal BCT (18)   Bacterial/Archeal INV (7)  Invertebrate VRT (7)   Other Vertebrate VRL (4)   Viral MAM (2)  Mammalian PHG (1)  Phage SYN (1)  Synthetic ENV (4)  Envir. samples UNA (1)   Unannotated “ Functional” (Bulk) EST (570)   Expressed Sequence Tag   GSS (197)  Genome Survey Sequence HTG (88)  High Throughput Genomic PAT (27)   Patent STS (9)  Sequence Tagged Site CON (1)  Contigs, virtual Organized by taxonomy (sort of) Direct submissions (Sequin/Bankit) Accurate (~1 error per 10,000 bp) Well characterized Organized by sequence type Batch submissions (ftp/email)  Less accurate Poorly characterized
GenBank Functional (Bulk) Divisions Expressed Sequence Tag 1st pass single read cDNA Genome Survey Sequence 1st pass single read gDNA High Throughput Genomic incomplete sequences of genomic clones Sequence Tagged Site PCR-based mapping reagents Whole Genome Shotgun GenBank EST STS GSS HTG
GSS, HTG, WGS shred Whole BAC insert (or genome) isolate clones sequence GSS division or trace archive Draft sequence ( HTG division ) assembly whole genome shotgun assemblies ( wgs projects )
Whole Genome Shotgun Projects 685 projects Bacteria (320) Environmental sequences (14) Archaea (8) Eukaryotes (140), including: Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human  Pufferfish (2) Honeybee, Anopheles, Fruit Flies (3), Silkworm Nematode (2) Yeasts (8),  Aspergillus (2) Rice (2)
Whole Genome Shotgun (WGS) Projects wgs master[properties] ftp://ftp.ncbi.nih.gov/genbank/wgs/
 
What is UniGene? A gene-oriented view of sequence entries MegaBlast based automated sequence clustering Now informed by genome hits  New! Nonredundant set of gene oriented clusters Each cluster a unique gene Information on tissue types and map locations Includes well-characterized genes and novel ESTs Useful for gene discovery and selection of mapping reagents
 
UniGene
RefSeq Benefits genomes transcripts proteins non-redundant; best representative updates to reflect current sequence data and  biology distinct, stable accession series
RefSeq:  NCBI’s Derivative Sequence Database Curated transcripts and proteins reviewed human, mouse, rat, fruit fly, zebrafish, arabidopsis Model transcripts and proteins Assembled Genomic Regions (contigs) human genome mouse genome Chromosome records Human genome microbial organelle ftp://ftp.ncbi.nih.gov/refseq/release / srcdb_refseq[Properties]
RefSeq Benefits non-redundancy    explicitly linked nucleotide and protein sequences updates to reflect current sequence data and biology data validation  format consistency distinct accession series  stewardship by NCBI staff and collaborators
RefSeq Accession Numbers mRNAs and Proteins NM_123456 Curated mRNA NP_123456 Curated Protein NR_123456 Curated non-coding RNA XM_123456 Predicted mRNA XP_123456   Predicted Protein   XR_123456 Predicted non-coding RNA Gene Records NG_123456   Reference Genomic Sequence Chromosome NC_123455   Microbial replicons, organelle  genomes, human chromosomes Assemblies NT_123456   Contig   NW_123456   WGS   Supercontig
Third Party Annotation (TPA) Database   Annotations of  existing  GenBank sequences Allows for community annotation of genomes Direct submissions BankIt  Sequin tpa[Properties]
Other NCBI Databases dbSNP: nucleotide polymorphism Geo: Gene Expression Omnibus microarray and other expression data Gene: gene records Unifies LocusLink and Microbial Genomes  Structure:   imported structures (PDB) Cn3D viewer, NCBI curation CDD:   conserved domain database Protein families (COGs and KOGs) Single domains (PFAM, SMART, CD)
NCBI Protein Databases GenPept  GenBank, EMBL, DDBJ CDS translations RefSeq  mRNA based (NP_) and genome based (XP_) Swiss-Prot  curated high quality protein reviews PIR  protein information resource Georgetown University PRF protein resource foundation PDB  Protein Databank sequences from structures
NCBI Structures and Domains
The International Nucleotide Sequence Database Collaboration NIH NCBI ENTREZ GenBank NIG CIB Get Entry DDBJ EMBL EBI SRS EMBL
Sequence formats   ASN.1 DNAStrider EMBL Fitch GCG GenBank /GB IG/Stanford MSF NBRF Olsen PAUP/NEXUS Pearson/ Fasta Phylip PIR/CODATA Plain/Raw Pretty Zuker FASTA is a popular sequence format NOTE:
GenBank format
Fasta format
 
Data Analysis Tools
 
 
How BLAST works - pictoral Query Sequence “ words” (subsequences of the query sequence) Query words are compared to the database (target sequences) and exact matches identified For each word match, alignment is extended in both directions to find alignments that score greater than some threshold (maximal segment pairs, or MSPs) (Schneider and La Rota 2000)
 
 
 
 
 
 
 
 
 
 
Result Page 1 of BLASTn
Result Page of BLASTx
Result Page of BLASTx
 
 
 
 
 
 
 
 
 
 
 
 
 
Literature Links PubMed OMIM
NM_000249: PubMed Books
Books Link
BOOKS Database
OMIM: Human Disease Genes
Taxonomy Link
Taxonomy Link
For More Information…
Thank You!

NCBI

  • 1.
    National Center forBiotechnology Information By, Kavisa Ghosh, V M.Sc.Biotechnology(Int.)
  • 2.
    The National Centerfor Biotechnology Information Created in 1988 as a part of the National Library of Medicine at NIH Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Bethesda,MD
  • 3.
  • 4.
  • 5.
  • 6.
    Entrez: An IntegratedDatabase Search and Retrieval System
  • 7.
  • 8.
    Entrez: Database IntegrationGenomes Taxonomy PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure 3 -D Structure Word weight VAST BLAST BLAST Phylogeny
  • 9.
    The (ever expanding) Entrez System Entrez PopSet Structure PubMed Books 3D Domains Taxonomy GEO/GDS UniGene Nucleotide Protein Genome OMIM CDD/CDART Journals SNP UniSTS PubMed Central
  • 10.
    Entrez Databases AllMolecular Database entries are organized by organism (Taxonomy Database) . Each record is assigned a UID. A “unique integer identifier” for internal tracking Each record is indexed by data fields. [author], [title], [organism], and many others Each record is given a Document Summary. a summary of the record’s content (DocSum) Each record is manually or computationally assigned links to biologically related UIDs in and across databases.
  • 11.
    Literature Databases PubMed Books PubMed Central Journals On-Line Mendelian Inheritance in Man (OMIM)
  • 12.
    Molecular Sequence DatabasesSequence Databases Nucleotide (GenBank) Taxonomy PopSet Protein Marker Databases Single Nucleotide Polymorphisms (SNP’s, dbSNP) Sequence Tagged Sites (STS’s, dbSTS) Expressed Sequence Tags (EST’s, dbEST) UniGene
  • 13.
    Molecular Databases PrimaryDatabases Original submissions by experimentalists Database staff organize but don’t add additional information Example: GenBank Derivative Databases Human curated compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Example: UniGene Combinations Example: NCBI Genome Assembly
  • 14.
    ATTGACTA ACGTGC TTGACACGTGA ATTGACTA TATAGCCG ACGTGC ACGTGC ACGTGC TTGACA TTGACA TTGACA CGTGA CGTGA CGTGA ATTGACTA ATTGACTA ATTGACTA ATTGACTA TATAGCCG TATAGCCG TATAGCCG TATAGCCG GenBank TATAGCCG TATAGCCG TATAGCCG TATAGCCG AT GA C ATT GA GA ATT ATT C C GA GA ATT C C GA GA ATT C GA GA ATT C GA GA ATT C C GA GA ATT C C UniGene RefSeq Genome Assembly Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA
  • 15.
    Derivative DatabasesGenBank Sequencing Centers UniGene RefSeq: Entrez Gene and annotation pipelines Labs Updated ONLY by submitters EST UniSTS STS HTG GSS PRI ROD PLN MAM BCT INV VRT PHG VRL Updated by NCBI RefSeq ATT GA ATT C GA C GA C C C ATT TA ACT
  • 16.
    What is GenBank? Nucleotide only sequence database Archival in nature Historical Reflective of submitter point of view (subjective) Redundant GenBank Data Direct submissions (traditional records) Batch submissions (EST, GSS, STS) ftp accounts (genome data) Three collaborating databases GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL) Database
  • 17.
    The Old WayFrom Fran Lewitter, Whitehead Institute
  • 18.
    GenBank: NCBI’sPrimary Sequence Database full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank/ 121 Gigabytes of data Release 136 June 2003 25,592,865 Records 18,197,119(June 2002) 32,528,249,295 Nucleotides 22,616,937,182(June 2002) 110,000 + Species
  • 19.
    106,533,156,756 bases in108,431,692 sequence records in the traditional GenBank divisions 148,165,117,763 bases in 48,443,067 sequence records in the WGS division as of August 2009. GenBank Continued…
  • 20.
    GenBank Divisions “Organismal” (Traditional) PRI (28) Primate ROD (15) Rodent PLN (20) Plant and Fungal BCT (18) Bacterial/Archeal INV (7) Invertebrate VRT (7) Other Vertebrate VRL (4) Viral MAM (2) Mammalian PHG (1) Phage SYN (1) Synthetic ENV (4) Envir. samples UNA (1) Unannotated “ Functional” (Bulk) EST (570) Expressed Sequence Tag GSS (197) Genome Survey Sequence HTG (88) High Throughput Genomic PAT (27) Patent STS (9) Sequence Tagged Site CON (1) Contigs, virtual Organized by taxonomy (sort of) Direct submissions (Sequin/Bankit) Accurate (~1 error per 10,000 bp) Well characterized Organized by sequence type Batch submissions (ftp/email) Less accurate Poorly characterized
  • 21.
    GenBank Functional (Bulk)Divisions Expressed Sequence Tag 1st pass single read cDNA Genome Survey Sequence 1st pass single read gDNA High Throughput Genomic incomplete sequences of genomic clones Sequence Tagged Site PCR-based mapping reagents Whole Genome Shotgun GenBank EST STS GSS HTG
  • 22.
    GSS, HTG, WGSshred Whole BAC insert (or genome) isolate clones sequence GSS division or trace archive Draft sequence ( HTG division ) assembly whole genome shotgun assemblies ( wgs projects )
  • 23.
    Whole Genome ShotgunProjects 685 projects Bacteria (320) Environmental sequences (14) Archaea (8) Eukaryotes (140), including: Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human Pufferfish (2) Honeybee, Anopheles, Fruit Flies (3), Silkworm Nematode (2) Yeasts (8), Aspergillus (2) Rice (2)
  • 24.
    Whole Genome Shotgun(WGS) Projects wgs master[properties] ftp://ftp.ncbi.nih.gov/genbank/wgs/
  • 25.
  • 26.
    What is UniGene?A gene-oriented view of sequence entries MegaBlast based automated sequence clustering Now informed by genome hits New! Nonredundant set of gene oriented clusters Each cluster a unique gene Information on tissue types and map locations Includes well-characterized genes and novel ESTs Useful for gene discovery and selection of mapping reagents
  • 27.
  • 28.
  • 29.
    RefSeq Benefits genomestranscripts proteins non-redundant; best representative updates to reflect current sequence data and biology distinct, stable accession series
  • 30.
    RefSeq: NCBI’sDerivative Sequence Database Curated transcripts and proteins reviewed human, mouse, rat, fruit fly, zebrafish, arabidopsis Model transcripts and proteins Assembled Genomic Regions (contigs) human genome mouse genome Chromosome records Human genome microbial organelle ftp://ftp.ncbi.nih.gov/refseq/release / srcdb_refseq[Properties]
  • 31.
    RefSeq Benefits non-redundancy   explicitly linked nucleotide and protein sequences updates to reflect current sequence data and biology data validation format consistency distinct accession series stewardship by NCBI staff and collaborators
  • 32.
    RefSeq Accession NumbersmRNAs and Proteins NM_123456 Curated mRNA NP_123456 Curated Protein NR_123456 Curated non-coding RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted non-coding RNA Gene Records NG_123456 Reference Genomic Sequence Chromosome NC_123455 Microbial replicons, organelle genomes, human chromosomes Assemblies NT_123456 Contig NW_123456 WGS Supercontig
  • 33.
    Third Party Annotation(TPA) Database Annotations of existing GenBank sequences Allows for community annotation of genomes Direct submissions BankIt Sequin tpa[Properties]
  • 34.
    Other NCBI DatabasesdbSNP: nucleotide polymorphism Geo: Gene Expression Omnibus microarray and other expression data Gene: gene records Unifies LocusLink and Microbial Genomes Structure: imported structures (PDB) Cn3D viewer, NCBI curation CDD: conserved domain database Protein families (COGs and KOGs) Single domains (PFAM, SMART, CD)
  • 35.
    NCBI Protein DatabasesGenPept GenBank, EMBL, DDBJ CDS translations RefSeq mRNA based (NP_) and genome based (XP_) Swiss-Prot curated high quality protein reviews PIR protein information resource Georgetown University PRF protein resource foundation PDB Protein Databank sequences from structures
  • 36.
  • 37.
    The International NucleotideSequence Database Collaboration NIH NCBI ENTREZ GenBank NIG CIB Get Entry DDBJ EMBL EBI SRS EMBL
  • 38.
    Sequence formats ASN.1 DNAStrider EMBL Fitch GCG GenBank /GB IG/Stanford MSF NBRF Olsen PAUP/NEXUS Pearson/ Fasta Phylip PIR/CODATA Plain/Raw Pretty Zuker FASTA is a popular sequence format NOTE:
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
    How BLAST works- pictoral Query Sequence “ words” (subsequences of the query sequence) Query words are compared to the database (target sequences) and exact matches identified For each word match, alignment is extended in both directions to find alignments that score greater than some threshold (maximal segment pairs, or MSPs) (Schneider and La Rota 2000)
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
    Result Page 1of BLASTn
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.

Editor's Notes

  • #10 Based on key word searching (MESH terms, author names, gene names, accession or gi numbers, or just recognized patterns in the records). 15 database are included….
  • #22 What is the origin of these seqs?
  • #39 DNA/RNA overview
  • #40 DNA/RNA overview
  • #41 DNA/RNA overview