This document provides an introduction to biological databases. It discusses primary databases like GenBank which contain original sequence submissions and secondary databases derived from primary data, maintained by third parties like NCBI. Some key databases mentioned include GenBank, PDB, Swiss-Prot. The document also provides an overview of the NCBI and Entrez retrieval system, which allows integrated searches across literature and sequences.
INTRODUCTION.
NCBI.
EMBL.
DDBJ.
CONCLUSION.
REFERENSE.
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health.
The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.
The NCBI houses a series of databases relevant to biotechnology and biomedicine. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature.
All these databases are available online through the Entrez search engine.
There are many characteristics of biological data. All these characteristics make the management of biological information a particularly challenging problem. Here mainly we will focus on characteristics of biological information and multidisciplinary field called bioinformatics. Bioinformatics, now a days has emerged with graduate degree programs in several universities.
INTRODUCTION OF BIOINFORMATICS
HISTORY
WHAT IS DATABASE
NEED FOR DATABASE
TYPES OF DATABASE
PRIMARY DATABASE
NUCLEIC ACID SEQUENCE DATABASE
GENE BANK
INTRODUCTION
GENE BANK SUBMISSION TOOL
GENE BANK SUBMISSION TYPE
HOW TO RETRIEVE DATA FROM GENEBANK
APPLICATION
CONCLUSION
REFERENCE
INTRODUCTION.
NCBI.
EMBL.
DDBJ.
CONCLUSION.
REFERENSE.
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health.
The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.
The NCBI houses a series of databases relevant to biotechnology and biomedicine. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature.
All these databases are available online through the Entrez search engine.
There are many characteristics of biological data. All these characteristics make the management of biological information a particularly challenging problem. Here mainly we will focus on characteristics of biological information and multidisciplinary field called bioinformatics. Bioinformatics, now a days has emerged with graduate degree programs in several universities.
INTRODUCTION OF BIOINFORMATICS
HISTORY
WHAT IS DATABASE
NEED FOR DATABASE
TYPES OF DATABASE
PRIMARY DATABASE
NUCLEIC ACID SEQUENCE DATABASE
GENE BANK
INTRODUCTION
GENE BANK SUBMISSION TOOL
GENE BANK SUBMISSION TYPE
HOW TO RETRIEVE DATA FROM GENEBANK
APPLICATION
CONCLUSION
REFERENCE
The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Database Collaboration or INSDC.
The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. This presentation deals with what, why, how, where and who of PDB. In this presentation we have also included briefing about various file formats available in PDB with emphasis on PDB file format
This presentation gives you a detailed information about the swiss prot database that comes under UniProtKB. It also covers TrEMBL: a computer annotated supplement to Swiss-Prot.
The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Database Collaboration or INSDC.
The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. This presentation deals with what, why, how, where and who of PDB. In this presentation we have also included briefing about various file formats available in PDB with emphasis on PDB file format
This presentation gives you a detailed information about the swiss prot database that comes under UniProtKB. It also covers TrEMBL: a computer annotated supplement to Swiss-Prot.
BITS: Overview of important biological databases beyond sequencesBITS
Module 4 Other relevant biological data sources beyond sequences
Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
INTRODUCTION
DEFINITION OF BIOINFORMATICS
HISTORY
OBJECTIVE OF BIOINFORMATIC
TOOLS OF BIOINFORMATICS
PROCEDURE AND TOOLS OF BIOINFORMATIC
BIOLOGICAL DATABASES
HOMOLOGY AND SIMILARITY TOOLS (SEQUENCE ALIGNMENT)
PROTEIN FUNCTION ANALYSIS TOOLS
STRUCTURAL ANALYSIS TOOLS
SEQUENCE MANIPULATION TOOLS
SEQUENCE ANALYSIS TOOLS
APPLICATION
CONCLUSION
REFERENCES
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
2. Able to recognize various data formats, and their primary
use
Understand and utilize all types of sequence identifiers
Understand various feature present in the GenBank flat
files and other Databases
2
Objectives
3. 1. Database type: Primary vs. Secondary
2. Sequence Databases: GenBank, GCG and ACDEB
3. Structure Databases: PDB, MMDB
3. Data Formats
4. Other “databases” and “datasets”
3Outline
4. Structured collection of information of related data elements
Consists of basic units called records or entries.
Each record consists of fields, which hold pre-defined
data related to the record.
What are databases ??
5. Tables (entities)
•basic elements of information to track, e.g., gene, organism,
sequence, citation
Columns (fields)
•attributes of tables, e.g. for citation table, title, journal, volume,
author
Rows (records)
•actual data
•whereas fields describe what data is stored, the rows of a table are
where the actual data is stored
Databases
7. Characteristics of perfect database
Comprehensive, but easy to search.
Annotated, but not “too annotated”.
A simple, easy to understand structure.
Cross-referenced.
Minimum redundancy.
Easy retrieval of data.
8. Primary Databases
Original submissions by experimentalists
Content controlled by the submitter
Examples: GenBank, EMBL, DDBJ etc
Derivative Databases
Derived from primary data
Content controlled by third party (NCBI)
Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets,
UniGene, Homologene, Structure, Conserved Domain
Divisions of biological databases
10. “Ten Important Bioinformatics Databases”
GenBank www.ncbi.nlm.nih.gov nucleotide sequences
Ensembl www.ensembl.org human/mouse genome
(and others)
PubMed www.ncbi.nlm.nih.gov literature references
NR www.ncbi.nlm.nih.gov protein sequences
SWISS-PROT www.expasy.ch protein sequences
InterPro www.ebi.ac.uk protein domains
OMIM www.ncbi.nlm.nih.gov genetic diseases
Enzymes www.chem.qmul.ac.uk enzymes
PDB www.rcsb.org/pdb/ protein structures
KEGG www.genome.ad.jp metabolic pathways
Source: Bioinformatics for Dummies
11. The National Center for
Biotechnology Information
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Collection and distribution of biomedical information
Bethesda,MD
12. NCBI (National Center for Biotechnology
Information)
• over 30 databases including
GenBank, PubMed, OMIM, and
GEO
• Access all NCBI resources via
Entrez
(www.ncbi.nlm.nih.gov/Entrez/)
13. Entrez integrated molecular and literature databases
Free public access to biomedical literature
PubMed free Medline (3 million searches per day)
PubMed Central full text online access
Sequence database
Services provided by NCBI
14. ENTREZ: A DISCOVERY SYSTEM
Gene
Taxonomy
PubMed
abstracts
Nucleotide
sequences
Protein
sequences
3-D
Structure
3 -D
Structure
Word weight
VAST
BLASTBLAST
Phylogeny
Hard Link
Neighbors
Related Sequences
Neighbors
Related Sequences
BLink
Domains
Neighbors
Related Structures
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.
15.
16.
17.
18.
19.
20.
21.
22. Primary
GenBank: NCBI’s primary sequence database
Trace Archive: reads from capillary sequencers
Sequence Read Archive: next generation data
Derivative
GenPept (GenBank translations)
Outside Protein (UniProt—Swiss-Prot, PDB)
NCBI Reference Sequences (RefSeq)
Sequence Database at NCBI
23. Nucleotide only sequence database
Archival in nature
Historical
Reflective of submitter point of view (subjective)
Redundant
Data
Direct submissions (traditional records)
Batch submissions
FTP accounts (genome data)
Genbank: Primary Sequence Database
24. Three collaborating databases for Nucleotides
1. GenBank
2. DNA Database of Japan (DDBJ)
3. European Molecular Biology Laboratory (EMBL) Database
Genbank: Primary Sequence Database
26. In GenBank, records are grouped for various reasons: understand
this is key to using and fully taking advantage of this database.
26
Guiding principle of GenBank
You need identifiers which are stable through time
Need identifiers which will always refer to specific sequences
Need these identifiers to track history of sequence updates
Also need feature and annotation identifiers
Identifier
27. LOCUS, Accession, NID and protein_id27
LOCUS: Unique string of 10 letters and numbers in
the database. Not maintained amongst databases,
and is therefore a poor sequence identifier.
ACCESSION: A unique identifier to that record, citable
entity; does not change when record is updated. A good
record identifier, ideal for citation in publication.
VERSION: : New system where the accession and version play the
same function as the accession and gi number.
Nucleotide gi: Geninfo identifier (gi), a unique integer
which will change every time the sequence changes.
PID: Protein Identifier: g, e or d prefix to gi number.
Can have one or two on one CDS.
Protein gi: Geninfo identifier (gi), a unique integer which
will change every time the sequence changes.
protein_id: Identifier which has the same
structure and function as the nucleotide Accession.version
numbers, but slightlt different format.
28. LOCUS, Accession, gi and PID28
Accession.version
LOCUS HSU40282 1789 bp mRNA PRI 21-MAY-1998
DEFINITION Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.
ACCESSION U40282
VERSION U40282.1 GI:3150001
CDS 157..1515
/gene="ILK"
/note="protein serine/threonine kinase"
/codon_start=1
/product="integrin-linked kinase"
/protein_id="AAC16892.1"
/db_xref="PID:g3150002"
/db_xref="GI:3150002"
LOCUS: HSU40282
ACCESSION: U40282
VERSION: U40282.1
GI: 3150001
PID: g3150002
Protein gi: 3150002
protein_id: AAC16892.1 Protein_idprotein gi
ACCESSION
LOCUS
PIDgi
29. Traditional GenBank Record
ACCESSION U07418
VERSION U07418.1 GI:466461
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession
•Stable
•Reportable
•Universal
Accession
•Stable
•Reportable
•Universal
Version
Tracks changes in sequence
Version
Tracks changes in sequence
GI number
NCBI internal use
GI number
NCBI internal use
well annotatedwell annotated
the sequence is the datathe sequence is the data
31. What is GCG (Genetics
Computer Group)
An integrated package of over 130 programs (the GCG
Wisconsin Package).
For extensive analyses of nucleic acid and protein
sequences.
Associated with most major public nucleic acid and
protein databases.
Works on UNIX OS and Windows.
32. Why use GCG
Removes the need for the constant
collection of new software by end users.
Removes the need to learn new interface as
new software is released.
Provides a flow of analyses within a single
interface.
Unix environment allows users to automate
complex, repetitive tasks.
Allows users to use multiple processors to
accelerate their jobs.
Supports almost all public databases that
can be updated daily. Fast local search.
33. Interfaces
Command Line: Running programs from UNIX
system prompt.
SeqLab: Graphic User’s Interface, requiring an X
windows display.
SeqWeb: to a core set of sequence analysis
program.
34. Limitations with GCG
The GUI interface does not give the users the full
access to the power of the command line, nor to
the complete set of programs.
Many programs place a limit of the maximum size
of the sequences that they can handle (350 Kb).
This limitation will be removed in version 11.
42. PDB
HEADER
COMPND
SOURCE
AUTHOR
DATE
JRNL
REMARK
SECRES
ATOM COORDINATES
42
HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2
COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3
COMPND 2 ATF/CREB SITE DNA 1DGC 4
SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5
AUTHOR T.J.RICHMOND 1DGC 6
REVDAT 1 22-JUN-94 1DGC 0 1DGC 7
JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8
JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9
JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10
JRNL TITL 3 FLEXIBILITY 1DGC 11
JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12
JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13
REMARK 1 1DGC 14
REMARK 2 1DGC 15
REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16
REMARK 3 1DGC 17
REMARK 3 REFINEMENT. 1DGC 18
REMARK 3 PROGRAM X-PLOR 1DGC 19
REMARK 3 AUTHORS BRUNGER 1DGC 20
REMARK 3 R VALUE 0.216 1DGC 21
REMARK 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 1DGC 22
REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23
REMARK 3 1DGC 24
REMARK 3 NUMBER OF REFLECTIONS 3296 1DGC 25
REMARK 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 1DGC 26
REMARK 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27
REMARK 3 PERCENT COMPLETION 98.2 1DGC 28
REMARK 3 1DGC 29
REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30
REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31
REMARK 4 1DGC 32
REMARK 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 1DGC 33
REMARK 4 ACID BIOSYNTHETIC ENZYMES. 1DGC 34
REMARK 5 1DGC 35
REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36
REMARK 5 281 AMINO ACIDS OF INTACT GCN4. 1DGC 37
REMARK 6 1DGC 38
REMARK 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 1DGC 39
REMARK 7 1DGC 40
REMARK 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 - 1DGC 41
REMARK 7 226 ARE NOT WELL ORDERED. 1DGC 42
REMARK 8 1DGC 43
REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44
REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45
REMARK 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 1DGC 46
REMARK 9 1DGC 47
REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48
REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49
REMARK 10 1DGC 50
REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51
REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52
REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53
REMARK 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 1DGC 54
REMARK 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 1DGC 55
REMARK 10 1DGC 56
REMARK 10 0 -1 0 X 117.32 X SYMM 1DGC 57
REMARK 10 -1 0 0 Y + 117.32 = Y SYMM 1DGC 58
REMARK 10 0 0 -1 Z 43.33 Z SYMM 1DGC 59
SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60
SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61
SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62
SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63
SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64
SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65
SEQRES 2 B 19 A T C T C C 1DGC 66
HELIX 1 A ALA A 228 LYS A 276 1 1DGC 67
CRYST1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1DGC 68
ORIGX1 1.000000 0.000000 0.000000 0.00000 1DGC 69
ORIGX2 0.000000 1.000000 0.000000 0.00000 1DGC 70
ORIGX3 0.000000 0.000000 1.000000 0.00000 1DGC 71
SCALE1 0.017047 0.000000 0.000000 0.00000 1DGC 72
SCALE2 0.000000 0.017047 0.000000 0.00000 1DGC 73
SCALE3 0.000000 0.000000 0.011539 0.00000 1DGC 74
ATOM 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 1DGC 75
ATOM 2 CA PRO A 227 34.172 107.658 15.972 1.00 39.82 1DGC 76
ATOM 842 C5 C B 9 57.692 100.286 22.744 1.00 29.82 1DGC 916
ATOM 843 C6 C B 9 58.128 100.193 21.465 1.00 30.63 1DGC 917
TER 844 C B 9 1DGC 918
MASTER 46 0 0 1 0 0 0 6 842 2 0 7 1DGC 919
END 1DGC 920
Editor's Notes
Francis Ouellette August 3rd, 1999 Lecture 2.0
Primary databases serve as a repository of experimentalist sequences (GenBank). Derivative databases are sources of edited/curated sequences (RefSeq…reference sequences, UniGene...genes compared to genetic loci on genomes)