B.sc biochem i bobi u 2 database

Biological DatabasesBiological Databases
Course: B.Sc Biochemistry
Subject: Basic of Bioinformatics
Unit: II

What can be discovered about a geneWhat can be discovered about a gene
by a database search?by a database search?
 A little or a lot, depending on the geneA little or a lot, depending on the gene
 Evolutionary informationEvolutionary information: homologous genes, taxonomic: homologous genes, taxonomic
distributions, allele frequencies, synteny, etc.distributions, allele frequencies, synteny, etc.
 Genomic informationGenomic information: chromosomal location, introns,: chromosomal location, introns,
UTRs, regulatory regions, shared domains, etc.UTRs, regulatory regions, shared domains, etc.
 Structural informationStructural information: associated protein structures, fold: associated protein structures, fold
types, structural domainstypes, structural domains
 Expression informationExpression information: expression specific to particular: expression specific to particular
tissues, developmental stages, phenotypes, diseases, etc.tissues, developmental stages, phenotypes, diseases, etc.
 Functional informationFunctional information: enzymatic/molecular function,: enzymatic/molecular function,
pathway/cellular role, localization, role in diseasespathway/cellular role, localization, role in diseases

Using a databaseUsing a database
 How to get information out of a database:How to get information out of a database:
 Browsing: no targeted information to retrieveBrowsing: no targeted information to retrieve
 Search: looking for particular informationSearch: looking for particular information
 Searching a database:Searching a database:
 Must have a key that identifies the element(s) of theMust have a key that identifies the element(s) of the
database that are of interest.database that are of interest.
 Name of geneName of gene
 Sequence of geneSequence of gene
 Other informationOther information
 Helps to have particularHelps to have particular informational goalsinformational goals

Searching for informationSearching for information
about genes and their productsabout genes and their products
 Gene and gene product databases are often organizedGene and gene product databases are often organized
by sequenceby sequence
 Genomic sequence encodes all traits of an organism.Genomic sequence encodes all traits of an organism.
 Gene products are uniquely described by their sequences.Gene products are uniquely described by their sequences.
 Similar sequences among biomolecules indicates both similarSimilar sequences among biomolecules indicates both similar
function and an evolutionary relationshipfunction and an evolutionary relationship
 Macromolecular sequences provide biologicallyMacromolecular sequences provide biologically
meaningful keys for searching databasesmeaningful keys for searching databases

Searching sequence databasesSearching sequence databases
 Start from sequence, find information about itStart from sequence, find information about it
 Many kinds of input sequencesMany kinds of input sequences
 Could be amino acid or nucleotide sequenceCould be amino acid or nucleotide sequence
 Genomic or mRNA/cDNA or protein sequenceGenomic or mRNA/cDNA or protein sequence
 Complete or fragmentary sequencesComplete or fragmentary sequences
 Exact matches are rare (even uninteresting in manyExact matches are rare (even uninteresting in many
cases), so often goal is to retrieve a set of similarcases), so often goal is to retrieve a set of similar
sequences.sequences.
 Both small (mutations) and large (required for function)Both small (mutations) and large (required for function)
differences within “similar” can be interesting.differences within “similar” can be interesting.

What might we wantWhat might we want
to know about a sequence?to know about a sequence?
 Is this sequence similar to any known genes? How closeIs this sequence similar to any known genes? How close
is the best match? Significance?is the best match? Significance?
 What do we know about that gene?What do we know about that gene?
 Genomic (chromosomal location, allelic information,Genomic (chromosomal location, allelic information,
regulatory regions, etc.)regulatory regions, etc.)
 Structural (known structure? structural domains? etc.)Structural (known structure? structural domains? etc.)
 Functional (molecular, cellular & disease)Functional (molecular, cellular & disease)
 Evolutionary information:Evolutionary information:
 Is this gene found in other organisms?Is this gene found in other organisms?
 What is its taxonomic tree?What is its taxonomic tree?

A historical perspectiveA historical perspective
 The 1960s: the birth ofThe 1960s: the birth of
bioinformaticsbioinformatics
 High-level computerHigh-level computer
languageslanguages
 Protein sequence dataProtein sequence data
 Academic access toAcademic access to
computerscomputers
 Margaret Oakley DayhoffMargaret Oakley Dayhoff
 First protein databaseFirst protein database
 First program for sequenceFirst program for sequence
assemblyassembly IBM 7090 computer
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
1.

By way of comparison…By way of comparison…
IBM 7090 computer
32 Kbytes RAM
2.18 µHz
$2,900,000 in 1960
20” Apple iMac
1 GB RAM
2.4 GHz
$1199 in 2008
2.

Solving problems in computerSolving problems in computer
sciencescience
 Necessary parameters for assessing the difficultyNecessary parameters for assessing the difficulty
of a computer science problemof a computer science problem
 Algorithmic complexityAlgorithmic complexity
 Is the problem theoretically solvable?Is the problem theoretically solvable?
 If so, what is the most efficient solution?If so, what is the most efficient solution?
 Current state of computer technologyCurrent state of computer technology
 MemoryMemory
 CPU speedCPU speed
 CostCost
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458

AlgorithmsAlgorithms
 AnAn algorithmalgorithm is a sequence of instructions that oneis a sequence of instructions that one
must perform in order to solve a well-formulatedmust perform in order to solve a well-formulated
problemproblem
 First you must identify exactly what the problem is!First you must identify exactly what the problem is!
 AA problemproblem describes a class of computational tasks.describes a class of computational tasks.
A problemA problem instanceinstance is one particular input fromis one particular input from
that taskthat task
 In general, you should design your algorithms toIn general, you should design your algorithms to
work forwork for anyany instance of a problem (although thereinstance of a problem (although there
are cases in which this is not possible)are cases in which this is not possible)

Computer technology: memory, CPU speed, costComputer technology: memory, CPU speed, cost
• Dramatic improvements on yearly basis
• We do a lot of our work using desktop Macs out of the box
- 2 quad core 2.8 GHz processors, 500 GB disk space, 4 GB RAM for ~$3000
- 2 quad core 3.0 GHz processors, 2.5 TB disk space, 8 GB RAM for ~$6000
• CPU speed vs. memory: which is more important?
- for protein structure, might need many calculations but limited memory
- for genome searches, might have few calculations but huge amounts to store
in memory
• Reading from memory is several orders of magnitude faster than reading from disk

DatabasesDatabases
 What is a database?What is a database?
 A collection of related data elementsA collection of related data elements
 tablestables
 columns (fields)columns (fields)
 rows (records)rows (records)
 Records retrieved using a query languageRecords retrieved using a query language
 Database technology is well establishedDatabase technology is well established

 Databases are a fundamental part of the bioinformatics revolution. Much ofDatabases are a fundamental part of the bioinformatics revolution. Much of
the conceptual framework for databases had already been developed by thethe conceptual framework for databases had already been developed by the
1960s.1960s.
 By the 1970s, database technology had already permeated much of theBy the 1970s, database technology had already permeated much of the
government and corporate sectors.government and corporate sectors.
 Modern databases can be described as well-organized collections of dataModern databases can be described as well-organized collections of data
that can be accessed through the use of a query language.that can be accessed through the use of a query language.
 Two databases of particular importance to biologists areTwo databases of particular importance to biologists are GenBankGenBank®®
, which, which
encompasses all publicly available protein and nucleotide sequences, andencompasses all publicly available protein and nucleotide sequences, and
thethe Protein Data BankProtein Data Bank, which contains high quality 3-D structures of, which contains high quality 3-D structures of
proteins, nucleic acids, and carbohydrates.proteins, nucleic acids, and carbohydrates.
 The entire sequence of a single human could fit on one or two CD-ROMS.The entire sequence of a single human could fit on one or two CD-ROMS.
As we shall see shortly, it is the comparison of sequences that presentsAs we shall see shortly, it is the comparison of sequences that presents
algorithmic challenges.algorithmic challenges.

Tables (entitites)
•basic elements of information to track, e.g., gene, organism, sequence, citation
Columns (fields)
•attributes of tables, e.g. for citation table, title, journal, volume, author
Rows (records)
•actual data
•whereas fields describe what data is stored, the rows of a table are where the actual data
is stored
DatabasesDatabases

What is database?What is database?
 A database is a computerized records used toA database is a computerized records used to
store and organize data in such a way thatstore and organize data in such a way that
information can be retrieved easily via a varietyinformation can be retrieved easily via a variety
of search criteria. Databases are composed ofof search criteria. Databases are composed of
computer hardware and software for datacomputer hardware and software for data
management.management.

 Each record, also called an entry, should containEach record, also called an entry, should contain
a number of fields that hold the actual dataa number of fields that hold the actual data
items, for example, fields for names, phoneitems, for example, fields for names, phone
numbers, addresses, dates.numbers, addresses, dates.
 To retrieve a particular record from theTo retrieve a particular record from the
database, a user can specify a particular piece ofdatabase, a user can specify a particular piece of
information, called value, to be found in ainformation, called value, to be found in a
particular field and expect the computer toparticular field and expect the computer to
retrieve the whole data record.retrieve the whole data record.
 This process is called making a queryThis process is called making a query

 A biological database is a collection of both experimentalA biological database is a collection of both experimental
and theoretical data that is organized so that its contentsand theoretical data that is organized so that its contents
can be easilycan be easily
 accessedaccessed
 managedmanaged
 updatedupdated
 RetrievedRetrieved
 The activity of preparing a database can be divided in to:The activity of preparing a database can be divided in to:
 Collection of data in a form which can be easily accessedCollection of data in a form which can be easily accessed
 Making it available to a multi-user systemMaking it available to a multi-user system

Types of databaseTypes of database

Flat file database
 A flat file database describes any of various
means to encode a database model (most
commonly a table) as a single file. A flat file can
be a plain text file or a binary file. There are
usually no structural relationships between the
records.

 "Flat file database" may be defined very narrowly, or more broadly."Flat file database" may be defined very narrowly, or more broadly.
 Strictly, a flat file database should consist of nothing but data and, if records vary inStrictly, a flat file database should consist of nothing but data and, if records vary in
length, delimiters.length, delimiters.
 More broadly, the term refers to any database which exists in a single file in the formMore broadly, the term refers to any database which exists in a single file in the form
of rows and columns, with no relationships or links between records and fields exceptof rows and columns, with no relationships or links between records and fields except
the table structure.the table structure.
 Terms used to describe different aspects of a database and its tools differ from oneTerms used to describe different aspects of a database and its tools differ from one
implementation to the next, but the concepts remain the same.implementation to the next, but the concepts remain the same.
 FileMaker uses the term "Find", while MySQL uses the term "Query"; but the conceptFileMaker uses the term "Find", while MySQL uses the term "Query"; but the concept
is the same. FileMaker "files", in version 7 and above, are equivalent to MySQLis the same. FileMaker "files", in version 7 and above, are equivalent to MySQL
"databases", and so forth. To avoid confusing the reader, one consistent set of terms is"databases", and so forth. To avoid confusing the reader, one consistent set of terms is
used throughout this article.used throughout this article.
 However, the basic terms "record" and "field" are used in nearly every flat file databaseHowever, the basic terms "record" and "field" are used in nearly every flat file database
implementationimplementation

Rational databaseRational database
 Relational databases are both created and queriedRelational databases are both created and queried
by DataBase Management Systems (DBMSs).by DataBase Management Systems (DBMSs).
 Relational databases displaced hierarchicalRelational databases displaced hierarchical
databases because the ability to add new relations made itdatabases because the ability to add new relations made it
possible to add new information that was valuable butpossible to add new information that was valuable but
"broke" a database's original hierarchical conception."broke" a database's original hierarchical conception.
 The trend continues as a networked planet and socialThe trend continues as a networked planet and social
media create the world of "big data" which is largermedia create the world of "big data" which is larger
and less structured than the datasets and tasks thatand less structured than the datasets and tasks that
relational databases handle well (it is instructive torelational databases handle well (it is instructive to
compareHadoop).compareHadoop).

Rational databaseRational database

Object oriented databaseObject oriented database
 An object database (also object-orientedAn object database (also object-oriented
database management system) is a databasedatabase management system) is a database
management system in which information ismanagement system in which information is
represented in the form of objects as usedrepresented in the form of objects as used
in object-oriented programming.in object-oriented programming.
 Object databases are different from relationalObject databases are different from relational
databases which are table-oriented.databases which are table-oriented.

Biological databaseBiological database

Online DatabasesOnline Databases
When you query an online database, your query is translated into SQL, the database is
interrogated, and the answer displayed on your web browser.
Your computer and
browser (the “client”)
Software to receive
and translate the
instructions you enter
into your browser (on
the “server”)
The database itself
Image source: David Lane and Hugh E. Williams. Web Database Applications with PHP & MySQL. O’Reilly (2002).
4.

Biological Databases
•Over 1000 biological databases
•Vary in size, quality, coverage, level of interest
•Many of the major ones covered in the annual Database Issue
of Nucleic Acids Research
•What makes a good database?
•comprehensiveness
•accuracy
•is up-to-date
•good interface
•batch search/download
•API (web services, DAS, etc.)

“Ten Important Bioinformatics Databases”
GenBank www.ncbi.nlm.nih.gov nucleotide sequences
Ensembl www.ensembl.org human/mouse genome (and others)
PubMed www.ncbi.nlm.nih.gov literature references
NR www.ncbi.nlm.nih.gov protein sequences
SWISS-PROT www.expasy.ch protein sequences
InterPro www.ebi.ac.uk protein domains
OMIM www.ncbi.nlm.nih.gov genetic diseases
Enzymeswww.chem.qmul.ac.uk enzymes
PDB www.rcsb.org/pdb/ protein structures
KEGG www.genome.ad.jp metabolic pathways
Source: Bioinformatics for Dummies

NCBI (National Center for Biotechnology
Information)
• over 30 databases including GenBank,
PubMed, OMIM, and GEO
• Access all NCBI resources via Entrez
(www.ncbi.nlm.nih.gov/Entrez/)

INFORMATION RETRIEVALINFORMATION RETRIEVAL
FROM BIOLOGICAL DATABASESFROM BIOLOGICAL DATABASES
 NCBI-EntrezNCBI-Entrez
 SRS(Sequenceretrievalsystem)SRS(Sequenceretrievalsystem)

NCBI and EntrezNCBI and Entrez

The Central Dogma & Biological DataThe Central Dogma & Biological Data
Protein structures
-Experiments
-Models (homologues)
Literature information
Original DNA Sequences
(Genomes)
Protein Sequences
-Inferred
-Direct sequencing
Expressed DNA sequences
( = mRNA Sequences
= cDNA sequences)
Expressed Sequence Tags
(ESTs)
19.

NCBI Databases and ServicesNCBI Databases and Services
 GenBank primary sequence databaseGenBank primary sequence database
 Free public access to biomedical literatureFree public access to biomedical literature
 PubMed free Medline (3 million searches per day)PubMed free Medline (3 million searches per day)
 PubMed Central full text online accessPubMed Central full text online access
 Entrez integrated molecular and literature databasesEntrez integrated molecular and literature databases

PRIMARYPRIMARY VS.VS. DERIVATIVEDERIVATIVE
SEQUENCE DATABASESSEQUENCE DATABASES
GenBankGenBank
SequencingSequencing
CentersCenters
GA
GAGA
ATT
ATT
C
CGAGA
ATT
ATT
C
C
AT
GAGA
ATT
C
C GAGA
ATT
C
C
TTGACA
ATTGACTA
ACGTGC
TTGACA
CGTGA
ATTGACTA
TATAGCCG
ACGTGC
ACGTGC
ACGTGC
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTA
ATTGACTA
ATTGACTA
TATAGCCG
TATAGCCGTATAGCCG
TATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG
CATT
GAGA
ATT
C
C GAGA
ATT
C
C LabsLabs
AlgorithmsAlgorithms
UniGene
CuratorsCurators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated
continually
by NCBI
Updated ONLY
by submitters
20.

Sequence Databases at NCBISequence Databases at NCBI
 PrimaryPrimary
 GenBank: NCBI’s primary sequence databaseGenBank: NCBI’s primary sequence database
 Trace Archive: reads from capillary sequencersTrace Archive: reads from capillary sequencers
 Sequence Read Archive: next generation dataSequence Read Archive: next generation data
 DerivativeDerivative
 GenPept (GenBank translations)GenPept (GenBank translations)
 Outside Protein (UniProt—Swiss-Prot, PDB)Outside Protein (UniProt—Swiss-Prot, PDB)
 NCBI Reference SequencesNCBI Reference Sequences (RefSeq)(RefSeq)

GENBANK -GENBANK - PRIMARY SEQUENCE DBPRIMARY SEQUENCE DB
 Nucleotide onlyNucleotide only sequence databasesequence database
 Archival(Records)Archival(Records) in naturein nature
 HistoricalHistorical
 Reflective of submitter point of view (subjective)Reflective of submitter point of view (subjective)
 RedundantRedundant
 DataData
 Direct submissions (traditional records)Direct submissions (traditional records)
 Batch submissionsBatch submissions
 FTP accounts (genome data)FTP accounts (genome data)

GENBANK -GENBANK - PRIMARY SEQUENCE DB (2)PRIMARY SEQUENCE DB (2)
 Three collaborating databasesThree collaborating databases
1.1. GenBankGenBank
2.2. DNA Database of Japan (DDBJ)DNA Database of Japan (DDBJ)
3.3. European Molecular Biology Laboratory (EMBL)European Molecular Biology Laboratory (EMBL)
DatabaseDatabase

Traditional GenBank RecordTraditional GenBank Record
ACCESSION U07418
VERSION U07418.1 GI:466461
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession
•Stable
•Reportable
•Universal
Accession
•Stable
•Reportable
•Universal
Version
Tracks changes in sequence
Version
Tracks changes in sequence
GI number
NCBI internal use
GI number
NCBI internal use
well annotatedwell annotated
the sequence is the datathe sequence is the data
21.

NCBI and EntrezNCBI and Entrez
 One of the most useful and comprehensive sources ofOne of the most useful and comprehensive sources of
databases is the NCBI, part of the National Library ofdatabases is the NCBI, part of the National Library of
Medicine.Medicine.
 NCBI provides interesting summaries, browsers forNCBI provides interesting summaries, browsers for
genome data, and search toolsgenome data, and search tools
 Entrez is their database search interfaceEntrez is their database search interface
http://www.ncbi.nlm.nih.gov/Entrezhttp://www.ncbi.nlm.nih.gov/Entrez
 Can search on gene names, sequences, chromosomalCan search on gene names, sequences, chromosomal
location, diseases, keywords, ...location, diseases, keywords, ...

What did we just do?What did we just do?
 Identify loci (genes) associated with the sequence.Identify loci (genes) associated with the sequence.
Input was Alcohol DehydrogenaseInput was Alcohol Dehydrogenase
 For each particular “hit”, we can look at thatFor each particular “hit”, we can look at that
sequence and its alignment in more detail.sequence and its alignment in more detail.
 See similar sequences, and the organisms in whichSee similar sequences, and the organisms in which
they are found.they are found.
 But there’sBut there’s much moremuch more that can be found onthat can be found on
these genes, even just inside NCBI…these genes, even just inside NCBI…

More from Entrez GeneMore from Entrez Gene
23.

Sequence Retrieval SystemSequence Retrieval System
 The Sequence Retrieval System is aThe Sequence Retrieval System is a
database system that works with flat-files. Indatabase system that works with flat-files. In
addition, many bioinformatics tools areaddition, many bioinformatics tools are
incorporated and can be combined with theincorporated and can be combined with the
databases searches.databases searches.

NCBI is not all there is...NCBI is not all there is...
 Links to non-NCBI databasesLinks to non-NCBI databases
 Reactome & KEGG for pathwaysReactome & KEGG for pathways
 HGNC for nomenclatureHGNC for nomenclature
 UCSC Human Genome BrowserUCSC Human Genome Browser
 Other important gene/protein resources not linked to:Other important gene/protein resources not linked to:
 UniProt (most carefully annotated)UniProt (most carefully annotated)
 PDBPDB (main macromolecular structure repository)(main macromolecular structure repository)
 Other key biological data sourcesOther key biological data sources
 Gene OntologyGene Ontology/Open Biological Ontologies/Open Biological Ontologies
 EnzymeEnzyme
 Scientific society: iscb.orgScientific society: iscb.org
 Journals, Conferences…Journals, Conferences…

Take home messagesTake home messages
 There are a lot of molecular biology databases,There are a lot of molecular biology databases,
containing a lot of valuable informationcontaining a lot of valuable information
 Not even the best databases have everything (orNot even the best databases have everything (or
the best of everything)the best of everything)
 These databases are moderately well cross-These databases are moderately well cross-
linked, and there are “linker” databaseslinked, and there are “linker” databases
 Sequence is a good identifier, maybe even betterSequence is a good identifier, maybe even better
than gene name!than gene name!

FILE FORMATEFILE FORMATE
IG/Stanford Fitch Plain/Raw
GenBank/GB Fasta/Pearson PIR/CODATA
NBRF Zuker MSF
EMBL Olsen ASN 1.8
GCG Phylip 3.2 PAUP/NEXUS
DNAStrider Phylip Pretty
IG/Stanford Fitch Plain/Raw
GenBank/GB Fasta/Pearson PIR/CODATA
NBRF Zuker MSF
EMBL Olsen ASN 1.8
GCG Phylip 3.2 PAUP/NEXUS
DNAStrider Phylip Pretty

LOCUS, Accession, NID and protein_idLOCUS, Accession, NID and protein_id
LOCUS: Unique string of 10 letters and numbers in
the database. Not maintained amongst databases,
and is therefore a poor sequence identifier.
ACCESSION: A unique identifier to that record, citable
entity; does not change when record is updated. A good
record identifier, ideal for citation in publication.
VERSION: : New system where the accession and version play the
same function as the accession and gi number.
Nucleotide gi: Geninfo identifier (gi), a unique integer
which will change every time the sequence changes.
PID: Protein Identifier: g, e or d prefix to gi number.
Can have one or two on one CDS.
Protein gi: Geninfo identifier (gi), a unique integer which
will change every time the sequence changes.
protein_id: Identifier which has the same
structure and function as the nucleotide Accession.version
numbers, but slightlt different format.

Accession.version
LOCUS, Accession, gi and PIDLOCUS, Accession, gi and PID
LOCUS HSU40282 1789 bp mRNA PRI 21-MAY-1998
DEFINITION Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.
ACCESSION U40282
VERSION U40282.1 GI:3150001
CDS 157..1515
/gene="ILK"
/note="protein serine/threonine kinase"
/codon_start=1
/product="integrin-linked kinase"
/protein_id="AAC16892.1"
/db_xref="PID:g3150002"
/db_xref="GI:3150002"
LOCUS: HSU40282
ACCESSION: U40282
VERSION: U40282.1
GI: 3150001
PID: g3150002
Protein gi: 3150002
protein_id: AAC16892.1 Protein_idprotein gi
ACCESSION
LOCUS
PIDgi

PLAIN SEQUENCE FORMAT
A sequence in plain format may contain only IUPAC characters and
spaces (no numbers!).
Note: A file in plain sequence format may only contain one sequence,
while most other formats accept several sequences in one file.
An example sequence in plain format is:
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAA
CCTCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGC
CGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTG
CCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTC
TGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCT
PLAIN SEQUENCEPLAIN SEQUENCE
FORMATEFORMATE

FASTA FORMATEFASTA FORMATE
FASTA FORMAT
A sequence in Fasta format begins with a single-line description,
followed by lines of sequence data.
The description line is distinguished from the sequence data by a greater-than (">") symbol in
the first column.
It is recommended that all lines of text be shorter than 80 characters in length
An example sequence in FASTA format is:
>U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1)
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC

• The first line of each sequence entry is the ID definition line which contains entry name, dataclass,
molecule, division and sequence length.
• XX line contains no data, just a separator
• The AC line lists the accession number.
• DE line gives description about the sequence
• FT precise annotation for the sequence
• Sequence information SQ in the first two spaces.
• The sequence information begins on the fifth line of the sequence entry.
• The last line of each sequence entry in the file is a terminator line which has the two characters // in
the first two spaces.
ID AA03518 standard; DNA; FUN; 237 BP. XX AC U03518;
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
RX MEDLINE; 94303342.
RX PUBMED; 8030378.
XX
FT rRNA <1..20
FT /product="18S ribosomal RNA"
FT misc_RNA 21..205
FT /standard_name="Internal transcribed spacer 1 (ITS1)"
FT rRNA 206..>237
FT /product="5.8S ribosomal RNA"
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120
ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
EMBL/Swiss Prot
(http://www.ebi.ac.uk/help/formats_frame.html)

EMBL FORMAT
A sequence file in EMBL format can contain several sequences.
One sequence entry starts with an identifier line ("ID "), followed by further
annotation lines. The start of the sequence is marked by a line starting with "SQ"
and the end of the sequence is marked by two slashes ("//").
An example sequence in EMBL format is:
ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120
ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//

GENBANK FORMAT
A sequence file in GenBank format can contain several sequences.
One sequence in GenBank format starts with a line containing the word LOCUS
and a number of annotation lines. The start of the sequence is marked by a line
containing "ORIGIN" and the end of the sequence is marked by two slashes ("//").
•Can contain several sequences
•One sequence starts with: “LOCUS”
•The sequence starts with: "ORIGIN“
•The sequence ends with: "//“
An example sequence in GenBank format is:
LOCUS AAU03518 237 bp DNA PLN 04-FEB-1995
DEFINITION Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
rRNA and 5.8S rRNA genes, partial sequence.
ACCESSION U03518
BASE COUNT 41 a 77 c 67 g 52 t
ORIGIN
1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg
121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc
181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc
//

PIR- PROTEIN SEQUENCEPIR- PROTEIN SEQUENCE
DBDB
 PIR was established in 1984 by the National BiomedicalPIR was established in 1984 by the National Biomedical
Research Foundation (NBRF) as a resource to assist researchersResearch Foundation (NBRF) as a resource to assist researchers
in the identification and interpretation of protein sequencein the identification and interpretation of protein sequence
information.information.
 Prior to that, the NBRF compiled the first comprehensivePrior to that, the NBRF compiled the first comprehensive
collection of macromolecular sequences in thecollection of macromolecular sequences in the Atlas of ProteinAtlas of Protein
Sequence and StructureSequence and Structure, published from 1965-1978 under the, published from 1965-1978 under the
editorship of Margaret O. Dayhoff. editorship of Margaret O. Dayhoff. Dr. DayhoffDr. Dayhoff and her and her
research group pioneered in the development of computerresearch group pioneered in the development of computer
methods for the comparison of protein sequences, for themethods for the comparison of protein sequences, for the
detection of distantly related sequences and duplications withindetection of distantly related sequences and duplications within
sequences, and for the inference of evolutionary histories fromsequences, and for the inference of evolutionary histories from
alignments of protein sequences.alignments of protein sequences.

STRUCTURAL DB-PDBSTRUCTURAL DB-PDB
30.

 The Protein Data Bank (PDB) is a repository for theThe Protein Data Bank (PDB) is a repository for the
three-dimensional structural data of large biologicalthree-dimensional structural data of large biological
molecules, such as proteins and nucleic acids.molecules, such as proteins and nucleic acids.
 The data, typically obtained by X-rayThe data, typically obtained by X-ray
crystallography or NMR spectroscopy and submittedcrystallography or NMR spectroscopy and submitted
by biologists and biochemists from around the world,by biologists and biochemists from around the world,
are freely accessible on the Internet via the websites ofare freely accessible on the Internet via the websites of
its member organisationsits member organisations
 The PDB is overseen by an organization calledThe PDB is overseen by an organization called
theWorldwide Protein Data Bank, wwPDB.theWorldwide Protein Data Bank, wwPDB.

 The PDB is a key resource in areas of structuralThe PDB is a key resource in areas of structural
biology, such as structural genomics.biology, such as structural genomics.
 Most major scientific journals, and some fundingMost major scientific journals, and some funding
agencies, now require scientists to submit theiragencies, now require scientists to submit their
structure data to the PDB.structure data to the PDB.
 If the contents of the PDB are thought of as primaryIf the contents of the PDB are thought of as primary
data, then there are hundreds of derived (i.e.,data, then there are hundreds of derived (i.e.,
secondary) databases that categorize the datasecondary) databases that categorize the data
differently.differently.
 For example both SCOP and CATH categorizeFor example both SCOP and CATH categorize
structures according to type of structure and assumedstructures according to type of structure and assumed
evolutionary relations.evolutionary relations.

 HEADER, TITLE and AUTHOR records provide information about theHEADER, TITLE and AUTHOR records provide information about the
researchers who defined the structure; numerous other types of records areresearchers who defined the structure; numerous other types of records are
available to provide other types of informationavailable to provide other types of information
 REMARK records can contain free-form annotation, but they alsoREMARK records can contain free-form annotation, but they also
accommodate standardized information; for example, the REMARK 350accommodate standardized information; for example, the REMARK 350
BIOMT records describe how to compute the coordinates of theBIOMT records describe how to compute the coordinates of the
experimentally observed multimer from those of the explicitly specified onesexperimentally observed multimer from those of the explicitly specified ones
of a single repeating unit.of a single repeating unit.
 SEQRES records give the sequences of the three peptide chains (named A, BSEQRES records give the sequences of the three peptide chains (named A, B
and C), which are very short in this example but usually span multiple lines.and C), which are very short in this example but usually span multiple lines.
 ATOM records describe the coordinates of the atoms that are part of theATOM records describe the coordinates of the atoms that are part of the
protein. For example, the first ATOM line above describes the alpha-N atomprotein. For example, the first ATOM line above describes the alpha-N atom
of the first residue of peptide chain A, which is a proline residue; the firstof the first residue of peptide chain A, which is a proline residue; the first
three floating point numbers are its x, y and z coordinates and are in unitsthree floating point numbers are its x, y and z coordinates and are in units
of Ångströms.of Ångströms.
 HETATM records describe coordinates of hetero-atoms, that is those atomsHETATM records describe coordinates of hetero-atoms, that is those atoms
which are not part of the protein molecule.which are not part of the protein molecule.

PUBCHEMPUBCHEM
 PubChem is database of chemical molecules and their activitiesPubChem is database of chemical molecules and their activities
against biological assays. The system is maintained byagainst biological assays. The system is maintained by
theNational Center for Biotechnology Information (NCBI), atheNational Center for Biotechnology Information (NCBI), a
component of the National Library of Medicine, which is part ofcomponent of the National Library of Medicine, which is part of
the United States National Institutes of Health (NIH). PubChemthe United States National Institutes of Health (NIH). PubChem
can be accessed for free through a web user interface. Millions ofcan be accessed for free through a web user interface. Millions of
compound structures and descriptive datasets can be freelycompound structures and descriptive datasets can be freely
downloaded via FTP. PubChem contains substance descriptionsdownloaded via FTP. PubChem contains substance descriptions
and small molecules with fewer than 1000 atoms and 1000and small molecules with fewer than 1000 atoms and 1000
bonds. More than 80 database vendors contribute to the growingbonds. More than 80 database vendors contribute to the growing
PubChem databasePubChem database

Books and Web ReferencesBooks and Web References
 Books Name :Books Name :
1. Introduction To Bioinformatics by T. K. Attwood1. Introduction To Bioinformatics by T. K. Attwood
2. BioInformatics by Sangita2. BioInformatics by Sangita
3. Basic Bioinformatics by S.Ignacimuthu, s.j.3. Basic Bioinformatics by S.Ignacimuthu, s.j.
 http://en.wikipedia.org/wiki/Biological_databasehttp://en.wikipedia.org/wiki/Biological_database
 http://bioinformaticsweb.net/data.htmlhttp://bioinformaticsweb.net/data.html
 http://www.apbionet.org/s-star/downloads/tutorial/t1b.pdfhttp://www.apbionet.org/s-star/downloads/tutorial/t1b.pdf
90

Image ReferencesImage References
 1. & 2. https://encrypted-tbn0.gstatic.com/images?1. & 2. https://encrypted-tbn0.gstatic.com/images?
q=tbn:ANd9GcR39w90rTM4wRcS2WE4I0zbjV7R6KE8JMVZq=tbn:ANd9GcR39w90rTM4wRcS2WE4I0zbjV7R6KE8JMVZ
z4QF0qY6A8W1qti_QQaeDx5Xz4QF0qY6A8W1qti_QQaeDx5X
 3. & 4.Book: Basic Bioinformatics by S.Ignacimuthu, s.j.3. & 4.Book: Basic Bioinformatics by S.Ignacimuthu, s.j.
 5. to 18.http://www.ncbi.nlm.nih.gov/5. to 18.http://www.ncbi.nlm.nih.gov/
 19. https://encrypted-tbn0.gstatic.com/images?19. https://encrypted-tbn0.gstatic.com/images?
q=tbn:ANd9GcRGBmCsguJs4geE45YMjE_O80bbqD9dFtCE9q=tbn:ANd9GcRGBmCsguJs4geE45YMjE_O80bbqD9dFtCE9
fgZYySwzYSIDbIpfgZYySwzYSIDbIp
 21. to 29. http://www.ncbi.nlm.nih.gov/21. to 29. http://www.ncbi.nlm.nih.gov/
 30. & 31. http://www.rcsb.org/pdb/home/home.do30. & 31. http://www.rcsb.org/pdb/home/home.do

B.sc biochem i bobi u 2 database

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to B.sc biochem i bobi u 2 database

Similar to B.sc biochem i bobi u 2 database (20)

More from Rai University

More from Rai University (20)

Recently uploaded

Recently uploaded (20)

B.sc biochem i bobi u 2 database

Editor's Notes