SlideShare a Scribd company logo
Biological DatabasesBiological Databases
Course: B.Sc Biochemistry
Subject: Basic of Bioinformatics
Unit: II
What can be discovered about a geneWhat can be discovered about a gene
by a database search?by a database search?
 A little or a lot, depending on the geneA little or a lot, depending on the gene
 Evolutionary informationEvolutionary information: homologous genes, taxonomic: homologous genes, taxonomic
distributions, allele frequencies, synteny, etc.distributions, allele frequencies, synteny, etc.
 Genomic informationGenomic information: chromosomal location, introns,: chromosomal location, introns,
UTRs, regulatory regions, shared domains, etc.UTRs, regulatory regions, shared domains, etc.
 Structural informationStructural information: associated protein structures, fold: associated protein structures, fold
types, structural domainstypes, structural domains
 Expression informationExpression information: expression specific to particular: expression specific to particular
tissues, developmental stages, phenotypes, diseases, etc.tissues, developmental stages, phenotypes, diseases, etc.
 Functional informationFunctional information: enzymatic/molecular function,: enzymatic/molecular function,
pathway/cellular role, localization, role in diseasespathway/cellular role, localization, role in diseases
Using a databaseUsing a database
 How to get information out of a database:How to get information out of a database:
 Browsing: no targeted information to retrieveBrowsing: no targeted information to retrieve
 Search: looking for particular informationSearch: looking for particular information
 Searching a database:Searching a database:
 Must have a key that identifies the element(s) of theMust have a key that identifies the element(s) of the
database that are of interest.database that are of interest.
 Name of geneName of gene
 Sequence of geneSequence of gene
 Other informationOther information
 Helps to have particularHelps to have particular informational goalsinformational goals
Searching for informationSearching for information
about genes and their productsabout genes and their products
 Gene and gene product databases are often organizedGene and gene product databases are often organized
by sequenceby sequence
 Genomic sequence encodes all traits of an organism.Genomic sequence encodes all traits of an organism.
 Gene products are uniquely described by their sequences.Gene products are uniquely described by their sequences.
 Similar sequences among biomolecules indicates both similarSimilar sequences among biomolecules indicates both similar
function and an evolutionary relationshipfunction and an evolutionary relationship
 Macromolecular sequences provide biologicallyMacromolecular sequences provide biologically
meaningful keys for searching databasesmeaningful keys for searching databases
Searching sequence databasesSearching sequence databases
 Start from sequence, find information about itStart from sequence, find information about it
 Many kinds of input sequencesMany kinds of input sequences
 Could be amino acid or nucleotide sequenceCould be amino acid or nucleotide sequence
 Genomic or mRNA/cDNA or protein sequenceGenomic or mRNA/cDNA or protein sequence
 Complete or fragmentary sequencesComplete or fragmentary sequences
 Exact matches are rare (even uninteresting in manyExact matches are rare (even uninteresting in many
cases), so often goal is to retrieve a set of similarcases), so often goal is to retrieve a set of similar
sequences.sequences.
 Both small (mutations) and large (required for function)Both small (mutations) and large (required for function)
differences within “similar” can be interesting.differences within “similar” can be interesting.
What might we wantWhat might we want
to know about a sequence?to know about a sequence?
 Is this sequence similar to any known genes? How closeIs this sequence similar to any known genes? How close
is the best match? Significance?is the best match? Significance?
 What do we know about that gene?What do we know about that gene?
 Genomic (chromosomal location, allelic information,Genomic (chromosomal location, allelic information,
regulatory regions, etc.)regulatory regions, etc.)
 Structural (known structure? structural domains? etc.)Structural (known structure? structural domains? etc.)
 Functional (molecular, cellular & disease)Functional (molecular, cellular & disease)
 Evolutionary information:Evolutionary information:
 Is this gene found in other organisms?Is this gene found in other organisms?
 What is its taxonomic tree?What is its taxonomic tree?
A historical perspectiveA historical perspective
 The 1960s: the birth ofThe 1960s: the birth of
bioinformaticsbioinformatics
 High-level computerHigh-level computer
languageslanguages
 Protein sequence dataProtein sequence data
 Academic access toAcademic access to
computerscomputers
 Margaret Oakley DayhoffMargaret Oakley Dayhoff
 First protein databaseFirst protein database
 First program for sequenceFirst program for sequence
assemblyassembly IBM 7090 computer
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
1.
By way of comparison…By way of comparison…
IBM 7090 computer
32 Kbytes RAM
2.18 µHz
$2,900,000 in 1960
20” Apple iMac
1 GB RAM
2.4 GHz
$1199 in 2008
2.
Solving problems in computerSolving problems in computer
sciencescience
 Necessary parameters for assessing the difficultyNecessary parameters for assessing the difficulty
of a computer science problemof a computer science problem
 Algorithmic complexityAlgorithmic complexity
 Is the problem theoretically solvable?Is the problem theoretically solvable?
 If so, what is the most efficient solution?If so, what is the most efficient solution?
 Current state of computer technologyCurrent state of computer technology
 MemoryMemory
 CPU speedCPU speed
 CostCost
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
AlgorithmsAlgorithms
 AnAn algorithmalgorithm is a sequence of instructions that oneis a sequence of instructions that one
must perform in order to solve a well-formulatedmust perform in order to solve a well-formulated
problemproblem
 First you must identify exactly what the problem is!First you must identify exactly what the problem is!
 AA problemproblem describes a class of computational tasks.describes a class of computational tasks.
A problemA problem instanceinstance is one particular input fromis one particular input from
that taskthat task
 In general, you should design your algorithms toIn general, you should design your algorithms to
work forwork for anyany instance of a problem (although thereinstance of a problem (although there
are cases in which this is not possible)are cases in which this is not possible)
Computer technology: memory, CPU speed, costComputer technology: memory, CPU speed, cost
• Dramatic improvements on yearly basis
• We do a lot of our work using desktop Macs out of the box
- 2 quad core 2.8 GHz processors, 500 GB disk space, 4 GB RAM for ~$3000
- 2 quad core 3.0 GHz processors, 2.5 TB disk space, 8 GB RAM for ~$6000
• CPU speed vs. memory: which is more important?
- for protein structure, might need many calculations but limited memory
- for genome searches, might have few calculations but huge amounts to store
in memory
• Reading from memory is several orders of magnitude faster than reading from disk
DatabasesDatabases
 What is a database?What is a database?
 A collection of related data elementsA collection of related data elements
 tablestables
 columns (fields)columns (fields)
 rows (records)rows (records)
 Records retrieved using a query languageRecords retrieved using a query language
 Database technology is well establishedDatabase technology is well established
 Databases are a fundamental part of the bioinformatics revolution. Much ofDatabases are a fundamental part of the bioinformatics revolution. Much of
the conceptual framework for databases had already been developed by thethe conceptual framework for databases had already been developed by the
1960s.1960s.
 By the 1970s, database technology had already permeated much of theBy the 1970s, database technology had already permeated much of the
government and corporate sectors.government and corporate sectors.
 Modern databases can be described as well-organized collections of dataModern databases can be described as well-organized collections of data
that can be accessed through the use of a query language.that can be accessed through the use of a query language.
 Two databases of particular importance to biologists areTwo databases of particular importance to biologists are GenBankGenBank®®
, which, which
encompasses all publicly available protein and nucleotide sequences, andencompasses all publicly available protein and nucleotide sequences, and
thethe Protein Data BankProtein Data Bank, which contains high quality 3-D structures of, which contains high quality 3-D structures of
proteins, nucleic acids, and carbohydrates.proteins, nucleic acids, and carbohydrates.
 The entire sequence of a single human could fit on one or two CD-ROMS.The entire sequence of a single human could fit on one or two CD-ROMS.
As we shall see shortly, it is the comparison of sequences that presentsAs we shall see shortly, it is the comparison of sequences that presents
algorithmic challenges.algorithmic challenges.
Tables (entitites)
•basic elements of information to track, e.g., gene, organism, sequence, citation
Columns (fields)
•attributes of tables, e.g. for citation table, title, journal, volume, author
Rows (records)
•actual data
•whereas fields describe what data is stored, the rows of a table are where the actual data
is stored
DatabasesDatabases
What is database?What is database?
 A database is a computerized records used toA database is a computerized records used to
store and organize data in such a way thatstore and organize data in such a way that
information can be retrieved easily via a varietyinformation can be retrieved easily via a variety
of search criteria. Databases are composed ofof search criteria. Databases are composed of
computer hardware and software for datacomputer hardware and software for data
management.management.
What is database?What is database?
 Each record, also called an entry, should containEach record, also called an entry, should contain
a number of fields that hold the actual dataa number of fields that hold the actual data
items, for example, fields for names, phoneitems, for example, fields for names, phone
numbers, addresses, dates.numbers, addresses, dates.
 To retrieve a particular record from theTo retrieve a particular record from the
database, a user can specify a particular piece ofdatabase, a user can specify a particular piece of
information, called value, to be found in ainformation, called value, to be found in a
particular field and expect the computer toparticular field and expect the computer to
retrieve the whole data record.retrieve the whole data record.
 This process is called making a queryThis process is called making a query
What is database?What is database?
 A biological database is a collection of both experimentalA biological database is a collection of both experimental
and theoretical data that is organized so that its contentsand theoretical data that is organized so that its contents
can be easilycan be easily
 accessedaccessed
 managedmanaged
 updatedupdated
 RetrievedRetrieved
 The activity of preparing a database can be divided in to:The activity of preparing a database can be divided in to:
 Collection of data in a form which can be easily accessedCollection of data in a form which can be easily accessed
 Making it available to a multi-user systemMaking it available to a multi-user system
Types of databaseTypes of database
Flat file database
 A flat file database describes any of various
means to encode a database model (most
commonly a table) as a single file. A flat file can
be a plain text file or a binary file. There are
usually no structural relationships between the
records.
 "Flat file database" may be defined very narrowly, or more broadly."Flat file database" may be defined very narrowly, or more broadly.
 Strictly, a flat file database should consist of nothing but data and, if records vary inStrictly, a flat file database should consist of nothing but data and, if records vary in
length, delimiters.length, delimiters.
 More broadly, the term refers to any database which exists in a single file in the formMore broadly, the term refers to any database which exists in a single file in the form
of rows and columns, with no relationships or links between records and fields exceptof rows and columns, with no relationships or links between records and fields except
the table structure.the table structure.
 Terms used to describe different aspects of a database and its tools differ from oneTerms used to describe different aspects of a database and its tools differ from one
implementation to the next, but the concepts remain the same.implementation to the next, but the concepts remain the same.
 FileMaker uses the term "Find", while MySQL uses the term "Query"; but the conceptFileMaker uses the term "Find", while MySQL uses the term "Query"; but the concept
is the same. FileMaker "files", in version 7 and above, are equivalent to MySQLis the same. FileMaker "files", in version 7 and above, are equivalent to MySQL
"databases", and so forth. To avoid confusing the reader, one consistent set of terms is"databases", and so forth. To avoid confusing the reader, one consistent set of terms is
used throughout this article.used throughout this article.
 However, the basic terms "record" and "field" are used in nearly every flat file databaseHowever, the basic terms "record" and "field" are used in nearly every flat file database
implementationimplementation
Rational databaseRational database
 Relational databases are both created and queriedRelational databases are both created and queried
by DataBase Management Systems (DBMSs).by DataBase Management Systems (DBMSs).
 Relational databases displaced hierarchicalRelational databases displaced hierarchical
databases because the ability to add new relations made itdatabases because the ability to add new relations made it
possible to add new information that was valuable butpossible to add new information that was valuable but
"broke" a database's original hierarchical conception."broke" a database's original hierarchical conception.
 The trend continues as a networked planet and socialThe trend continues as a networked planet and social
media create the world of "big data" which is largermedia create the world of "big data" which is larger
and less structured than the datasets and tasks thatand less structured than the datasets and tasks that
relational databases handle well (it is instructive torelational databases handle well (it is instructive to
compareHadoop).compareHadoop).
Rational databaseRational database
Object oriented databaseObject oriented database
 An object database (also object-orientedAn object database (also object-oriented
database management system) is a databasedatabase management system) is a database
management system in which information ismanagement system in which information is
represented in the form of objects as usedrepresented in the form of objects as used
in object-oriented programming.in object-oriented programming.
 Object databases are different from relationalObject databases are different from relational
databases which are table-oriented.databases which are table-oriented.
Biological databaseBiological database
3.
Online DatabasesOnline Databases
When you query an online database, your query is translated into SQL, the database is
interrogated, and the answer displayed on your web browser.
Your computer and
browser (the “client”)
Software to receive
and translate the
instructions you enter
into your browser (on
the “server”)
The database itself
Image source: David Lane and Hugh E. Williams. Web Database Applications with PHP & MySQL. O’Reilly (2002).
4.
Biological Databases
•Over 1000 biological databases
•Vary in size, quality, coverage, level of interest
•Many of the major ones covered in the annual Database Issue
of Nucleic Acids Research
•What makes a good database?
•comprehensiveness
•accuracy
•is up-to-date
•good interface
•batch search/download
•API (web services, DAS, etc.)
“Ten Important Bioinformatics Databases”
GenBank www.ncbi.nlm.nih.gov nucleotide sequences
Ensembl www.ensembl.org human/mouse genome (and others)
PubMed www.ncbi.nlm.nih.gov literature references
NR www.ncbi.nlm.nih.gov protein sequences
SWISS-PROT www.expasy.ch protein sequences
InterPro www.ebi.ac.uk protein domains
OMIM www.ncbi.nlm.nih.gov genetic diseases
Enzymeswww.chem.qmul.ac.uk enzymes
PDB www.rcsb.org/pdb/ protein structures
KEGG www.genome.ad.jp metabolic pathways
Source: Bioinformatics for Dummies
NCBI (National Center for Biotechnology
Information)
• over 30 databases including GenBank,
PubMed, OMIM, and GEO
• Access all NCBI resources via Entrez
(www.ncbi.nlm.nih.gov/Entrez/)
5.
6.
7.
8.
9.
10.
11.
PubMedPubMed
12.
13.
14.
15.
16.
17.
18.
INFORMATION RETRIEVALINFORMATION RETRIEVAL
FROM BIOLOGICAL DATABASESFROM BIOLOGICAL DATABASES
 NCBI-EntrezNCBI-Entrez
 SRS(Sequenceretrievalsystem)SRS(Sequenceretrievalsystem)
NCBI and EntrezNCBI and Entrez
The Central Dogma & Biological DataThe Central Dogma & Biological Data
Protein structures
-Experiments
-Models (homologues)
Literature information
Original DNA Sequences
(Genomes)
Protein Sequences
-Inferred
-Direct sequencing
Expressed DNA sequences
( = mRNA Sequences
= cDNA sequences)
Expressed Sequence Tags
(ESTs)
19.
NCBI Databases and ServicesNCBI Databases and Services
 GenBank primary sequence databaseGenBank primary sequence database
 Free public access to biomedical literatureFree public access to biomedical literature
 PubMed free Medline (3 million searches per day)PubMed free Medline (3 million searches per day)
 PubMed Central full text online accessPubMed Central full text online access
 Entrez integrated molecular and literature databasesEntrez integrated molecular and literature databases
PRIMARYPRIMARY VS.VS. DERIVATIVEDERIVATIVE
SEQUENCE DATABASESSEQUENCE DATABASES
GenBankGenBank
SequencingSequencing
CentersCenters
GA
GAGA
ATT
ATT
C
CGAGA
ATT
ATT
C
C
AT
GAGA
ATT
C
C GAGA
ATT
C
C
TTGACA
ATTGACTA
ACGTGC
TTGACA
CGTGA
ATTGACTA
TATAGCCG
ACGTGC
ACGTGC
ACGTGC
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTA
ATTGACTA
ATTGACTA
TATAGCCG
TATAGCCGTATAGCCG
TATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG
CATT
GAGA
ATT
C
C GAGA
ATT
C
C LabsLabs
AlgorithmsAlgorithms
UniGene
CuratorsCurators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated
continually
by NCBI
Updated ONLY
by submitters
20.
Sequence Databases at NCBISequence Databases at NCBI
 PrimaryPrimary
 GenBank: NCBI’s primary sequence databaseGenBank: NCBI’s primary sequence database
 Trace Archive: reads from capillary sequencersTrace Archive: reads from capillary sequencers
 Sequence Read Archive: next generation dataSequence Read Archive: next generation data
 DerivativeDerivative
 GenPept (GenBank translations)GenPept (GenBank translations)
 Outside Protein (UniProt—Swiss-Prot, PDB)Outside Protein (UniProt—Swiss-Prot, PDB)
 NCBI Reference SequencesNCBI Reference Sequences (RefSeq)(RefSeq)
GENBANK -GENBANK - PRIMARY SEQUENCE DBPRIMARY SEQUENCE DB
 Nucleotide onlyNucleotide only sequence databasesequence database
 Archival(Records)Archival(Records) in naturein nature
 HistoricalHistorical
 Reflective of submitter point of view (subjective)Reflective of submitter point of view (subjective)
 RedundantRedundant
 DataData
 Direct submissions (traditional records)Direct submissions (traditional records)
 Batch submissionsBatch submissions
 FTP accounts (genome data)FTP accounts (genome data)
GENBANK -GENBANK - PRIMARY SEQUENCE DB (2)PRIMARY SEQUENCE DB (2)
 Three collaborating databasesThree collaborating databases
1.1. GenBankGenBank
2.2. DNA Database of Japan (DDBJ)DNA Database of Japan (DDBJ)
3.3. European Molecular Biology Laboratory (EMBL)European Molecular Biology Laboratory (EMBL)
DatabaseDatabase
Traditional GenBank RecordTraditional GenBank Record
ACCESSION U07418
VERSION U07418.1 GI:466461
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession
•Stable
•Reportable
•Universal
Accession
•Stable
•Reportable
•Universal
Version
Tracks changes in sequence
Version
Tracks changes in sequence
GI number
NCBI internal use
GI number
NCBI internal use
well annotatedwell annotated
the sequence is the datathe sequence is the data
21.
NCBI and EntrezNCBI and Entrez
 One of the most useful and comprehensive sources ofOne of the most useful and comprehensive sources of
databases is the NCBI, part of the National Library ofdatabases is the NCBI, part of the National Library of
Medicine.Medicine.
 NCBI provides interesting summaries, browsers forNCBI provides interesting summaries, browsers for
genome data, and search toolsgenome data, and search tools
 Entrez is their database search interfaceEntrez is their database search interface
http://www.ncbi.nlm.nih.gov/Entrezhttp://www.ncbi.nlm.nih.gov/Entrez
 Can search on gene names, sequences, chromosomalCan search on gene names, sequences, chromosomal
location, diseases, keywords, ...location, diseases, keywords, ...
What did we just do?What did we just do?
 Identify loci (genes) associated with the sequence.Identify loci (genes) associated with the sequence.
Input was Alcohol DehydrogenaseInput was Alcohol Dehydrogenase
 For each particular “hit”, we can look at thatFor each particular “hit”, we can look at that
sequence and its alignment in more detail.sequence and its alignment in more detail.
 See similar sequences, and the organisms in whichSee similar sequences, and the organisms in which
they are found.they are found.
 But there’sBut there’s much moremuch more that can be found onthat can be found on
these genes, even just inside NCBI…these genes, even just inside NCBI…
22.
More from Entrez GeneMore from Entrez Gene
23.
And more…And more…
Sequence Retrieval SystemSequence Retrieval System
 The Sequence Retrieval System is aThe Sequence Retrieval System is a
database system that works with flat-files. Indatabase system that works with flat-files. In
addition, many bioinformatics tools areaddition, many bioinformatics tools are
incorporated and can be combined with theincorporated and can be combined with the
databases searches.databases searches.
24.
NCBI is not all there is...NCBI is not all there is...
 Links to non-NCBI databasesLinks to non-NCBI databases
 Reactome & KEGG for pathwaysReactome & KEGG for pathways
 HGNC for nomenclatureHGNC for nomenclature
 UCSC Human Genome BrowserUCSC Human Genome Browser
 Other important gene/protein resources not linked to:Other important gene/protein resources not linked to:
 UniProt (most carefully annotated)UniProt (most carefully annotated)
 PDBPDB (main macromolecular structure repository)(main macromolecular structure repository)
 Other key biological data sourcesOther key biological data sources
 Gene OntologyGene Ontology/Open Biological Ontologies/Open Biological Ontologies
 EnzymeEnzyme
 Scientific society: iscb.orgScientific society: iscb.org
 Journals, Conferences…Journals, Conferences…
Take home messagesTake home messages
 There are a lot of molecular biology databases,There are a lot of molecular biology databases,
containing a lot of valuable informationcontaining a lot of valuable information
 Not even the best databases have everything (orNot even the best databases have everything (or
the best of everything)the best of everything)
 These databases are moderately well cross-These databases are moderately well cross-
linked, and there are “linker” databaseslinked, and there are “linker” databases
 Sequence is a good identifier, maybe even betterSequence is a good identifier, maybe even better
than gene name!than gene name!
FILE FORMATEFILE FORMATE
IG/Stanford Fitch Plain/Raw
GenBank/GB Fasta/Pearson PIR/CODATA
NBRF Zuker MSF
EMBL Olsen ASN 1.8
GCG Phylip 3.2 PAUP/NEXUS
DNAStrider Phylip Pretty
IG/Stanford Fitch Plain/Raw
GenBank/GB Fasta/Pearson PIR/CODATA
NBRF Zuker MSF
EMBL Olsen ASN 1.8
GCG Phylip 3.2 PAUP/NEXUS
DNAStrider Phylip Pretty
LOCUS, Accession, NID and protein_idLOCUS, Accession, NID and protein_id
LOCUS: Unique string of 10 letters and numbers in
the database. Not maintained amongst databases,
and is therefore a poor sequence identifier.
ACCESSION: A unique identifier to that record, citable
entity; does not change when record is updated. A good
record identifier, ideal for citation in publication.
VERSION: : New system where the accession and version play the
same function as the accession and gi number.
Nucleotide gi: Geninfo identifier (gi), a unique integer
which will change every time the sequence changes.
PID: Protein Identifier: g, e or d prefix to gi number.
Can have one or two on one CDS.
Protein gi: Geninfo identifier (gi), a unique integer which
will change every time the sequence changes.
protein_id: Identifier which has the same
structure and function as the nucleotide Accession.version
numbers, but slightlt different format.
Accession.version
LOCUS, Accession, gi and PIDLOCUS, Accession, gi and PID
LOCUS HSU40282 1789 bp mRNA PRI 21-MAY-1998
DEFINITION Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.
ACCESSION U40282
VERSION U40282.1 GI:3150001
CDS 157..1515
/gene="ILK"
/note="protein serine/threonine kinase"
/codon_start=1
/product="integrin-linked kinase"
/protein_id="AAC16892.1"
/db_xref="PID:g3150002"
/db_xref="GI:3150002"
LOCUS: HSU40282
ACCESSION: U40282
VERSION: U40282.1
GI: 3150001
PID: g3150002
Protein gi: 3150002
protein_id: AAC16892.1 Protein_idprotein gi
ACCESSION
LOCUS
PIDgi
PLAIN SEQUENCE FORMAT
A sequence in plain format may contain only IUPAC characters and
spaces (no numbers!).
Note: A file in plain sequence format may only contain one sequence,
while most other formats accept several sequences in one file.
An example sequence in plain format is:
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAA
CCTCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGC
CGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTG
CCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTC
TGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCT
PLAIN SEQUENCEPLAIN SEQUENCE
FORMATEFORMATE
FASTA FORMATEFASTA FORMATE
FASTA FORMAT
A sequence in Fasta format begins with a single-line description,
followed by lines of sequence data.
The description line is distinguished from the sequence data by a greater-than (">") symbol in
the first column.
It is recommended that all lines of text be shorter than 80 characters in length
An example sequence in FASTA format is:
>U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1)
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC
• The first line of each sequence entry is the ID definition line which contains entry name, dataclass,
molecule, division and sequence length.
• XX line contains no data, just a separator
• The AC line lists the accession number.
• DE line gives description about the sequence
• FT precise annotation for the sequence
• Sequence information SQ in the first two spaces.
• The sequence information begins on the fifth line of the sequence entry.
• The last line of each sequence entry in the file is a terminator line which has the two characters // in
the first two spaces.
ID AA03518 standard; DNA; FUN; 237 BP. XX AC U03518;
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
DE rRNA and 5.8S rRNA genes, partial sequence.
RX MEDLINE; 94303342.
RX PUBMED; 8030378.
XX
FT rRNA <1..20
FT /product="18S ribosomal RNA"
FT misc_RNA 21..205
FT /standard_name="Internal transcribed spacer 1 (ITS1)"
FT rRNA 206..>237
FT /product="5.8S ribosomal RNA"
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120
ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
EMBL/Swiss Prot
(http://www.ebi.ac.uk/help/formats_frame.html)
EMBL FORMAT
A sequence file in EMBL format can contain several sequences.
One sequence entry starts with an identifier line ("ID "), followed by further
annotation lines. The start of the sequence is marked by a line starting with "SQ"
and the end of the sequence is marked by two slashes ("//").
An example sequence in EMBL format is:
ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120
ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
GENBANK FORMAT
A sequence file in GenBank format can contain several sequences.
One sequence in GenBank format starts with a line containing the word LOCUS
and a number of annotation lines. The start of the sequence is marked by a line
containing "ORIGIN" and the end of the sequence is marked by two slashes ("//").
•Can contain several sequences
•One sequence starts with: “LOCUS”
•The sequence starts with: "ORIGIN“
•The sequence ends with: "//“
An example sequence in GenBank format is:
LOCUS AAU03518 237 bp DNA PLN 04-FEB-1995
DEFINITION Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
rRNA and 5.8S rRNA genes, partial sequence.
ACCESSION U03518
BASE COUNT 41 a 77 c 67 g 52 t
ORIGIN
1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg
121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc
181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc
//
25.
26.
27.
28.
29.
PIR- PROTEIN SEQUENCEPIR- PROTEIN SEQUENCE
DBDB
 PIR was established in 1984 by the National BiomedicalPIR was established in 1984 by the National Biomedical
Research Foundation (NBRF) as a resource to assist researchersResearch Foundation (NBRF) as a resource to assist researchers
in the identification and interpretation of protein sequencein the identification and interpretation of protein sequence
information.information.
 Prior to that, the NBRF compiled the first comprehensivePrior to that, the NBRF compiled the first comprehensive
collection of macromolecular sequences in thecollection of macromolecular sequences in the Atlas of ProteinAtlas of Protein
Sequence and StructureSequence and Structure, published from 1965-1978 under the, published from 1965-1978 under the
editorship of Margaret O. Dayhoff. editorship of Margaret O. Dayhoff. Dr. DayhoffDr. Dayhoff and her and her
research group pioneered in the development of computerresearch group pioneered in the development of computer
methods for the comparison of protein sequences, for themethods for the comparison of protein sequences, for the
detection of distantly related sequences and duplications withindetection of distantly related sequences and duplications within
sequences, and for the inference of evolutionary histories fromsequences, and for the inference of evolutionary histories from
alignments of protein sequences.alignments of protein sequences.
STRUCTURAL DB-PDBSTRUCTURAL DB-PDB
30.
Protein Data Bank (PDB)
31.
 The Protein Data Bank (PDB) is a repository for theThe Protein Data Bank (PDB) is a repository for the
three-dimensional structural data of large biologicalthree-dimensional structural data of large biological
molecules, such as proteins and nucleic acids.molecules, such as proteins and nucleic acids.
 The data, typically obtained by X-rayThe data, typically obtained by X-ray
crystallography or NMR spectroscopy and submittedcrystallography or NMR spectroscopy and submitted
by biologists and biochemists from around the world,by biologists and biochemists from around the world,
are freely accessible on the Internet via the websites ofare freely accessible on the Internet via the websites of
its member organisationsits member organisations
 The PDB is overseen by an organization calledThe PDB is overseen by an organization called
theWorldwide Protein Data Bank, wwPDB.theWorldwide Protein Data Bank, wwPDB.
 The PDB is a key resource in areas of structuralThe PDB is a key resource in areas of structural
biology, such as structural genomics.biology, such as structural genomics.
 Most major scientific journals, and some fundingMost major scientific journals, and some funding
agencies, now require scientists to submit theiragencies, now require scientists to submit their
structure data to the PDB.structure data to the PDB.
 If the contents of the PDB are thought of as primaryIf the contents of the PDB are thought of as primary
data, then there are hundreds of derived (i.e.,data, then there are hundreds of derived (i.e.,
secondary) databases that categorize the datasecondary) databases that categorize the data
differently.differently.
 For example both SCOP and CATH categorizeFor example both SCOP and CATH categorize
structures according to type of structure and assumedstructures according to type of structure and assumed
evolutionary relations.evolutionary relations.
 HEADER, TITLE and AUTHOR records provide information about theHEADER, TITLE and AUTHOR records provide information about the
researchers who defined the structure; numerous other types of records areresearchers who defined the structure; numerous other types of records are
available to provide other types of informationavailable to provide other types of information
 REMARK records can contain free-form annotation, but they alsoREMARK records can contain free-form annotation, but they also
accommodate standardized information; for example, the REMARK 350accommodate standardized information; for example, the REMARK 350
BIOMT records describe how to compute the coordinates of theBIOMT records describe how to compute the coordinates of the
experimentally observed multimer from those of the explicitly specified onesexperimentally observed multimer from those of the explicitly specified ones
of a single repeating unit.of a single repeating unit.
 SEQRES records give the sequences of the three peptide chains (named A, BSEQRES records give the sequences of the three peptide chains (named A, B
and C), which are very short in this example but usually span multiple lines.and C), which are very short in this example but usually span multiple lines.
 ATOM records describe the coordinates of the atoms that are part of theATOM records describe the coordinates of the atoms that are part of the
protein. For example, the first ATOM line above describes the alpha-N atomprotein. For example, the first ATOM line above describes the alpha-N atom
of the first residue of peptide chain A, which is a proline residue; the firstof the first residue of peptide chain A, which is a proline residue; the first
three floating point numbers are its x, y and z coordinates and are in unitsthree floating point numbers are its x, y and z coordinates and are in units
of Ångströms.of Ångströms.
 HETATM records describe coordinates of hetero-atoms, that is those atomsHETATM records describe coordinates of hetero-atoms, that is those atoms
which are not part of the protein molecule.which are not part of the protein molecule.
PUBCHEMPUBCHEM
 PubChem is database of chemical molecules and their activitiesPubChem is database of chemical molecules and their activities
against biological assays. The system is maintained byagainst biological assays. The system is maintained by
theNational Center for Biotechnology Information (NCBI), atheNational Center for Biotechnology Information (NCBI), a
component of the National Library of Medicine, which is part ofcomponent of the National Library of Medicine, which is part of
the United States National Institutes of Health (NIH). PubChemthe United States National Institutes of Health (NIH). PubChem
can be accessed for free through a web user interface. Millions ofcan be accessed for free through a web user interface. Millions of
compound structures and descriptive datasets can be freelycompound structures and descriptive datasets can be freely
downloaded via FTP. PubChem contains substance descriptionsdownloaded via FTP. PubChem contains substance descriptions
and small molecules with fewer than 1000 atoms and 1000and small molecules with fewer than 1000 atoms and 1000
bonds. More than 80 database vendors contribute to the growingbonds. More than 80 database vendors contribute to the growing
PubChem databasePubChem database
Books and Web ReferencesBooks and Web References
 Books Name :Books Name :
1. Introduction To Bioinformatics by T. K. Attwood1. Introduction To Bioinformatics by T. K. Attwood
2. BioInformatics by Sangita2. BioInformatics by Sangita
3. Basic Bioinformatics by S.Ignacimuthu, s.j.3. Basic Bioinformatics by S.Ignacimuthu, s.j.
 http://en.wikipedia.org/wiki/Biological_databasehttp://en.wikipedia.org/wiki/Biological_database
 http://bioinformaticsweb.net/data.htmlhttp://bioinformaticsweb.net/data.html
 http://www.apbionet.org/s-star/downloads/tutorial/t1b.pdfhttp://www.apbionet.org/s-star/downloads/tutorial/t1b.pdf
90
Image ReferencesImage References
 1. & 2. https://encrypted-tbn0.gstatic.com/images?1. & 2. https://encrypted-tbn0.gstatic.com/images?
q=tbn:ANd9GcR39w90rTM4wRcS2WE4I0zbjV7R6KE8JMVZq=tbn:ANd9GcR39w90rTM4wRcS2WE4I0zbjV7R6KE8JMVZ
z4QF0qY6A8W1qti_QQaeDx5Xz4QF0qY6A8W1qti_QQaeDx5X
 3. & 4.Book: Basic Bioinformatics by S.Ignacimuthu, s.j.3. & 4.Book: Basic Bioinformatics by S.Ignacimuthu, s.j.
 5. to 18.http://www.ncbi.nlm.nih.gov/5. to 18.http://www.ncbi.nlm.nih.gov/
 19. https://encrypted-tbn0.gstatic.com/images?19. https://encrypted-tbn0.gstatic.com/images?
q=tbn:ANd9GcRGBmCsguJs4geE45YMjE_O80bbqD9dFtCE9q=tbn:ANd9GcRGBmCsguJs4geE45YMjE_O80bbqD9dFtCE9
fgZYySwzYSIDbIpfgZYySwzYSIDbIp
 21. to 29. http://www.ncbi.nlm.nih.gov/21. to 29. http://www.ncbi.nlm.nih.gov/
 30. & 31. http://www.rcsb.org/pdb/home/home.do30. & 31. http://www.rcsb.org/pdb/home/home.do

More Related Content

What's hot

Open Reading Frames
Open Reading FramesOpen Reading Frames
Open Reading Frames
Osama Zahid
 
Cath
CathCath
Cath
Ramya S
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
sworna kumari chithiraivelu
 
Introduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbjIntroduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbj
KAUSHAL SAHU
 
Biological database
Biological databaseBiological database
Biological database
Iqbal college Peringammala TVM
 
FASTA
FASTAFASTA
Prosite
PrositeProsite
Structural databases
Structural databases Structural databases
Structural databases
Priyadharshana
 
ZINC database
ZINC databaseZINC database
ZINC database
Ankit Alankar
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
ammar kareem
 
Data retreival system
Data retreival systemData retreival system
Data retreival system
Shikha Thakur
 
Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
Santosh Kumar Sahoo
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
PrashantSharma807
 
Fasta
FastaFasta
Biological Databases
Biological DatabasesBiological Databases
Biological Databases
Shweta Kagliwal
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
Alichy Sowmya
 
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
sagrika chugh
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
Vidya Kalaivani Rajkumar
 
OMIM Database
OMIM DatabaseOMIM Database
Ddbj
DdbjDdbj

What's hot (20)

Open Reading Frames
Open Reading FramesOpen Reading Frames
Open Reading Frames
 
Cath
CathCath
Cath
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
Introduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbjIntroduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbj
 
Biological database
Biological databaseBiological database
Biological database
 
FASTA
FASTAFASTA
FASTA
 
Prosite
PrositeProsite
Prosite
 
Structural databases
Structural databases Structural databases
Structural databases
 
ZINC database
ZINC databaseZINC database
ZINC database
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Data retreival system
Data retreival systemData retreival system
Data retreival system
 
Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
 
Fasta
FastaFasta
Fasta
 
Biological Databases
Biological DatabasesBiological Databases
Biological Databases
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
OMIM Database
OMIM DatabaseOMIM Database
OMIM Database
 
Ddbj
DdbjDdbj
Ddbj
 

Viewers also liked

What is network
What is networkWhat is network
What is network
Shakir Khan
 
1.bioinformatics introduction 32.03.2071
1.bioinformatics introduction 32.03.20711.bioinformatics introduction 32.03.2071
1.bioinformatics introduction 32.03.2071
RajDip Basnet
 
B.sc biochem i bobi u-1 introduction to bioinformatics
B.sc biochem i bobi u-1 introduction to bioinformaticsB.sc biochem i bobi u-1 introduction to bioinformatics
B.sc biochem i bobi u-1 introduction to bioinformatics
Rai University
 
How to submit a sequence in NCBI
How to submit a sequence in NCBIHow to submit a sequence in NCBI
How to submit a sequence in NCBI
Minhaz Ahmed
 
Biological databases
Biological databasesBiological databases
Biological databases
Sarfaraz Nasri
 
Biological databases
Biological databasesBiological databases
Biological databases
Malla Reddy College of Pharmacy
 
NCBI
NCBINCBI
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
nadeem akhter
 
Database Architecture and Basic Concepts
Database Architecture and Basic ConceptsDatabase Architecture and Basic Concepts
Database Architecture and Basic Concepts
Tony Wong
 
Biological databases
Biological databasesBiological databases
Biological databases
Prasanthperceptron
 
Data communication and network Chapter -1
Data communication and network Chapter -1Data communication and network Chapter -1
Data communication and network Chapter -1
Zafar Ayub
 

Viewers also liked (11)

What is network
What is networkWhat is network
What is network
 
1.bioinformatics introduction 32.03.2071
1.bioinformatics introduction 32.03.20711.bioinformatics introduction 32.03.2071
1.bioinformatics introduction 32.03.2071
 
B.sc biochem i bobi u-1 introduction to bioinformatics
B.sc biochem i bobi u-1 introduction to bioinformaticsB.sc biochem i bobi u-1 introduction to bioinformatics
B.sc biochem i bobi u-1 introduction to bioinformatics
 
How to submit a sequence in NCBI
How to submit a sequence in NCBIHow to submit a sequence in NCBI
How to submit a sequence in NCBI
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Biological databases
Biological databasesBiological databases
Biological databases
 
NCBI
NCBINCBI
NCBI
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
Database Architecture and Basic Concepts
Database Architecture and Basic ConceptsDatabase Architecture and Basic Concepts
Database Architecture and Basic Concepts
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Data communication and network Chapter -1
Data communication and network Chapter -1Data communication and network Chapter -1
Data communication and network Chapter -1
 

Similar to B.sc biochem i bobi u 2 database

RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
Jackie Wirz, PhD
 
Data retrieval
Data retrievalData retrieval
Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of Nature
Robert Cormia
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
xRowlet
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
David Ruau
 
Genome data management
Genome data managementGenome data management
Genome data management
Shareb Ismaeel
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
nadeem akhter
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
mikaelhuss
 
Harvester I
Harvester IHarvester I
Harvester I
michelle886
 
Bioinformatics final
Bioinformatics finalBioinformatics final
Bioinformatics final
Rainu Rajeev
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
c.titus.brown
 
Introduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfIntroduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdf
kigaruantony
 
Introduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptxIntroduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptx
RAJESHKUMAR428748
 
Bioinformatics principles and applications
Bioinformatics principles and applicationsBioinformatics principles and applications
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
c.titus.brown
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
BITS
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
Atai Rabby
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
Anshika Bansal
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
Sangeeta Das
 

Similar to B.sc biochem i bobi u 2 database (20)

RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Data retrieval
Data retrievalData retrieval
Data retrieval
 
Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of Nature
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Genome data management
Genome data managementGenome data management
Genome data management
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
Harvester I
Harvester IHarvester I
Harvester I
 
Bioinformatics final
Bioinformatics finalBioinformatics final
Bioinformatics final
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Introduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfIntroduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdf
 
Introduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptxIntroduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptx
 
Bioinformatics principles and applications
Bioinformatics principles and applicationsBioinformatics principles and applications
Bioinformatics principles and applications
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
 

More from Rai University

Brochure Rai University
Brochure Rai University Brochure Rai University
Brochure Rai University
Rai University
 
Mm unit 4point2
Mm unit 4point2Mm unit 4point2
Mm unit 4point2
Rai University
 
Mm unit 4point1
Mm unit 4point1Mm unit 4point1
Mm unit 4point1
Rai University
 
Mm unit 4point3
Mm unit 4point3Mm unit 4point3
Mm unit 4point3
Rai University
 
Mm unit 3point2
Mm unit 3point2Mm unit 3point2
Mm unit 3point2
Rai University
 
Mm unit 3point1
Mm unit 3point1Mm unit 3point1
Mm unit 3point1
Rai University
 
Mm unit 2point2
Mm unit 2point2Mm unit 2point2
Mm unit 2point2
Rai University
 
Mm unit 2 point 1
Mm unit 2 point 1Mm unit 2 point 1
Mm unit 2 point 1
Rai University
 
Mm unit 1point3
Mm unit 1point3Mm unit 1point3
Mm unit 1point3
Rai University
 
Mm unit 1point2
Mm unit 1point2Mm unit 1point2
Mm unit 1point2
Rai University
 
Mm unit 1point1
Mm unit 1point1Mm unit 1point1
Mm unit 1point1
Rai University
 
Bdft ii, tmt, unit-iii, dyeing & types of dyeing,
Bdft ii, tmt, unit-iii,  dyeing & types of dyeing,Bdft ii, tmt, unit-iii,  dyeing & types of dyeing,
Bdft ii, tmt, unit-iii, dyeing & types of dyeing,
Rai University
 
Bsc agri 2 pae u-4.4 publicrevenue-presentation-130208082149-phpapp02
Bsc agri  2 pae  u-4.4 publicrevenue-presentation-130208082149-phpapp02Bsc agri  2 pae  u-4.4 publicrevenue-presentation-130208082149-phpapp02
Bsc agri 2 pae u-4.4 publicrevenue-presentation-130208082149-phpapp02
Rai University
 
Bsc agri 2 pae u-4.3 public expenditure
Bsc agri  2 pae  u-4.3 public expenditureBsc agri  2 pae  u-4.3 public expenditure
Bsc agri 2 pae u-4.3 public expenditure
Rai University
 
Bsc agri 2 pae u-4.2 public finance
Bsc agri  2 pae  u-4.2 public financeBsc agri  2 pae  u-4.2 public finance
Bsc agri 2 pae u-4.2 public finance
Rai University
 
Bsc agri 2 pae u-4.1 introduction
Bsc agri  2 pae  u-4.1 introductionBsc agri  2 pae  u-4.1 introduction
Bsc agri 2 pae u-4.1 introduction
Rai University
 
Bsc agri 2 pae u-3.3 inflation
Bsc agri  2 pae  u-3.3  inflationBsc agri  2 pae  u-3.3  inflation
Bsc agri 2 pae u-3.3 inflation
Rai University
 
Bsc agri 2 pae u-3.2 introduction to macro economics
Bsc agri  2 pae  u-3.2 introduction to macro economicsBsc agri  2 pae  u-3.2 introduction to macro economics
Bsc agri 2 pae u-3.2 introduction to macro economics
Rai University
 
Bsc agri 2 pae u-3.1 marketstructure
Bsc agri  2 pae  u-3.1 marketstructureBsc agri  2 pae  u-3.1 marketstructure
Bsc agri 2 pae u-3.1 marketstructure
Rai University
 
Bsc agri 2 pae u-3 perfect-competition
Bsc agri  2 pae  u-3 perfect-competitionBsc agri  2 pae  u-3 perfect-competition
Bsc agri 2 pae u-3 perfect-competition
Rai University
 

More from Rai University (20)

Brochure Rai University
Brochure Rai University Brochure Rai University
Brochure Rai University
 
Mm unit 4point2
Mm unit 4point2Mm unit 4point2
Mm unit 4point2
 
Mm unit 4point1
Mm unit 4point1Mm unit 4point1
Mm unit 4point1
 
Mm unit 4point3
Mm unit 4point3Mm unit 4point3
Mm unit 4point3
 
Mm unit 3point2
Mm unit 3point2Mm unit 3point2
Mm unit 3point2
 
Mm unit 3point1
Mm unit 3point1Mm unit 3point1
Mm unit 3point1
 
Mm unit 2point2
Mm unit 2point2Mm unit 2point2
Mm unit 2point2
 
Mm unit 2 point 1
Mm unit 2 point 1Mm unit 2 point 1
Mm unit 2 point 1
 
Mm unit 1point3
Mm unit 1point3Mm unit 1point3
Mm unit 1point3
 
Mm unit 1point2
Mm unit 1point2Mm unit 1point2
Mm unit 1point2
 
Mm unit 1point1
Mm unit 1point1Mm unit 1point1
Mm unit 1point1
 
Bdft ii, tmt, unit-iii, dyeing & types of dyeing,
Bdft ii, tmt, unit-iii,  dyeing & types of dyeing,Bdft ii, tmt, unit-iii,  dyeing & types of dyeing,
Bdft ii, tmt, unit-iii, dyeing & types of dyeing,
 
Bsc agri 2 pae u-4.4 publicrevenue-presentation-130208082149-phpapp02
Bsc agri  2 pae  u-4.4 publicrevenue-presentation-130208082149-phpapp02Bsc agri  2 pae  u-4.4 publicrevenue-presentation-130208082149-phpapp02
Bsc agri 2 pae u-4.4 publicrevenue-presentation-130208082149-phpapp02
 
Bsc agri 2 pae u-4.3 public expenditure
Bsc agri  2 pae  u-4.3 public expenditureBsc agri  2 pae  u-4.3 public expenditure
Bsc agri 2 pae u-4.3 public expenditure
 
Bsc agri 2 pae u-4.2 public finance
Bsc agri  2 pae  u-4.2 public financeBsc agri  2 pae  u-4.2 public finance
Bsc agri 2 pae u-4.2 public finance
 
Bsc agri 2 pae u-4.1 introduction
Bsc agri  2 pae  u-4.1 introductionBsc agri  2 pae  u-4.1 introduction
Bsc agri 2 pae u-4.1 introduction
 
Bsc agri 2 pae u-3.3 inflation
Bsc agri  2 pae  u-3.3  inflationBsc agri  2 pae  u-3.3  inflation
Bsc agri 2 pae u-3.3 inflation
 
Bsc agri 2 pae u-3.2 introduction to macro economics
Bsc agri  2 pae  u-3.2 introduction to macro economicsBsc agri  2 pae  u-3.2 introduction to macro economics
Bsc agri 2 pae u-3.2 introduction to macro economics
 
Bsc agri 2 pae u-3.1 marketstructure
Bsc agri  2 pae  u-3.1 marketstructureBsc agri  2 pae  u-3.1 marketstructure
Bsc agri 2 pae u-3.1 marketstructure
 
Bsc agri 2 pae u-3 perfect-competition
Bsc agri  2 pae  u-3 perfect-competitionBsc agri  2 pae  u-3 perfect-competition
Bsc agri 2 pae u-3 perfect-competition
 

Recently uploaded

C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
Bisnar Chase Personal Injury Attorneys
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
PECB
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 

Recently uploaded (20)

C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 

B.sc biochem i bobi u 2 database

  • 1. Biological DatabasesBiological Databases Course: B.Sc Biochemistry Subject: Basic of Bioinformatics Unit: II
  • 2. What can be discovered about a geneWhat can be discovered about a gene by a database search?by a database search?  A little or a lot, depending on the geneA little or a lot, depending on the gene  Evolutionary informationEvolutionary information: homologous genes, taxonomic: homologous genes, taxonomic distributions, allele frequencies, synteny, etc.distributions, allele frequencies, synteny, etc.  Genomic informationGenomic information: chromosomal location, introns,: chromosomal location, introns, UTRs, regulatory regions, shared domains, etc.UTRs, regulatory regions, shared domains, etc.  Structural informationStructural information: associated protein structures, fold: associated protein structures, fold types, structural domainstypes, structural domains  Expression informationExpression information: expression specific to particular: expression specific to particular tissues, developmental stages, phenotypes, diseases, etc.tissues, developmental stages, phenotypes, diseases, etc.  Functional informationFunctional information: enzymatic/molecular function,: enzymatic/molecular function, pathway/cellular role, localization, role in diseasespathway/cellular role, localization, role in diseases
  • 3. Using a databaseUsing a database  How to get information out of a database:How to get information out of a database:  Browsing: no targeted information to retrieveBrowsing: no targeted information to retrieve  Search: looking for particular informationSearch: looking for particular information  Searching a database:Searching a database:  Must have a key that identifies the element(s) of theMust have a key that identifies the element(s) of the database that are of interest.database that are of interest.  Name of geneName of gene  Sequence of geneSequence of gene  Other informationOther information  Helps to have particularHelps to have particular informational goalsinformational goals
  • 4. Searching for informationSearching for information about genes and their productsabout genes and their products  Gene and gene product databases are often organizedGene and gene product databases are often organized by sequenceby sequence  Genomic sequence encodes all traits of an organism.Genomic sequence encodes all traits of an organism.  Gene products are uniquely described by their sequences.Gene products are uniquely described by their sequences.  Similar sequences among biomolecules indicates both similarSimilar sequences among biomolecules indicates both similar function and an evolutionary relationshipfunction and an evolutionary relationship  Macromolecular sequences provide biologicallyMacromolecular sequences provide biologically meaningful keys for searching databasesmeaningful keys for searching databases
  • 5. Searching sequence databasesSearching sequence databases  Start from sequence, find information about itStart from sequence, find information about it  Many kinds of input sequencesMany kinds of input sequences  Could be amino acid or nucleotide sequenceCould be amino acid or nucleotide sequence  Genomic or mRNA/cDNA or protein sequenceGenomic or mRNA/cDNA or protein sequence  Complete or fragmentary sequencesComplete or fragmentary sequences  Exact matches are rare (even uninteresting in manyExact matches are rare (even uninteresting in many cases), so often goal is to retrieve a set of similarcases), so often goal is to retrieve a set of similar sequences.sequences.  Both small (mutations) and large (required for function)Both small (mutations) and large (required for function) differences within “similar” can be interesting.differences within “similar” can be interesting.
  • 6. What might we wantWhat might we want to know about a sequence?to know about a sequence?  Is this sequence similar to any known genes? How closeIs this sequence similar to any known genes? How close is the best match? Significance?is the best match? Significance?  What do we know about that gene?What do we know about that gene?  Genomic (chromosomal location, allelic information,Genomic (chromosomal location, allelic information, regulatory regions, etc.)regulatory regions, etc.)  Structural (known structure? structural domains? etc.)Structural (known structure? structural domains? etc.)  Functional (molecular, cellular & disease)Functional (molecular, cellular & disease)  Evolutionary information:Evolutionary information:  Is this gene found in other organisms?Is this gene found in other organisms?  What is its taxonomic tree?What is its taxonomic tree?
  • 7. A historical perspectiveA historical perspective  The 1960s: the birth ofThe 1960s: the birth of bioinformaticsbioinformatics  High-level computerHigh-level computer languageslanguages  Protein sequence dataProtein sequence data  Academic access toAcademic access to computerscomputers  Margaret Oakley DayhoffMargaret Oakley Dayhoff  First protein databaseFirst protein database  First program for sequenceFirst program for sequence assemblyassembly IBM 7090 computer Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 1.
  • 8. By way of comparison…By way of comparison… IBM 7090 computer 32 Kbytes RAM 2.18 µHz $2,900,000 in 1960 20” Apple iMac 1 GB RAM 2.4 GHz $1199 in 2008 2.
  • 9. Solving problems in computerSolving problems in computer sciencescience  Necessary parameters for assessing the difficultyNecessary parameters for assessing the difficulty of a computer science problemof a computer science problem  Algorithmic complexityAlgorithmic complexity  Is the problem theoretically solvable?Is the problem theoretically solvable?  If so, what is the most efficient solution?If so, what is the most efficient solution?  Current state of computer technologyCurrent state of computer technology  MemoryMemory  CPU speedCPU speed  CostCost Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
  • 10. AlgorithmsAlgorithms  AnAn algorithmalgorithm is a sequence of instructions that oneis a sequence of instructions that one must perform in order to solve a well-formulatedmust perform in order to solve a well-formulated problemproblem  First you must identify exactly what the problem is!First you must identify exactly what the problem is!  AA problemproblem describes a class of computational tasks.describes a class of computational tasks. A problemA problem instanceinstance is one particular input fromis one particular input from that taskthat task  In general, you should design your algorithms toIn general, you should design your algorithms to work forwork for anyany instance of a problem (although thereinstance of a problem (although there are cases in which this is not possible)are cases in which this is not possible)
  • 11. Computer technology: memory, CPU speed, costComputer technology: memory, CPU speed, cost • Dramatic improvements on yearly basis • We do a lot of our work using desktop Macs out of the box - 2 quad core 2.8 GHz processors, 500 GB disk space, 4 GB RAM for ~$3000 - 2 quad core 3.0 GHz processors, 2.5 TB disk space, 8 GB RAM for ~$6000 • CPU speed vs. memory: which is more important? - for protein structure, might need many calculations but limited memory - for genome searches, might have few calculations but huge amounts to store in memory • Reading from memory is several orders of magnitude faster than reading from disk
  • 12. DatabasesDatabases  What is a database?What is a database?  A collection of related data elementsA collection of related data elements  tablestables  columns (fields)columns (fields)  rows (records)rows (records)  Records retrieved using a query languageRecords retrieved using a query language  Database technology is well establishedDatabase technology is well established
  • 13.  Databases are a fundamental part of the bioinformatics revolution. Much ofDatabases are a fundamental part of the bioinformatics revolution. Much of the conceptual framework for databases had already been developed by thethe conceptual framework for databases had already been developed by the 1960s.1960s.  By the 1970s, database technology had already permeated much of theBy the 1970s, database technology had already permeated much of the government and corporate sectors.government and corporate sectors.  Modern databases can be described as well-organized collections of dataModern databases can be described as well-organized collections of data that can be accessed through the use of a query language.that can be accessed through the use of a query language.  Two databases of particular importance to biologists areTwo databases of particular importance to biologists are GenBankGenBank®® , which, which encompasses all publicly available protein and nucleotide sequences, andencompasses all publicly available protein and nucleotide sequences, and thethe Protein Data BankProtein Data Bank, which contains high quality 3-D structures of, which contains high quality 3-D structures of proteins, nucleic acids, and carbohydrates.proteins, nucleic acids, and carbohydrates.  The entire sequence of a single human could fit on one or two CD-ROMS.The entire sequence of a single human could fit on one or two CD-ROMS. As we shall see shortly, it is the comparison of sequences that presentsAs we shall see shortly, it is the comparison of sequences that presents algorithmic challenges.algorithmic challenges.
  • 14. Tables (entitites) •basic elements of information to track, e.g., gene, organism, sequence, citation Columns (fields) •attributes of tables, e.g. for citation table, title, journal, volume, author Rows (records) •actual data •whereas fields describe what data is stored, the rows of a table are where the actual data is stored DatabasesDatabases
  • 15. What is database?What is database?  A database is a computerized records used toA database is a computerized records used to store and organize data in such a way thatstore and organize data in such a way that information can be retrieved easily via a varietyinformation can be retrieved easily via a variety of search criteria. Databases are composed ofof search criteria. Databases are composed of computer hardware and software for datacomputer hardware and software for data management.management.
  • 16. What is database?What is database?  Each record, also called an entry, should containEach record, also called an entry, should contain a number of fields that hold the actual dataa number of fields that hold the actual data items, for example, fields for names, phoneitems, for example, fields for names, phone numbers, addresses, dates.numbers, addresses, dates.  To retrieve a particular record from theTo retrieve a particular record from the database, a user can specify a particular piece ofdatabase, a user can specify a particular piece of information, called value, to be found in ainformation, called value, to be found in a particular field and expect the computer toparticular field and expect the computer to retrieve the whole data record.retrieve the whole data record.  This process is called making a queryThis process is called making a query
  • 17. What is database?What is database?  A biological database is a collection of both experimentalA biological database is a collection of both experimental and theoretical data that is organized so that its contentsand theoretical data that is organized so that its contents can be easilycan be easily  accessedaccessed  managedmanaged  updatedupdated  RetrievedRetrieved  The activity of preparing a database can be divided in to:The activity of preparing a database can be divided in to:  Collection of data in a form which can be easily accessedCollection of data in a form which can be easily accessed  Making it available to a multi-user systemMaking it available to a multi-user system
  • 18. Types of databaseTypes of database
  • 19. Flat file database  A flat file database describes any of various means to encode a database model (most commonly a table) as a single file. A flat file can be a plain text file or a binary file. There are usually no structural relationships between the records.
  • 20.  "Flat file database" may be defined very narrowly, or more broadly."Flat file database" may be defined very narrowly, or more broadly.  Strictly, a flat file database should consist of nothing but data and, if records vary inStrictly, a flat file database should consist of nothing but data and, if records vary in length, delimiters.length, delimiters.  More broadly, the term refers to any database which exists in a single file in the formMore broadly, the term refers to any database which exists in a single file in the form of rows and columns, with no relationships or links between records and fields exceptof rows and columns, with no relationships or links between records and fields except the table structure.the table structure.  Terms used to describe different aspects of a database and its tools differ from oneTerms used to describe different aspects of a database and its tools differ from one implementation to the next, but the concepts remain the same.implementation to the next, but the concepts remain the same.  FileMaker uses the term "Find", while MySQL uses the term "Query"; but the conceptFileMaker uses the term "Find", while MySQL uses the term "Query"; but the concept is the same. FileMaker "files", in version 7 and above, are equivalent to MySQLis the same. FileMaker "files", in version 7 and above, are equivalent to MySQL "databases", and so forth. To avoid confusing the reader, one consistent set of terms is"databases", and so forth. To avoid confusing the reader, one consistent set of terms is used throughout this article.used throughout this article.  However, the basic terms "record" and "field" are used in nearly every flat file databaseHowever, the basic terms "record" and "field" are used in nearly every flat file database implementationimplementation
  • 21. Rational databaseRational database  Relational databases are both created and queriedRelational databases are both created and queried by DataBase Management Systems (DBMSs).by DataBase Management Systems (DBMSs).  Relational databases displaced hierarchicalRelational databases displaced hierarchical databases because the ability to add new relations made itdatabases because the ability to add new relations made it possible to add new information that was valuable butpossible to add new information that was valuable but "broke" a database's original hierarchical conception."broke" a database's original hierarchical conception.  The trend continues as a networked planet and socialThe trend continues as a networked planet and social media create the world of "big data" which is largermedia create the world of "big data" which is larger and less structured than the datasets and tasks thatand less structured than the datasets and tasks that relational databases handle well (it is instructive torelational databases handle well (it is instructive to compareHadoop).compareHadoop).
  • 23. Object oriented databaseObject oriented database  An object database (also object-orientedAn object database (also object-oriented database management system) is a databasedatabase management system) is a database management system in which information ismanagement system in which information is represented in the form of objects as usedrepresented in the form of objects as used in object-oriented programming.in object-oriented programming.  Object databases are different from relationalObject databases are different from relational databases which are table-oriented.databases which are table-oriented.
  • 24.
  • 26.
  • 27. 3.
  • 28. Online DatabasesOnline Databases When you query an online database, your query is translated into SQL, the database is interrogated, and the answer displayed on your web browser. Your computer and browser (the “client”) Software to receive and translate the instructions you enter into your browser (on the “server”) The database itself Image source: David Lane and Hugh E. Williams. Web Database Applications with PHP & MySQL. O’Reilly (2002). 4.
  • 29. Biological Databases •Over 1000 biological databases •Vary in size, quality, coverage, level of interest •Many of the major ones covered in the annual Database Issue of Nucleic Acids Research •What makes a good database? •comprehensiveness •accuracy •is up-to-date •good interface •batch search/download •API (web services, DAS, etc.)
  • 30. “Ten Important Bioinformatics Databases” GenBank www.ncbi.nlm.nih.gov nucleotide sequences Ensembl www.ensembl.org human/mouse genome (and others) PubMed www.ncbi.nlm.nih.gov literature references NR www.ncbi.nlm.nih.gov protein sequences SWISS-PROT www.expasy.ch protein sequences InterPro www.ebi.ac.uk protein domains OMIM www.ncbi.nlm.nih.gov genetic diseases Enzymeswww.chem.qmul.ac.uk enzymes PDB www.rcsb.org/pdb/ protein structures KEGG www.genome.ad.jp metabolic pathways Source: Bioinformatics for Dummies
  • 31. NCBI (National Center for Biotechnology Information) • over 30 databases including GenBank, PubMed, OMIM, and GEO • Access all NCBI resources via Entrez (www.ncbi.nlm.nih.gov/Entrez/)
  • 32. 5.
  • 33. 6.
  • 34. 7.
  • 35. 8.
  • 36. 9.
  • 37. 10.
  • 38. 11.
  • 40. 13.
  • 41. 14.
  • 42. 15.
  • 43. 16.
  • 44. 17.
  • 45. 18.
  • 46.
  • 47. INFORMATION RETRIEVALINFORMATION RETRIEVAL FROM BIOLOGICAL DATABASESFROM BIOLOGICAL DATABASES  NCBI-EntrezNCBI-Entrez  SRS(Sequenceretrievalsystem)SRS(Sequenceretrievalsystem)
  • 48. NCBI and EntrezNCBI and Entrez
  • 49. The Central Dogma & Biological DataThe Central Dogma & Biological Data Protein structures -Experiments -Models (homologues) Literature information Original DNA Sequences (Genomes) Protein Sequences -Inferred -Direct sequencing Expressed DNA sequences ( = mRNA Sequences = cDNA sequences) Expressed Sequence Tags (ESTs) 19.
  • 50. NCBI Databases and ServicesNCBI Databases and Services  GenBank primary sequence databaseGenBank primary sequence database  Free public access to biomedical literatureFree public access to biomedical literature  PubMed free Medline (3 million searches per day)PubMed free Medline (3 million searches per day)  PubMed Central full text online accessPubMed Central full text online access  Entrez integrated molecular and literature databasesEntrez integrated molecular and literature databases
  • 51. PRIMARYPRIMARY VS.VS. DERIVATIVEDERIVATIVE SEQUENCE DATABASESSEQUENCE DATABASES GenBankGenBank SequencingSequencing CentersCenters GA GAGA ATT ATT C CGAGA ATT ATT C C AT GAGA ATT C C GAGA ATT C C TTGACA ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC ACGTGC ACGTGC TTGACA TTGACA CGTGA CGTGA CGTGA ATTGACTA ATTGACTA ATTGACTA ATTGACTA TATAGCCG TATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG CATT GAGA ATT C C GAGA ATT C C LabsLabs AlgorithmsAlgorithms UniGene CuratorsCurators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI Updated continually by NCBI Updated ONLY by submitters 20.
  • 52. Sequence Databases at NCBISequence Databases at NCBI  PrimaryPrimary  GenBank: NCBI’s primary sequence databaseGenBank: NCBI’s primary sequence database  Trace Archive: reads from capillary sequencersTrace Archive: reads from capillary sequencers  Sequence Read Archive: next generation dataSequence Read Archive: next generation data  DerivativeDerivative  GenPept (GenBank translations)GenPept (GenBank translations)  Outside Protein (UniProt—Swiss-Prot, PDB)Outside Protein (UniProt—Swiss-Prot, PDB)  NCBI Reference SequencesNCBI Reference Sequences (RefSeq)(RefSeq)
  • 53. GENBANK -GENBANK - PRIMARY SEQUENCE DBPRIMARY SEQUENCE DB  Nucleotide onlyNucleotide only sequence databasesequence database  Archival(Records)Archival(Records) in naturein nature  HistoricalHistorical  Reflective of submitter point of view (subjective)Reflective of submitter point of view (subjective)  RedundantRedundant  DataData  Direct submissions (traditional records)Direct submissions (traditional records)  Batch submissionsBatch submissions  FTP accounts (genome data)FTP accounts (genome data)
  • 54. GENBANK -GENBANK - PRIMARY SEQUENCE DB (2)PRIMARY SEQUENCE DB (2)  Three collaborating databasesThree collaborating databases 1.1. GenBankGenBank 2.2. DNA Database of Japan (DDBJ)DNA Database of Japan (DDBJ) 3.3. European Molecular Biology Laboratory (EMBL)European Molecular Biology Laboratory (EMBL) DatabaseDatabase
  • 55. Traditional GenBank RecordTraditional GenBank Record ACCESSION U07418 VERSION U07418.1 GI:466461 ACCESSION U07418 VERSION U07418.1 GI:466461 Accession •Stable •Reportable •Universal Accession •Stable •Reportable •Universal Version Tracks changes in sequence Version Tracks changes in sequence GI number NCBI internal use GI number NCBI internal use well annotatedwell annotated the sequence is the datathe sequence is the data 21.
  • 56. NCBI and EntrezNCBI and Entrez  One of the most useful and comprehensive sources ofOne of the most useful and comprehensive sources of databases is the NCBI, part of the National Library ofdatabases is the NCBI, part of the National Library of Medicine.Medicine.  NCBI provides interesting summaries, browsers forNCBI provides interesting summaries, browsers for genome data, and search toolsgenome data, and search tools  Entrez is their database search interfaceEntrez is their database search interface http://www.ncbi.nlm.nih.gov/Entrezhttp://www.ncbi.nlm.nih.gov/Entrez  Can search on gene names, sequences, chromosomalCan search on gene names, sequences, chromosomal location, diseases, keywords, ...location, diseases, keywords, ...
  • 57.
  • 58. What did we just do?What did we just do?  Identify loci (genes) associated with the sequence.Identify loci (genes) associated with the sequence. Input was Alcohol DehydrogenaseInput was Alcohol Dehydrogenase  For each particular “hit”, we can look at thatFor each particular “hit”, we can look at that sequence and its alignment in more detail.sequence and its alignment in more detail.  See similar sequences, and the organisms in whichSee similar sequences, and the organisms in which they are found.they are found.  But there’sBut there’s much moremuch more that can be found onthat can be found on these genes, even just inside NCBI…these genes, even just inside NCBI…
  • 59. 22.
  • 60. More from Entrez GeneMore from Entrez Gene 23.
  • 62. Sequence Retrieval SystemSequence Retrieval System  The Sequence Retrieval System is aThe Sequence Retrieval System is a database system that works with flat-files. Indatabase system that works with flat-files. In addition, many bioinformatics tools areaddition, many bioinformatics tools are incorporated and can be combined with theincorporated and can be combined with the databases searches.databases searches.
  • 63. 24.
  • 64. NCBI is not all there is...NCBI is not all there is...  Links to non-NCBI databasesLinks to non-NCBI databases  Reactome & KEGG for pathwaysReactome & KEGG for pathways  HGNC for nomenclatureHGNC for nomenclature  UCSC Human Genome BrowserUCSC Human Genome Browser  Other important gene/protein resources not linked to:Other important gene/protein resources not linked to:  UniProt (most carefully annotated)UniProt (most carefully annotated)  PDBPDB (main macromolecular structure repository)(main macromolecular structure repository)  Other key biological data sourcesOther key biological data sources  Gene OntologyGene Ontology/Open Biological Ontologies/Open Biological Ontologies  EnzymeEnzyme  Scientific society: iscb.orgScientific society: iscb.org  Journals, Conferences…Journals, Conferences…
  • 65. Take home messagesTake home messages  There are a lot of molecular biology databases,There are a lot of molecular biology databases, containing a lot of valuable informationcontaining a lot of valuable information  Not even the best databases have everything (orNot even the best databases have everything (or the best of everything)the best of everything)  These databases are moderately well cross-These databases are moderately well cross- linked, and there are “linker” databaseslinked, and there are “linker” databases  Sequence is a good identifier, maybe even betterSequence is a good identifier, maybe even better than gene name!than gene name!
  • 66. FILE FORMATEFILE FORMATE IG/Stanford Fitch Plain/Raw GenBank/GB Fasta/Pearson PIR/CODATA NBRF Zuker MSF EMBL Olsen ASN 1.8 GCG Phylip 3.2 PAUP/NEXUS DNAStrider Phylip Pretty IG/Stanford Fitch Plain/Raw GenBank/GB Fasta/Pearson PIR/CODATA NBRF Zuker MSF EMBL Olsen ASN 1.8 GCG Phylip 3.2 PAUP/NEXUS DNAStrider Phylip Pretty
  • 67. LOCUS, Accession, NID and protein_idLOCUS, Accession, NID and protein_id LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases, and is therefore a poor sequence identifier. ACCESSION: A unique identifier to that record, citable entity; does not change when record is updated. A good record identifier, ideal for citation in publication. VERSION: : New system where the accession and version play the same function as the accession and gi number. Nucleotide gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes. PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS. Protein gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes. protein_id: Identifier which has the same structure and function as the nucleotide Accession.version numbers, but slightlt different format.
  • 68. Accession.version LOCUS, Accession, gi and PIDLOCUS, Accession, gi and PID LOCUS HSU40282 1789 bp mRNA PRI 21-MAY-1998 DEFINITION Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds. ACCESSION U40282 VERSION U40282.1 GI:3150001 CDS 157..1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC16892.1" /db_xref="PID:g3150002" /db_xref="GI:3150002" LOCUS: HSU40282 ACCESSION: U40282 VERSION: U40282.1 GI: 3150001 PID: g3150002 Protein gi: 3150002 protein_id: AAC16892.1 Protein_idprotein gi ACCESSION LOCUS PIDgi
  • 69. PLAIN SEQUENCE FORMAT A sequence in plain format may contain only IUPAC characters and spaces (no numbers!). Note: A file in plain sequence format may only contain one sequence, while most other formats accept several sequences in one file. An example sequence in plain format is: AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAA CCTCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGC CGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTG CCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTC TGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCT PLAIN SEQUENCEPLAIN SEQUENCE FORMATEFORMATE
  • 70. FASTA FORMATEFASTA FORMATE FASTA FORMAT A sequence in Fasta format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length An example sequence in FASTA format is: >U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1) AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT TTCAACAATGGATCTCTTGGTTCCGGC
  • 71. • The first line of each sequence entry is the ID definition line which contains entry name, dataclass, molecule, division and sequence length. • XX line contains no data, just a separator • The AC line lists the accession number. • DE line gives description about the sequence • FT precise annotation for the sequence • Sequence information SQ in the first two spaces. • The sequence information begins on the fifth line of the sequence entry. • The last line of each sequence entry in the file is a terminator line which has the two characters // in the first two spaces. ID AA03518 standard; DNA; FUN; 237 BP. XX AC U03518; XX AC U03518; XX DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S DE rRNA and 5.8S rRNA genes, partial sequence. DE rRNA and 5.8S rRNA genes, partial sequence. RX MEDLINE; 94303342. RX PUBMED; 8030378. XX FT rRNA <1..20 FT /product="18S ribosomal RNA" FT misc_RNA 21..205 FT /standard_name="Internal transcribed spacer 1 (ITS1)" FT rRNA 206..>237 FT /product="5.8S ribosomal RNA" SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237 // EMBL/Swiss Prot (http://www.ebi.ac.uk/help/formats_frame.html)
  • 72. EMBL FORMAT A sequence file in EMBL format can contain several sequences. One sequence entry starts with an identifier line ("ID "), followed by further annotation lines. The start of the sequence is marked by a line starting with "SQ" and the end of the sequence is marked by two slashes ("//"). An example sequence in EMBL format is: ID AA03518 standard; DNA; FUN; 237 BP. XX AC U03518; XX DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S DE rRNA and 5.8S rRNA genes, partial sequence. XX SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237 //
  • 73. GENBANK FORMAT A sequence file in GenBank format can contain several sequences. One sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//"). •Can contain several sequences •One sequence starts with: “LOCUS” •The sequence starts with: "ORIGIN“ •The sequence ends with: "//“ An example sequence in GenBank format is: LOCUS AAU03518 237 bp DNA PLN 04-FEB-1995 DEFINITION Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S rRNA and 5.8S rRNA genes, partial sequence. ACCESSION U03518 BASE COUNT 41 a 77 c 67 g 52 t ORIGIN 1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc //
  • 74. 25.
  • 75. 26.
  • 76.
  • 77. 27.
  • 78.
  • 79. 28.
  • 80. 29.
  • 81. PIR- PROTEIN SEQUENCEPIR- PROTEIN SEQUENCE DBDB  PIR was established in 1984 by the National BiomedicalPIR was established in 1984 by the National Biomedical Research Foundation (NBRF) as a resource to assist researchersResearch Foundation (NBRF) as a resource to assist researchers in the identification and interpretation of protein sequencein the identification and interpretation of protein sequence information.information.  Prior to that, the NBRF compiled the first comprehensivePrior to that, the NBRF compiled the first comprehensive collection of macromolecular sequences in thecollection of macromolecular sequences in the Atlas of ProteinAtlas of Protein Sequence and StructureSequence and Structure, published from 1965-1978 under the, published from 1965-1978 under the editorship of Margaret O. Dayhoff. editorship of Margaret O. Dayhoff. Dr. DayhoffDr. Dayhoff and her and her research group pioneered in the development of computerresearch group pioneered in the development of computer methods for the comparison of protein sequences, for themethods for the comparison of protein sequences, for the detection of distantly related sequences and duplications withindetection of distantly related sequences and duplications within sequences, and for the inference of evolutionary histories fromsequences, and for the inference of evolutionary histories from alignments of protein sequences.alignments of protein sequences.
  • 83. Protein Data Bank (PDB) 31.
  • 84.  The Protein Data Bank (PDB) is a repository for theThe Protein Data Bank (PDB) is a repository for the three-dimensional structural data of large biologicalthree-dimensional structural data of large biological molecules, such as proteins and nucleic acids.molecules, such as proteins and nucleic acids.  The data, typically obtained by X-rayThe data, typically obtained by X-ray crystallography or NMR spectroscopy and submittedcrystallography or NMR spectroscopy and submitted by biologists and biochemists from around the world,by biologists and biochemists from around the world, are freely accessible on the Internet via the websites ofare freely accessible on the Internet via the websites of its member organisationsits member organisations  The PDB is overseen by an organization calledThe PDB is overseen by an organization called theWorldwide Protein Data Bank, wwPDB.theWorldwide Protein Data Bank, wwPDB.
  • 85.  The PDB is a key resource in areas of structuralThe PDB is a key resource in areas of structural biology, such as structural genomics.biology, such as structural genomics.  Most major scientific journals, and some fundingMost major scientific journals, and some funding agencies, now require scientists to submit theiragencies, now require scientists to submit their structure data to the PDB.structure data to the PDB.  If the contents of the PDB are thought of as primaryIf the contents of the PDB are thought of as primary data, then there are hundreds of derived (i.e.,data, then there are hundreds of derived (i.e., secondary) databases that categorize the datasecondary) databases that categorize the data differently.differently.  For example both SCOP and CATH categorizeFor example both SCOP and CATH categorize structures according to type of structure and assumedstructures according to type of structure and assumed evolutionary relations.evolutionary relations.
  • 86.
  • 87.  HEADER, TITLE and AUTHOR records provide information about theHEADER, TITLE and AUTHOR records provide information about the researchers who defined the structure; numerous other types of records areresearchers who defined the structure; numerous other types of records are available to provide other types of informationavailable to provide other types of information  REMARK records can contain free-form annotation, but they alsoREMARK records can contain free-form annotation, but they also accommodate standardized information; for example, the REMARK 350accommodate standardized information; for example, the REMARK 350 BIOMT records describe how to compute the coordinates of theBIOMT records describe how to compute the coordinates of the experimentally observed multimer from those of the explicitly specified onesexperimentally observed multimer from those of the explicitly specified ones of a single repeating unit.of a single repeating unit.  SEQRES records give the sequences of the three peptide chains (named A, BSEQRES records give the sequences of the three peptide chains (named A, B and C), which are very short in this example but usually span multiple lines.and C), which are very short in this example but usually span multiple lines.  ATOM records describe the coordinates of the atoms that are part of theATOM records describe the coordinates of the atoms that are part of the protein. For example, the first ATOM line above describes the alpha-N atomprotein. For example, the first ATOM line above describes the alpha-N atom of the first residue of peptide chain A, which is a proline residue; the firstof the first residue of peptide chain A, which is a proline residue; the first three floating point numbers are its x, y and z coordinates and are in unitsthree floating point numbers are its x, y and z coordinates and are in units of Ångströms.of Ångströms.  HETATM records describe coordinates of hetero-atoms, that is those atomsHETATM records describe coordinates of hetero-atoms, that is those atoms which are not part of the protein molecule.which are not part of the protein molecule.
  • 88. PUBCHEMPUBCHEM  PubChem is database of chemical molecules and their activitiesPubChem is database of chemical molecules and their activities against biological assays. The system is maintained byagainst biological assays. The system is maintained by theNational Center for Biotechnology Information (NCBI), atheNational Center for Biotechnology Information (NCBI), a component of the National Library of Medicine, which is part ofcomponent of the National Library of Medicine, which is part of the United States National Institutes of Health (NIH). PubChemthe United States National Institutes of Health (NIH). PubChem can be accessed for free through a web user interface. Millions ofcan be accessed for free through a web user interface. Millions of compound structures and descriptive datasets can be freelycompound structures and descriptive datasets can be freely downloaded via FTP. PubChem contains substance descriptionsdownloaded via FTP. PubChem contains substance descriptions and small molecules with fewer than 1000 atoms and 1000and small molecules with fewer than 1000 atoms and 1000 bonds. More than 80 database vendors contribute to the growingbonds. More than 80 database vendors contribute to the growing PubChem databasePubChem database
  • 89. Books and Web ReferencesBooks and Web References  Books Name :Books Name : 1. Introduction To Bioinformatics by T. K. Attwood1. Introduction To Bioinformatics by T. K. Attwood 2. BioInformatics by Sangita2. BioInformatics by Sangita 3. Basic Bioinformatics by S.Ignacimuthu, s.j.3. Basic Bioinformatics by S.Ignacimuthu, s.j.  http://en.wikipedia.org/wiki/Biological_databasehttp://en.wikipedia.org/wiki/Biological_database  http://bioinformaticsweb.net/data.htmlhttp://bioinformaticsweb.net/data.html  http://www.apbionet.org/s-star/downloads/tutorial/t1b.pdfhttp://www.apbionet.org/s-star/downloads/tutorial/t1b.pdf 90
  • 90. Image ReferencesImage References  1. & 2. https://encrypted-tbn0.gstatic.com/images?1. & 2. https://encrypted-tbn0.gstatic.com/images? q=tbn:ANd9GcR39w90rTM4wRcS2WE4I0zbjV7R6KE8JMVZq=tbn:ANd9GcR39w90rTM4wRcS2WE4I0zbjV7R6KE8JMVZ z4QF0qY6A8W1qti_QQaeDx5Xz4QF0qY6A8W1qti_QQaeDx5X  3. & 4.Book: Basic Bioinformatics by S.Ignacimuthu, s.j.3. & 4.Book: Basic Bioinformatics by S.Ignacimuthu, s.j.  5. to 18.http://www.ncbi.nlm.nih.gov/5. to 18.http://www.ncbi.nlm.nih.gov/  19. https://encrypted-tbn0.gstatic.com/images?19. https://encrypted-tbn0.gstatic.com/images? q=tbn:ANd9GcRGBmCsguJs4geE45YMjE_O80bbqD9dFtCE9q=tbn:ANd9GcRGBmCsguJs4geE45YMjE_O80bbqD9dFtCE9 fgZYySwzYSIDbIpfgZYySwzYSIDbIp  21. to 29. http://www.ncbi.nlm.nih.gov/21. to 29. http://www.ncbi.nlm.nih.gov/  30. & 31. http://www.rcsb.org/pdb/home/home.do30. & 31. http://www.rcsb.org/pdb/home/home.do

Editor's Notes

  1. The 1960s marked the beginning of bioinformatics. Prior to the advent of high-level computer languages in 1957, programmers needed a detailed knowledge of a computer’s design and were forced to use languages that were unintuitive to humans. High-level computer languages allowed computer scientists to spend more time designing complex algorithms and less time worrying about the technical details of the particular computer model they were using. By the 1960s, mainframe computers like the one pictured in the slide were becoming common at universities and research institutions, giving academics unprecedented access to computers. (As useful as these computers were, they filled entire rooms and had processing power far below that of consumer-grade personal computers today!) Margaret Oakley Dayhoff and colleagues took advantage of these developments and the accumulation of protein sequence data to create some of the first bioinformatics applications. For example, Dayhoff wrote the first computer program to automate sequence assembly, enabling a task that previously took human workers months to be accomplished in minutes. She and her colleagues also published (in paper form) the first protein sequence database and performed many groundbreaking studies regarding phylogeny and scoring sequence comparisons. For these reasons, she is considered one of the great pioneers of computational biology and bioinformatics.
  2. The 1960s marked the beginning of bioinformatics. Prior to the advent of high-level computer languages in 1957, programmers needed a detailed knowledge of a computer’s design and were forced to use languages that were unintuitive to humans. High-level computer languages allowed computer scientists to spend more time designing complex algorithms and less time worrying about the technical details of the particular computer model they were using. By the 1960s, mainframe computers like the one pictured in the slide were becoming common at universities and research institutions, giving academics unprecedented access to computers. (As useful as these computers were, they filled entire rooms and had processing power far below that of consumer-grade personal computers today!) Margaret Oakley Dayhoff and colleagues took advantage of these developments and the accumulation of protein sequence data to create some of the first bioinformatics applications. For example, Dayhoff wrote the first computer program to automate sequence assembly, enabling a task that previously took human workers months to be accomplished in minutes. She and her colleagues also published (in paper form) the first protein sequence database and performed many groundbreaking studies regarding phylogeny and scoring sequence comparisons. For these reasons, she is considered one of the great pioneers of computational biology and bioinformatics.
  3. Though computers are capable of doing a wide variety of tasks at extraordinary speed, many important problems are still unsolvable by computers because the tasks require too much computation. The limits of a computer are dependent on the algorithmic complexity of the problem and the hardware specifications of the machine being used. Some problems are so algorithmically complex that they will never be solved on any computer now or in the future, and some are simply unsolvable even in theory. Other problems are limited only by the current state of computer technology. For example, sequencing entire genomes via the shotgun approach was not possible until the mid-1990s because the computational power needed was unavailable until that time.
  4. Databases are a fundamental part of the bioinformatics revolution. Much of the conceptual framework for databases had already been developed by the 1960s. By the 1970s, database technology had already permeated much of the government and corporate sectors. Modern databases can be described as well-organized collections of data that can be accessed through the use of a query language. Two databases of particular importance to biologists are GenBank®, which encompasses all publicly available protein and nucleotide sequences, and the Protein Data Bank, which contains high quality 3-D structures of proteins, nucleic acids, and carbohydrates. Despite media hype about the enormity of the human genome sequence, from the perspective of digital computers, the entire sequence of a single human could fit on one or two CD-ROMS. As we shall see shortly, it is the comparison of sequences that presents algorithmic challenges.
  5. ~11,000 sequences are submitted per day.