Introduction to bioinformatics 
Sylvia B. Nagl
What is bioinformatics? 
• an emerging interdisciplinary research area 
• deals with the computational management 
and analysis of biological information: genes, 
genomes, proteins, cells, ecological systems, 
medical information, robots, artificial 
intelligence...
The Core of Bioinformatics to date 
•Relationships between 
TDQAAFDTNIVTLTRFVM 
EQGRKARGTGEMTQLLNS 
LCTAVKAISTAVRKAGIA 
HLYGIAGSTNVTGDQVKK 
LDVLSNDLVINVLKSSFA 
TCVLVTEEDKNAIIVEPE 
KRGKYVVCFDPLDGSSNI 
DCLVSIGTIFGIYRKNST 
DEPSEKDALQPGRNLVAA 
GYALYGSATMLV 
sequence 3D structure protein functions 
•Properties and evolution of genes, genomes, 
proteins, metabolic pathways in cells 
•Use of this knowledge for prediction, modelling, and 
design
“The holy grail of bioinformatics” 
GCTCCTCACTGTCTGTGTTTATTC 
TTTTAGCTTCTTCAGATCTTTTAG 
TCTGAGGAAGCCTGGCATGTGCA 
AATGAAGTTAACCTAA... 
> 500, 000 genes 
sequenced to date 
Expected number of 
unique protein 
structures: 
~ 700-1, 000
Basic concepts 
• conceptual foundations of bioinformatics: 
evolution 
protein folding 
protein function 
• bioinformatics builds mathematical models 
of these processes - 
to infer relationships between components 
of complex biological systems
Information processing in cells 
coding regions 
regulatory 
sites 
nucleic acids 
transcripts 
proteins 
One-to-many mappings! 
Context-dependence!
Global approaches: Toward a new Systems Biology 
Global cell state 
Genome 
Genome activation 
patterns: transcriptomics 
Protein population: 
proteomics 
Organisation: 
tissue imaging EM X-ray, NMR 
cells 
molecular complexes 
•How does the spatial and 
temporal organisation of 
living matter give rise to 
biological processes?
Global approaches: Toward a new Systems Biology 
Perturbation Living cell 
Dynamic response 
“Virtual cell” 
Biological knowledge 
(computerised) 
Sequence information 
Structural information 
•Basic principles 
•Practical 
applications 
Bioinformatics 
Mathematical 
modelling 
Simulation
We do not know yet whether the information in the genome is sufficient 
to reconstruct an entire biological system. Information on building 
blocks not enough, information on their interactions is essential. 
External environment 
Internal environment 
Metabolic net 
Genetic networks 
DNA hRNA mRNAs proteins
Bioinformatics in context 
Genomics 
Molecular Biophysics 
biology 
Molecular 
evolution 
Ethical, legal, 
and social 
implications 
Bioinformatics 
Mathematics/ 
computer 
science
Current challenges to users 
• Potential hurdles: 
Methods are in flux and not fully developed-scattered 
and heterogeneous resources 
• Remedies: Web resources 
navigation guides 
integration of tools and databanks 
http://www.biochem.ucl.ac.uk/~nagl/bioinformatics.html
Example 1 
Sequence homology search of the 
genome of Plasmodium falciparum 
Target identification for antimalerial 
drugs
The search for new antimalarial 
drugs 
• Malaria is one of the leading causes of morbidity 
and mortality in the tropics. 
• 300 to 500 million estimated clinical cases and 1.5 
million to 2.7 million deaths per year. 
• Nearly all fatal cases are caused by Plasmodium 
falciparum. 
• The parasite's resistance to conventional 
antimalarial drugs such as chloroquine is growing 
at an alarming rate.
•P. falciparum has a plastidlike organelle, called the 
apicoplast, acquired by endosymbiosis of an alga. 
Jomaa et al. (1999) 
•Self-replicating, maternally inherited (35kb, circular DNA). 
•Comparative genome analysis: Search for orthologs. 
Apicoplast contains enzymes found in plant and bacterial, 
but not animal metabolic pathways. 
•Potential target for antimalerial drugs: 
DOXP reductoisomerase
Jomaa et al. (1999) Science 285: 1573-1576:
Biological databases
(Boguski, 1999) 
The challenge 
In 1995, the number of genes in the database started to exceed 
the number of papers on molecular biology and genetics in the 
literature!
Data types 
primary data 
secondary data 
tertiary data 
sequence 
DNA 
amino acid 
AATGCGTATAGGC 
DMPVERILEALAVE 
primary database 
secondary 
“motifs”: regular protein structure 
expressions, blocks, 
profiles, fingerprints e. g., alpha-helices, beta-strands 
secondary db 
tertiary protein 
structure 
domains, folding units 
tertiary db 
atomic co-ordinates
Primary biological databases 
• Nucleic acid 
EMBL 
GenBank 
DDBJ (DNA Data 
Bank of Japan) 
• Protein 
PIR 
MIPS 
SWISS-PROT 
TrEMBL 
NRL-3D
International nucleotide data banks 
EMBL 
Europe 
EMBL 
EBI 
GenBank 
USA 
NLM 
NCBI 
International 
Advisory Meeting 
Collaborative Meeting 
DDBJ 
Japan 
NIG 
CIB 
TrEMBL NRDB
GenBank file format
GenBank file format
Swiss-Prot
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
Other primary protein databases 
• TrEMBL (translated EMBL) in SWISS-PROT format 
rapid access to sequence data from genome projects 
computer-annotated supplement to SWISS-PROT 
translations of all coding sequences (CDS) in EMBL 
• SP-TrEMBL 
• REM-TrEMBL: immunoglobulins, T-cell receptors, short 
fragments, synthetic and patented sequences
Other primary protein databases 
The Protein Information Resource (PIR) 
• integrated system of protein sequence databases 
and derived related databases, e. g., alignment 
databases 
• rapid searching, comparison, and pattern matching of 
protein sequences 
• retrieval of descriptive, bibliographic, feature, and 
concurrent cross-reference information 
• aims to be comprehensive and consistently 
annotated
PIR: related databases 
NRL-3D Sequence-Structure Database 
• produced by PIR from sequence and annotation 
information extracted from three-dimensional 
structures in the Protein Databank (PDB) 
• allows keyword and similarity searches
PIR: related databases 
PATCHX integrated with PIR 
• a non-redundant database of protein sequences 
produced by MIPS, the European branch of PIR-International 
The PIR Protein Sequence Database and PATCHX 
together provide the most complete collection of 
protein sequence data currently available in the 
public domain.
Composite protein sequence dbs 
NRDB OWL MIPSX(PIR+PATCHX) SP+TrEMBL 
PIR PIR PIR TrEMBL 
SP SP SP SP 
PDB GenBank MIPSOwn 
GenPept NRL-3D NRL-3D 
MIPSH 
PIRMOD 
MIPSTrn 
EMTrans 
GBTrans 
Kabat 
PseqIP
OWL composite database 
OWL only released every 6-8 
weeks 
By accession number 
• By database code 
• By text 
• By sequence 
• By title 
• By author 
• By query language 
• By regular expression 
Direct OWL access: 
OWL Blast server
Two other useful sites 
INFOBIOGEN-The Public Catalog of Databases 
http://www.infobiogen.fr/services/dbcat/ 
KEGG-Kyoto Encyclopedia of Genes and Genomes 
http://www.genome.ad.jp/kegg/ 
Kyoto Encyclopedia of Genes and Genomes (KEGG) is an effort to 
computerize current knowledge of molecular and cellular biology in 
terms of the information pathways that consist of interacting molecules 
or genes and to provide links from the gene catalogs produced by 
genome sequencing projects.
Sequence Retrieval System (SRS) 
Database browser that allows 
users to 
•retrieve 
•link 
•access 
entries from all interconnected 
resources. 
Users can formulate queries 
across a range of different 
database types.
Guide to Protein Databases: 
http://www.biochem.ucl.ac.uk/~robert/bioinf 
/lecture1/index.html 
http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture2/index 
.html 
With thanks to Dr Roman Laskowski.

bioinfomatics

  • 1.
  • 2.
    What is bioinformatics? • an emerging interdisciplinary research area • deals with the computational management and analysis of biological information: genes, genomes, proteins, cells, ecological systems, medical information, robots, artificial intelligence...
  • 3.
    The Core ofBioinformatics to date •Relationships between TDQAAFDTNIVTLTRFVM EQGRKARGTGEMTQLLNS LCTAVKAISTAVRKAGIA HLYGIAGSTNVTGDQVKK LDVLSNDLVINVLKSSFA TCVLVTEEDKNAIIVEPE KRGKYVVCFDPLDGSSNI DCLVSIGTIFGIYRKNST DEPSEKDALQPGRNLVAA GYALYGSATMLV sequence 3D structure protein functions •Properties and evolution of genes, genomes, proteins, metabolic pathways in cells •Use of this knowledge for prediction, modelling, and design
  • 4.
    “The holy grailof bioinformatics” GCTCCTCACTGTCTGTGTTTATTC TTTTAGCTTCTTCAGATCTTTTAG TCTGAGGAAGCCTGGCATGTGCA AATGAAGTTAACCTAA... > 500, 000 genes sequenced to date Expected number of unique protein structures: ~ 700-1, 000
  • 5.
    Basic concepts •conceptual foundations of bioinformatics: evolution protein folding protein function • bioinformatics builds mathematical models of these processes - to infer relationships between components of complex biological systems
  • 6.
    Information processing incells coding regions regulatory sites nucleic acids transcripts proteins One-to-many mappings! Context-dependence!
  • 7.
    Global approaches: Towarda new Systems Biology Global cell state Genome Genome activation patterns: transcriptomics Protein population: proteomics Organisation: tissue imaging EM X-ray, NMR cells molecular complexes •How does the spatial and temporal organisation of living matter give rise to biological processes?
  • 8.
    Global approaches: Towarda new Systems Biology Perturbation Living cell Dynamic response “Virtual cell” Biological knowledge (computerised) Sequence information Structural information •Basic principles •Practical applications Bioinformatics Mathematical modelling Simulation
  • 9.
    We do notknow yet whether the information in the genome is sufficient to reconstruct an entire biological system. Information on building blocks not enough, information on their interactions is essential. External environment Internal environment Metabolic net Genetic networks DNA hRNA mRNAs proteins
  • 10.
    Bioinformatics in context Genomics Molecular Biophysics biology Molecular evolution Ethical, legal, and social implications Bioinformatics Mathematics/ computer science
  • 11.
    Current challenges tousers • Potential hurdles: Methods are in flux and not fully developed-scattered and heterogeneous resources • Remedies: Web resources navigation guides integration of tools and databanks http://www.biochem.ucl.ac.uk/~nagl/bioinformatics.html
  • 12.
    Example 1 Sequencehomology search of the genome of Plasmodium falciparum Target identification for antimalerial drugs
  • 13.
    The search fornew antimalarial drugs • Malaria is one of the leading causes of morbidity and mortality in the tropics. • 300 to 500 million estimated clinical cases and 1.5 million to 2.7 million deaths per year. • Nearly all fatal cases are caused by Plasmodium falciparum. • The parasite's resistance to conventional antimalarial drugs such as chloroquine is growing at an alarming rate.
  • 14.
    •P. falciparum hasa plastidlike organelle, called the apicoplast, acquired by endosymbiosis of an alga. Jomaa et al. (1999) •Self-replicating, maternally inherited (35kb, circular DNA). •Comparative genome analysis: Search for orthologs. Apicoplast contains enzymes found in plant and bacterial, but not animal metabolic pathways. •Potential target for antimalerial drugs: DOXP reductoisomerase
  • 15.
    Jomaa et al.(1999) Science 285: 1573-1576:
  • 16.
  • 17.
    (Boguski, 1999) Thechallenge In 1995, the number of genes in the database started to exceed the number of papers on molecular biology and genetics in the literature!
  • 18.
    Data types primarydata secondary data tertiary data sequence DNA amino acid AATGCGTATAGGC DMPVERILEALAVE primary database secondary “motifs”: regular protein structure expressions, blocks, profiles, fingerprints e. g., alpha-helices, beta-strands secondary db tertiary protein structure domains, folding units tertiary db atomic co-ordinates
  • 19.
    Primary biological databases • Nucleic acid EMBL GenBank DDBJ (DNA Data Bank of Japan) • Protein PIR MIPS SWISS-PROT TrEMBL NRL-3D
  • 20.
    International nucleotide databanks EMBL Europe EMBL EBI GenBank USA NLM NCBI International Advisory Meeting Collaborative Meeting DDBJ Japan NIG CIB TrEMBL NRDB
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
    Other primary proteindatabases • TrEMBL (translated EMBL) in SWISS-PROT format rapid access to sequence data from genome projects computer-annotated supplement to SWISS-PROT translations of all coding sequences (CDS) in EMBL • SP-TrEMBL • REM-TrEMBL: immunoglobulins, T-cell receptors, short fragments, synthetic and patented sequences
  • 29.
    Other primary proteindatabases The Protein Information Resource (PIR) • integrated system of protein sequence databases and derived related databases, e. g., alignment databases • rapid searching, comparison, and pattern matching of protein sequences • retrieval of descriptive, bibliographic, feature, and concurrent cross-reference information • aims to be comprehensive and consistently annotated
  • 30.
    PIR: related databases NRL-3D Sequence-Structure Database • produced by PIR from sequence and annotation information extracted from three-dimensional structures in the Protein Databank (PDB) • allows keyword and similarity searches
  • 31.
    PIR: related databases PATCHX integrated with PIR • a non-redundant database of protein sequences produced by MIPS, the European branch of PIR-International The PIR Protein Sequence Database and PATCHX together provide the most complete collection of protein sequence data currently available in the public domain.
  • 32.
    Composite protein sequencedbs NRDB OWL MIPSX(PIR+PATCHX) SP+TrEMBL PIR PIR PIR TrEMBL SP SP SP SP PDB GenBank MIPSOwn GenPept NRL-3D NRL-3D MIPSH PIRMOD MIPSTrn EMTrans GBTrans Kabat PseqIP
  • 33.
    OWL composite database OWL only released every 6-8 weeks By accession number • By database code • By text • By sequence • By title • By author • By query language • By regular expression Direct OWL access: OWL Blast server
  • 34.
    Two other usefulsites INFOBIOGEN-The Public Catalog of Databases http://www.infobiogen.fr/services/dbcat/ KEGG-Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp/kegg/ Kyoto Encyclopedia of Genes and Genomes (KEGG) is an effort to computerize current knowledge of molecular and cellular biology in terms of the information pathways that consist of interacting molecules or genes and to provide links from the gene catalogs produced by genome sequencing projects.
  • 35.
    Sequence Retrieval System(SRS) Database browser that allows users to •retrieve •link •access entries from all interconnected resources. Users can formulate queries across a range of different database types.
  • 36.
    Guide to ProteinDatabases: http://www.biochem.ucl.ac.uk/~robert/bioinf /lecture1/index.html http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture2/index .html With thanks to Dr Roman Laskowski.