SlideShare a Scribd company logo
1 of 109
1
Introduction to
Bioinformatics
2
GENERAL INFORMATION
Course Methodology
The course consists of the following components;
i. a series of 10 lectures and 10 mini-exams,
ii. 7 skills classes, each with one programming task,
iii. one final written exam.
•In the lectures the main theoretical aspects will be presented.
•Each lecture starts with a "mini-exam" with three short questions belonging
to the previous lecture.
•In the skills classes (SCs) several programming tasks are performed, one of
which has to be submitted until next SC.
•Finally ,the course terminates with a open-book exam.
3
GENERAL INFORMATION
10 lectures and 10 mini-exams
Prologue (In praise of cells)
Chapter 1. The first look at a genome (sequence statistics)
Chapter 2. All the sequence's men (gene finding)
Chapter 3. All in the family (sequence Alignment)
Chapter 4. The boulevard of broken genes (hidden Markov models)
Chapter 5. Are Neanderthals among us? (variation within and between
species)
Chapter 6. Fighting HIV (natural selection at the molecular level)
Chapter 7. SARS: a post-genomic epidemic (phylogenetic analysis)
Chapter 8. Welcome to the hotel Chlamydia (whole genome comparisons)
Chapter 9. The genomics of wine-making (Analysis of gene expression)
Chapter 10. A bed-time story (identification of regulatory sequences)
4
GENERAL INFORMATION
mini-exams
* First 15 minutes of the lecture
* Closed Book
* Three short questions on the previous lecture
* Counts as bonus points for the final mark …
* There is a resit, where you can redo individual mini’s
you failed to attend with a legitimate leave
5
GENERAL INFORMATION
Skills Class:
* Each Friday one hour hands-on with real data
* Hand in one-a-week – for a bonus point
6
7
8
GENERAL INFORMATION
Final Exam:
* 10 short questions regarding the course material
* Open book
9
GENERAL INFORMATION
Grading:
The relative weights of the components are:
i. 10 mini-exam: B1 bonus points (max 1)
ii. 7 skills class programming task: B2 bonus points (max 1)
iii. final written exam (open-book, three hours): E points (max 10)
Final grade = min(E + (B1 + B2), 10)
Study Points:
6 ECTS/ 4 NSP
10
GENERAL INFORMATION
Course Book:
Introduction to
Computational
Genomics
A Case Studies
Approach
Nello Cristianini,
Matthew W. Hahn
11
GENERAL INFORMATION
Additional recommended texts:
• Bioinformatics: the machine learning approach, Baldi & Brunak.
• Introduction to Bioinformatics, Lesk, and: Introduction to Bioinformatics,
Attwood & Parry-Smith.
12
Introduction to Bioinformatics.
LECTURES
13
Introduction to Bioinformatics.
LECTURE 1:
* Prologue (In praise of cells)
* Chapter 1. The first look at a genome (sequence statistics)
14
Introduction to Bioinformatics.
Prologue :
In praise of cells
* Nothing in Biology Makes Sense Except in the Light of Evolution
(Theodosius Dobzhansky)
15
GENOMICS and PROTEOMICS
Genomics is the study of an organism's genome and the use of the genes.
It deals with the systematic use of genome information, associated with other
data, to provide answers in biology, medicine, and industry.
Proteomics is the large-scale study of proteins, particularly their structures
and functions.
Proteomics is much more complicated than genomics. Most importantly,
while the genome is a rather constant entity, the proteome differs from cell to
cell and is constantly changing through its biochemical interactions with the
genome and the environment. One organism will have radically different
protein expression in different parts of its body, in different stages of its life
cycle and in different environmental conditions.
16
Development
of
Genomics/
Proteomics
Databases
17
modern map-makers
have mapped the entire
human genome
Hurrah – we know the
entire 3.3 billion bps of the
human genome !!!
… but what does it mean
???
18
Metabolic activity in GENETIC PATHWAYS
19
20
How can we
measure
metabolic
processes
and
gene activity ???
21
EXAMPLE:
Caenorhabditis elegans
22
Some fine day
in 1982 …
23
Boy, do I
want to
map the
activity of
these
genes !!!
24
Until recently we lacked tools to measure
gene activity
1989 saw the introduction of the
microarray technique by Stephen Fodor
But only in 1992 this technique became
generally available – but still very costly
25
Until recently we lacked tools to measure
gene activity
1989 saw the introduction of the
microarray technique by Stephen Fodor
But only in 1992 this technique became
generally available – but still very costly
Stephen Fodor
Microarray
Microarray-ontwikkelaar
Ontwikkelde microarray
26
27
28
Some fine day many,
many, many years
later …
29
Now I’m
almost
there …
30
Using the microarray technology we can
now make time series of the activity of
our 22.000 genes – so-called
genome wide expression profiles
31
The identification of genetic pathways
from Microarray Timeseries
Sequence of genome-
wide expression profiles
at consequent instants
become more realistic
with decreasing costs …
32
Genomewide expression profiles: 25,000 genes
33
Now the problem is to map these
microarray-series of genome-wide
expression profiles into something that
tells us what the genes are actually doing
… for instance a network representing
their interaction
34
35
GENOMICS: structure and coding
36
DNA
Deoxyribonucleic acid (DNA) is a nucleic acid that
contains the genetic instructions specifying the biological
development of all cellular forms of life (and most viruses).
DNA is a long polymer of nucleotides and encodes the
sequence of the amino acid residues in proteins using the
genetic code, a triplet code of nucleotides.
37
38
DNA under electron microscope
39
3D model of a section of the DNA molecule
40
James Watson and Francis Crick
41
42
Genetic code
The genetic code is a set of rules that maps DNA sequences
to proteins in the living cell, and is employed in the process of
protein synthesis.
Nearly all living things use the same genetic code, called the
standard genetic code, although a few organisms use minor
variations of the standard code.
Fundamental code in DNA: {x(i)|i=1..N,x(i) in {C,A,T,G}}
Human: N = 3.3 billion
43
Genetic code
44
Replication
of
DNA
45
Genetic code: TRANSCRIPTION
DNA → RNA
Transcription is the process through which a DNA sequence is enzymatically
copied by an RNA polymerase to produce a complementary RNA. Or, in other
words, the transfer of genetic information from DNA into RNA. In the case of
protein-encoding DNA, transcription is the beginning of the process that
ultimately leads to the translation of the genetic code (via the mRNA
intermediate) into a functional peptide or protein. Transcription has some
proofreading mechanisms, but they are fewer and less effective than the
controls for DNA; therefore, transcription has a lower copying fidelity than
DNA replication.
Like DNA replication, transcription proceeds in the 5' → 3' direction (ie the old
polymer is read in the 3' → 5' direction and the new, complementary
fragments are generated in the 5' → 3' direction).
IN RNA Thymine (T) → Uracil (U)
46
Genetic code: TRANSLATION
DNA-triplet → RNA-triplet = codon → amino acid
RNA codon table
There are 20 standard amino acids used in proteins,
here are some of the RNA-codons that code for each amino acid.
Ala A GCU, GCC, GCA, GCG
Leu L UUA, UUG, CUU, CUC, CUA, CUG
Arg R CGU, CGC, CGA, CGG, AGA, AGG
Lys K AAA, AAG
Asn N AAU, AAC
Met M AUG
Asp D GAU, GAC
Phe F UUU, UUC
Cys C UGU, UGC
Pro P CCU, CCC, CCA, CCG
...
Start AUG, GUG
Stop UAG, UGA, UAA
47
PROTEOMICS: structure and function
48
Protein Structure:
primary structure
49
Protein
Structure:
secondary
Structure
a: Alpha-helix,
b: Beta-sheet
50
Protein Structure:
super-secondary Structure
51
Protein Structure = protein function:
52
EVOLUTION and the origin of SPECIES
53
Tree of Life
54
55
56
Phylogenetic relations between
Cetaceans and ariodactyl
57
Unsolved problems in biology
Life. How did it start? Is life a cosmic phenomenon? Are the conditions necessary for the origin of
life narrow or broad? How did life originate and diversify in hundred millions of years? Why, after
rapid diversification, do microorganisms remain unchanged for millions of years? Did life start on
this planet or was there an extraterrestrial intervention (for example a meteor from another planet)?
Why have so many biological systems developed sexual reproduction? How do organisms
recognize like species? How are the sizes of cells, organs, and bodies controlled? Is immortality
possible?
DNA / Genome. Do all organisms link together to a primary source? Given a DNA sequence, what
shape will the protein fold into? Given a particular desired shape, what DNA sequence will produce
it? What are all the functions of the DNA? Other than the structural genes, which is the simpler part
of the system? What is the complete structure and function of the proteome proteins expressed by
a cell or organ at a particular time and under specific conditions? What is the complete function of
the regulator genes? The building block of life may be a precursor to a generation of electronic
devices and computers, but what are the electronic properties of DNA? Does Junk DNA function as
molecular garbage?
Viruses / Immune system. What causes immune system deficiencies? What are the signs of
current or past infection to discover where Ebola hides between human outbreaks? What is the
origin of antibody diversity? What leads to the complexity of the immune system? What is the
relationship between the immune system and the brain?
Humanity: Why are there drastic changes in hominid morphology? Why are there giant hominid
skeletons and very small hominid skeletons? Is hominid evolution static? Is hominid devolution
possible? Are there Human-Neanderthal hybrids? What explains the differences between Human
and Neanderthal Fossils?
58
Introduction to Bioinformatics.
LECTURE 1:
CHAPTER 1:
The first look at a genome (sequence statistics)
* A mathematical model should be as simple as possible, but not
too simple!
(A. Einstein)
* All models are wrong, but some are useful. (G. Box)
59
Introduction to Bioinformatics.
The first look at a genome (sequence statistics)
• Genome and genomic sequences
• Probabilistic models and sequences
• Statistical properties of sequences
• Standard data formats and databases
60
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
1.1 Genomic era, year zero
• 1958: Fred Sanger (Cambridge, UK): Nobel prize for
developing protein sequencing techniques
• 1978: Fred Sanger: First complete viral genome
• 1980: Fred Sanger: First mitochrondrial genome
• 1980: Fred Sanger: Nobel prize for developing DNA
sequencing techniques
•1995: Craig Venter (TIGR): complete geneome of
Haemophilus influenza
• 2001: entire genome of Homo sapiens sapiens
• Start of post-genomic era (?!)
61
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
1.1 Genomic era, year zero
ORGANISM DATE SIZE DESCRIPTION
Phage phiX 74 1978 5,368 bp 1st viral genome
Human mtDNA 1980 16,571 bp 1st organelle genome
HIV 1985 9,193 bp AIDS retrovirus
H. influenza 1995 1,830 Kb 1st bacterial genome
H. sapiens 2001 3,500 Mb complete human
genome
62
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
1.2 The anatomy of a genome
• Definition of genome
• Prokaryotic genomes
• Eukaryotic genomes
• Viral genomes
• Organellar genomes
63
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
1.3 Probabilistic models of
genome sequences
• Alphabets, sequences, and sequence space
• Multinomial sequence model
• Markov sequence model
64
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
1.3 Probabilistic models of
genome sequences
Alphabets, sequences, and sequence space
4-letter alphabet N = {A,C,G,T} (= nucleoitides)
* sequence: s = s1s2…sn e.g.: s = ATATGCCTGACTG
* sequence space: the space of all sequences (up to a certain
length)
65
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
1.3 Probabilistic models of
genome sequences
Multinomial sequence model
* Nucleotides are independent and identically distributed
(i.i.d),
* p = {pA,pC,pG,pT}, pA + pC + pG + pT = 1
* 


n
i
i
p
P
1
))
(
(
)
( s
s
66
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
1.3 Probabilistic models of genome sequences
Markov
sequence model
67
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
1.3 Probabilistic models of genome sequences
Markov sequence model
* Probability start state π
* State transition matrix T
* 



n
i
i
i
p
P
1
1 ))
(
),
1
(
(
)
(
)
( s
s
s
s 
68
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
1.4 Annotating a genome:
statistical sequence analysis
• Base composition & sliding window plot
• GC content & change point analysis
• Finding unusual DNA words
• Biological relevance of unusual motifs
• Pattern matching versus pattern discovery
69
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
1.4 Annotating a genome: statistical sequence analysis
Base composition H. influenzae
BASE AMOUNT FREQUENCY
A 567623 0.3102
C 350723 0.1916
G 347436 0.1898
T 564241 0.3083
70
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
Haemophilus influenzae type b
71
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
1.4 Annotating a genome: statistical sequence analysis
Base composition & sliding window plot
72
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
1.4 Annotating a genome: statistical sequence analysis
Base composition & sliding window plot
73
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
1.4 Annotating a genome: statistical sequence analysis
Base composition & sliding window plot
74
Evidence for co-evolution of gene order and recombination rate
Csaba Pál & Laurence D. Hurst
Nature Genetics 33, 392 - 395 (2003)
Figure 3. Sliding-window plot of the number of essential genes
(black line) and standard deviation from chromosomal mean
recombination rate (gray line) along chromosome 9.
Dot indicates the centromere. The windows were each ten genes long, and
one gene jump was made between windows.
75
GC content
Organism GC content
H. influenzae 38.8
M. tuberculosis 65.8
S. enteridis 49.5
GC versus AT
76
GC content
•Detect foreign genetic material
•Horizontal gene transfer
•Change point analysis
• AT denatures (=splits) at lower temperatures
• Thermophylic Archaeabacteriae: high CG
• Evolution:
Archaea > Eubacteriae > Eukaryotes
77
GC content
Example of very high GC content
Average GC content: 61%
78
GC content
79
Change points in Labda-phage
80
k-mer frequency motif bias
• dimer, trimer, k-mer: nucleotide word of length 2, 3, k
• “unusual” k-mers
• 2-mer in H. influenzae
81
k-mer frequency motif bias
2-mer (dinucleotide) density in H. influenzae
*A C G T
A* 0.1202 0.0505 0.0483 0.0912
C 0.0665 0.0372 0.0396 0.0484
G 0.0514 0.0522 0.0363 0.0499
T 0.0721 0.0518 0.0656 0.1189
NB: freq(‘AT’)  freq(A or T)
82
k-mer frequency motif bias
Most frequent 10-mer (dinucleotide) density
in H. influenzae:
AAAGTGCGGT
ACCGCACTTT
Why?
83
84
85
Unusual DNA-words
Compare OBSERVED with EXPECTED frequency
of a word using multinomial model
Observed/expected ratio:
*A C G T
A* 1.2491 0.8496 0.8210 0.9535
C 1.1182 1.0121 1.0894 0.8190
G 0.8736 1.4349 1.0076 0.8526
T 0.7541 0.8763 1.1204 1.2505
This takes also into account the relative
proportionality pA, pC, pG, pT.
86
Unusual DNA-words
Restriction sites: very unusual words
CTAG -> “kincking” of DNA-strand
87
genome
signature:
Nucleotide motif
bias in four
genomes
88
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
1.5 Finding data: GenBank,
EMBL, and DDBJ
• Online databases
•FASTA: a standard data format
89
DATABASES
Generalized (DNA, proteins and carbohydrates, 3D-
structures)
Specialized (EST, STS, SNP, RNA, genomes, protein
families, pathways, microarray data ...)
90
OVERVIEW OF DATABASES
1. Database indexing and specification of search terms
(retrieval, follow-up, analysis)
2. Archives (databases on: nucleic acid sequences, genome,
protein sequences, structures, proteomics, expression,
pathways)
3. Gateways to Archives (NCBI, Entrez, PubMed, ExPasy,
Swiss-Prot, SRS, PIR, Ensembl)
91
Generalized DNA, protein
and carbohydrate databases
Primary sequence databases
EMBL (European Molecular Biology Laboratory nucleotide sequence
database at EBI, Hinxton, UK)
GenBank (at National Center for Biotechnology information, NCBI,
Bethesda, MD, USA)
DDBJ (DNA Data Bank Japan at CIB , Mishima, Japan)
92
NCBI: National Center for
Biotechnology information
Established in 1988 as a national resource for molecular biology information,
NCBI creates public databases, conducts research in computational biology,
develops software tools for analyzing genome data, and disseminates
biomedical information - all for the better understanding of molecular processes
affecting human health and disease.
93
NCBI - GenBank
94
The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes
Europe's primary nucleotide sequence resource. Main sources for DNA and RNA
sequences are direct submissions from individual researchers, genome
sequencing projects and patent applications.
95
EBI: European
Bioinformatics Institute
The European Bioinformatics Institute (EBI) is a non-profit academic organisation that
forms part of the European Molecular Biology Laboratory (EMBL).
The EBI is a centre for research and services in bioinformatics. The Institute manages
databases of biological data including nucleic acid, protein sequences and
macromolecular structures.
Our mission
To provide freely available data and bioinformatics services to all facets of the scientific
community in ways that promote scientific progress
To contribute to the advancement of biology through basic investigator-driven research
in bioinformatics
To provide advanced bioinformatics training to scientists at all levels, from PhD students
to independent investigators
To help disseminate cutting-edge technologies to industry
96
What is DDBJ
DDBJ (DNA Data Bank of Japan) began DNA data bank activities in earnest in
1986 at the National Institute of Genetics (NIG).
DDBJ has been functioning as the international nucleotide sequence database
in collaboration with EBI/EMBL and NCBI/GenBank.
DNA sequence records the organismic evolution more directly than other
biological materials and ,thus, is invaluable not only for research in life
sciences, but also human welfare in general. The databases are, so to speak, a
common treasure of human beings. With this in mind, we make the databases
online accessible to anyone in the world
97
The ExPASy (Expert Protein Analysis System) proteomics
server of the Swiss Institute of Bioinformatics (SIB) is
dedicated to the analysis of protein sequences and
structures as well as 2-D PAGE
ExPASy Proteomics Server
(SWISS-PROT)
98
Generalized DNA, protein
and carbohydrate databases
Protein sequence databases
SWISS-PROT (Swiss Institute of Bioinformatics, SIB, Geneva, CH)
TrEMBL (=Translated EMBL: computer annotated protein sequence
database at EBI, UK)
PIR-PSD (PIR-International Protein Sequence Database, annotated
protein database by PIR, MIPS and JIPID at NBRF, Georgetown
University, USA)
UniProt (Joined data from Swiss-Prot, TrEMBL and PIR)
UniRef (UniProt NREF (Non-redundant REFerence) database at EBI, UK)
IPI (International Protein Index; human, rat and mouse proteome database
at EBI, UK)
99
Generalized DNA, protein
and carbohydrate databases
Carbohydrate databases
CarbBank (Former complex carbohydrate structure database, CCSD,
discontinued!)
3D structure databases
PDB (Protein Data Bank cured by RCSB, USA)
EBI-MSD (Macromolecular Structure Database at EBI, UK )
NDB (Nucleic Acid structure Datatabase at Rutgers State University of
New Jersey , USA)
100
PROTEIN DATA BANK
101
DATABASE SEARCH
Text-based (SRS, Entrez ...)
Sequence-based (sequence similarity search) (BLAST, FASTA...)
Motif-based (ScanProsite, eMOTIF)
Structure-based (structure similarity search) (VAST, DALI...)
Mass-based protein search (ProteinProspector, PeptIdent, Prowl …)
102
Search across databases Help
Welcome to the Entrez cross-database search page
PubMed: biomedical literature citations and abstracts PubMed Central: free, full
text journal articles Site Search: NCBI web and FTP sites Books: online books
OMIM: online Mendelian Inheritance in Man OMIA: online Mendelian Inheritance in
Animals
Nucleotide: sequence database (GenBank) Protein: sequence database Genome:
whole genome sequences Structure: three-dimensional macromolecular structures
Taxonomy: organisms in GenBank SNP: single nucleotide polymorphism Gene:
gene-centered information HomoloGene: eukaryotic homology groups PubChem
Compound: unique small molecule chemical structures PubChem Substance:
deposited chemical substance records Genome Project: genome project information
UniGene: gene-oriented clusters of transcript sequences CDD: conserved protein
domain database 3D Domains: domains from Entrez Structure UniSTS: markers
and mapping data PopSet: population study data sets GEO Profiles: expression
and molecular abundance profiles GEO DataSets: experimental sets of GEO data
Cancer Chromosomes: cytogenetic databases PubChem BioAssay: bioactivity
screens of chemical substances GENSAT: gene expression atlas of mouse central
nervous system Probe: sequence-specific reagents
103
New! Assembly Archive recently created at NCBI links together trace data and finished sequence providing
complete information about a genome assembly. The Assembly Archive's first entries are a set of closely related
strains of Bacillus anthracis. The assemblies are avalaible at TraceAssembly
See more about Bacillus anthracis genome Bacillus licheniformis ATCC
14580Release Date: September 15, 2004
Reference: Rey,M.W.,et al.
Complete genome sequence of the industrial bacterium Bacillus licheniformis and
comparisons with closely related Bacillus species (er) Genome Biol. 5, R77 (2004)
Lineage: Bacteria; Firmicutes; Bacillales; Bacillaceae; Bacillus.
Organism: Bacillus licheniformis ATCC 14580
Genome sequence information
chromosome - CP000002 - NC_006270
Size: 4,222,336 bp Proteins: 4161
Sequence data files submitted to GenBank/EMBL/DDBJ can be found at NCBI FTP:
GenBank or RefSeq Genomes
Bacillus cereus ZKRelease Date: September 15, 2004
Reference: Brettin,T.S., et al. Complete genome sequence of Bacillus cereus ZK
Lineage: Bacteria; Firmicutes; Bacillales; Bacillaceae; Bacillus; Bacillus cereus group.
Organism:
104
NCBI → BLAST Latest news: 6 December 2005 : BLAST 2.2.13 released About
Getting started / News / FAQs
More info
NAR 2004 / NCBI Handbook / The Statistics of Sequence Similarity Scores
Software
Downloads / Developer info
Other resources
References / NCBI Contributors / Mailing list / Contact us
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity
between sequences. The program compares nucleotide or protein sequences to
sequence databases and calculates the statistical significance of matches. BLAST can
be used to infer functional and evolutionary relationships between sequences as well
as help identify members of gene families. Nucleotide
Quickly search for highly similar sequences (megablast)
Quickly search for divergent sequences (discontiguous megablast)
Nucleotide-nucleotide BLAST (blastn)
Search for short, nearly exact matches
Search trace archives with megablast or discontiguous megablast
Protein
Protein-protein BLAST (blastp)
BLAST
105
Fasta Protein Database Query
Provides sequence similarity searching against nucleotide and protein databases
using the Fasta programs.
Fasta can be very specific when identifying long regions of low similarity especially for
highly diverged sequences.
You can also conduct sequence similarity searching against complete proteome or
genome databases using the Fasta programs.
Download Software
106
Kangaroo
MOTIV BASED SEARCH
Kangaroo is a program that facilitates searching for gene and protein patterns
and sequences
Kangaroo is a pattern search program. Given a sequence pattern the program
will find all the records that contain that pattern.
To use this program, simply enter a sequence of DNA or Amino Acids in the
pattern window, choose the type of search, the taxonomy and submit your
request.
107
ANALYSIS TOOLS
DNA sequence analysis tools
RNA analysis tools
Protein sequence and structure analysis tools (primary, secondary, tertiary
structure)
Tools for protein Function assignment
Phylogeny
Microarray analysis tools
108
MISCELLANEOUS
Literature search
Patent search
Bioinformatics centers and servers
Links to other collections of bioinformatics resources
Medical resources
Bioethics
Protocols
Software
(Bio)chemie
Educational resources
109
Introduction to Bioinformatics.
END of LECTURE 1

More Related Content

Similar to Introduction-to-Bioinformatics-1.ppt

Isolation of rhizobium species from soil and to
Isolation of rhizobium species from soil and toIsolation of rhizobium species from soil and to
Isolation of rhizobium species from soil and to
tusha madan
 

Similar to Introduction-to-Bioinformatics-1.ppt (20)

Introduction
IntroductionIntroduction
Introduction
 
2014 naples
2014 naples2014 naples
2014 naples
 
Isolation of rhizobium species from soil and to
Isolation of rhizobium species from soil and toIsolation of rhizobium species from soil and to
Isolation of rhizobium species from soil and to
 
A comparative study using different measure of filteration
A comparative study using different measure of filterationA comparative study using different measure of filteration
A comparative study using different measure of filteration
 
Epigenetic Analysis Sequencing
Epigenetic Analysis SequencingEpigenetic Analysis Sequencing
Epigenetic Analysis Sequencing
 
Bio Inspired Computing Final Version
Bio Inspired Computing Final VersionBio Inspired Computing Final Version
Bio Inspired Computing Final Version
 
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
 
DNA & Bio computer
DNA & Bio computerDNA & Bio computer
DNA & Bio computer
 
rheumatoid arthritis
rheumatoid arthritisrheumatoid arthritis
rheumatoid arthritis
 
Bioinformatics t1-introduction wim-vancriekinge_v2013
Bioinformatics t1-introduction wim-vancriekinge_v2013Bioinformatics t1-introduction wim-vancriekinge_v2013
Bioinformatics t1-introduction wim-vancriekinge_v2013
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
Molecular biology lecture
Molecular biology lectureMolecular biology lecture
Molecular biology lecture
 
2015 bioinformatics wim_vancriekinge
2015 bioinformatics wim_vancriekinge2015 bioinformatics wim_vancriekinge
2015 bioinformatics wim_vancriekinge
 
2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload
 
Biological technologies
Biological technologiesBiological technologies
Biological technologies
 
DNA-based methods for bioaerosol analysis
DNA-based methods for bioaerosol analysisDNA-based methods for bioaerosol analysis
DNA-based methods for bioaerosol analysis
 
Genomic Data Analysis
Genomic Data AnalysisGenomic Data Analysis
Genomic Data Analysis
 
Marzillier_09052014.pdf
Marzillier_09052014.pdfMarzillier_09052014.pdf
Marzillier_09052014.pdf
 
2014 09 30_t1_bioinformatics_wim_vancriekinge
2014 09 30_t1_bioinformatics_wim_vancriekinge2014 09 30_t1_bioinformatics_wim_vancriekinge
2014 09 30_t1_bioinformatics_wim_vancriekinge
 
Bioinformatica 29-09-2011-t1-bioinformatics
Bioinformatica 29-09-2011-t1-bioinformaticsBioinformatica 29-09-2011-t1-bioinformatics
Bioinformatica 29-09-2011-t1-bioinformatics
 

Recently uploaded

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 

Introduction-to-Bioinformatics-1.ppt

  • 2. 2 GENERAL INFORMATION Course Methodology The course consists of the following components; i. a series of 10 lectures and 10 mini-exams, ii. 7 skills classes, each with one programming task, iii. one final written exam. •In the lectures the main theoretical aspects will be presented. •Each lecture starts with a "mini-exam" with three short questions belonging to the previous lecture. •In the skills classes (SCs) several programming tasks are performed, one of which has to be submitted until next SC. •Finally ,the course terminates with a open-book exam.
  • 3. 3 GENERAL INFORMATION 10 lectures and 10 mini-exams Prologue (In praise of cells) Chapter 1. The first look at a genome (sequence statistics) Chapter 2. All the sequence's men (gene finding) Chapter 3. All in the family (sequence Alignment) Chapter 4. The boulevard of broken genes (hidden Markov models) Chapter 5. Are Neanderthals among us? (variation within and between species) Chapter 6. Fighting HIV (natural selection at the molecular level) Chapter 7. SARS: a post-genomic epidemic (phylogenetic analysis) Chapter 8. Welcome to the hotel Chlamydia (whole genome comparisons) Chapter 9. The genomics of wine-making (Analysis of gene expression) Chapter 10. A bed-time story (identification of regulatory sequences)
  • 4. 4 GENERAL INFORMATION mini-exams * First 15 minutes of the lecture * Closed Book * Three short questions on the previous lecture * Counts as bonus points for the final mark … * There is a resit, where you can redo individual mini’s you failed to attend with a legitimate leave
  • 5. 5 GENERAL INFORMATION Skills Class: * Each Friday one hour hands-on with real data * Hand in one-a-week – for a bonus point
  • 6. 6
  • 7. 7
  • 8. 8 GENERAL INFORMATION Final Exam: * 10 short questions regarding the course material * Open book
  • 9. 9 GENERAL INFORMATION Grading: The relative weights of the components are: i. 10 mini-exam: B1 bonus points (max 1) ii. 7 skills class programming task: B2 bonus points (max 1) iii. final written exam (open-book, three hours): E points (max 10) Final grade = min(E + (B1 + B2), 10) Study Points: 6 ECTS/ 4 NSP
  • 10. 10 GENERAL INFORMATION Course Book: Introduction to Computational Genomics A Case Studies Approach Nello Cristianini, Matthew W. Hahn
  • 11. 11 GENERAL INFORMATION Additional recommended texts: • Bioinformatics: the machine learning approach, Baldi & Brunak. • Introduction to Bioinformatics, Lesk, and: Introduction to Bioinformatics, Attwood & Parry-Smith.
  • 13. 13 Introduction to Bioinformatics. LECTURE 1: * Prologue (In praise of cells) * Chapter 1. The first look at a genome (sequence statistics)
  • 14. 14 Introduction to Bioinformatics. Prologue : In praise of cells * Nothing in Biology Makes Sense Except in the Light of Evolution (Theodosius Dobzhansky)
  • 15. 15 GENOMICS and PROTEOMICS Genomics is the study of an organism's genome and the use of the genes. It deals with the systematic use of genome information, associated with other data, to provide answers in biology, medicine, and industry. Proteomics is the large-scale study of proteins, particularly their structures and functions. Proteomics is much more complicated than genomics. Most importantly, while the genome is a rather constant entity, the proteome differs from cell to cell and is constantly changing through its biochemical interactions with the genome and the environment. One organism will have radically different protein expression in different parts of its body, in different stages of its life cycle and in different environmental conditions.
  • 17. 17 modern map-makers have mapped the entire human genome Hurrah – we know the entire 3.3 billion bps of the human genome !!! … but what does it mean ???
  • 18. 18 Metabolic activity in GENETIC PATHWAYS
  • 19. 19
  • 23. 23 Boy, do I want to map the activity of these genes !!!
  • 24. 24 Until recently we lacked tools to measure gene activity 1989 saw the introduction of the microarray technique by Stephen Fodor But only in 1992 this technique became generally available – but still very costly
  • 25. 25 Until recently we lacked tools to measure gene activity 1989 saw the introduction of the microarray technique by Stephen Fodor But only in 1992 this technique became generally available – but still very costly Stephen Fodor Microarray Microarray-ontwikkelaar Ontwikkelde microarray
  • 26. 26
  • 27. 27
  • 28. 28 Some fine day many, many, many years later …
  • 30. 30 Using the microarray technology we can now make time series of the activity of our 22.000 genes – so-called genome wide expression profiles
  • 31. 31 The identification of genetic pathways from Microarray Timeseries Sequence of genome- wide expression profiles at consequent instants become more realistic with decreasing costs …
  • 33. 33 Now the problem is to map these microarray-series of genome-wide expression profiles into something that tells us what the genes are actually doing … for instance a network representing their interaction
  • 34. 34
  • 36. 36 DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions specifying the biological development of all cellular forms of life (and most viruses). DNA is a long polymer of nucleotides and encodes the sequence of the amino acid residues in proteins using the genetic code, a triplet code of nucleotides.
  • 37. 37
  • 38. 38 DNA under electron microscope
  • 39. 39 3D model of a section of the DNA molecule
  • 40. 40 James Watson and Francis Crick
  • 41. 41
  • 42. 42 Genetic code The genetic code is a set of rules that maps DNA sequences to proteins in the living cell, and is employed in the process of protein synthesis. Nearly all living things use the same genetic code, called the standard genetic code, although a few organisms use minor variations of the standard code. Fundamental code in DNA: {x(i)|i=1..N,x(i) in {C,A,T,G}} Human: N = 3.3 billion
  • 45. 45 Genetic code: TRANSCRIPTION DNA → RNA Transcription is the process through which a DNA sequence is enzymatically copied by an RNA polymerase to produce a complementary RNA. Or, in other words, the transfer of genetic information from DNA into RNA. In the case of protein-encoding DNA, transcription is the beginning of the process that ultimately leads to the translation of the genetic code (via the mRNA intermediate) into a functional peptide or protein. Transcription has some proofreading mechanisms, but they are fewer and less effective than the controls for DNA; therefore, transcription has a lower copying fidelity than DNA replication. Like DNA replication, transcription proceeds in the 5' → 3' direction (ie the old polymer is read in the 3' → 5' direction and the new, complementary fragments are generated in the 5' → 3' direction). IN RNA Thymine (T) → Uracil (U)
  • 46. 46 Genetic code: TRANSLATION DNA-triplet → RNA-triplet = codon → amino acid RNA codon table There are 20 standard amino acids used in proteins, here are some of the RNA-codons that code for each amino acid. Ala A GCU, GCC, GCA, GCG Leu L UUA, UUG, CUU, CUC, CUA, CUG Arg R CGU, CGC, CGA, CGG, AGA, AGG Lys K AAA, AAG Asn N AAU, AAC Met M AUG Asp D GAU, GAC Phe F UUU, UUC Cys C UGU, UGC Pro P CCU, CCC, CCA, CCG ... Start AUG, GUG Stop UAG, UGA, UAA
  • 51. 51 Protein Structure = protein function:
  • 52. 52 EVOLUTION and the origin of SPECIES
  • 54. 54
  • 55. 55
  • 57. 57 Unsolved problems in biology Life. How did it start? Is life a cosmic phenomenon? Are the conditions necessary for the origin of life narrow or broad? How did life originate and diversify in hundred millions of years? Why, after rapid diversification, do microorganisms remain unchanged for millions of years? Did life start on this planet or was there an extraterrestrial intervention (for example a meteor from another planet)? Why have so many biological systems developed sexual reproduction? How do organisms recognize like species? How are the sizes of cells, organs, and bodies controlled? Is immortality possible? DNA / Genome. Do all organisms link together to a primary source? Given a DNA sequence, what shape will the protein fold into? Given a particular desired shape, what DNA sequence will produce it? What are all the functions of the DNA? Other than the structural genes, which is the simpler part of the system? What is the complete structure and function of the proteome proteins expressed by a cell or organ at a particular time and under specific conditions? What is the complete function of the regulator genes? The building block of life may be a precursor to a generation of electronic devices and computers, but what are the electronic properties of DNA? Does Junk DNA function as molecular garbage? Viruses / Immune system. What causes immune system deficiencies? What are the signs of current or past infection to discover where Ebola hides between human outbreaks? What is the origin of antibody diversity? What leads to the complexity of the immune system? What is the relationship between the immune system and the brain? Humanity: Why are there drastic changes in hominid morphology? Why are there giant hominid skeletons and very small hominid skeletons? Is hominid evolution static? Is hominid devolution possible? Are there Human-Neanderthal hybrids? What explains the differences between Human and Neanderthal Fossils?
  • 58. 58 Introduction to Bioinformatics. LECTURE 1: CHAPTER 1: The first look at a genome (sequence statistics) * A mathematical model should be as simple as possible, but not too simple! (A. Einstein) * All models are wrong, but some are useful. (G. Box)
  • 59. 59 Introduction to Bioinformatics. The first look at a genome (sequence statistics) • Genome and genomic sequences • Probabilistic models and sequences • Statistical properties of sequences • Standard data formats and databases
  • 60. 60 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1.1 Genomic era, year zero • 1958: Fred Sanger (Cambridge, UK): Nobel prize for developing protein sequencing techniques • 1978: Fred Sanger: First complete viral genome • 1980: Fred Sanger: First mitochrondrial genome • 1980: Fred Sanger: Nobel prize for developing DNA sequencing techniques •1995: Craig Venter (TIGR): complete geneome of Haemophilus influenza • 2001: entire genome of Homo sapiens sapiens • Start of post-genomic era (?!)
  • 61. 61 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1.1 Genomic era, year zero ORGANISM DATE SIZE DESCRIPTION Phage phiX 74 1978 5,368 bp 1st viral genome Human mtDNA 1980 16,571 bp 1st organelle genome HIV 1985 9,193 bp AIDS retrovirus H. influenza 1995 1,830 Kb 1st bacterial genome H. sapiens 2001 3,500 Mb complete human genome
  • 62. 62 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1.2 The anatomy of a genome • Definition of genome • Prokaryotic genomes • Eukaryotic genomes • Viral genomes • Organellar genomes
  • 63. 63 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1.3 Probabilistic models of genome sequences • Alphabets, sequences, and sequence space • Multinomial sequence model • Markov sequence model
  • 64. 64 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1.3 Probabilistic models of genome sequences Alphabets, sequences, and sequence space 4-letter alphabet N = {A,C,G,T} (= nucleoitides) * sequence: s = s1s2…sn e.g.: s = ATATGCCTGACTG * sequence space: the space of all sequences (up to a certain length)
  • 65. 65 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1.3 Probabilistic models of genome sequences Multinomial sequence model * Nucleotides are independent and identically distributed (i.i.d), * p = {pA,pC,pG,pT}, pA + pC + pG + pT = 1 *    n i i p P 1 )) ( ( ) ( s s
  • 66. 66 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1.3 Probabilistic models of genome sequences Markov sequence model
  • 67. 67 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1.3 Probabilistic models of genome sequences Markov sequence model * Probability start state π * State transition matrix T *     n i i i p P 1 1 )) ( ), 1 ( ( ) ( ) ( s s s s 
  • 68. 68 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1.4 Annotating a genome: statistical sequence analysis • Base composition & sliding window plot • GC content & change point analysis • Finding unusual DNA words • Biological relevance of unusual motifs • Pattern matching versus pattern discovery
  • 69. 69 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1.4 Annotating a genome: statistical sequence analysis Base composition H. influenzae BASE AMOUNT FREQUENCY A 567623 0.3102 C 350723 0.1916 G 347436 0.1898 T 564241 0.3083
  • 70. 70 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) Haemophilus influenzae type b
  • 71. 71 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1.4 Annotating a genome: statistical sequence analysis Base composition & sliding window plot
  • 72. 72 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1.4 Annotating a genome: statistical sequence analysis Base composition & sliding window plot
  • 73. 73 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1.4 Annotating a genome: statistical sequence analysis Base composition & sliding window plot
  • 74. 74 Evidence for co-evolution of gene order and recombination rate Csaba Pál & Laurence D. Hurst Nature Genetics 33, 392 - 395 (2003) Figure 3. Sliding-window plot of the number of essential genes (black line) and standard deviation from chromosomal mean recombination rate (gray line) along chromosome 9. Dot indicates the centromere. The windows were each ten genes long, and one gene jump was made between windows.
  • 75. 75 GC content Organism GC content H. influenzae 38.8 M. tuberculosis 65.8 S. enteridis 49.5 GC versus AT
  • 76. 76 GC content •Detect foreign genetic material •Horizontal gene transfer •Change point analysis • AT denatures (=splits) at lower temperatures • Thermophylic Archaeabacteriae: high CG • Evolution: Archaea > Eubacteriae > Eukaryotes
  • 77. 77 GC content Example of very high GC content Average GC content: 61%
  • 79. 79 Change points in Labda-phage
  • 80. 80 k-mer frequency motif bias • dimer, trimer, k-mer: nucleotide word of length 2, 3, k • “unusual” k-mers • 2-mer in H. influenzae
  • 81. 81 k-mer frequency motif bias 2-mer (dinucleotide) density in H. influenzae *A C G T A* 0.1202 0.0505 0.0483 0.0912 C 0.0665 0.0372 0.0396 0.0484 G 0.0514 0.0522 0.0363 0.0499 T 0.0721 0.0518 0.0656 0.1189 NB: freq(‘AT’)  freq(A or T)
  • 82. 82 k-mer frequency motif bias Most frequent 10-mer (dinucleotide) density in H. influenzae: AAAGTGCGGT ACCGCACTTT Why?
  • 83. 83
  • 84. 84
  • 85. 85 Unusual DNA-words Compare OBSERVED with EXPECTED frequency of a word using multinomial model Observed/expected ratio: *A C G T A* 1.2491 0.8496 0.8210 0.9535 C 1.1182 1.0121 1.0894 0.8190 G 0.8736 1.4349 1.0076 0.8526 T 0.7541 0.8763 1.1204 1.2505 This takes also into account the relative proportionality pA, pC, pG, pT.
  • 86. 86 Unusual DNA-words Restriction sites: very unusual words CTAG -> “kincking” of DNA-strand
  • 88. 88 Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1.5 Finding data: GenBank, EMBL, and DDBJ • Online databases •FASTA: a standard data format
  • 89. 89 DATABASES Generalized (DNA, proteins and carbohydrates, 3D- structures) Specialized (EST, STS, SNP, RNA, genomes, protein families, pathways, microarray data ...)
  • 90. 90 OVERVIEW OF DATABASES 1. Database indexing and specification of search terms (retrieval, follow-up, analysis) 2. Archives (databases on: nucleic acid sequences, genome, protein sequences, structures, proteomics, expression, pathways) 3. Gateways to Archives (NCBI, Entrez, PubMed, ExPasy, Swiss-Prot, SRS, PIR, Ensembl)
  • 91. 91 Generalized DNA, protein and carbohydrate databases Primary sequence databases EMBL (European Molecular Biology Laboratory nucleotide sequence database at EBI, Hinxton, UK) GenBank (at National Center for Biotechnology information, NCBI, Bethesda, MD, USA) DDBJ (DNA Data Bank Japan at CIB , Mishima, Japan)
  • 92. 92 NCBI: National Center for Biotechnology information Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.
  • 94. 94 The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications.
  • 95. 95 EBI: European Bioinformatics Institute The European Bioinformatics Institute (EBI) is a non-profit academic organisation that forms part of the European Molecular Biology Laboratory (EMBL). The EBI is a centre for research and services in bioinformatics. The Institute manages databases of biological data including nucleic acid, protein sequences and macromolecular structures. Our mission To provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress To contribute to the advancement of biology through basic investigator-driven research in bioinformatics To provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators To help disseminate cutting-edge technologies to industry
  • 96. 96 What is DDBJ DDBJ (DNA Data Bank of Japan) began DNA data bank activities in earnest in 1986 at the National Institute of Genetics (NIG). DDBJ has been functioning as the international nucleotide sequence database in collaboration with EBI/EMBL and NCBI/GenBank. DNA sequence records the organismic evolution more directly than other biological materials and ,thus, is invaluable not only for research in life sciences, but also human welfare in general. The databases are, so to speak, a common treasure of human beings. With this in mind, we make the databases online accessible to anyone in the world
  • 97. 97 The ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB) is dedicated to the analysis of protein sequences and structures as well as 2-D PAGE ExPASy Proteomics Server (SWISS-PROT)
  • 98. 98 Generalized DNA, protein and carbohydrate databases Protein sequence databases SWISS-PROT (Swiss Institute of Bioinformatics, SIB, Geneva, CH) TrEMBL (=Translated EMBL: computer annotated protein sequence database at EBI, UK) PIR-PSD (PIR-International Protein Sequence Database, annotated protein database by PIR, MIPS and JIPID at NBRF, Georgetown University, USA) UniProt (Joined data from Swiss-Prot, TrEMBL and PIR) UniRef (UniProt NREF (Non-redundant REFerence) database at EBI, UK) IPI (International Protein Index; human, rat and mouse proteome database at EBI, UK)
  • 99. 99 Generalized DNA, protein and carbohydrate databases Carbohydrate databases CarbBank (Former complex carbohydrate structure database, CCSD, discontinued!) 3D structure databases PDB (Protein Data Bank cured by RCSB, USA) EBI-MSD (Macromolecular Structure Database at EBI, UK ) NDB (Nucleic Acid structure Datatabase at Rutgers State University of New Jersey , USA)
  • 101. 101 DATABASE SEARCH Text-based (SRS, Entrez ...) Sequence-based (sequence similarity search) (BLAST, FASTA...) Motif-based (ScanProsite, eMOTIF) Structure-based (structure similarity search) (VAST, DALI...) Mass-based protein search (ProteinProspector, PeptIdent, Prowl …)
  • 102. 102 Search across databases Help Welcome to the Entrez cross-database search page PubMed: biomedical literature citations and abstracts PubMed Central: free, full text journal articles Site Search: NCBI web and FTP sites Books: online books OMIM: online Mendelian Inheritance in Man OMIA: online Mendelian Inheritance in Animals Nucleotide: sequence database (GenBank) Protein: sequence database Genome: whole genome sequences Structure: three-dimensional macromolecular structures Taxonomy: organisms in GenBank SNP: single nucleotide polymorphism Gene: gene-centered information HomoloGene: eukaryotic homology groups PubChem Compound: unique small molecule chemical structures PubChem Substance: deposited chemical substance records Genome Project: genome project information UniGene: gene-oriented clusters of transcript sequences CDD: conserved protein domain database 3D Domains: domains from Entrez Structure UniSTS: markers and mapping data PopSet: population study data sets GEO Profiles: expression and molecular abundance profiles GEO DataSets: experimental sets of GEO data Cancer Chromosomes: cytogenetic databases PubChem BioAssay: bioactivity screens of chemical substances GENSAT: gene expression atlas of mouse central nervous system Probe: sequence-specific reagents
  • 103. 103 New! Assembly Archive recently created at NCBI links together trace data and finished sequence providing complete information about a genome assembly. The Assembly Archive's first entries are a set of closely related strains of Bacillus anthracis. The assemblies are avalaible at TraceAssembly See more about Bacillus anthracis genome Bacillus licheniformis ATCC 14580Release Date: September 15, 2004 Reference: Rey,M.W.,et al. Complete genome sequence of the industrial bacterium Bacillus licheniformis and comparisons with closely related Bacillus species (er) Genome Biol. 5, R77 (2004) Lineage: Bacteria; Firmicutes; Bacillales; Bacillaceae; Bacillus. Organism: Bacillus licheniformis ATCC 14580 Genome sequence information chromosome - CP000002 - NC_006270 Size: 4,222,336 bp Proteins: 4161 Sequence data files submitted to GenBank/EMBL/DDBJ can be found at NCBI FTP: GenBank or RefSeq Genomes Bacillus cereus ZKRelease Date: September 15, 2004 Reference: Brettin,T.S., et al. Complete genome sequence of Bacillus cereus ZK Lineage: Bacteria; Firmicutes; Bacillales; Bacillaceae; Bacillus; Bacillus cereus group. Organism:
  • 104. 104 NCBI → BLAST Latest news: 6 December 2005 : BLAST 2.2.13 released About Getting started / News / FAQs More info NAR 2004 / NCBI Handbook / The Statistics of Sequence Similarity Scores Software Downloads / Developer info Other resources References / NCBI Contributors / Mailing list / Contact us The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. Nucleotide Quickly search for highly similar sequences (megablast) Quickly search for divergent sequences (discontiguous megablast) Nucleotide-nucleotide BLAST (blastn) Search for short, nearly exact matches Search trace archives with megablast or discontiguous megablast Protein Protein-protein BLAST (blastp) BLAST
  • 105. 105 Fasta Protein Database Query Provides sequence similarity searching against nucleotide and protein databases using the Fasta programs. Fasta can be very specific when identifying long regions of low similarity especially for highly diverged sequences. You can also conduct sequence similarity searching against complete proteome or genome databases using the Fasta programs. Download Software
  • 106. 106 Kangaroo MOTIV BASED SEARCH Kangaroo is a program that facilitates searching for gene and protein patterns and sequences Kangaroo is a pattern search program. Given a sequence pattern the program will find all the records that contain that pattern. To use this program, simply enter a sequence of DNA or Amino Acids in the pattern window, choose the type of search, the taxonomy and submit your request.
  • 107. 107 ANALYSIS TOOLS DNA sequence analysis tools RNA analysis tools Protein sequence and structure analysis tools (primary, secondary, tertiary structure) Tools for protein Function assignment Phylogeny Microarray analysis tools
  • 108. 108 MISCELLANEOUS Literature search Patent search Bioinformatics centers and servers Links to other collections of bioinformatics resources Medical resources Bioethics Protocols Software (Bio)chemie Educational resources