Basic Bioinformatics
Mai Masri
What is Internet
• It’s the global net work of computer networks
that links scientific institutions (government
,academic and business institutions).
• The networks share a communication
protocol ; different types of machine are able
to speak to each other in a common way.
• To facilitate communication each computer
has unique identifying number (IP address).
Bsmir30.biochemistry.ucl.ac.uk
• Machine name= bsmir30
• Placed In Biochemistry dep. At University
Collage London (UCL)=Biochemistry.ucl
• Belongs to the academic sub-domain =(ac)
• In the domain country London= UK
• Easy access to information at different sites
was allowed by browsers that communicate
with servers.
• The first point of contact between browsers
and server is the home page
Bioinformatics as a tool
• Molecular biology has witnessed an
information revolution as a result of rapid
DNA sequencing technologies and
corresponding progress in computer-based
technologies which allowed us to cope with
information in increasingly efficient ways.
What is bioinformatics?
• Application of information technology to the storage,
management and analysis of biological information
• Facilitated by the use of computers.
• It is a computerized annotation of genomic and biological
information and data (databases), transformation and
manipulation of these data (software tools).
• Overall Aim of Bioinformatics:
• provide biologically important predictions from data and
transformation / manipulation of these data
Biologists
collect molecular data:
DNA & Protein sequences,
gene expression, etc.
Computer scientists
(+Mathematicians, Statisticians, etc.)
Develop tools, softwares, algorithms
to store and analyze the data.
Bioinformaticians
Study biological questions by
analyzing molecular data
The field of science in which biology, computer science and
information technology merge into a single discipline
7
8
What is Bioinformatics?
What can you do using bioinformatics?
• Sequence analysis
– Geneticists/ molecular biologists analyse genome sequence
information to understand disease processes
• Molecular modeling
– Crystallographers/ biochemists design drugs using computer-
aided tools
• Phylogeny/evolution
– Geneticists obtain information about the evolution of
organisms by looking for similarities in gene sequences.
• Ecology and population studies
– Bioinformatics is used to handle large amounts of data
obtained in population studies
• Medical informatics
Bioinformatics includes:
Data analysis
Data storage
Data mining
Knowledge discovery
Biological modeling
Bioinformatics may be applied in:
DNA and protein sequence analysis
Differential gene expression
Protein structure and function
Protein interactions
Biological networks
Variation analysis
Drug design
Where was the start?
Gene Sequencing: Automated chemcial
sequencing methods allow rapid generation of large
data banks of gene sequences
What is a database?
• A collection of data
– structured
– searchable (index) -> table of contents
– updated periodically (release) -> new edition
– cross-referenced (hyperlinks) -> links with other db
• Includes also associated tools (software) necessary for
access, updating, information insertion, information
deletion….
• Data storage management: flat files, relational databases…
13
Sequence Databases and Their Use:
A: Primary Sequence Databases:
Nucleic Acid Databases
NCBI (Natl Center Biotech Information) - GenBank
http://www.ncbi.nlm.nih.gov/
EBI (European Bioinformatics Institute) - EMBL
http://www.ebi.ac.uk/
DISC - DNA Information and Stock Center, Japan
http://www.dna.affrc.go.jp/
Protein Databases
NCBI - GenPept
http://www.ncbi.nlm.nih.gov/
ExPASy - SwissProt and TrEMBL
http://www.expasy.ch/
EBI (European Bioinformatics Institute)
SwissProt, TrEMBL, PIR
http://www.ebi.ac.uk/
GenBank file format
GenBank file format
In bioinformatics, FASTA format is a text-based format for representing
either nucleotide sequences or peptide sequences, in which nucleotides
or amino acids are represented using single-letter codes.
The format also allows for sequence names and comments to precede the
sequences.
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY
>gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGG
CGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAG
CCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCT
GTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGC
CCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTA
AGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAAT
AAAGTCTGAGTGGGCGGC
A cDNA sequence
>gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGG
CGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAG
CCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCT
GTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGC
CCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTA
AGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAAT
AAAGTCTGAGTGGGCGGC
18
A cDNA sequence (reading frame)
A protein sequence
>gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTC
GGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTG
AGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCG
CTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCC
GCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGT
TAAGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTT
GAATAAAGTCTGAGTGGGCGGC
>gi|4504347|ref|NP_000549.1| alpha 1 globin [Homo sapiens]
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDL
HAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
19
Sequence analysis: overview
Nucleotide sequence file
Search databases for
similar sequences
Sequence comparison
Design further
experiments
Restriction mapping
PCR planning
Translate
into
protein
Search for
known motifs
RNA structure
prediction
non-coding
coding
Protein
sequence
analysis
Search for protein
coding regions
Manual
sequence
entry
Sequence
database
browsing
Sequencing project
management
Protein sequence file
Search databases
for similar
sequences
Sequence comparison
Search for
known motifs
Predict
secondary
structure
Predict
tertiary
structure
Create a multiple
sequence alignment
Edit the alignment
Format the alignment
for publication
Molecular
phylogeny
Protein family
analysis
Nucleotide
sequence
analysis
Sequence
entry
1. BLAST Programs :
http://www.ncbi.nlm.nih.gov/BLAST/
Nucleotide-nucleotide BLAST (blastn)
Translated query vs. protein database (blastx)
Protein query vs. translated database (tblastn)
Translated query vs. translated database (tblastx)
Protein-protein BLAST (blastp)
Align two sequences (bl2seq)
BLAST flavors
BLASTN
Nucleotide query sequence
Nucleotide database
BLASTP
Protein query sequence
Protein database
BLASTX
Nucleotide query sequence
Protein database
Compares all six reading frames with the database
Database searching
Using pairwise alignments to search
databases for similar sequences
Database
Query sequence
Sequence comparison:Gene sequences can be aligned to
see similarities between gene from different sources
768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG 813
|| || || | | ||| | |||| ||||| ||| |||
87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG 135
. . . . .
814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG 863
| | | | |||||| | |||| | || | |
136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG 172
. . . . .
864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT 913
||| | ||| || || ||| | ||||||||| || |||||| |
173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 216
Multiple sequence alignment: Sequences of
proteins from different organisms can be aligned to see
similarities and differences
Alignment formatted using MacBoxshade
Why compare sequences?
• Determination of
evolutionary
relationships
• Prediction of protein
function and structure
(database searches).
Protein 1: binds oxygen
Sequence similarity
Protein 2: binds oxygen ?
Sequences are related
• Darwin: all organisms are related through descent with modification
• Related molecules have similar functions in different organisms
Phylogenetic tree based
on ribosomal RNA:
three domains of life
28
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT .................................
.............. TGAAAAACGTA
TF binding site
promoter
Ribosome binding Site
ORF = Open Reading Frame
CDS = Coding Sequence
Transcription
Start
Site
29
Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL
Bos MALWTRLRPLLALLALWPPPPARAFVNQHL
**** : * *.*: *:..* :. *:****
Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ
Bos CGSHLVEALYLVCGERGFFYTPKARREVEG
***************:***** ** :*::*
Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV
Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV
.**. ** * * *****
Xenopus EQCCHSTCSLFQLENYCN
Bos EQCCASVCSLYQLENYCN
**** *.***:*******
Alignment preproinsulin
Protein structure prediction
Protein docking
30
PCR Primer Design:
Oligonucleotides for use in the polymerase chain
reaction can be designed using computer based
prgrams
OPTIMAL primer length --> 20
MINIMUM primer length --> 18
MAXIMUM primer length --> 22
OPTIMAL primer melting temperature --> 60.000
MINIMUM acceptable melting temp --> 57.000
MAXIMUM acceptable melting temp --> 63.000
MINIMUM acceptable primer GC% --> 20.000
MAXIMUM acceptable primer GC% --> 80.000
Salt concentration (mM) --> 50.000
DNA concentration (nM) --> 50.000
MAX no. unknown bases (Ns) allowed --> 0
MAX acceptable self-complementarity --> 12
MAXIMUM 3' end self-complementarity --> 8
GC clamp how many 3' bases --> 0
Primer design program
Primer3
Pick PCR primers from nucleotide sequence
http://www.basic.nwu.edu/biotools/Primer3.html
http://frodo.wi.mit.edu/cgi-bin/primer3/primer3.cgi/results_from_primer3
http://www.firstmarket.com/firstmarket/cutter/cut2.html
Restriction mapping program
Restriction mapping: Genes can be analysed to detect
gene sequences that can be cleaved with restriction
enzymes
AceIII 1 CAGCTCnnnnnnn’nnn...
AluI 2 AG’CT
AlwI 1 GGATCnnnn’n_
ApoI 2 r’AATT_y
BanII 1 G_rGCy’C
BfaI 2 C’TA_G
BfiI 1 ACTGGG
BsaXI 1 ACnnnnnCTCC
BsgI 1 GTGCAGnnnnnnnnnnn...
BsiHKAI 1 G_wGCw’C
Bsp1286I 1 G_dGCh’C
BsrI 2 ACTG_Gn’
BsrFI 1 r’CCGG_y
CjeI 2 CCAnnnnnnGTnnnnnn...
CviJI 4 rG’Cy
CviRI 1 TG’CA
DdeI 2 C’TnA_G
DpnI 2 GA’TC
EcoRI 1 G’AATT_C
HinfI 2 G’AnT_C
MaeIII 1 ’GTnAC_
MnlI 1 CCTCnnnnnn_n’
MseI 2 T’TA_A
MspI 1 C’CG_G
NdeI 1 CA’TA_TG
Sau3AI 2 ’GATC_
SstI 1 G_AGCT’C
TfiI 2 G’AwT_C
Tsp45I 1 ’GTsAC_
Tsp509I 3 ’AATT_
TspRI 1 CAGTGnn’
50 100 150 200 250
Prediction of RNA secondary structure: an
example
A. Single stranded RNA 5’ 3’
5’
3’
B. Stem and loop or hairpin loop
RNA structure prediction: Structural
features of RNA can be predicted
G
G
A
C
A
G
G
A
G
G
A
U
A
C
C
G
C
G
G
U
C
C
U
G
C C
G G U C C
U C
A
C
U
U
G
G
A
C
U
U
A
G
U
A
U
C
A
U
C
A
G
U
C
U
G
C
G
C
A
A
U
A
G
G
U
A A
C
G C
G
U
Sequences are related, II
Phylogenetic tree of
globin-type proteins
found in humans
Molecular evolution
Smith et al. (2009) Nature 459, 1122-1125
Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic
38
a new swine-origin influenza A (H1N1) virus (S-OIV) emerged in Mexico and the
United States
Analysis of gene expression
Gene expression profile of relapsing versus non-relapsing Wilms tumors.
A set of 39 genes discriminates between the two classes of tumors.
(http://www.biozentrum2.uni-wuerzburg.de/, Prof. Gessler)
39
Analysis of regulation
Toledo and Bardot (2009) Nature 460, 466-467
40
Luscombe, Greenbaum, Gerstein (2001)
41
One day of searching might save a whole
year(s) of lab work
Bioinformatics
Biology
Computer Science Mathematics
NCBI
(Omim, Blast)
Primer 3
Nebcutter

Bioinformatics مي.pdf

  • 1.
  • 2.
    What is Internet •It’s the global net work of computer networks that links scientific institutions (government ,academic and business institutions). • The networks share a communication protocol ; different types of machine are able to speak to each other in a common way.
  • 3.
    • To facilitatecommunication each computer has unique identifying number (IP address). Bsmir30.biochemistry.ucl.ac.uk • Machine name= bsmir30 • Placed In Biochemistry dep. At University Collage London (UCL)=Biochemistry.ucl • Belongs to the academic sub-domain =(ac) • In the domain country London= UK
  • 4.
    • Easy accessto information at different sites was allowed by browsers that communicate with servers. • The first point of contact between browsers and server is the home page
  • 5.
    Bioinformatics as atool • Molecular biology has witnessed an information revolution as a result of rapid DNA sequencing technologies and corresponding progress in computer-based technologies which allowed us to cope with information in increasingly efficient ways.
  • 6.
    What is bioinformatics? •Application of information technology to the storage, management and analysis of biological information • Facilitated by the use of computers. • It is a computerized annotation of genomic and biological information and data (databases), transformation and manipulation of these data (software tools). • Overall Aim of Bioinformatics: • provide biologically important predictions from data and transformation / manipulation of these data
  • 7.
    Biologists collect molecular data: DNA& Protein sequences, gene expression, etc. Computer scientists (+Mathematicians, Statisticians, etc.) Develop tools, softwares, algorithms to store and analyze the data. Bioinformaticians Study biological questions by analyzing molecular data The field of science in which biology, computer science and information technology merge into a single discipline 7
  • 8.
  • 9.
    What can youdo using bioinformatics? • Sequence analysis – Geneticists/ molecular biologists analyse genome sequence information to understand disease processes • Molecular modeling – Crystallographers/ biochemists design drugs using computer- aided tools • Phylogeny/evolution – Geneticists obtain information about the evolution of organisms by looking for similarities in gene sequences. • Ecology and population studies – Bioinformatics is used to handle large amounts of data obtained in population studies • Medical informatics
  • 10.
    Bioinformatics includes: Data analysis Datastorage Data mining Knowledge discovery Biological modeling Bioinformatics may be applied in: DNA and protein sequence analysis Differential gene expression Protein structure and function Protein interactions Biological networks Variation analysis Drug design
  • 11.
  • 12.
    Gene Sequencing: Automatedchemcial sequencing methods allow rapid generation of large data banks of gene sequences
  • 13.
    What is adatabase? • A collection of data – structured – searchable (index) -> table of contents – updated periodically (release) -> new edition – cross-referenced (hyperlinks) -> links with other db • Includes also associated tools (software) necessary for access, updating, information insertion, information deletion…. • Data storage management: flat files, relational databases… 13
  • 14.
    Sequence Databases andTheir Use: A: Primary Sequence Databases: Nucleic Acid Databases NCBI (Natl Center Biotech Information) - GenBank http://www.ncbi.nlm.nih.gov/ EBI (European Bioinformatics Institute) - EMBL http://www.ebi.ac.uk/ DISC - DNA Information and Stock Center, Japan http://www.dna.affrc.go.jp/ Protein Databases NCBI - GenPept http://www.ncbi.nlm.nih.gov/ ExPASy - SwissProt and TrEMBL http://www.expasy.ch/ EBI (European Bioinformatics Institute) SwissProt, TrEMBL, PIR http://www.ebi.ac.uk/
  • 15.
  • 16.
  • 17.
    In bioinformatics, FASTAformat is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY >gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGG CGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAG CCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCT GTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGC CCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTA AGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAAT AAAGTCTGAGTGGGCGGC
  • 18.
    A cDNA sequence >gi|14456711|ref|NM_000558.3|Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGG CGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAG CCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCT GTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGC CCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTA AGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAAT AAAGTCTGAGTGGGCGGC 18
  • 19.
    A cDNA sequence(reading frame) A protein sequence >gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTC GGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTG AGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCG CTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCC GCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGT TAAGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTT GAATAAAGTCTGAGTGGGCGGC >gi|4504347|ref|NP_000549.1| alpha 1 globin [Homo sapiens] MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDL HAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR 19
  • 20.
    Sequence analysis: overview Nucleotidesequence file Search databases for similar sequences Sequence comparison Design further experiments Restriction mapping PCR planning Translate into protein Search for known motifs RNA structure prediction non-coding coding Protein sequence analysis Search for protein coding regions Manual sequence entry Sequence database browsing Sequencing project management Protein sequence file Search databases for similar sequences Sequence comparison Search for known motifs Predict secondary structure Predict tertiary structure Create a multiple sequence alignment Edit the alignment Format the alignment for publication Molecular phylogeny Protein family analysis Nucleotide sequence analysis Sequence entry
  • 21.
    1. BLAST Programs: http://www.ncbi.nlm.nih.gov/BLAST/ Nucleotide-nucleotide BLAST (blastn) Translated query vs. protein database (blastx) Protein query vs. translated database (tblastn) Translated query vs. translated database (tblastx) Protein-protein BLAST (blastp) Align two sequences (bl2seq)
  • 22.
    BLAST flavors BLASTN Nucleotide querysequence Nucleotide database BLASTP Protein query sequence Protein database BLASTX Nucleotide query sequence Protein database Compares all six reading frames with the database
  • 23.
    Database searching Using pairwisealignments to search databases for similar sequences Database Query sequence
  • 24.
    Sequence comparison:Gene sequencescan be aligned to see similarities between gene from different sources 768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG 813 || || || | | ||| | |||| ||||| ||| ||| 87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG 135 . . . . . 814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG 863 | | | | |||||| | |||| | || | | 136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG 172 . . . . . 864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT 913 ||| | ||| || || ||| | ||||||||| || |||||| | 173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 216
  • 25.
    Multiple sequence alignment:Sequences of proteins from different organisms can be aligned to see similarities and differences Alignment formatted using MacBoxshade
  • 26.
    Why compare sequences? •Determination of evolutionary relationships • Prediction of protein function and structure (database searches). Protein 1: binds oxygen Sequence similarity Protein 2: binds oxygen ?
  • 27.
    Sequences are related •Darwin: all organisms are related through descent with modification • Related molecules have similar functions in different organisms Phylogenetic tree based on ribosomal RNA: three domains of life
  • 28.
    28 CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA TAT GGACAA TTG GTT TCT TCT CTG AAT ................................. .............. TGAAAAACGTA TF binding site promoter Ribosome binding Site ORF = Open Reading Frame CDS = Coding Sequence Transcription Start Site
  • 29.
    29 Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL Bos MALWTRLRPLLALLALWPPPPARAFVNQHL ****: * *.*: *:..* :. *:**** Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ Bos CGSHLVEALYLVCGERGFFYTPKARREVEG ***************:***** ** :*::* Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV .**. ** * * ***** Xenopus EQCCHSTCSLFQLENYCN Bos EQCCASVCSLYQLENYCN **** *.***:******* Alignment preproinsulin
  • 30.
  • 31.
    PCR Primer Design: Oligonucleotidesfor use in the polymerase chain reaction can be designed using computer based prgrams OPTIMAL primer length --> 20 MINIMUM primer length --> 18 MAXIMUM primer length --> 22 OPTIMAL primer melting temperature --> 60.000 MINIMUM acceptable melting temp --> 57.000 MAXIMUM acceptable melting temp --> 63.000 MINIMUM acceptable primer GC% --> 20.000 MAXIMUM acceptable primer GC% --> 80.000 Salt concentration (mM) --> 50.000 DNA concentration (nM) --> 50.000 MAX no. unknown bases (Ns) allowed --> 0 MAX acceptable self-complementarity --> 12 MAXIMUM 3' end self-complementarity --> 8 GC clamp how many 3' bases --> 0
  • 32.
    Primer design program Primer3 PickPCR primers from nucleotide sequence http://www.basic.nwu.edu/biotools/Primer3.html http://frodo.wi.mit.edu/cgi-bin/primer3/primer3.cgi/results_from_primer3
  • 33.
  • 34.
    Restriction mapping: Genescan be analysed to detect gene sequences that can be cleaved with restriction enzymes AceIII 1 CAGCTCnnnnnnn’nnn... AluI 2 AG’CT AlwI 1 GGATCnnnn’n_ ApoI 2 r’AATT_y BanII 1 G_rGCy’C BfaI 2 C’TA_G BfiI 1 ACTGGG BsaXI 1 ACnnnnnCTCC BsgI 1 GTGCAGnnnnnnnnnnn... BsiHKAI 1 G_wGCw’C Bsp1286I 1 G_dGCh’C BsrI 2 ACTG_Gn’ BsrFI 1 r’CCGG_y CjeI 2 CCAnnnnnnGTnnnnnn... CviJI 4 rG’Cy CviRI 1 TG’CA DdeI 2 C’TnA_G DpnI 2 GA’TC EcoRI 1 G’AATT_C HinfI 2 G’AnT_C MaeIII 1 ’GTnAC_ MnlI 1 CCTCnnnnnn_n’ MseI 2 T’TA_A MspI 1 C’CG_G NdeI 1 CA’TA_TG Sau3AI 2 ’GATC_ SstI 1 G_AGCT’C TfiI 2 G’AwT_C Tsp45I 1 ’GTsAC_ Tsp509I 3 ’AATT_ TspRI 1 CAGTGnn’ 50 100 150 200 250
  • 35.
    Prediction of RNAsecondary structure: an example A. Single stranded RNA 5’ 3’ 5’ 3’ B. Stem and loop or hairpin loop
  • 36.
    RNA structure prediction:Structural features of RNA can be predicted G G A C A G G A G G A U A C C G C G G U C C U G C C G G U C C U C A C U U G G A C U U A G U A U C A U C A G U C U G C G C A A U A G G U A A C G C G U
  • 37.
    Sequences are related,II Phylogenetic tree of globin-type proteins found in humans
  • 38.
    Molecular evolution Smith etal. (2009) Nature 459, 1122-1125 Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic 38 a new swine-origin influenza A (H1N1) virus (S-OIV) emerged in Mexico and the United States
  • 39.
    Analysis of geneexpression Gene expression profile of relapsing versus non-relapsing Wilms tumors. A set of 39 genes discriminates between the two classes of tumors. (http://www.biozentrum2.uni-wuerzburg.de/, Prof. Gessler) 39
  • 40.
    Analysis of regulation Toledoand Bardot (2009) Nature 460, 466-467 40
  • 41.
  • 43.
    One day ofsearching might save a whole year(s) of lab work
  • 44.