FBW
20-11-2018
Wim Van Criekinge
Google Calendar
Control Structures
if condition:
statements
[elif condition:
statements] ...
else:
statements
while condition:
statements
for var in sequence:
statements
break
continue
Lists
• Flexible arrays, not Lisp-like linked
lists
• a = [99, "bottles of beer", ["on", "the",
"wall"]]
• Same operators as for strings
• a+b, a*3, a[0], a[-1], a[1:], len(a)
• Item and slice assignment
• a[0] = 98
• a[1:2] = ["bottles", "of", "beer"]
-> [98, "bottles", "of", "beer", ["on", "the", "wall"]]
• del a[-1] # -> [98, "bottles", "of", "beer"]
Dictionaries
• Hash tables, "associative arrays"
• d = {"duck": "eend", "water": "water"}
• Lookup:
• d["duck"] -> "eend"
• d["back"] # raises KeyError exception
• Delete, insert, overwrite:
• del d["water"] # {"duck": "eend", "back": "rug"}
• d["back"] = "rug" # {"duck": "eend", "back":
"rug"}
• d["duck"] = "duik" # {"duck": "duik", "back":
"rug"}
Prelude
• Erdos
• Gaussian - towards random
DNAwalks
• AAindex parser
Erdos.py
Gaussian.py
Prelude
• Erdos
• Gaussian - towards random
DNAwalks
import Bio
print(Bio.__version__)
from Bio.Seq import Seq
my_seq = Seq('CATGTAGACTAG')
#print out some details about it
print ('seq %s is %i bases long' % (my_seq, len(my_seq)))
print ('reverse complement is %s' %
my_seq.reverse_complement())
print ('protein translation is %s' % my_seq.translate())
from Bio import SeqIO
c=0
handle = open(r'/Users/wvcrieki/Downloads/uniprot_sprot.dat')
for seq_rec in SeqIO.parse(handle, "swiss"):
print (seq_rec.id)
print (seq_rec.seq)
print (len(seq_rec))
c+=1
if c>5:
break
BioPython
• Make a histogram of the MW (in kDa) of all proteins in
Swiss-Prot
• Find the most basic and most acidic protein in Swiss-Prot?
• Biological relevance of the results ?
From AAIndex
H ZIMJ680104
D Isoelectric point (Zimmerman et al., 1968)
R LIT:2004109b PMID:5700434
A Zimmerman, J.M., Eliezer, N. and Simha, R.
T The characterization of amino acid sequences in proteins by
statistical
methods
J J. Theor. Biol. 21, 170-201 (1968)
C KLEP840101 0.941 FAUJ880111 0.813 FINA910103 0.805
I A/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V
6.00 10.76 5.41 2.77 5.05 5.65 3.22 5.97 7.59 6.02
5.98 9.74 5.74 5.48 6.30 5.68 5.66 5.89 5.66 5.96
Find_Most_Basic_Protein.py
Find_Most_Basic_Protein.py
Extra Questions
• How many human proteins in Swiss Prot ?
• What is the longest human protein ? The shortest ?
• Calculate for all human proteins their MW and pI, display as
two histograms (2D scatter ?)
• How many human proteins have “cancer” in their description?
• Which genes has the highest number of SNPs/somatic
mutations (COSMIC)
• How many human DNA-repair enzymes are represented in
Swiss Prot (using description / GO)?
• List proteins that only contain alpha-helices based on the
Chou-Fasman algorithm
• List proteins based on the number of predicted
transmembrane regions (Kyte-Doollittle)
Biopython AAindex ? Dictionary
Hydrophobicity = {A:6.00,L:5.98,R:10.76,K:9.74,N:5.41,M:5.74,D:2.77,F:5.48,
C:5.05,P:6.30,Q:5.65,S:5.68,E:3.22,T:5.66,G:5.97,W:5.89,
H:7.59,Y:5.66,I:6.02,V:5.96}
from Bio import SeqIO
c=0
handle = open(r'/Users/wvcrieki/Downloads/uniprot_sprot.dat')
for seq_rec in SeqIO.parse(handle, "swiss"):
print (seq_rec.id)
print (repr(seq_rec.seq))
print (len(seq_rec))
c+=1
if c>5:
break
Extra Questions (2)
• How many human proteins in Swiss Prot ?
• What is the longest human protein ? The shortest ?
• Calculate for all human proteins their MW and pI, display as
two histograms (2D scatter ?)
• How many human proteins have “cancer” in their description?
• Which genes has the highest number of SNPs/somatic
mutations (COSMIC)
• How many human DNA-repair enzymes are represented in
Swiss Prot (using description / GO)?
• List proteins that only contain alpha-helices based on the
Chou-Fasman algorithm
• List proteins based on the number of predicted
transmembrane regions (Kyte-Doollittle)
Primary sequence reveals important clues about a protein
DnaG E. coli ...EPNRLLVVEGYMDVVAL...
DnaG S. typ ...EPQRLLVVEGYMDVVAL...
DnaG B. subt ...KQERAVLFEGFADVYTA...
gp4 T3 ...GGKKIVVTEGEIDMLTV...
gp4 T7 ...GGKKIVVTEGEIDALTV...
: *: :: * * : :
small hydrophobic
large hydrophobic
polar
positive charge
negative charge
• Evolution conserves amino acids that are important to protein
structure and function across species. Sequence comparison of
multiple “homologs” of a particular protein reveals highly
conserved regions that are important for function.
• Clusters of conserved residues are called “motifs” -- motifs
carry out a particular function or form a particular structure
that is important for the conserved protein.
motif
 The hydropathy index of an amino acid is a number
representing the hydrophobic or hydrophilic properties of its
side-chain.
 It was proposed by Jack Kyte and Russell Doolittle in 1982.
 The larger the number is, the more hydrophobic the amino
acid. The most hydrophobic amino acids are isoleucine (4.5)
and valine (4.2). The most hydrophilic ones are arginine (-4.5)
and lysine (-3.9).
 This is very important in protein structure; hydrophobic
amino acids tend to be internal in the protein 3D structure,
while hydrophilic amino acids are more commonly found
towards the protein surface.
Hydropathy index of amino acids
(http://gcat.davidson.edu/DGPB/kd/kyte-doolittle.htm)Kyte Doolittle Hydropathy Plot
Possible transmembrane fragment
Window size – 9, strong negative peaks indicate possible surface regions
Surface region of a protein
Prediction of transmembrane helices in proteins
(TMHMM)
5-hydroxytryptamine receptor 2A (Mus musculus)
5-hydroxytryptamine receptor 2 (Grapical output)

P6 2018 biopython2b

  • 2.
  • 3.
  • 5.
    Control Structures if condition: statements [elifcondition: statements] ... else: statements while condition: statements for var in sequence: statements break continue
  • 6.
    Lists • Flexible arrays,not Lisp-like linked lists • a = [99, "bottles of beer", ["on", "the", "wall"]] • Same operators as for strings • a+b, a*3, a[0], a[-1], a[1:], len(a) • Item and slice assignment • a[0] = 98 • a[1:2] = ["bottles", "of", "beer"] -> [98, "bottles", "of", "beer", ["on", "the", "wall"]] • del a[-1] # -> [98, "bottles", "of", "beer"]
  • 7.
    Dictionaries • Hash tables,"associative arrays" • d = {"duck": "eend", "water": "water"} • Lookup: • d["duck"] -> "eend" • d["back"] # raises KeyError exception • Delete, insert, overwrite: • del d["water"] # {"duck": "eend", "back": "rug"} • d["back"] = "rug" # {"duck": "eend", "back": "rug"} • d["duck"] = "duik" # {"duck": "duik", "back": "rug"}
  • 9.
    Prelude • Erdos • Gaussian- towards random DNAwalks • AAindex parser
  • 10.
  • 12.
  • 13.
    Prelude • Erdos • Gaussian- towards random DNAwalks
  • 14.
    import Bio print(Bio.__version__) from Bio.Seqimport Seq my_seq = Seq('CATGTAGACTAG') #print out some details about it print ('seq %s is %i bases long' % (my_seq, len(my_seq))) print ('reverse complement is %s' % my_seq.reverse_complement()) print ('protein translation is %s' % my_seq.translate())
  • 15.
    from Bio importSeqIO c=0 handle = open(r'/Users/wvcrieki/Downloads/uniprot_sprot.dat') for seq_rec in SeqIO.parse(handle, "swiss"): print (seq_rec.id) print (seq_rec.seq) print (len(seq_rec)) c+=1 if c>5: break
  • 16.
    BioPython • Make ahistogram of the MW (in kDa) of all proteins in Swiss-Prot • Find the most basic and most acidic protein in Swiss-Prot? • Biological relevance of the results ? From AAIndex H ZIMJ680104 D Isoelectric point (Zimmerman et al., 1968) R LIT:2004109b PMID:5700434 A Zimmerman, J.M., Eliezer, N. and Simha, R. T The characterization of amino acid sequences in proteins by statistical methods J J. Theor. Biol. 21, 170-201 (1968) C KLEP840101 0.941 FAUJ880111 0.813 FINA910103 0.805 I A/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V 6.00 10.76 5.41 2.77 5.05 5.65 3.22 5.97 7.59 6.02 5.98 9.74 5.74 5.48 6.30 5.68 5.66 5.89 5.66 5.96
  • 18.
  • 19.
  • 21.
    Extra Questions • Howmany human proteins in Swiss Prot ? • What is the longest human protein ? The shortest ? • Calculate for all human proteins their MW and pI, display as two histograms (2D scatter ?) • How many human proteins have “cancer” in their description? • Which genes has the highest number of SNPs/somatic mutations (COSMIC) • How many human DNA-repair enzymes are represented in Swiss Prot (using description / GO)? • List proteins that only contain alpha-helices based on the Chou-Fasman algorithm • List proteins based on the number of predicted transmembrane regions (Kyte-Doollittle)
  • 23.
    Biopython AAindex ?Dictionary Hydrophobicity = {A:6.00,L:5.98,R:10.76,K:9.74,N:5.41,M:5.74,D:2.77,F:5.48, C:5.05,P:6.30,Q:5.65,S:5.68,E:3.22,T:5.66,G:5.97,W:5.89, H:7.59,Y:5.66,I:6.02,V:5.96} from Bio import SeqIO c=0 handle = open(r'/Users/wvcrieki/Downloads/uniprot_sprot.dat') for seq_rec in SeqIO.parse(handle, "swiss"): print (seq_rec.id) print (repr(seq_rec.seq)) print (len(seq_rec)) c+=1 if c>5: break
  • 24.
    Extra Questions (2) •How many human proteins in Swiss Prot ? • What is the longest human protein ? The shortest ? • Calculate for all human proteins their MW and pI, display as two histograms (2D scatter ?) • How many human proteins have “cancer” in their description? • Which genes has the highest number of SNPs/somatic mutations (COSMIC) • How many human DNA-repair enzymes are represented in Swiss Prot (using description / GO)? • List proteins that only contain alpha-helices based on the Chou-Fasman algorithm • List proteins based on the number of predicted transmembrane regions (Kyte-Doollittle)
  • 25.
    Primary sequence revealsimportant clues about a protein DnaG E. coli ...EPNRLLVVEGYMDVVAL... DnaG S. typ ...EPQRLLVVEGYMDVVAL... DnaG B. subt ...KQERAVLFEGFADVYTA... gp4 T3 ...GGKKIVVTEGEIDMLTV... gp4 T7 ...GGKKIVVTEGEIDALTV... : *: :: * * : : small hydrophobic large hydrophobic polar positive charge negative charge • Evolution conserves amino acids that are important to protein structure and function across species. Sequence comparison of multiple “homologs” of a particular protein reveals highly conserved regions that are important for function. • Clusters of conserved residues are called “motifs” -- motifs carry out a particular function or form a particular structure that is important for the conserved protein. motif
  • 26.
     The hydropathyindex of an amino acid is a number representing the hydrophobic or hydrophilic properties of its side-chain.  It was proposed by Jack Kyte and Russell Doolittle in 1982.  The larger the number is, the more hydrophobic the amino acid. The most hydrophobic amino acids are isoleucine (4.5) and valine (4.2). The most hydrophilic ones are arginine (-4.5) and lysine (-3.9).  This is very important in protein structure; hydrophobic amino acids tend to be internal in the protein 3D structure, while hydrophilic amino acids are more commonly found towards the protein surface. Hydropathy index of amino acids
  • 27.
  • 28.
  • 29.
    Window size –9, strong negative peaks indicate possible surface regions Surface region of a protein
  • 30.
    Prediction of transmembranehelices in proteins (TMHMM)
  • 31.
  • 32.