P7 2018 biopython3

FBW
20-11-2018
Wim Van Criekinge

Control Structures
if condition:
statements
[elif condition:
statements] ...
else:
statements
while condition:
statements
for var in sequence:
statements
break
continue

Lists
• Flexible arrays, not Lisp-like linked
lists
• a = [99, "bottles of beer", ["on", "the",
"wall"]]
• Same operators as for strings
• a+b, a*3, a[0], a[-1], a[1:], len(a)
• Item and slice assignment
• a[0] = 98
• a[1:2] = ["bottles", "of", "beer"]
-> [98, "bottles", "of", "beer", ["on", "the", "wall"]]
• del a[-1] # -> [98, "bottles", "of", "beer"]

Dictionaries
• Hash tables, "associative arrays"
• d = {"duck": "eend", "water": "water"}
• Lookup:
• d["duck"] -> "eend"
• d["back"] # raises KeyError exception
• Delete, insert, overwrite:
• del d["water"] # {"duck": "eend", "back": "rug"}
• d["back"] = "rug" # {"duck": "eend", "back":
"rug"}
• d["duck"] = "duik" # {"duck": "duik", "back":
"rug"}

Prelude
• Erdos
• Gaussian - towards random
DNAwalks
• AAindex parser

Prelude
• Erdos
• Gaussian - towards random
DNAwalks

import Bio
print(Bio.__version__)
from Bio.Seq import Seq
my_seq = Seq('CATGTAGACTAG')
#print out some details about it
print ('seq %s is %i bases long' % (my_seq, len(my_seq)))
print ('reverse complement is %s' %
my_seq.reverse_complement())
print ('protein translation is %s' % my_seq.translate())

from Bio import SeqIO
c=0
handle = open(r'/Users/wvcrieki/Downloads/uniprot_sprot.dat')
for seq_rec in SeqIO.parse(handle, "swiss"):
print (seq_rec.id)
print (seq_rec.seq)
print (len(seq_rec))
c+=1
if c>5:
break

BioPython
• Make a histogram of the MW (in kDa) of all proteins in
Swiss-Prot
• Find the most basic and most acidic protein in Swiss-Prot?
• Biological relevance of the results ?
From AAIndex
H ZIMJ680104
D Isoelectric point (Zimmerman et al., 1968)
R LIT:2004109b PMID:5700434
A Zimmerman, J.M., Eliezer, N. and Simha, R.
T The characterization of amino acid sequences in proteins by
statistical
methods
J J. Theor. Biol. 21, 170-201 (1968)
C KLEP840101 0.941 FAUJ880111 0.813 FINA910103 0.805
I A/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V
6.00 10.76 5.41 2.77 5.05 5.65 3.22 5.97 7.59 6.02
5.98 9.74 5.74 5.48 6.30 5.68 5.66 5.89 5.66 5.96

Extra Questions
• How many human proteins in Swiss Prot ?
• What is the longest human protein ? The shortest ?
• Calculate for all human proteins their MW and pI, display as
two histograms (2D scatter ?)
• How many human proteins have “cancer” in their description?
• Which genes has the highest number of SNPs/somatic
mutations (COSMIC)
• How many human DNA-repair enzymes are represented in
Swiss Prot (using description / GO)?
• List proteins that only contain alpha-helices based on the
Chou-Fasman algorithm
• List proteins based on the number of predicted
transmembrane regions (Kyte-Doollittle)

Biopython AAindex ? Dictionary
Hydrophobicity = {A:6.00,L:5.98,R:10.76,K:9.74,N:5.41,M:5.74,D:2.77,F:5.48,
C:5.05,P:6.30,Q:5.65,S:5.68,E:3.22,T:5.66,G:5.97,W:5.89,
H:7.59,Y:5.66,I:6.02,V:5.96}
from Bio import SeqIO
c=0
handle = open(r'/Users/wvcrieki/Downloads/uniprot_sprot.dat')
for seq_rec in SeqIO.parse(handle, "swiss"):
print (seq_rec.id)
print (repr(seq_rec.seq))
print (len(seq_rec))
c+=1
if c>5:
break

Extra Questions (2)
• How many human proteins have “cancer” in their description?
mutations (COSMIC)

Primary sequence reveals important clues about a protein
DnaG E. coli ...EPNRLLVVEGYMDVVAL...
DnaG S. typ ...EPQRLLVVEGYMDVVAL...
DnaG B. subt ...KQERAVLFEGFADVYTA...
gp4 T3 ...GGKKIVVTEGEIDMLTV...
gp4 T7 ...GGKKIVVTEGEIDALTV...
: *: :: * * : :
small hydrophobic
large hydrophobic
polar
positive charge
negative charge
• Evolution conserves amino acids that are important to protein
structure and function across species. Sequence comparison of
multiple “homologs” of a particular protein reveals highly
conserved regions that are important for function.
• Clusters of conserved residues are called “motifs” -- motifs
carry out a particular function or form a particular structure
that is important for the conserved protein.
motif

 The hydropathy index of an amino acid is a number
representing the hydrophobic or hydrophilic properties of its
side-chain.
 It was proposed by Jack Kyte and Russell Doolittle in 1982.
 The larger the number is, the more hydrophobic the amino
acid. The most hydrophobic amino acids are isoleucine (4.5)
and valine (4.2). The most hydrophilic ones are arginine (-4.5)
and lysine (-3.9).
 This is very important in protein structure; hydrophobic
amino acids tend to be internal in the protein 3D structure,
while hydrophilic amino acids are more commonly found
towards the protein surface.
Hydropathy index of amino acids

(http://gcat.davidson.edu/DGPB/kd/kyte-doolittle.htm)Kyte Doolittle Hydropathy Plot

Possible transmembrane fragment

Window size – 9, strong negative peaks indicate possible surface regions
Surface region of a protein

Prediction of transmembrane helices in proteins
(TMHMM)

5-hydroxytryptamine receptor 2A (Mus musculus)

5-hydroxytryptamine receptor 2 (Grapical output)

Exceptions
• Exceptions are events that can
modify the flow or control through a
program.
• They are automatically triggered on
errors.
• try/except : catch and recover from
raised by you or Python exceptions
• try/finally: perform cleanup actions
whether exceptions occur or not
• raise: trigger an exception manually in
your code
• assert: conditionally trigger an
exception in your code

Exception Roles
• Error handling
– Wherever Python detects an error it raises
exceptions
– Default behavior: stops program.
– Otherwise, code try to catch and recover from
the exception (try handler)
• Event notification
– Can signal a valid condition (for example, in
search)
• Special-case handling
– Handles unusual situations
• Termination actions
– Guarantees the required closing-time
operators (try/finally)
• Unusual control-flows
– A sort of high-level “goto”

Numpy – SciPy – Matplib
• Numpy: Fundamental open source package for
scientific computing with Python.
– N-dimensional array object
Linear algebra, Fourier transform, random
number capabilities
• SciPy (prono1unced “Sigh Pie”) is a Python-based
ecosystem of open-source software for
mathematics, science, and engineering.
• Matplotlib is a python 2D plotting library which
produces publication quality figures in a variety of
hardcopy formats and interactive environments
across platforms.

Numpy – SciPy – Matplib -- > PANDAS ?
• Python Data Analysis Library, similar to:
– R
– MATLAB
– SAS
• Combined with the IPython toolkit
• Built on top of NumPy, SciPy, to some extent
matplotlib
• Panel Data System
– Open source, BSD-licensed
• Key Components
– Series is a named Python list (dict with list as value).
{ ‘grades’ : [50,90,100,45] }
– DataFrame is a dictionary of Series (dict of series):
{ { ‘names’ : [‘bob’,’ken’,’art’,’joe’]}
{ ‘grades’ : [50,90,100,45] }
}

Install Extra Packages …matplotlib …
T = +1
A = -1
C = +1G = -1
2D – random walk
1D – random walk
u(i)=1 for pyrimidines (C or T) and u(i)=-1 for purines (A or G)

Install Extra Packages …matplotlib …. and nolds

Extra Questions (2)
• How many human proteins have “cancer” in their
description?
mutations (COSMIC)

 Amino acid sequences fold onto themselves to become a
biologically active molecule.
There are three types of local segments:
Helices: Where protein residues seem to be following the shape
of a spring. The most common are the so-called alpha helices
Extended or Beta-strands: Where residues are in line and
successive residues turn back to each other
Random coils: When the amino acid chain is neither helical nor
extended
Secondary structure of protein

Chou-Fasman Algorithm
Chou, P.Y. and Fasman, G.D. (1974). Conformational parameters for amino acids in helical,
b-sheet, and random coil regions calculated from proteins.Biochemistry 13, 211-221.
Chou, P.Y. and Fasman, G.D. (1974). Prediction of protein conformation. Biochemistry 13,
222-245.
Analyzed the frequency of the 20 amino acids in alpha helices,
Beta sheets and turns.
• Ala (A), Glu (E), Leu (L), and Met (M) are strong predictors of
 helices
• Pro (P) and Gly (G) break  helices.
• When 4 of 5 amino acids have a high probability of being in an alpha helix, it predicts a
alpha helix.
• When 3 of 5 amino acids have a high probability of being in a
b strand, it predicts a b strand.
• 4 amino acids are used to predict turns.

Calculation of Propensities
Pr[i|b-sheet]/Pr[i], Pr[i|-helix]/Pr[i], Pr[i|other]/Pr[i]
determine the probability that amino acid i is in
each structure, normalized by the background
probability that i occurs at all.
Example.
let's say that there are 20,000 amino acids in the database, of
which 2000 are serine, and there are 5000 amino acids in
helical conformation, of which 500 are serine. Then the
helical propensity for serine is: (500/5000) / (2000/20000) =
1.0

Calculation of preference parameters
• Preference parameter > 1.0  specific
residue has a preference for the specific
secondary structure.
• Preference parameter = 1.0  specific
residue does not have a preference for, nor
dislikes the specific secondary structure.
• Preference parameter < 1.0  specific
residue dislikes the specific secondary
structure.

Preference parameters
Residue P(a) P(b) P(t) f(i) f(i+1) f(i+2) f(i+3)
Ala 1.45 0.97 0.57 0.049 0.049 0.034 0.029
Arg 0.79 0.90 1.00 0.051 0.127 0.025 0.101
Asn 0.73 0.65 1.68 0.101 0.086 0.216 0.065
Asp 0.98 0.80 1.26 0.137 0.088 0.069 0.059
Cys 0.77 1.30 1.17 0.089 0.022 0.111 0.089
Gln 1.17 1.23 0.56 0.050 0.089 0.030 0.089
Glu 1.53 0.26 0.44 0.011 0.032 0.053 0.021
Gly 0.53 0.81 1.68 0.104 0.090 0.158 0.113
His 1.24 0.71 0.69 0.083 0.050 0.033 0.033
Ile 1.00 1.60 0.58 0.068 0.034 0.017 0.051
Leu 1.34 1.22 0.53 0.038 0.019 0.032 0.051
Lys 1.07 0.74 1.01 0.060 0.080 0.067 0.073
Met 1.20 1.67 0.67 0.070 0.070 0.036 0.070
Phe 1.12 1.28 0.71 0.031 0.047 0.063 0.063
Pro 0.59 0.62 1.54 0.074 0.272 0.012 0.062
Ser 0.79 0.72 1.56 0.100 0.095 0.095 0.104
Thr 0.82 1.20 1.00 0.062 0.093 0.056 0.068
Trp 1.14 1.19 1.11 0.045 0.000 0.045 0.205
Tyr 0.61 1.29 1.25 0.136 0.025 0.110 0.102
Val 1.14 1.65 0.30 0.023 0.029 0.011 0.029

Applying algorithm
1. Assign parameters (propensities) to residue.
2. Identify regions (nucleation sites) where 4 out of 6 residues have
P(a)>100: a-helix. Extend helix in both directions until four
contiguous residues have an average P(a)<100: end of a-helix. If
segment is longer than 5 residues and P(a)>P(b): a-helix.
3. Repeat this procedure to locate all of the helical regions.
4. Identify regions where 3 out of 5 residues have P(b)>100: b-
sheet. Extend sheet in both directions until four contiguous
residues have an average P(b)<100: end of b-sheet. If P(b)>105
and P(b)>P(a): b-sheet.
5. Rest: P(a)>P(b)  a-helix. P(b)>P(a)  b-sheet.
6. To identify a bend at residue number i, calculate the following
value: p(t) = f(i)f(i+1)f(i+2)f(i+3)
If: (1) p(t) > 0.000075; (2) average P(t)>1.00 in the tetrapeptide;
and (3) averages for tetrapeptide obey P(a)<P(t)>P(b): b-turn.

• Use uniprot_sprot_2018small.dat
– How many entries are in the file ?
– How many different organisms ?
– How many DNA repair enzymes ?
– Find human proteins of at least 250aa that
contain the fewest secondary – structure
elements ? Are they candidates for being IDP ?

Additional fun
• Find proteins that contain no prosite patterns (using
scanner from previous exercise) ?
• Calculate for all human proteins their MW and pI (display
as 2D gel)

P7 2018 biopython3

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to P7 2018 biopython3

Similar to P7 2018 biopython3 (20)

More from Prof. Wim Van Criekinge

More from Prof. Wim Van Criekinge (20)

Recently uploaded

Recently uploaded (20)

P7 2018 biopython3