Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

P7 2017 biopython3

Biopython fun

  • Login to see the comments

  • Be the first to like this

P7 2017 biopython3

  1. 1. FBW 05-12-2017 Wim Van Criekinge
  2. 2. Google Calendar
  3. 3. esearch + efetch
  4. 4. Install Extra Packages … finance ? pip is the preferred installer program. Starting with Python 3.4, it is included by default with the Python binary installers. pip3.5 install Biopython #pip3.5 install yahoo_finance from yahoo_finance import Share yahoo = Share('AAPL') print (yahoo.get_open())
  5. 5. Numpy – SciPy – Matplib • Numpy: Fundamental open source package for scientific computing with Python. – N-dimensional array object Linear algebra, Fourier transform, random number capabilities • SciPy (prono1unced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering. • Matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
  6. 6. Numpy – SciPy – Matplib -- > PANDAS ? • Python Data Analysis Library, similar to: – R – MATLAB – SAS • Combined with the IPython toolkit • Built on top of NumPy, SciPy, to some extent matplotlib • Panel Data System – Open source, BSD-licensed • Key Components – Series is a named Python list (dict with list as value). { ‘grades’ : [50,90,100,45] } – DataFrame is a dictionary of Series (dict of series): { { ‘names’ : [‘bob’,’ken’,’art’,’joe’]} { ‘grades’ : [50,90,100,45] } }
  7. 7. Install Extra Packages … googleapiclient?
  8. 8. Install Extra Packages … twitter
  9. 9. Install Extra Packages …matplotlib … T = +1 A = -1 C = +1G = -1 2D – random walk 1D – random walk u(i)=1 for pyrimidines (C or T) and u(i)=-1 for purines (A or G)
  10. 10. Install Extra Packages …matplotlib …. and nolds
  11. 11. Extra Questions (2) • How many human proteins in Swiss Prot ? • What is the longest human protein ? The shortest ? • Calculate for all human proteins their MW and pI, display as two histograms (2D scatter ?) • How many human proteins have “cancer” in their description? • Which genes has the highest number of SNPs/somatic mutations (COSMIC) • How many human DNA-repair enzymes are represented in Swiss Prot (using description / GO)? • List proteins that only contain alpha-helices based on the Chou-Fasman algorithm • List proteins based on the number of predicted transmembrane regions (Kyte-Doollittle)
  12. 12.  Amino acid sequences fold onto themselves to become a biologically active molecule. There are three types of local segments: Helices: Where protein residues seem to be following the shape of a spring. The most common are the so-called alpha helices Extended or Beta-strands: Where residues are in line and successive residues turn back to each other Random coils: When the amino acid chain is neither helical nor extended Secondary structure of protein
  13. 13. Chou-Fasman Algorithm Chou, P.Y. and Fasman, G.D. (1974). Conformational parameters for amino acids in helical, b-sheet, and random coil regions calculated from proteins.Biochemistry 13, 211-221. Chou, P.Y. and Fasman, G.D. (1974). Prediction of protein conformation. Biochemistry 13, 222-245. Analyzed the frequency of the 20 amino acids in alpha helices, Beta sheets and turns. • Ala (A), Glu (E), Leu (L), and Met (M) are strong predictors of  helices • Pro (P) and Gly (G) break  helices. • When 4 of 5 amino acids have a high probability of being in an alpha helix, it predicts a alpha helix. • When 3 of 5 amino acids have a high probability of being in a b strand, it predicts a b strand. • 4 amino acids are used to predict turns.
  14. 14. Calculation of Propensities Pr[i|b-sheet]/Pr[i], Pr[i|-helix]/Pr[i], Pr[i|other]/Pr[i] determine the probability that amino acid i is in each structure, normalized by the background probability that i occurs at all. Example. let's say that there are 20,000 amino acids in the database, of which 2000 are serine, and there are 5000 amino acids in helical conformation, of which 500 are serine. Then the helical propensity for serine is: (500/5000) / (2000/20000) = 1.0
  15. 15. Calculation of preference parameters • Preference parameter > 1.0  specific residue has a preference for the specific secondary structure. • Preference parameter = 1.0  specific residue does not have a preference for, nor dislikes the specific secondary structure. • Preference parameter < 1.0  specific residue dislikes the specific secondary structure.
  16. 16. Calculation of Propensities Pr[i|b-sheet]/Pr[i], Pr[i|-helix]/Pr[i], Pr[i|other]/Pr[i] determine the probability that amino acid i is in each structure, normalized by the background probability that i occurs at all. Example. let's say that there are 20,000 amino acids in the database, of which 2000 are serine, and there are 5000 amino acids in helical conformation, of which 500 are serine. Then the helical propensity for serine is: (500/5000) / (2000/20000) = 1.0
  17. 17. Calculation of preference parameters • Preference parameter > 1.0  specific residue has a preference for the specific secondary structure. • Preference parameter = 1.0  specific residue does not have a preference for, nor dislikes the specific secondary structure. • Preference parameter < 1.0  specific residue dislikes the specific secondary structure.
  18. 18. Preference parameters Residue P(a) P(b) P(t) f(i) f(i+1) f(i+2) f(i+3) Ala 1.45 0.97 0.57 0.049 0.049 0.034 0.029 Arg 0.79 0.90 1.00 0.051 0.127 0.025 0.101 Asn 0.73 0.65 1.68 0.101 0.086 0.216 0.065 Asp 0.98 0.80 1.26 0.137 0.088 0.069 0.059 Cys 0.77 1.30 1.17 0.089 0.022 0.111 0.089 Gln 1.17 1.23 0.56 0.050 0.089 0.030 0.089 Glu 1.53 0.26 0.44 0.011 0.032 0.053 0.021 Gly 0.53 0.81 1.68 0.104 0.090 0.158 0.113 His 1.24 0.71 0.69 0.083 0.050 0.033 0.033 Ile 1.00 1.60 0.58 0.068 0.034 0.017 0.051 Leu 1.34 1.22 0.53 0.038 0.019 0.032 0.051 Lys 1.07 0.74 1.01 0.060 0.080 0.067 0.073 Met 1.20 1.67 0.67 0.070 0.070 0.036 0.070 Phe 1.12 1.28 0.71 0.031 0.047 0.063 0.063 Pro 0.59 0.62 1.54 0.074 0.272 0.012 0.062 Ser 0.79 0.72 1.56 0.100 0.095 0.095 0.104 Thr 0.82 1.20 1.00 0.062 0.093 0.056 0.068 Trp 1.14 1.19 1.11 0.045 0.000 0.045 0.205 Tyr 0.61 1.29 1.25 0.136 0.025 0.110 0.102 Val 1.14 1.65 0.30 0.023 0.029 0.011 0.029
  19. 19. Applying algorithm 1. Assign parameters (propensities) to residue. 2. Identify regions (nucleation sites) where 4 out of 6 residues have P(a)>100: a-helix. Extend helix in both directions until four contiguous residues have an average P(a)<100: end of a-helix. If segment is longer than 5 residues and P(a)>P(b): a-helix. 3. Repeat this procedure to locate all of the helical regions. 4. Identify regions where 3 out of 5 residues have P(b)>100: b- sheet. Extend sheet in both directions until four contiguous residues have an average P(b)<100: end of b-sheet. If P(b)>105 and P(b)>P(a): b-sheet. 5. Rest: P(a)>P(b)  a-helix. P(b)>P(a)  b-sheet. 6. To identify a bend at residue number i, calculate the following value: p(t) = f(i)f(i+1)f(i+2)f(i+3) If: (1) p(t) > 0.000075; (2) average P(t)>1.00 in the tetrapeptide; and (3) averages for tetrapeptide obey P(a)<P(t)>P(b): b-turn.
  20. 20. Additional fun • Find proteins of at least 250aa that contain the fewest secondary – structure elements ? Are they candidates for being IDP ? • Find proteins that contain no prosite patterns (using scanner from previous exercise) ? • Calculate for all human proteins their MW and pI (display as 2D gel)

×