SlideShare a Scribd company logo
1 of 35
FBW
17-11-2015
Wim Van Criekinge
Bioinformatics.be
GitHub: Hosted GIT
• Largest open source git hosting site
• Public and private options
• User-centric rather than project-centric
• http://github.ugent.be (use your Ugent
login and password)
– Accept invitation from Bioinformatics-I-
2015
URI:
– https://github.ugent.be/Bioinformatics-I-
2015/Python.git
Control Structures
if condition:
statements
[elif condition:
statements] ...
else:
statements
while condition:
statements
for var in sequence:
statements
break
continue
Lists
• Flexible arrays, not Lisp-like linked
lists
• a = [99, "bottles of beer", ["on", "the",
"wall"]]
• Same operators as for strings
• a+b, a*3, a[0], a[-1], a[1:], len(a)
• Item and slice assignment
• a[0] = 98
• a[1:2] = ["bottles", "of", "beer"]
-> [98, "bottles", "of", "beer", ["on", "the", "wall"]]
• del a[-1] # -> [98, "bottles", "of", "beer"]
Dictionaries
• Hash tables, "associative arrays"
• d = {"duck": "eend", "water": "water"}
• Lookup:
• d["duck"] -> "eend"
• d["back"] # raises KeyError exception
• Delete, insert, overwrite:
• del d["water"] # {"duck": "eend", "back": "rug"}
• d["back"] = "rug" # {"duck": "eend", "back":
"rug"}
• d["duck"] = "duik" # {"duck": "duik", "back":
"rug"}
Regex.py
text = 'abbaaabbbbaaaaa'
pattern = 'ab'
for match in re.finditer(pattern, text):
s = match.start()
e = match.end()
print ('Found "%s" at %d:%d' % (text[s:e], s, e))
m = re.search("^([A-Z]) ",line)
if m:
from_letter = m.groups()[0]
Install Biopython
pip is the preferred installer program.
Starting with Python 3.4, it is included
by default with the Python binary
installers.
pip3.5 install Biopython
#pip3.5 install yahoo_finance
from yahoo_finance import Share
yahoo = Share('AAPL')
print (yahoo.get_open())
BioPython
• Make a histogram of the MW (in kDa) of all proteins in
Swiss-Prot
• Find the most basic and most acidic protein in Swiss-Prot?
• Biological relevance of the results ?
From AAIndex
H ZIMJ680104
D Isoelectric point (Zimmerman et al., 1968)
R LIT:2004109b PMID:5700434
A Zimmerman, J.M., Eliezer, N. and Simha, R.
T The characterization of amino acid sequences in proteins by
statistical
methods
J J. Theor. Biol. 21, 170-201 (1968)
C KLEP840101 0.941 FAUJ880111 0.813 FINA910103 0.805
I A/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V
6.00 10.76 5.41 2.77 5.05 5.65 3.22 5.97 7.59 6.02
5.98 9.74 5.74 5.48 6.30 5.68 5.66 5.89 5.66 5.96
Biopython AAindex ? Dictionary
Hydrophobicity = {A:6.00,L:5.98,R:10.76,K:9.74,N:5.41,M:5.74,D:2.77,F:5.48,
C:5.05,P:6.30,Q:5.65,S:5.68,E:3.22,T:5.66,G:5.97,W:5.89,
H:7.59,Y:5.66,I:6.02,V:5.96}
from Bio import SeqIO
c=0
handle = open(r'/Users/wvcrieki/Downloads/uniprot_sprot.dat')
for seq_rec in SeqIO.parse(handle, "swiss"):
print (seq_rec.id)
print (repr(seq_rec.seq))
print (len(seq_rec))
c+=1
if c>5:
break
Find_Most_Basic_Protein.py
Find_Most_Basic_Protein.py
Parsing sequences from the net
Parsing GenBank records from the net
Parsing SwissProt sequence from the net
Handles are not always from files
>>>from Bio import Entrez
>>>from Bio import SeqIO
>>>handle = Entrez.efetch(db="nucleotide",rettype="fasta",id="6273291")
>>>seq_record = SeqIO.read(handle,”fasta”)
>>>handle.close()
>>>seq_record.description
>>>from Bio import ExPASy
>>>from Bio import SeqIO
>>>handle = ExPASy.get_sprot_raw("6273291")
>>>seq_record = SeqIO.read(handle,”swiss”)
>>>handle.close()
>>>print seq_record.id
>>>print seq_record.name
>>>prin seq_record.description
Biopython_live.py
Biopython_live.py
Extra Questions (2)
• How many human proteins in Swiss Prot ?
• What is the longest human protein ? The shortest ?
• Calculate for all human proteins their MW and pI, display as
two histograms (2D scatter ?)
• How many human proteins have “cancer” in their description?
• Which genes has the highest number of SNPs/somatic
mutations (COSMIC)
• How many human DNA-repair enzymes are represented in
Swiss Prot (using description / GO)?
• List proteins that only contain alpha-helices based on the
Chou-Fasman algorithm
• List proteins based on the number of predicted
transmembrane regions (Kyte-Doollittle)
 Amino acid sequences fold onto themselves to become a
biologically active molecule.
There are three types of local segments:
Helices: Where protein residues seem to be following the shape
of a spring. The most common are the so-called alpha helices
Extended or Beta-strands: Where residues are in line and
successive residues turn back to each other
Random coils: When the amino acid chain is neither helical nor
extended
Secondary structure of protein
Chou-Fasman Algorithm
Chou, P.Y. and Fasman, G.D. (1974). Conformational parameters for amino acids in helical,
b-sheet, and random coil regions calculated from proteins.Biochemistry 13, 211-221.
Chou, P.Y. and Fasman, G.D. (1974). Prediction of protein conformation. Biochemistry 13,
222-245.
Analyzed the frequency of the 20 amino acids in alpha helices,
Beta sheets and turns.
• Ala (A), Glu (E), Leu (L), and Met (M) are strong predictors of
 helices
• Pro (P) and Gly (G) break  helices.
• When 4 of 5 amino acids have a high probability of being in an alpha helix, it predicts a
alpha helix.
• When 3 of 5 amino acids have a high probability of being in a
b strand, it predicts a b strand.
• 4 amino acids are used to predict turns.
Calculation of Propensities
Pr[i|b-sheet]/Pr[i], Pr[i|-helix]/Pr[i], Pr[i|other]/Pr[i]
determine the probability that amino acid i is in
each structure, normalized by the background
probability that i occurs at all.
Example.
let's say that there are 20,000 amino acids in the database, of
which 2000 are serine, and there are 5000 amino acids in
helical conformation, of which 500 are serine. Then the
helical propensity for serine is: (500/5000) / (2000/20000) =
1.0
Calculation of preference parameters
• Preference parameter > 1.0  specific
residue has a preference for the specific
secondary structure.
• Preference parameter = 1.0  specific
residue does not have a preference for, nor
dislikes the specific secondary structure.
• Preference parameter < 1.0  specific
residue dislikes the specific secondary
structure.
Preference parameters
Residue P(a) P(b) P(t) f(i) f(i+1) f(i+2) f(i+3)
Ala 1.45 0.97 0.57 0.049 0.049 0.034 0.029
Arg 0.79 0.90 1.00 0.051 0.127 0.025 0.101
Asn 0.73 0.65 1.68 0.101 0.086 0.216 0.065
Asp 0.98 0.80 1.26 0.137 0.088 0.069 0.059
Cys 0.77 1.30 1.17 0.089 0.022 0.111 0.089
Gln 1.17 1.23 0.56 0.050 0.089 0.030 0.089
Glu 1.53 0.26 0.44 0.011 0.032 0.053 0.021
Gly 0.53 0.81 1.68 0.104 0.090 0.158 0.113
His 1.24 0.71 0.69 0.083 0.050 0.033 0.033
Ile 1.00 1.60 0.58 0.068 0.034 0.017 0.051
Leu 1.34 1.22 0.53 0.038 0.019 0.032 0.051
Lys 1.07 0.74 1.01 0.060 0.080 0.067 0.073
Met 1.20 1.67 0.67 0.070 0.070 0.036 0.070
Phe 1.12 1.28 0.71 0.031 0.047 0.063 0.063
Pro 0.59 0.62 1.54 0.074 0.272 0.012 0.062
Ser 0.79 0.72 1.56 0.100 0.095 0.095 0.104
Thr 0.82 1.20 1.00 0.062 0.093 0.056 0.068
Trp 1.14 1.19 1.11 0.045 0.000 0.045 0.205
Tyr 0.61 1.29 1.25 0.136 0.025 0.110 0.102
Val 1.14 1.65 0.30 0.023 0.029 0.011 0.029
Applying algorithm
1. Assign parameters (propensities) to residue.
2. Identify regions (nucleation sites) where 4 out of 6 residues have
P(a)>100: a-helix. Extend helix in both directions until four
contiguous residues have an average P(a)<100: end of a-helix. If
segment is longer than 5 residues and P(a)>P(b): a-helix.
3. Repeat this procedure to locate all of the helical regions.
4. Identify regions where 3 out of 5 residues have P(b)>100: b-
sheet. Extend sheet in both directions until four contiguous
residues have an average P(b)<100: end of b-sheet. If P(b)>105
and P(b)>P(a): b-sheet.
5. Rest: P(a)>P(b)  a-helix. P(b)>P(a)  b-sheet.
6. To identify a bend at residue number i, calculate the following
value: p(t) = f(i)f(i+1)f(i+2)f(i+3)
If: (1) p(t) > 0.000075; (2) average P(t)>1.00 in the tetrapeptide;
and (3) averages for tetrapeptide obey P(a)<P(t)>P(b): b-turn.
Extra Questions (2)
• How many human proteins in Swiss Prot ?
• What is the longest human protein ? The shortest ?
• Calculate for all human proteins their MW and pI, display as
two histograms (2D scatter ?)
• How many human proteins have “cancer” in their description?
• Which genes has the highest number of SNPs/somatic
mutations (COSMIC)
• How many human DNA-repair enzymes are represented in
Swiss Prot (using description / GO)?
• List proteins that only contain alpha-helices based on the
Chou-Fasman algorithm
• List proteins based on the number of predicted
transmembrane regions (Kyte-Doollittle)
Primary sequence reveals important clues about a protein
DnaG E. coli ...EPNRLLVVEGYMDVVAL...
DnaG S. typ ...EPQRLLVVEGYMDVVAL...
DnaG B. subt ...KQERAVLFEGFADVYTA...
gp4 T3 ...GGKKIVVTEGEIDMLTV...
gp4 T7 ...GGKKIVVTEGEIDALTV...
: *: :: * * : :
small hydrophobic
large hydrophobic
polar
positive charge
negative charge
• Evolution conserves amino acids that are important to protein
structure and function across species. Sequence comparison of
multiple “homologs” of a particular protein reveals highly
conserved regions that are important for function.
• Clusters of conserved residues are called “motifs” -- motifs
carry out a particular function or form a particular structure
that is important for the conserved protein.
motif
 The hydropathy index of an amino acid is a number
representing the hydrophobic or hydrophilic properties of its
side-chain.
 It was proposed by Jack Kyte and Russell Doolittle in 1982.
 The larger the number is, the more hydrophobic the amino
acid. The most hydrophobic amino acids are isoleucine (4.5)
and valine (4.2). The most hydrophilic ones are arginine (-4.5)
and lysine (-3.9).
 This is very important in protein structure; hydrophobic
amino acids tend to be internal in the protein 3D structure,
while hydrophilic amino acids are more commonly found
towards the protein surface.
Hydropathy index of amino acids
(http://gcat.davidson.edu/DGPB/kd/kyte-doolittle.htm)Kyte Doolittle Hydropathy Plot
Possible transmembrane fragment
Window size – 9, strong negative peaks indicate possible surface regions
Surface region of a protein
Prediction of transmembrane helices in proteins
(TMHMM)
5-hydroxytryptamine receptor 2A (Mus musculus)
5-hydroxytryptamine receptor 2 (Grapical output)

More Related Content

Viewers also liked

Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searching
Prof. Wim Van Criekinge
 
Bioinformatica t9-t10-biocheminformatics
Bioinformatica t9-t10-biocheminformaticsBioinformatica t9-t10-biocheminformatics
Bioinformatica t9-t10-biocheminformatics
Prof. Wim Van Criekinge
 
2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb
2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb
2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb
Prof. Wim Van Criekinge
 

Viewers also liked (20)

2015 bioinformatics wim_vancriekinge
2015 bioinformatics wim_vancriekinge2015 bioinformatics wim_vancriekinge
2015 bioinformatics wim_vancriekinge
 
2015 bioinformatics bio_python_partii
2015 bioinformatics bio_python_partii2015 bioinformatics bio_python_partii
2015 bioinformatics bio_python_partii
 
Jose María Ordovás-El impacto de las ciencias ómicas en la nutrición, la medi...
Jose María Ordovás-El impacto de las ciencias ómicas en la nutrición, la medi...Jose María Ordovás-El impacto de las ciencias ómicas en la nutrición, la medi...
Jose María Ordovás-El impacto de las ciencias ómicas en la nutrición, la medi...
 
2015 09 imec_wim_vancriekinge_v42_to_present
2015 09 imec_wim_vancriekinge_v42_to_present2015 09 imec_wim_vancriekinge_v42_to_present
2015 09 imec_wim_vancriekinge_v42_to_present
 
December 2012 drylab
December 2012 drylabDecember 2012 drylab
December 2012 drylab
 
2013 03 12_epigenetic_profiling
2013 03 12_epigenetic_profiling2013 03 12_epigenetic_profiling
2013 03 12_epigenetic_profiling
 
Bioinformatics p5-bioperlv2014
Bioinformatics p5-bioperlv2014Bioinformatics p5-bioperlv2014
Bioinformatics p5-bioperlv2014
 
2012 12 02_epigenetic_profiling_environmental_health_sciences
2012 12 02_epigenetic_profiling_environmental_health_sciences2012 12 02_epigenetic_profiling_environmental_health_sciences
2012 12 02_epigenetic_profiling_environmental_health_sciences
 
Bioinformatics life sciences_2012
Bioinformatics life sciences_2012Bioinformatics life sciences_2012
Bioinformatics life sciences_2012
 
2012 12 12_adam_v_final
2012 12 12_adam_v_final2012 12 12_adam_v_final
2012 12 12_adam_v_final
 
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
 
Bioinformatica p6-bioperl
Bioinformatica p6-bioperlBioinformatica p6-bioperl
Bioinformatica p6-bioperl
 
2015 bioinformatics go_hmm_wim_vancriekinge
2015 bioinformatics go_hmm_wim_vancriekinge2015 bioinformatics go_hmm_wim_vancriekinge
2015 bioinformatics go_hmm_wim_vancriekinge
 
2015 04 22_time_labs_shared
2015 04 22_time_labs_shared2015 04 22_time_labs_shared
2015 04 22_time_labs_shared
 
Bioinformatica 27-10-2011-t4-alignments
Bioinformatica 27-10-2011-t4-alignmentsBioinformatica 27-10-2011-t4-alignments
Bioinformatica 27-10-2011-t4-alignments
 
Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searching
 
2015 bioinformatics bio_python
2015 bioinformatics bio_python2015 bioinformatics bio_python
2015 bioinformatics bio_python
 
Bioinformatica t9-t10-biocheminformatics
Bioinformatica t9-t10-biocheminformaticsBioinformatica t9-t10-biocheminformatics
Bioinformatica t9-t10-biocheminformatics
 
2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb
2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb
2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vweb
 
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekinge2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
 

Similar to 2015 bioinformatics bio_python_part3

Presentation for Phi Sigma Fall 2015
Presentation for Phi Sigma Fall 2015Presentation for Phi Sigma Fall 2015
Presentation for Phi Sigma Fall 2015
Caelie Kern
 
ARM 2008: Dissection, characterisation and utilisation of disease QTL -- R Ne...
ARM 2008: Dissection, characterisation and utilisation of disease QTL -- R Ne...ARM 2008: Dissection, characterisation and utilisation of disease QTL -- R Ne...
ARM 2008: Dissection, characterisation and utilisation of disease QTL -- R Ne...
CGIAR Generation Challenge Programme
 
1 tobit analysis
1 tobit analysis1 tobit analysis
1 tobit analysis
Aero Girls
 
Bioinformatics practical note
Bioinformatics practical noteBioinformatics practical note
Bioinformatics practical note
Atai Rabby
 
Penn State Tomato Breeding Program
Penn State Tomato Breeding ProgramPenn State Tomato Breeding Program
Penn State Tomato Breeding Program
heathermerk
 

Similar to 2015 bioinformatics bio_python_part3 (20)

P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
 
Primer designgeneprediction
Primer designgenepredictionPrimer designgeneprediction
Primer designgeneprediction
 
04 Amino acids SPECIAL @.pdf
04 Amino acids SPECIAL @.pdf04 Amino acids SPECIAL @.pdf
04 Amino acids SPECIAL @.pdf
 
2015 bioinformatics score_matrices_wim_vancriekinge
2015 bioinformatics score_matrices_wim_vancriekinge2015 bioinformatics score_matrices_wim_vancriekinge
2015 bioinformatics score_matrices_wim_vancriekinge
 
Presentation for Phi Sigma Fall 2015
Presentation for Phi Sigma Fall 2015Presentation for Phi Sigma Fall 2015
Presentation for Phi Sigma Fall 2015
 
Selection analysis using HyPhy
Selection analysis using HyPhySelection analysis using HyPhy
Selection analysis using HyPhy
 
ARM 2008: Dissection, characterisation and utilisation of disease QTL -- R Ne...
ARM 2008: Dissection, characterisation and utilisation of disease QTL -- R Ne...ARM 2008: Dissection, characterisation and utilisation of disease QTL -- R Ne...
ARM 2008: Dissection, characterisation and utilisation of disease QTL -- R Ne...
 
1 tobit analysis
1 tobit analysis1 tobit analysis
1 tobit analysis
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Mark, Polymer Data Handbook
Mark, Polymer Data HandbookMark, Polymer Data Handbook
Mark, Polymer Data Handbook
 
Purification optimization and characterization of protease from Bacillus va...
Purification optimization and characterization of  protease from  Bacillus va...Purification optimization and characterization of  protease from  Bacillus va...
Purification optimization and characterization of protease from Bacillus va...
 
So Many Solvents, So Little Time
So Many Solvents, So Little TimeSo Many Solvents, So Little Time
So Many Solvents, So Little Time
 
Feed restriction on broiler performance
Feed restriction on broiler performanceFeed restriction on broiler performance
Feed restriction on broiler performance
 
Bioinformatics practical note
Bioinformatics practical noteBioinformatics practical note
Bioinformatics practical note
 
Course on parsing methods for biologists with a focus on ChIP-seq data
Course on parsing methods for biologists with a focus on ChIP-seq dataCourse on parsing methods for biologists with a focus on ChIP-seq data
Course on parsing methods for biologists with a focus on ChIP-seq data
 
Penn State Tomato Breeding Program
Penn State Tomato Breeding ProgramPenn State Tomato Breeding Program
Penn State Tomato Breeding Program
 
EpiVax_Tregitope_Overview_2013
EpiVax_Tregitope_Overview_2013EpiVax_Tregitope_Overview_2013
EpiVax_Tregitope_Overview_2013
 
Bioethanol production from lignocellulosic, whey, and starch.pptx
Bioethanol production from lignocellulosic, whey, and starch.pptxBioethanol production from lignocellulosic, whey, and starch.pptx
Bioethanol production from lignocellulosic, whey, and starch.pptx
 
Automate Express to Streamline Differential Extraction process
Automate Express to Streamline Differential Extraction processAutomate Express to Streamline Differential Extraction process
Automate Express to Streamline Differential Extraction process
 
101 mimmo elia - 7357124 - multiple capillary fuel injector for an internal...
101   mimmo elia - 7357124 - multiple capillary fuel injector for an internal...101   mimmo elia - 7357124 - multiple capillary fuel injector for an internal...
101 mimmo elia - 7357124 - multiple capillary fuel injector for an internal...
 

More from Prof. Wim Van Criekinge

More from Prof. Wim Van Criekinge (20)

2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
 
P6 2018 biopython2b
P6 2018 biopython2bP6 2018 biopython2b
P6 2018 biopython2b
 
P4 2018 io_functions
P4 2018 io_functionsP4 2018 io_functions
P4 2018 io_functions
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
 
P6 2017 biopython2
P6 2017 biopython2P6 2017 biopython2
P6 2017 biopython2
 
Van criekinge 2017_11_13_rodebiotech
Van criekinge 2017_11_13_rodebiotechVan criekinge 2017_11_13_rodebiotech
Van criekinge 2017_11_13_rodebiotech
 

Recently uploaded

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Recently uploaded (20)

PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 

2015 bioinformatics bio_python_part3

  • 1.
  • 4.
  • 5. GitHub: Hosted GIT • Largest open source git hosting site • Public and private options • User-centric rather than project-centric • http://github.ugent.be (use your Ugent login and password) – Accept invitation from Bioinformatics-I- 2015 URI: – https://github.ugent.be/Bioinformatics-I- 2015/Python.git
  • 6. Control Structures if condition: statements [elif condition: statements] ... else: statements while condition: statements for var in sequence: statements break continue
  • 7. Lists • Flexible arrays, not Lisp-like linked lists • a = [99, "bottles of beer", ["on", "the", "wall"]] • Same operators as for strings • a+b, a*3, a[0], a[-1], a[1:], len(a) • Item and slice assignment • a[0] = 98 • a[1:2] = ["bottles", "of", "beer"] -> [98, "bottles", "of", "beer", ["on", "the", "wall"]] • del a[-1] # -> [98, "bottles", "of", "beer"]
  • 8. Dictionaries • Hash tables, "associative arrays" • d = {"duck": "eend", "water": "water"} • Lookup: • d["duck"] -> "eend" • d["back"] # raises KeyError exception • Delete, insert, overwrite: • del d["water"] # {"duck": "eend", "back": "rug"} • d["back"] = "rug" # {"duck": "eend", "back": "rug"} • d["duck"] = "duik" # {"duck": "duik", "back": "rug"}
  • 9. Regex.py text = 'abbaaabbbbaaaaa' pattern = 'ab' for match in re.finditer(pattern, text): s = match.start() e = match.end() print ('Found "%s" at %d:%d' % (text[s:e], s, e)) m = re.search("^([A-Z]) ",line) if m: from_letter = m.groups()[0]
  • 10. Install Biopython pip is the preferred installer program. Starting with Python 3.4, it is included by default with the Python binary installers. pip3.5 install Biopython #pip3.5 install yahoo_finance from yahoo_finance import Share yahoo = Share('AAPL') print (yahoo.get_open())
  • 11.
  • 12. BioPython • Make a histogram of the MW (in kDa) of all proteins in Swiss-Prot • Find the most basic and most acidic protein in Swiss-Prot? • Biological relevance of the results ? From AAIndex H ZIMJ680104 D Isoelectric point (Zimmerman et al., 1968) R LIT:2004109b PMID:5700434 A Zimmerman, J.M., Eliezer, N. and Simha, R. T The characterization of amino acid sequences in proteins by statistical methods J J. Theor. Biol. 21, 170-201 (1968) C KLEP840101 0.941 FAUJ880111 0.813 FINA910103 0.805 I A/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V 6.00 10.76 5.41 2.77 5.05 5.65 3.22 5.97 7.59 6.02 5.98 9.74 5.74 5.48 6.30 5.68 5.66 5.89 5.66 5.96
  • 13. Biopython AAindex ? Dictionary Hydrophobicity = {A:6.00,L:5.98,R:10.76,K:9.74,N:5.41,M:5.74,D:2.77,F:5.48, C:5.05,P:6.30,Q:5.65,S:5.68,E:3.22,T:5.66,G:5.97,W:5.89, H:7.59,Y:5.66,I:6.02,V:5.96} from Bio import SeqIO c=0 handle = open(r'/Users/wvcrieki/Downloads/uniprot_sprot.dat') for seq_rec in SeqIO.parse(handle, "swiss"): print (seq_rec.id) print (repr(seq_rec.seq)) print (len(seq_rec)) c+=1 if c>5: break
  • 16.
  • 17. Parsing sequences from the net Parsing GenBank records from the net Parsing SwissProt sequence from the net Handles are not always from files >>>from Bio import Entrez >>>from Bio import SeqIO >>>handle = Entrez.efetch(db="nucleotide",rettype="fasta",id="6273291") >>>seq_record = SeqIO.read(handle,”fasta”) >>>handle.close() >>>seq_record.description >>>from Bio import ExPASy >>>from Bio import SeqIO >>>handle = ExPASy.get_sprot_raw("6273291") >>>seq_record = SeqIO.read(handle,”swiss”) >>>handle.close() >>>print seq_record.id >>>print seq_record.name >>>prin seq_record.description
  • 20. Extra Questions (2) • How many human proteins in Swiss Prot ? • What is the longest human protein ? The shortest ? • Calculate for all human proteins their MW and pI, display as two histograms (2D scatter ?) • How many human proteins have “cancer” in their description? • Which genes has the highest number of SNPs/somatic mutations (COSMIC) • How many human DNA-repair enzymes are represented in Swiss Prot (using description / GO)? • List proteins that only contain alpha-helices based on the Chou-Fasman algorithm • List proteins based on the number of predicted transmembrane regions (Kyte-Doollittle)
  • 21.  Amino acid sequences fold onto themselves to become a biologically active molecule. There are three types of local segments: Helices: Where protein residues seem to be following the shape of a spring. The most common are the so-called alpha helices Extended or Beta-strands: Where residues are in line and successive residues turn back to each other Random coils: When the amino acid chain is neither helical nor extended Secondary structure of protein
  • 22. Chou-Fasman Algorithm Chou, P.Y. and Fasman, G.D. (1974). Conformational parameters for amino acids in helical, b-sheet, and random coil regions calculated from proteins.Biochemistry 13, 211-221. Chou, P.Y. and Fasman, G.D. (1974). Prediction of protein conformation. Biochemistry 13, 222-245. Analyzed the frequency of the 20 amino acids in alpha helices, Beta sheets and turns. • Ala (A), Glu (E), Leu (L), and Met (M) are strong predictors of  helices • Pro (P) and Gly (G) break  helices. • When 4 of 5 amino acids have a high probability of being in an alpha helix, it predicts a alpha helix. • When 3 of 5 amino acids have a high probability of being in a b strand, it predicts a b strand. • 4 amino acids are used to predict turns.
  • 23. Calculation of Propensities Pr[i|b-sheet]/Pr[i], Pr[i|-helix]/Pr[i], Pr[i|other]/Pr[i] determine the probability that amino acid i is in each structure, normalized by the background probability that i occurs at all. Example. let's say that there are 20,000 amino acids in the database, of which 2000 are serine, and there are 5000 amino acids in helical conformation, of which 500 are serine. Then the helical propensity for serine is: (500/5000) / (2000/20000) = 1.0
  • 24. Calculation of preference parameters • Preference parameter > 1.0  specific residue has a preference for the specific secondary structure. • Preference parameter = 1.0  specific residue does not have a preference for, nor dislikes the specific secondary structure. • Preference parameter < 1.0  specific residue dislikes the specific secondary structure.
  • 25. Preference parameters Residue P(a) P(b) P(t) f(i) f(i+1) f(i+2) f(i+3) Ala 1.45 0.97 0.57 0.049 0.049 0.034 0.029 Arg 0.79 0.90 1.00 0.051 0.127 0.025 0.101 Asn 0.73 0.65 1.68 0.101 0.086 0.216 0.065 Asp 0.98 0.80 1.26 0.137 0.088 0.069 0.059 Cys 0.77 1.30 1.17 0.089 0.022 0.111 0.089 Gln 1.17 1.23 0.56 0.050 0.089 0.030 0.089 Glu 1.53 0.26 0.44 0.011 0.032 0.053 0.021 Gly 0.53 0.81 1.68 0.104 0.090 0.158 0.113 His 1.24 0.71 0.69 0.083 0.050 0.033 0.033 Ile 1.00 1.60 0.58 0.068 0.034 0.017 0.051 Leu 1.34 1.22 0.53 0.038 0.019 0.032 0.051 Lys 1.07 0.74 1.01 0.060 0.080 0.067 0.073 Met 1.20 1.67 0.67 0.070 0.070 0.036 0.070 Phe 1.12 1.28 0.71 0.031 0.047 0.063 0.063 Pro 0.59 0.62 1.54 0.074 0.272 0.012 0.062 Ser 0.79 0.72 1.56 0.100 0.095 0.095 0.104 Thr 0.82 1.20 1.00 0.062 0.093 0.056 0.068 Trp 1.14 1.19 1.11 0.045 0.000 0.045 0.205 Tyr 0.61 1.29 1.25 0.136 0.025 0.110 0.102 Val 1.14 1.65 0.30 0.023 0.029 0.011 0.029
  • 26. Applying algorithm 1. Assign parameters (propensities) to residue. 2. Identify regions (nucleation sites) where 4 out of 6 residues have P(a)>100: a-helix. Extend helix in both directions until four contiguous residues have an average P(a)<100: end of a-helix. If segment is longer than 5 residues and P(a)>P(b): a-helix. 3. Repeat this procedure to locate all of the helical regions. 4. Identify regions where 3 out of 5 residues have P(b)>100: b- sheet. Extend sheet in both directions until four contiguous residues have an average P(b)<100: end of b-sheet. If P(b)>105 and P(b)>P(a): b-sheet. 5. Rest: P(a)>P(b)  a-helix. P(b)>P(a)  b-sheet. 6. To identify a bend at residue number i, calculate the following value: p(t) = f(i)f(i+1)f(i+2)f(i+3) If: (1) p(t) > 0.000075; (2) average P(t)>1.00 in the tetrapeptide; and (3) averages for tetrapeptide obey P(a)<P(t)>P(b): b-turn.
  • 27. Extra Questions (2) • How many human proteins in Swiss Prot ? • What is the longest human protein ? The shortest ? • Calculate for all human proteins their MW and pI, display as two histograms (2D scatter ?) • How many human proteins have “cancer” in their description? • Which genes has the highest number of SNPs/somatic mutations (COSMIC) • How many human DNA-repair enzymes are represented in Swiss Prot (using description / GO)? • List proteins that only contain alpha-helices based on the Chou-Fasman algorithm • List proteins based on the number of predicted transmembrane regions (Kyte-Doollittle)
  • 28. Primary sequence reveals important clues about a protein DnaG E. coli ...EPNRLLVVEGYMDVVAL... DnaG S. typ ...EPQRLLVVEGYMDVVAL... DnaG B. subt ...KQERAVLFEGFADVYTA... gp4 T3 ...GGKKIVVTEGEIDMLTV... gp4 T7 ...GGKKIVVTEGEIDALTV... : *: :: * * : : small hydrophobic large hydrophobic polar positive charge negative charge • Evolution conserves amino acids that are important to protein structure and function across species. Sequence comparison of multiple “homologs” of a particular protein reveals highly conserved regions that are important for function. • Clusters of conserved residues are called “motifs” -- motifs carry out a particular function or form a particular structure that is important for the conserved protein. motif
  • 29.  The hydropathy index of an amino acid is a number representing the hydrophobic or hydrophilic properties of its side-chain.  It was proposed by Jack Kyte and Russell Doolittle in 1982.  The larger the number is, the more hydrophobic the amino acid. The most hydrophobic amino acids are isoleucine (4.5) and valine (4.2). The most hydrophilic ones are arginine (-4.5) and lysine (-3.9).  This is very important in protein structure; hydrophobic amino acids tend to be internal in the protein 3D structure, while hydrophilic amino acids are more commonly found towards the protein surface. Hydropathy index of amino acids
  • 32. Window size – 9, strong negative peaks indicate possible surface regions Surface region of a protein
  • 33. Prediction of transmembrane helices in proteins (TMHMM)
  • 35. 5-hydroxytryptamine receptor 2 (Grapical output)