Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Protein Modeling using MODELLER
1. Asian University For Women
BINF 3016: Protein Modeling (Lab#06)
Protein Modeling: MODELLER
Fall 2022
This content is prepared by Syed Mohammad Lokman (Adjunct Faculty, Bioinformatics & Environmental Sciences),
Asian University For Women, Chittagong, Bangladesh.
References are cited in between the contents.
2. 6.1. Setting up your Google Colab:
1. Visit https://colab.research.google.com/ to access Google Colab, and Change your
account to Academic account if not already selected
2. Create a New notebook: File > New notebook
3. Rename the filename as “Protein_Modeling_Modeller_yourIDnumber”
4. Connect to a hosted runtime by clicking on the “Connect” button.
3. 5. Click on the “Files” menu from the left panel.
6. Now, look at the Google Colab UI for a while.
Fig: (1) To insert Code or Text; (2) To access files in the current runtime; (3) Code Snippet
7. Write the following code in your code snippet and click on “Play/Run” button to check
whether Google Colab is Working properly or not:
import time
print(time.ctime())
8. If the current time is shown up, Google Colab is working Properly.
4. 6.2. Installing Necessary Softwares: MODELLER, BioPython, and Py3dMol:
1. Remember some Google Colab Shortcuts:
Ctrl + M + M => to create Text Cell
Ctrl + M + B => to create Code Snippet
Ctrl + M + D => to Delete Current Cell
Ctrl + Enter => Execute the Code
2. Installing MODELLER:
a. In order to use MODELLER, you will need to obtain an Academic License by
registering on this website https://salilab.org/modeller/registration.html. The
license key will be immediately sent to your email address.
b. Before running this script, make sure to replace the MODELLER #License Key
with the one sent after registration in the MODELLER website
#1. Installing MODELLER
!wget https://salilab.org/modeller/10.3/modeller-10.3.tar.gz
!tar -zxf modeller-10.3.tar.gz
!echo "MODELLER extraction completed"
%cd modeller-10.3
#2. For installing, including a license key
with open('modeller_config', 'a') as f:
f.write("2n") #2 for selecting x86_64 (Opteron/EM64T) box (Linux)
f.write("/content/compiled/MODELLERn")
f.write("YOUR_LICENSE_KEYn") #ADD YOUR LICENSE KEY HERE!
!./Install < modeller_config
!echo "MODELLER set up completed"
!ln -sf /content/compiled/MODELLER/bin/mod10.3 /usr/bin/
#Checking if MODELLER works
!mod10.3 | awk 'NR==1{if($1=="usage:") print "MODELLER succesfully installed";
else if($1!="usage:") print "Something went wrong. Please install again"}'
%pwd
3. Installing BioPython and Py3dMol:
a. Execute following codes in Google Colab:
#1. Installing biopython using pip
!pip install biopython
#2. Installing py3Dmol using pip
!pip install py3Dmol
#3. And importing the py3Dmol module
import py3Dmol
5. 6.3. Building Profile for Target Protein Sequence
1. Create a Directory (Folder) in your Colab Files Section to gather all the necessary files
together:
a. In Files section, Right Click > New Folder > Rename the folder as “lab6”
2. Prepare your Target Sequence File:
a. Click on the three-dotted menu button besides “lab6” directory. Then click on
“New File” to create a new file for sequence. Rename the file as “target.ali”.
b. Open “target.ali” by double clicking on the file. Paste the following code in the
editor section:
7. 4. Search for templates for your target sequence
a. Create a new file named “build_profile.py” in the “lab6” folder and open the file.
b. Modify “build_profile.py” script as follows:
from modeller import *
log.verbose()
env = Environ()
sdb = SequenceDB(env)
sdb.read(seq_database_file='pdb_95.pir', seq_database_format='PIR',
chains_list='ALL', minmax_db_seq_len=(30, 4000),
clean_sequences=True)
sdb.write(seq_database_file='pdb_95.bin', seq_database_format='BINARY',
chains_list='ALL')
sdb.read(seq_database_file='pdb_95.bin', seq_database_format='BINARY',
chains_list='ALL')
#Change according to your File Name here
aln = Alignment(env)
aln.append(file='target.ali', alignment_format='PIR', align_codes='ALL')
prf = aln.to_profile()
prf.build(sdb, matrix_offset=-450, rr_file='${LIB}/blosum62.sim.mat',
gap_penalties_1d=(-500, -50), n_prof_iterations=1,
check_profile=False, max_aln_evalue=0.01)
prf.write(file='build_profile.prf', profile_format='TEXT')
aln = prf.to_alignment()
#-- Write out the alignment file
aln.write(file='build_profile.ali', alignment_format='PIR')
c. Execute build_profile.py by using following code to find template for the target:
#1. Running the profile-build script
!mod10.3 build_profile.py
#2. Printing only the list of potential templates
!sed -n '/HITS FOUND IN ITERATION: 1/,/Weight Matrix/p;/Weight Matrix/q'
build_profile.log
8. d. The most important columns in the Profile.build() output are the second, tenth,
eleventh and twelfth columns.
i. The second column reports the code of the PDB sequence that was
compared with the target sequence. The PDB code in each line is the
representative of a group of PDB sequences that share 95% or more
sequence identity to each other and have less than 30 residues or 30%
sequence length difference.
ii. The eleventh column reports the percentage sequence identities
between target and a PDB sequence normalized by the lengths of the
alignment (indicated in the tenth column). In general, a sequence identity
value above approximately 25% indicates a potential template unless the
alignment is short (i.e., less than 100 residues).
iii. A better measure of the significance of the alignment is given in the
twelfth column by the e-value of the alignment.
e. To select the most appropriate template for the query sequence, a comparison
could be performed among the selected templates.
i. Download pdb files using following BioPython code:
#Downloading the PDB files using biopython
import os
from pathlib import Path
from Bio.PDB import *
templates = ['5ftw', '5xlx', '5xly', '1af7']
pdbl = PDBList()
for s in templates:
pdbl.retrieve_pdb_file(s, pdir='.', file_format ="pdb", overwrite=True)
os.rename("pdb"+s+".ent", s+".pdb")
ii. Create a new “compare.py” file and modify as follow:
from modeller import *
env = Environ()
aln = Alignment(env)
for (pdb, chain) in (('5ftw', 'A'), ('5xlx', 'A'), ('5xly', 'A'),
('1af7', 'A')):
m = Model(env, file=pdb, model_segment=('FIRST:'+chain, 'LAST:'+chain))
aln.append_model(m, atom_files=pdb, align_codes=pdb+chain)
aln.malign()
aln.malign3d()
aln.compare_structures()
aln.id_table(matrix_file='family.mat')
env.dendrogram(matrix_file='family.mat', cluster_cut=-1.0)
9. iii. Execute compare.py:
#1. Running the compare script
!mod10.3 compare.py
#2. Check the log file
!sed -ne '/Sequence identity comparison (ID_TABLE):/,$ p' compare.log
f. From the comparison, select the best template for modeling
5. Aligning Target-Template:
a. Create a new file named “align2D.py” and modify it as follow:
from modeller import *
env = Environ()
aln = Alignment(env)
mdl = Model(env, file='1af7', model_segment=('FIRST:A','LAST:A'))
#Provide PDB code of your template in the next line.
aln.append_model(mdl, align_codes='1af7A', atom_files='1af7.pdb')
aln.append(file='target.ali', align_codes='target')
aln.align2d(max_gap_length=50)
aln.write(file='aligned.fasta', alignment_format='FASTA')
aln.write(file='aligned.ali', alignment_format='PIR')
aln.write(file='aligned.pap', alignment_format='PAP')
b. Execute align2d.py as follow:
#1. Running the align2D script
!mod10.3 align2d.py
c. You will end up with two new files (aligned.ali and aligned.fasta) that contain the
pairwise alignment of the target and template sequences.
10. 6. Model Building:
a. Now, you have three files to build models:
i. 1. target 2. template and 3. alignment
b. Create a new file named “model-single.py” and modify as foolows:
from modeller import *
from modeller.automodel import *
env = environ()
a = automodel(env, alnfile='aligned.ali',
knowns='1af7A', sequence='target',
assess_methods=(assess.DOPE,
#soap_protein_od.Scorer(),
assess.GA341))
a.starting_model = 1
a.ending_model = 50
a.make()
# Get a list of all successfully built models from a.outputs
ok_models = filter(lambda x: x['failure'] is None, a.outputs)
# Rank the models by DOPE score
key = 'DOPE score'
ok_models.sort(lambda a,b: cmp(a[key], b[key]))
# Get top model
m = ok_models[0]
print "Top model: %s (DOPE score %.3f)" % (m['name'], m[key])
c. Execute model-single.py as follows:
#1.Running the model-single script
!mod10.3 model-single.py
d. The model-single.log output has the total potential energy for each
structure,according to MODELLER’s DOPE (discrete optimized protein energy)
score. The log file gives a summary of all the models built. The last line of the log
file contains the best model according to the DOPE score.
11. 7. Model Visualization using Py3dMol:
a. Execute the following codes: (change the model number if necessary)
#1. Copying our best model with a new chain id (To Superimpose)
!sed "s/ A / E /g" target.B99990046.pdb > bestmodel.pdb
!sed "s/ A / D /g" target.B99990040.pdb > best2model.pdb
#2. Setting up py3Dmol for visualization
view=py3Dmol.view()
#3. Loading template
view.addModel(open('1af7.pdb', 'r').read(),'pdb')
#4. Loading best DOPE score model
view.addModel(open('bestmodel.pdb', 'r').read(),'pdb')
view.addModel(open('best2model.pdb', 'r').read(),'pdb')
#5. Zooming into all visualized structures
view.zoomTo()
#6. Here we set the background color as white
view.setBackgroundColor('white')
#7. Here we set the visualization style for chains
view.setStyle({'chain':'A'},{'cartoon': {'color':'purple'}})
view.setStyle({'chain':'E'},{'cartoon': {'color':'yellow'}})
view.setStyle({'chain':'D'},{'cartoon': {'color':'green'}})
#8. And we finally visualize the structures using the command below
view.show()
8. Download all the data of your “lab6” directory:
a. Execute the following code:
#1. Archive your files
!zip -r /content/lab6.zip /content/lab6
#2. Download from Google Colab
from google.colab import files
files.download("/content/lab6.zip")
9. Model Evaluation:
a. Visit SAVES Server: https://saves.mbi.ucla.edu/ and upload your best model after
renaming as “best-model.pdb”
12. b. Consider the following factors:
i. VERIFY3D (i.e. compatibility of an atomic 3D model to its 1D
sequence when compared tothe energetics of good structures from
the PDB).
Check the VERIFY3D results: >80% of the residues should have an
average score ≥ 0.2, whereas the score profile allows you to identify
conflicting regions.
ii. PROCHECK (stereochemical and geometrical quality of the model,
via Ramachandran plots, sidechain rotamers, etc).
Check the Ramachandran plot: Are there any residues outside the
allowed regions? What types of residues are found within those
regions? (Check it by clicking on each dot in the plot).
Check the errors in PROCHECK: are the errors located within the
loop regions?
Lab#6: Exercise:
● Build a Model of Prolyl 4-hydroxylase 13 · Arabidopsis thaliana (UniProt: F4ILF8)
using MODELLER. You should follow the instructions given below:
○ Target File Name: “target_YOURidNUMBER.ali”
○ Submit Colab Notebook (Download the ipynb format)
○ Attach Valuation Report and Based on the Valuation report, Interpret your
Model
○ Download the lab6 folder and rename it as “lab6_YOURidNumber”
○ Upload all the files in a Drive Folder (Notebook, Validation Report,
Interpretation of Validation Report, Lab Folder) and Share the Drive Folder
as Assignment file.
● Submit the Assignment at least two days before the next class (23rd October,
2022). Submission after 23rd October will not be accepted.
13. Reference:
● https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007449
● https://saves.mbi.ucla.edu/
● http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2242414/
● http://salilab.org/modeller/9.13/manual/node255.html
About DOPE Score:
DOPE score is a pairwise atomistic statistical potential which is used to distinguish the
"good" models from the "bad" ones. Lower the DOPE score better is the model. So, it is
used to compare models made of the single amino acids sequence.
DOPE (Discrete optimized protein energy) gives information by comparison of energies
from different models generated taking into account the same sequence. It is useful to
select the best model in terms of energy. DOPE score is only useful to rank the
generated models for a single amino acid sequence.