Zarlish attique 187104 project assignment modeller

Subject: Pharmacoinformatics
Government Post Graduate College Mandian Abbottabad
Assignment no 2: PROJECT ASSIGNMENT MODELLER
Submitted by:
Name: Zarlish Attique
Registration no: 187104
Subject: Pharmacoinformatics
Department: Bioinformatics
Semester: 5th
Submitted to:
Teacher Name: Sir Muhammad Imran Sharif
Department of Bioinformatics
Date of Submission: January 7,2020
PROJECT QUESTIONS: -
1. Take Any protein sequence (make sure that the 3D structure is not present in the PDB
database), predict the structure by using MODELLER.
2. Write down functions of the protein, structural organization (no. of beta sheets, helices
etc).
3. Write methodology and results of modeling procedure.

2 | P a g e
Homology Modeling
Homology modeling, also known as comparative modeling of protein, refers to constructing an
atomic-resolution model of the "target" protein from its amino acid sequence and an
experimental three-dimensional structure of a related homologous protein.

3 | P a g e
ABOUT PROTEIN: dACE2
Truncated angiotensin converting enzyme 2
Primate-specific isoform of ACE2
Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), which causes COVID19,
utilizes angiotensin-converting enzyme 2 (ACE2) for entry into target cells. ACE2 has been
proposed as an interferon-stimulated gene (ISG). Thus, interferon-induced variability in ACE2
expression levels could be important for susceptibility to COVID-19 or its outcomes. The
discovery of a novel, primate-specific isoform of ACE2 has been reported, which is designate as
deltaACE2 (dACE2). Demonstrate that dACE2, but not ACE2, is an ISG. In vitro, dACE2, which
lacks 356 N-terminal amino acids, was non-functional in binding the SARS-CoV-2 spike protein
and as a carboxypeptidase. Their results reconcile current knowledge on ACE2 expression and
suggest that the ISG-type induction of dACE2 in IFN-high conditions created by treatments,
inflammatory tumor microenvironment, or viral co-infections is unlikely to affect the cellular
entry of SARS-CoV-2 and promote infection.
An interferon-stimulated gene (ISG) is a gene whose expression is stimulated by interferon.
Interferons (IFNs) are a group of signaling proteins made and released by host cells in response
to the presence of several viruses. In a typical scenario, a virus-infected cell will
release interferons causing nearby cells to heighten their anti-viral defenses.

4 | P a g e
METHODOLOGY AND THE RESULTS OF PROTEIN STRUCTURE
PREDICTION AND MODELLER
1. Protein selection from UniProt with no known structure.
UniProt is a freely accessible database of protein sequence and functional information, many
entries being derived from genome sequencing projects. It contains a large amount of information
about the biological function of proteins derived from the research literature. In the first step, the
protein with UniProtKB-A0A7D6JAD5_HUMAN has been selected for the study.
Figure: Till 2-jan-2020 the structure is unknown not present in pdb as well.
NCBI Nucleotide: https://www.ncbi.nlm.nih.gov/nucleotide/MT505392
NCBI Protein: https://www.ncbi.nlm.nih.gov/protein/1878857681
NCBI Taxonomy: https://www.ncbi.nlm.nih.gov/taxonomy/?term=9606
24-JUL-2020

5 | P a g e
Table: Entry information from UniProt.
Entry name A0A7D6JAD5_HUMAN
Accession Primary (citable) accession number: A0A7D6JAD5
Entry history Integrated into UniProtKB/TrEMBL: December 2, 2020
Last sequence update: December 2, 2020
Last modified: December 2, 2020
Entry status Unreviewed (UniProtKB/TrEMBL)
FASTA Sequence
A0A7D6JAD5_HUMAN taken from UniProt.
Length:459
Mass (Da):52,737
>tr|A0A7D6JAD5|A0A7D6JAD5_HUMAN Truncated angiotensin converting enzyme 2
OS=Homo sapiens OX=9606 GN=ACE2 PE=2 SV=1
MREAGWDKGGRILMCTKVTMDDFLTAHHEMGHIQYDMAYAAQPFLLRNGANEGFHE
AVGE
IMSLSAATPKHLKSIGLLSPDFQEDNETEINFLLKQALTIVGTLPFTYMLEKWRWMVFKG
EIPKDQWMKKWWEMKREIVGVVEPVPHDETYCDPASLFHVSNDYSFIRYYTRTLYQFQ
FQ
EALCQAAKHEGPLHKCDISNSTEAGQKLFNMLRLGKSEPWTLALENVVGAKNMNVRPL
LN
YFEPLFTWLKDQNKNSFVGWSTDWSPYADQSIKVRISLKSALGDKAYEWNDNEMYLFR
SS
VAYAMRQYFLKVKNQMILFGEEDVRVANLKPRISFNFFVTAPKNVSDIIPRTEVEKAIRM
SRSRINDAFRLNDNSLEFLGIQPTLGPPNQPPVSIWLIVFGVVMGVIVVGIVILIFTGIR
DRKKKNKARSGENPYASIDISKGENNPGFQNTDDVQTSF

6 | P a g e
2. Template recognition and initial alignment using BLASTp and PDB.
Template recognition & selection involved searching the PDB for homologous proteins with
determined structures. The search was performed using simple sequence alignment programs such
as BLAST and FASTA as the percentage identity between the Target sequence and a possible
template is high enough in the safe zone, to be detected with these programs. In general, 40%
sequence identity is required to generate an useful model. Here, in the second step the sequence of
dACE2 with FASTA format has put into the BLAST and search out for the PDB.
Figure: FASTA FORMAT of query sequence with unknown structure.

7 | P a g e
Figure: Performing BLASTp from BLAST website.
Figure: The results of the BLASTp showing multiple outputs.

8 | P a g e
Table: The selected four template structures according to the lowest e-value, greater query
coverage and greater percent identity. The description, scientific name, maximum score,
total score, query coverage, e-value, percentage identity, accession length, and its PDB
accession has been given.
Description Scientific
Name
Max
Score
Total
Score
Query
Cover
E
value
Per.
Ident
Acc.
Len
Acce..
The 2019-nCoV
RBD/ACE2-B0AT1
complex
Homo
sapiens
942 942 99% 0.0 98.47% 814 6M17_B
S protein of SARS-
CoV-2 in complex
bound with T-ACE2
Homo
sapiens
942 942 99% 0.0 98.47% 817 7CT5_D
Cryo-EM structure
of cat ACE2 and
SARS-CoV-2 RBD
Felis
catus
723 723 85% 0.0 86.55% 732 7C8D_A
SARS Spike
Glycoprotein -
human ACE2
complex, Stabilized
variant, all ACE2-
bound particles
Homo
sapiens
560 560 59% 0.0 96.35% 605 6CS2_D

9 | P a g e
3. Refinements of the structures taken from PDB using Chimera 1.15rc
Refinement of structure 1 using CHIMERA 1.15rc : 6M17
Figure: 6M17 A to V chain
Figure:6M17 Chain B
Renamed it as seq1 for sake of convenience its not necessary.

10 | P a g e
Refinement of structure 2 using CHIMERA 1.15rc : 7CT5
Figure:7CT5 A to Z chains
Figure:7CT5 chain D

11 | P a g e
Refinement of structure 3 using CHIMERA 1.15rc : 7C8D
Figure:7C8D A and B chains
Figure:7C8D chain A

12 | P a g e
Refinement of structure 4 using CHIMERA 1.15rc: 6CS2
Figure:6CS2 A to Z chains.
Figure:6CS2 chain D

13 | P a g e
PREPARATION OF THE FIVE SCRIPTS FROM MODELLER TUTORIAL WEBSITE:
MODELLER STEPS
Now we have our query sequence and also 3D templates are recognized, the next step is the
preparation of the five scripts for MODELLER from MODELLER Tutorial website
https://salilab.org/modeller/tutorial/basic.html
MODELLER
MODELLER is used for homology or comparative modeling of protein three-dimensional
structures. The user provides an alignment of a sequence to be modeled with known related
structures and MODELLER automatically calculates a model containing all non-hydrogen
atoms. MODELLER implements comparative protein structure modeling by satisfaction of
spatial restraints, and can perform many additional tasks, including de novo modeling of loops
in protein structures, optimization of various models of protein structure with respect to a
flexibly defined objective function, multiple alignment of protein sequences and/or structures,
clustering, searching of sequence databases, comparison of protein structures, etc. that are
shown below when explaining modelling steps.
Figure: Modeller program interface for scripts execution.

14 | P a g e
4. MODELLER Step 1: Script_1 preparation to analyze the query sequence and
maintain profile.
The first line contains the sequence code, in the format ">P1;code". The second line with ten fields
separated by colons generally contains information about the structure file. Only two of these fields
are used for sequences, "sequence" (indicating that the file contains a sequence without known
structure) and "TvLDH" (the model file name). The rest of the file contains the sequence of
TvLDH, with "*" marking its end.
>P1;TvLDH
sequence:TvLDH:::::::0.00: 0.00
Here placed our Query Sequence*
Figure: Query sequence save it as .ALI file with proper formatting.

15 | P a g e
Here in script 1 form MODELLER website, no need to change anything just save it as .PY
file.
from modeller import *
log.verbose()
env = environ()
#-- Prepare the input files
#-- Read in the sequence database
sdb = sequence_db(env)
sdb.read(seq_database_file='pdb_95.pir', seq_database_format='PIR',
chains_list='ALL', minmax_db_seq_len=(30, 4000), clean_sequences=True)
#-- Write the sequence database in binary form
sdb.write(seq_database_file='pdb_95.bin', seq_database_format='BINARY',
chains_list='ALL')
#-- Now, read in the binary database
sdb.read(seq_database_file='pdb_95.bin', seq_database_format='BINARY',
chains_list='ALL')
#-- Read in the target sequence/alignment
aln = alignment(env)
aln.append(file='TvLDH.ali', alignment_format='PIR', align_codes='ALL')
#-- Convert the input sequence/alignment into
# profile format
prf = aln.to_profile()

16 | P a g e
#-- Scan sequence database to pick up homologous sequences
prf.build(sdb, matrix_offset=-450, rr_file='${LIB}/blosum62.sim.mat',
gap_penalties_1d=(-500, -50), n_prof_iterations=1,
check_profile=False, max_aln_evalue=0.01)
#-- Write out the profile in text format
prf.write(file='build_profile.prf', profile_format='TEXT')
#-- Convert the profile back to alignment format
aln = prf.to_alignment()
#-- Write out the alignment file
aln.write(file='build_profile.ali', alignment_format='PIR')
Figure: Script 1; save it as PY file.

17 | P a g e
5. MODELLER Step 2: Script_2 preparation to carry out MULTIPLE
SEQUENCE ALIGNMENT and PHYLOGENETOC TREE construction and
check out the crystallographic resolution.
Here in the script2 replaced the name of our four pdb structures that were named as seq1, seq2,
seq3, and seq4 along with the chain name i.e B, D, A and D.
Note: If you have more structures or less structures you can add or delete structures according to
the choice.
env = environ()
for (pdb, chain) in (('1b8p', 'A'), ('1bdm', 'A'), ('1civ', 'A'),
('5mdh', 'A'), ('7mdh', 'A'), ('1smk', 'A')):
m = model(env, file=pdb, model_segment=('FIRST:'+chain, 'LAST:'+chain))
aln.append_model(m, atom_files=pdb, align_codes=pdb+chain)
aln.malign()
aln.malign3d()
aln.compare_structures()
aln.id_table(matrix_file='family.mat')
env.dendrogram(matrix_file='family.mat', cluster_cut=-1.0)

18 | P a g e
Figure: Script 2; save it as PY file.

19 | P a g e
--------Run Script1 and script2--------
Now place the script1, script2 along with query file, PIR file and four pdb structure in modeler
folder here I placed in bin as shown below,
Now run modeler

20 | P a g e
Figure: For script 1 it will generate additional files
Figure: Additional files Pdb.95.bin, build.profile, script1.log

21 | P a g e
Run script2 now
Open script2 file and checkout the scores of MSA and Phylogenetic Tree.
Figure: Here it performs MSA and on the basis of MSA phylogenetic tree has constructed.
Now here we pick one template on the basis of crystallography resolution: seq1B @2.9 has
chosen due to its low crystallographic value.

22 | P a g e
6. MODELLER Step 3: Script_3 preparation for pairwise alignment using dynamic
programing for aligning the best one template with query.
In the previous step, it takes into account structural information from the template when
constructing an alignment. This task is achieved through a variable gap penalty function that tends
to place gaps in solvent exposed and curved regions, outside secondary structure segments, and
between two positions that are close in space. As a result, the alignment errors are reduced by
approximately one third relative to those that occur with standard sequence alignment techniques.
This improvement becomes more important as the similarity between the sequences decreases and
the number of gaps increases.
Here just place the template that we chose i.e., seq1 that is chosen in the MODELLER step2.
env = environ()
mdl = model(env, file='1bdm', model_segment=('FIRST:A','LAST:A'))
aln.append_model(mdl, align_codes='1bdmA', atom_files='1bdm.pdb')
aln.append(file='TvLDH.ali', align_codes='TvLDH')
aln.align2d()
aln.write(file='TvLDH-1bdmA.ali', alignment_format='PIR')
aln.write(file='TvLDH-1bdmA.pap', alignment_format='PAP')

23 | P a g e
Now Run script3 to get the pairwise alignment of best template.
This is now pairwise alignment that will help to build our models conserved regions, it is
dynamic programing and exhaustive algorithm it will take time.

24 | P a g e
Script 3 output file
Figure: Now pairwise alignment has been done which is the necessary step in the model
building Time:172.75

25 | P a g e
7. MODELLER Step 4: Script_4 preparation for Model Building and Backbone R-
chain.
Once a target-template alignment is constructed, MODELLER calculates a 3D model of the target
completely automatically, using its automodel class. The following script will generate ten similar
models of our protein based on the seq1:
from modeller.automodel import *
#from modeller import soap_protein_od
env = environ()
a = automodel(env, alnfile='TvLDH-1bdmA.ali',
knowns='1bdmA', sequence='TvLDH',
assess_methods=(assess.DOPE,
#soap_protein_od.Scorer(),
assess.GA341))
a.starting_model = 1
a.ending_model = 5
a.make()

26 | P a g e
Figure: Here I need 10 models so I choose 10 and replace pdb name.
Figure: Now run the MODELLER for script_4.

27 | P a g e
Figure: Running script4 generating models for us.
Figure: Now our ten models has been successfully generated.

28 | P a g e
Open script4 output file
Here according to DOPE score and molpdf value I chose one of the best model for our query
protein.
Several models are calculated for the same target, the "best" model can be selected in several ways.
For example, you could pick the model with the lowest value of the MODELLER objective
function or the DOPE or SOAP assessment scores, or with the highest GA341 assessment score,
which are reported at the end of the log file, above.
TvLDH.B99990010.pdb 3124.62549 -35491.57422 0.77961

29 | P a g e
>> Summary of successfully produced models:
Filename molpdf DOPE score GA341 score
----------------------------------------------------------------------
TvLDH.B99990001.pdb 3195.83813 -33944.30859 0.75628
TvLDH.B99990002.pdb 3074.42725 -34495.23438 0.66094
TvLDH.B99990003.pdb 2914.32275 -34914.96875 0.62078
TvLDH.B99990004.pdb 3244.13867 -35109.01563 0.79621
TvLDH.B99990005.pdb 3072.41846 -34744.25781 0.94353
TvLDH.B99990006.pdb 2985.87280 -34632.87891 0.80809
TvLDH.B99990007.pdb 3338.26465 -35036.42578 0.78566
TvLDH.B99990008.pdb 3178.71118 -34787.64063 0.54706
TvLDH.B99990009.pdb 3354.25049 -34837.94922 0.72689
TvLDH.B99990010.pdb 3124.62549 -35491.57422 0.77961
Total CPU time [seconds] : 325.75
Now this model TvLDH.B99990010.pdb has chosen according to its low molpdf values and DOPE
score.

30 | P a g e
8. MODELLER Step 5: Script_5 preparation for Model optimization
Before any external evaluation of the model, one should check and restraint violations. The file
"evaluate_model.py" here named as script-5 evaluates an input model with the DOPE potential.
Note that here we TvLDH.B99990010.pdb picked the tenth generated model
from modeller.scripts import complete_pdb
log.verbose() # request verbose output
env = environ()
env.libs.topology.read(file='$(LIB)/top_heav.lib') # read
topology
env.libs.parameters.read(file='$(LIB)/par.lib') # read
parameters
# read model file
mdl = complete_pdb(env, 'TvLDH.B99990002.pdb')
# Assess with DOPE:
s = selection(mdl) # all atom selection
s.assess_dope(output='ENERGY_PROFILE NO_REPORT',
file='TvLDH.profile',
normalize_profile=True, smoothing_window=15)
Figure: Preparing and Running the script5.

31 | P a g e
Figure: Running the script5. It will optimize our model TvLDH.B99990010.pdb
Note: Loop_2 was also done to check if further best models can be generated but the results
that we find out in loop_1 was more acceptable as compared to the loop_2 model that was
later analyzed by Ramachandran Plot.
Figure: Representation of Loop_2 but models we get from Loop_2 (10 disallowed regions)
was not very much authenticated by Ramachandran as of Loop_1.

32 | P a g e
9. Validation and structural organization of 3D Model using Ramachandran Plot
and Chimera.
Our generated Best Model open using Chimera1.15rc and visualize using PYMOL
Figure: Open using chimera TvLDH.B99990010.pdb
Figure: Shows the 3D Structural Organization of protein with number of turns,coils and beta
strands.

33 | P a g e
Figure: Represents the electrostatic potential protein contact. Here red represents the acidic,
blue represents basic and grey represents the neutral part of the protein.
Figure: Surface model that represents C-green, H-grey, N-blue, O-red, S-orange.

34 | P a g e
Figure: Labelled 3D model with the residues and its main chain atomic structure.

35 | P a g e
VALIDATION USING RAMACHANDRAN PLOT
In biochemistry, a Ramachandran plot (also known as a Rama plot, a Ramachandran
diagram or a [φ,ψ] plot), originally developed in 1963 by G. N. Ramachandran, C.
Ramakrishnan, and V. Sasisekharan, is a way to visualize energetically allowed regions for
backbone dihedral angles ψ against φ of amino acid residues in protein structure.
PROCHECK JOB TITLE: https://saves.mbi.ucla.edu/?job=602392
Figure: Graphical representation of the 3D structure of predicted model for dACE2
sequence. A Ramachandran plot generated, a protein that contains both β-sheet and α-helix
and randomn coils. The red, brown, and yellow regions represent the favored, allowed, and
"generously allowed" regions as defined by ProCheck.
Figure: The Plot statistics generated by PROCHECK shows its several characteristics.

36 | P a g e
On the basis of amino acids stereochemistry
Figure: On the basis of aminoacid steriochemistry different residues are shown.
On the basis of statistics
Figure: On the basis of statistics of each residue involved.

37 | P a g e
On the basis of residues properties
Figure: Shows absolute deviation from mean Chi-1 value, omega torsion, C-alpha chirality,
secondary structure and G-factor of the protein with sequence length.

38 | P a g e

39 | P a g e

40 | P a g e

41 | P a g e
From here also we can estimate the no of helix 22, 27 random coils and 4 beta sheet strands.

42 | P a g e
Functions of the Protein in the literature
The sequence of dACE2 protein has published in July 20,2020 but its functions are not mentioned
in the UniProt or any other source except its published paper. Identified a novel, primate-specific
isoform of ACE2, which we designate as deltaACE2 (dACE2). They showed that dACE2, but
not ACE2, is induced in various human cell types by IFNs and viruses; this information is
important to consider for future therapeutic strategies and understanding susceptibility and
outcomes of COVID-19.
1. dACE2 is a novel inducible and primate-specific isoform of ACE2- The
novel ACE2 isoform at 5p22.2 locus of human chromosome X is predicted to encode a
protein of 459 aa, in which Ex1c encodes the first 10 aa, which are unique. Compared to
the full-length ACE2 protein of 805 aa, the truncation eliminates 17 aa of the signal peptide
and 339 aa of the N-terminal peptidase domain as shown in figure 1.
Figure 1: designation novel inducible isoform.
2. dACE2 is induced by IFNs in vitro.- In most cell lines tested, dACE2 but not ACE2 was
strongly upregulated by SeV infection (Figure 2B,C). Treatments with IFN-β or a cocktail
of IFNλ1–3 significantly induced only expression of dACE2 and not ACE2 (Figure 2E, ).

43 | P a g e
Figure 2: designated B, C and E.
3. dACE2 is induced in virally infected human respiratory epithelial cells- dACE2 but
not ACE2 was induced in RSV-infected human pulmonary carcinoma cell line (H292).
4. dACE2 is enriched in squamous epithelial tumors- They hypothesized that as an
ISG, dACE2 might be absent or expressed at low levels in normal tissues but could be
induced by the inflammatory tissue microenvironment. We explored the data from The
Cancer Genome Atlas (TCGA), which represents the largest collection of tumors and
tumor-adjacent normal tissues. Expression of both ACE2 and dACE2 was detectable in
many tumor-adjacent normal tissues.
5. dACE2 is induced by SARS-CoV-2 in vitro- These results confirm that dACE2 is
inducible by SARS-CoV-2 infection. Expression of ACE2 and dACE2 was much higher
in a lung adenocarcinoma cell line Calu3 compared to both colon adenocarcinoma cell lines
Caco2 and T84.
6. dACE2 is non-functional as SARS-CoV-2 receptor and carboxypeptidase- the main
activities that involve the peptidase domain of ACE2 appear to be abrogated in dACE2.

44 | P a g e
In conclusion, they present the first report of the discovery and functional annotation of dACE2,
an IFN-inducible isoform of ACE2. The existence of two functionally distinct ACE2 isoforms
reconciles several biological properties previously attributed to ACE2, with dACE2 being an ISG,
and ACE2 acting as the SARS-CoV-2 entry receptor and carboxypeptidase, without being
regulated by IFNs.
Major Contribution for disclosing dACE2- Laboratory of Translational Genomics, Division of
Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health,
Bethesda, MD, USA.

45 | P a g e
Conclusion
In this project assignment, I predict the three-dimensional structure through MODELLER
homology modelling of the dACE2 protein sequence that was disclosed by the Laboratory of
Translational Genomics, Division of Cancer Epidemiology and Genetics, National Cancer
Institute, National Institutes of Health, Bethesda, MD, USA in JULY 2020. My predicted protein
structure shows 99.3% authenticity (87.7% most allowed region, 9.3% in allowed region and
2.2% in rigorously allowed region) according to the Ramachandran plot analysis defined by
PROCHECK. Also, the structural organization have shown the existence of helix, beta strands and
random coils.

46 | P a g e
References
1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7386494/
2) https://www.uniprot.org/uniprot/A0A7D6JAD5
3) https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSe
arch&LINK_LOC=blasthome
4) https://www.rcsb.org/structure/6M17#entity-2
5) https://www.rcsb.org/structure/7CT5
6) https://www.rcsb.org/structure/7C8D
7) https://www.rcsb.org/structure/6CS2
8) https://salilab.org/modeller/
9) http://services.mbi.ucla.edu/SAVES/Ramachandran/
10) Tools: chimera 1.15rc, MODELLER 9.25, PyMOL.

Zarlish attique 187104 project assignment modeller

More Related Content

Similar to Zarlish attique 187104 project assignment modeller

More from ZarlishAttique1

Recently uploaded

Zarlish attique 187104 project assignment modeller