Seeking glimmers of light in Pharos “Tdark” proteins
1. Seeking glimmers of light: Intersecting Pharos “Tdark” with
UniProt chemistry cross-references
Introduction
The 20,409 proteins in Pharos are usefully divided into development levels (DLs)
1) Tclin= 613: associated with a drug Mechanism of Action (MoA)
2) Tchem = 1598 : exceeds target-class bioactivity thresholds from ChEMBL,
DrugCentral or other curation
3) Tbio = 1145 : lack small molecule annotation and satisfy one of following:
Exceed the cut off criteria for Tdark
GO Molecular Function or Biological Process with Experimental Evidence
An OMIM phenotype
4) Tdark = 6368 : little information available, and satisfy these criteria:
PubMed Jensen Lab text-mining score below 5
Less than 3 Gene References into Function (RIFs)
Less than 50 Antibodies available in Antibodypedia
Objectives
We decided to investigating the complementarity between Pharos DL annotations
and selected cross-references in UniProtKB. Our initial focus is on the six resources
in the “Chemistry” section indicated below (with total species counts)
These were selected because they a) include manual expert curated associations
between proteins and their activity modulators and b) exhibited complementary
differences in primary source selections and curatorial stringencies. ChEMBL and
DrugCentral are already used in the compilation of Tchem. Swiss-lipids is a (useful)
odd-man out in focusing on enzyme substrates rather than modulators.
Human protein counts
Below are the Swiss-Prot counts for each of the sources plus the union of all six in
blue. Counts unique for each source are shown in orange.
The proportional uniqueness of sources needs further examination but is
interpretable in some cases, such as a lower curation stringency for DrugBank and
extraction of a larger literature corpus of 75K papers by ChEMBL. The union at
~500 represents a surprisingly large potentially druggable genome of ~25%.
Christopher Southan1 and Tudor Oprea2. 1. TW2Informatics, Gothenburg,
Sweden 42166. 2. Department of Internal Medicine, Comprehensive Cancer
Center, University of New Mexico School of Medicine, Albuquerque, NM, USA.
Comparison against Tdark
As a next step we compared the union of the six cross-references against Tdark. The
intersect is shown in the Venn diagram below.
The result of 144 proteins in-common was surprising since we might not expect
Tdark to have a curated modulation ligand (i.e. being less functionally dark).
Tdark matches by source
We then split the 144 by source as shown in the chart below.
The matches totalled 167 indicating there was some corroboration (i.e. more that
one source having annotated the same Tdark).
Analysis of the matches
Examples of the DrugCentral matches with Tdark are shown below.
We cannot show all lists here (data available on request) but manual inspection of
selected records, both in UniProt and at source, indicated the following themes;
• Many could be classified as secondary cross-reactivities of lower potency than
against their primary efficacy targets (e.g. some of the ChEMBL matches.
• The 11 SwissLipid matches had been characterised bioinformatically and
enzymatically (i.e. were not completely dark)
• DrugBank does not record affinity data which gives rise to false positives in
terms of potentially illuminating Tdarks.
• Some had only recently “cone into the light” (e.g.. GtoPdb annotated a NUDT7
inhibitor with a pEC50 of 6.0 from a 2018 bioRxiv, paper)
Plans
Manual inspection of the lists is continuing to tease out further trends and nuances.
Discussions with the IDG teams will consider which of the 144 could either “come
in from the dark” in Pharos or might even pass Tchem thresholds. Other informative
UniProt annotations and cross references we will be intersected against the DL
classifications (e.g. ~350 Tdark match PDB entries)