Structure-Activity Relationship (SAR) analysis is important for the development of novel small molecule drugs. Such analyses rely on bioactivity data either from in-house or published data, with data from the latter currently being extracted manually at much expensive.
Here we report on an entirely automated system for extracting bioactivity data that we are developing, initially targeting US patents. The system relies on combining the results of many technologies: chemical entity recognition, chemical name to structure, table processing, chemical compound number resolution, chemical sketch interpretation, and even in some cases reconstitution of molecules from a generic core and R-group definitions. Where possible, the target and the assay description are also identified.
To assess the precision/recall of our system we compare our results with those manually extracted from US patents by BindingDB. We also compare the data we’ve extracted with the data present in ChEMBL from journal articles, to analyse whether there are significant differences between activity data in journal articles and patents e.g. differences in targets of interest.
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Automatic extraction of bioactivity data from patents
1. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Automatic extraction of bioactivity
data from patents
Daniel Lowe*, Stefan Senger† and Roger Sayle*
*NextMove Software Cambridge, UK
†GlaxoSmithKline, Stevenage, UK
2. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Example Use cases
• “A patent has recently come out on a topic of
interest, can the key compounds be extracted
with their activity data?”
• “Which compounds have been found to be
active against this target?”
3. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
US Patent data freely available
patents.reedtech.com
(Or from the USPTO: bulkdata.uspto.gov)
4. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
= text-mined
What are
these
compounds?
5. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Understanding table semantics
SureChEMBL Google Patents
After text-mining for chemical entities:
Green = substituent
Purple = molecule
Source: US20170050925A9
6. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
SureChEMBL
Google PatentsPatent PDF
PatFetch
(NextMove Software)Source: US20010016661A1
7. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Understanding table semantics
5 columns
6 columns
• Columns merged such that header and body
have same number of columns
8. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Getting the compound
structures
• Chemical names
• Chemical sketches
• R-group tables
• Compound identifier associated with any of
the above
9. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Chemical names
• OPSIN (Open Parser for systematic IUPAC
nomenclature)
• Dictionaries (ChEMBL/PubChem/NextMove)
• Chemical line formula parsing, especially
useful for peptide names and R-group
definitions
10. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Chemical sketches
• Utilize the ChemDraw sketches provided by
the USPTO
• Detection and handling of repeat brackets and
positional variation
• Fixing obvious errors e.g. undervalent
nitrogen near to H atom with no bond
• Labels reinterpreted
11. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Formula Interpretation
Input ChemDraw 15 This work
HATU
C4F9
H3PO4
CON(cHex)2 No result
III-2 No result
N
N
+
O
N
N
N
N
F
P
-
F
F
F
F
F
A
T U
C C
F
FF
F
F
F
F F
F
FF
F F
FF
F
F
F
O
N
P
O
O
O
OH
HH HO P
O
OH
OH
I
I
2
-
I
12. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
R-group tables
13. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
• Need to “name space” identifiers
– “Compound 1”, “Reference compound 1”,
“Example 1”
– But “Compound 1” = “cmpd 1” = “cpd. #1”
• Where a column is just called “#” is it a
compound number, example number or just a
table row number!
• Identifier may be defined multiple times e.g.
as a sketch and chemical name
14. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
(text-mining)
15. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
(Sketches)
16. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
(Tables)
17. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Extracting compound-activity
relationships
18. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Excel table export
19. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Extracting compound-activity
relationships
What is the
target?
20. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Assay identification
• Naïve Bayes classifier trained from assay
descriptions identified by BindingDB curators
• 10-fold cross validation: 98.9% recall, 94.7%
precision
• Paragraph associated with next table or table
mentioned in paragraph
• Target/organism detected
• Care taken to avoid common irrelevant
organisms/proteins e.g. bovine serum albumin
21. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Results
22. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Results From US Patent
applications (2001-Mar 2017)
Red = Bioactivity
23. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Activities with associated
structures per year
0
100,000
200,000
300,000
400,000
500,000
600,000
Activitty-structurerelationshipsextracted
Publication Year
24. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison with BindingDB
• Activity data from ~1500 US patent grants (2013-
2016) manually extracted over the course of 3 years
• ~150,000 activities
• Comparison done on the subset that was made
available in ChEMBL 22_1 (98,898 activity values,
1012 patents)
• As some assay results are missed by the automatic
extraction, and some are considered out of scope by
BindingDB, difficult to distinguish differences in
coverage from genuine disagreements
25. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison with BindingDB
• Values normalized into nM
– 1000s of instances of measurements in nanometers!
• Mid point of ranges taken
• Structures compared by StdInChI
• Target name normalized to ChEMBL target ID
(organism specific), using either:
– ChEMBL target synonyms
– Normalize to HGNC symbol and check if HGNC symbol is a
ChEMBL target synonym
26. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison
Expected
values
found
Expected
structures
found
Expected
value +
structure
found
Expected
value +
structure +
target
75% 65% 53% 18%
27. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Unclear structure assignment
? ?
28. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Stereochemistry and salts
OH
O
O
N
H
CH3H3C
Br
H
H
Patent BindingDB This
work
29. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Long tail of difficult cases
What does this
superscript term
mean?
What are the
units?
30. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Targets of patent data compared
to journal data
ChEMBL 22_1
(excluding BindingDB)
US Patent Applications
Common Target Classes
0%
5%
10%
15%
20%
25%
30%
35%
40%
2002
2004
2006
2008
2010
2012
2014
2016
%peryear
Kinase
GPCR (Family A)
Protease
Nuclear receptor
Voltage-gated ion
channel
Electrochemical
transporter
Oxidoreductase
0%
5%
10%
15%
20%
25%
30%
35%
40%
2002
2004
2006
2008
2010
2012
2014
2016
31. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Upcoming target classes
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
Percentageofdocumentswithactivityvaluesagainst
targetclass
Epigenetic writer (Patents)
Epigenetic reader (Patents)
Epigenetic writer (ChEMBL ex
BindingDB)
Epigenetic reader (ChEMBL ex
BindingDB)
32. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Future work
• Support for more complex R-group tables
• Improve recognition and resolution of protein
target names
• Support for activities specified in text e.g.
Example 1 has an IC50 of 12 nM measured at rat EP4
• Resolution of symbols for activity ranges e.g.
“A” indicates an IC50 value of less than 100 nM
• Improve assay metadata extraction
cf. BioAssay Express
33. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Disambiguation of Conflicting
structure descriptions
Image from
original filing
Redrawn by US
patent office in
ChemDraw
Intended
structure from
chemical name
34. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Conclusions
• Processing all US patents from 2001 to present
can be done in less than a day on a desktop PC
• Technique applicable to chemical properties
other than activity values
• Compound number <-> structure relationships
useful for key compound identification
• For the majority of patents, extracting
structure-activity relationships can be
significantly expedited
35. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Acknowledgements
• Noel O`Boyle
• John Mayfield
• Funding provided by:
36. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Thank you for your time!
http://nextmovesoftware.com
http://nextmovesoftware.com/blog
daniel@nextmovesoftware.com