Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Automatic extraction of bioactivity
data from patents
Dan...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Example Use cases
• “A patent has recently come out on a ...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
US Patent data freely available
patents.reedtech.com
(Or ...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
= text-mined
What are
these
compounds?
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Understanding table semantics
SureChEMBL Google Patents
A...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
SureChEMBL
Google PatentsPatent PDF
PatFetch
(NextMove So...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Understanding table semantics
5 columns
6 columns
• Colum...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Getting the compound
structures
• Chemical names
• Chemic...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Chemical names
• OPSIN (Open Parser for systematic IUPAC
...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Chemical sketches
• Utilize the ChemDraw sketches provide...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Formula Interpretation
Input ChemDraw 15 This work
HATU
C...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
R-group tables
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
• Need to “name space” identifiers
...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
(text-mining)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
(Sketches)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
(Tables)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Extracting compound-activity
relationships
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Excel table export
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Extracting compound-activity
relationships
What is the
ta...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Assay identification
• Naïve Bayes classifier trained fro...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Results
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Results From US Patent
applications (2001-Mar 2017)
Red =...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Activities with associated
structures per year
0
100,000
...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison with BindingDB
• Activity data from ~1500 US p...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison with BindingDB
• Values normalized into nM
– 1...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison
Expected
values
found
Expected
structures
foun...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Unclear structure assignment
? ?
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Stereochemistry and salts
OH
O
O
N
H
CH3H3C
Br
H
H
Patent...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Long tail of difficult cases
What does this
superscript t...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Targets of patent data compared
to journal data
ChEMBL 22...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Upcoming target classes
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Future work
• Support for more complex R-group tables
• I...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Disambiguation of Conflicting
structure descriptions
Imag...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Conclusions
• Processing all US patents from 2001 to pres...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Acknowledgements
• Noel O`Boyle
• John Mayfield
• Funding...
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Thank you for your time!
http://nextmovesoftware.com
http...
Upcoming SlideShare
Loading in …5
×

Automatic extraction of bioactivity data from patents

2,897 views

Published on

Structure-Activity Relationship (SAR) analysis is important for the development of novel small molecule drugs. Such analyses rely on bioactivity data either from in-house or published data, with data from the latter currently being extracted manually at much expensive.
Here we report on an entirely automated system for extracting bioactivity data that we are developing, initially targeting US patents. The system relies on combining the results of many technologies: chemical entity recognition, chemical name to structure, table processing, chemical compound number resolution, chemical sketch interpretation, and even in some cases reconstitution of molecules from a generic core and R-group definitions. Where possible, the target and the assay description are also identified.
To assess the precision/recall of our system we compare our results with those manually extracted from US patents by BindingDB. We also compare the data we’ve extracted with the data present in ChEMBL from journal articles, to analyse whether there are significant differences between activity data in journal articles and patents e.g. differences in targets of interest.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Automatic extraction of bioactivity data from patents

  1. 1. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Automatic extraction of bioactivity data from patents Daniel Lowe*, Stefan Senger† and Roger Sayle* *NextMove Software Cambridge, UK †GlaxoSmithKline, Stevenage, UK
  2. 2. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Example Use cases • “A patent has recently come out on a topic of interest, can the key compounds be extracted with their activity data?” • “Which compounds have been found to be active against this target?”
  3. 3. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 US Patent data freely available patents.reedtech.com (Or from the USPTO: bulkdata.uspto.gov)
  4. 4. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 = text-mined What are these compounds?
  5. 5. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Understanding table semantics SureChEMBL Google Patents After text-mining for chemical entities: Green = substituent Purple = molecule Source: US20170050925A9
  6. 6. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 SureChEMBL Google PatentsPatent PDF PatFetch (NextMove Software)Source: US20010016661A1
  7. 7. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Understanding table semantics 5 columns 6 columns • Columns merged such that header and body have same number of columns
  8. 8. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Getting the compound structures • Chemical names • Chemical sketches • R-group tables • Compound identifier associated with any of the above
  9. 9. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Chemical names • OPSIN (Open Parser for systematic IUPAC nomenclature) • Dictionaries (ChEMBL/PubChem/NextMove) • Chemical line formula parsing, especially useful for peptide names and R-group definitions
  10. 10. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Chemical sketches • Utilize the ChemDraw sketches provided by the USPTO • Detection and handling of repeat brackets and positional variation • Fixing obvious errors e.g. undervalent nitrogen near to H atom with no bond • Labels reinterpreted
  11. 11. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Formula Interpretation Input ChemDraw 15 This work HATU C4F9 H3PO4 CON(cHex)2 No result III-2 No result N N + O N N N N F P - F F F F F A T U C C F FF F F F F F F FF F F FF F F F O N P O O O OH HH HO P O OH OH I I 2 - I
  12. 12. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 R-group tables
  13. 13. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Resolving Identifiers • Need to “name space” identifiers – “Compound 1”, “Reference compound 1”, “Example 1” – But “Compound 1” = “cmpd 1” = “cpd. #1” • Where a column is just called “#” is it a compound number, example number or just a table row number! • Identifier may be defined multiple times e.g. as a sketch and chemical name
  14. 14. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Resolving Identifiers (text-mining)
  15. 15. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Resolving Identifiers (Sketches)
  16. 16. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Resolving Identifiers (Tables)
  17. 17. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Extracting compound-activity relationships
  18. 18. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Excel table export
  19. 19. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Extracting compound-activity relationships What is the target?
  20. 20. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Assay identification • Naïve Bayes classifier trained from assay descriptions identified by BindingDB curators • 10-fold cross validation: 98.9% recall, 94.7% precision • Paragraph associated with next table or table mentioned in paragraph • Target/organism detected • Care taken to avoid common irrelevant organisms/proteins e.g. bovine serum albumin
  21. 21. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Results
  22. 22. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Results From US Patent applications (2001-Mar 2017) Red = Bioactivity
  23. 23. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Activities with associated structures per year 0 100,000 200,000 300,000 400,000 500,000 600,000 Activitty-structurerelationshipsextracted Publication Year
  24. 24. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Comparison with BindingDB • Activity data from ~1500 US patent grants (2013- 2016) manually extracted over the course of 3 years • ~150,000 activities • Comparison done on the subset that was made available in ChEMBL 22_1 (98,898 activity values, 1012 patents) • As some assay results are missed by the automatic extraction, and some are considered out of scope by BindingDB, difficult to distinguish differences in coverage from genuine disagreements
  25. 25. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Comparison with BindingDB • Values normalized into nM – 1000s of instances of measurements in nanometers! • Mid point of ranges taken • Structures compared by StdInChI • Target name normalized to ChEMBL target ID (organism specific), using either: – ChEMBL target synonyms – Normalize to HGNC symbol and check if HGNC symbol is a ChEMBL target synonym
  26. 26. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Comparison Expected values found Expected structures found Expected value + structure found Expected value + structure + target 75% 65% 53% 18%
  27. 27. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Unclear structure assignment ? ?
  28. 28. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Stereochemistry and salts OH O O N H CH3H3C Br H H Patent BindingDB This work
  29. 29. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Long tail of difficult cases What does this superscript term mean? What are the units?
  30. 30. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Targets of patent data compared to journal data ChEMBL 22_1 (excluding BindingDB) US Patent Applications Common Target Classes 0% 5% 10% 15% 20% 25% 30% 35% 40% 2002 2004 2006 2008 2010 2012 2014 2016 %peryear Kinase GPCR (Family A) Protease Nuclear receptor Voltage-gated ion channel Electrochemical transporter Oxidoreductase 0% 5% 10% 15% 20% 25% 30% 35% 40% 2002 2004 2006 2008 2010 2012 2014 2016
  31. 31. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Upcoming target classes 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Percentageofdocumentswithactivityvaluesagainst targetclass Epigenetic writer (Patents) Epigenetic reader (Patents) Epigenetic writer (ChEMBL ex BindingDB) Epigenetic reader (ChEMBL ex BindingDB)
  32. 32. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Future work • Support for more complex R-group tables • Improve recognition and resolution of protein target names • Support for activities specified in text e.g. Example 1 has an IC50 of 12 nM measured at rat EP4 • Resolution of symbols for activity ranges e.g. “A” indicates an IC50 value of less than 100 nM • Improve assay metadata extraction cf. BioAssay Express
  33. 33. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Disambiguation of Conflicting structure descriptions Image from original filing Redrawn by US patent office in ChemDraw Intended structure from chemical name
  34. 34. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Conclusions • Processing all US patents from 2001 to present can be done in less than a day on a desktop PC • Technique applicable to chemical properties other than activity values • Compound number <-> structure relationships useful for key compound identification • For the majority of patents, extracting structure-activity relationships can be significantly expedited
  35. 35. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Acknowledgements • Noel O`Boyle • John Mayfield • Funding provided by:
  36. 36. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Thank you for your time! http://nextmovesoftware.com http://nextmovesoftware.com/blog daniel@nextmovesoftware.com

×