Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
From Open text mining solutions to
Open Da...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
The idea
Accessible text
e.g. US patents
O...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Building on existing projects
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
ol
What is chemical name to structure?
(2S...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Supported chain nomenclature
Alkanes Heter...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Supported ring nomenclature
Monocyclic spi...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Structural assembly
nomenclature
Conjuncti...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Structural modifications
Heteroatom replac...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Bridges and stereochemistry
Bridges
4a,8a-...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Miscellaneous nomenclature
1,3-xylene
Grou...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Polymer nomenclature
poly[(benzo[1,2-d:4,5...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Domain specific nomenclature
Steroid nomen...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Carbohydrates
ʟ-ribo-ᴅ-manno-nonose
2,7-an...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Usage
Batch conversion on the
command line...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Who is using OPSIN?
Commercial software
Ci...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Steps involved
• Identifying experimental ...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Example
Methyl 4-[(pentafluorophenoxy)sulf...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Graphical Output
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
CML output
<reaction xmlns="http://www.xml...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Current status
• ~1 million reactions from...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
https://bitbucket.org/dan2097/patent-react...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Identify Synthetic Routes
1 2 3 4 5 6 7 8 ...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Trends in Reaction Types
0.0%
1.0%
2.0%
3....
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Trends In Solvent Use
0.0%
5.0%
10.0%
15.0...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Are solvents getting greener?
1976 2013
Wa...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Conclusions
Open Source tools facilitate r...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Acknowledgements
• Albina Asadulina
• Pete...
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Thank you for your time!
http://nextmoveso...
Upcoming SlideShare
Loading in …5
×

From Open text mining solutions to Open Data resources

1,067 views

Published on

OPSIN (Open Parser for Systematic IUPAC nomenclature) has developed into a mature solution for chemical name to structure conversion. Together with other Open Source utilities such as OSCAR4, ChemSpot, and ChemicalTagger, we now have the tools to address many of the problems in chemical text mining. This ecosystem of tools has facilitated the extraction of over a million reactions, from the US patent literature, which are now available freely to all under CC-Zero. I will describe advances in OPSIN, how reactions can be extracted from text, and present some interesting analyses that are made possible by the public availability of this dataset.

Published in: Software
  • Be the first to comment

From Open text mining solutions to Open Data resources

  1. 1. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 From Open text mining solutions to Open Data resources Daniel Lowe NextMove Software Cambridge, UK
  2. 2. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 The idea Accessible text e.g. US patents Open Reaction Data resource Reaction Extraction System
  3. 3. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Building on existing projects
  4. 4. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 ol What is chemical name to structure? (2S)- but2-Amino 1-- Stereochemistry locant substituent locant alk unsaturation suffix an NH2•
  5. 5. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Supported chain nomenclature Alkanes Heteroatom hydrides Heterogeneous heteroatom hydrides dodectetractkiliane pentaphosphane disilazane Trivial acids butyric acid
  6. 6. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Supported ring nomenclature Monocyclic spiro dispiro[4.2.4.2]tetradecane Hantzsch-Widman 1,3,5-triazine furo[3,2-b]thieno[2,3-e]pyridine 2,2':6',2''-terpyridyl Fused ring Ring assembly Von Baeyer tricyclo[2.2.1.12,5]octane Polycyclic spiro spiro[piperidine-4,9'-xanthene]
  7. 7. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Structural assembly nomenclature Conjunctive nomenclature benzeneethanol Substitutive nomenclature 2,4,6-trinitrotoluene Additive nomenclature methylsulfonyl Multiplicative nomenclature 4,4'-methylenedioxydibenzoic acid Functional class nomenclature ethyl alcohol
  8. 8. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Structural modifications Heteroatom replacement 1-thia-4-aza-2,6-disilacyclohexane Unsaturation hexa-1,3-dien-5-yne Hydro, dehydro, indicated hydrogen and added hydrogen 2,7-dihydro-1H-azepine Functional replacement Suffixes including infixed suffixes methanedithioic acid 1-chloro-2,4- diimidotricarbonic acid Lambda convention 2λ6-trisulfane
  9. 9. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Bridges and stereochemistry Bridges 4a,8a-propanoquinoline E/Z stereochemistry (Z)-2-chloro-but-2-ene Relative cis/trans stereochemistry trans-2,6-dimethyl-2,6-dihydronaphthalene R/S stereochemistry (1R,3S)-3-amino-3-methylcyclohexanol
  10. 10. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Miscellaneous nomenclature 1,3-xylene Groups with indeterminately positioned structural features Charge and oxidation numbers methylmercury(1+) or methylmercury(II) “per-nomenclature” 2-deoxy-ᴅ-ribose Subtractive nomenclature perhydroanthracene perchlorobenzene
  11. 11. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Polymer nomenclature poly[(benzo[1,2-d:4,5-d']bis[1,3]thiazole-2,6-diyl)-1,4-phenyleneoxy-1,3-phenylene(1,3,5,7-tetraoxo- 1,2,3,5,6,7-hexahydrobenzo[1,2-c:4,5-c']dipyrrole-2,6-diyl)-1,3-phenyleneoxy-1,4-phenylene] Structure-based polymer nomenclature
  12. 12. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Domain specific nomenclature Steroid nomenclature 17β-Hydroxy-8α,9β,10α-androst-4-en-3-one ʟ-leucinamide Amino acid cyclo(ᴅ-alanyl-ʟ-phenylalanyl)ʟ-arginyl-O-phosphono-ʟ-seryl-ʟ-alanyl-ʟ-proline Oligopeptide Cyclic peptide guanylyl(3'-5')uridine 3'-monophosphate Nucleotide nomenclature
  13. 13. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Carbohydrates ʟ-ribo-ᴅ-manno-nonose 2,7-anhydro-D-glycero-β-D-galacto-oct-2-ulopyranosonic acid β-ᴅ-Fructofuranosyl α-ᴅ-glucopyranoside β-ᴅ-glucopyranosyl-(1→3)-β-ᴅ-glucopyranosyl- (1→3)-ᴅ-glucopyranose
  14. 14. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Usage Batch conversion on the command line RESTful web service (opsin.ch.cam.ac.uk) NameToStructure nts = NameToStructure.getInstance(); String chemicalName = "acetonitrile"; String smiles = nts.parseToSmiles(chemicalName); Java API java -jar opsin-1.6.0-jar-with-dependencies.jar -osmi input.txt output.smi
  15. 15. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Who is using OPSIN? Commercial software Cinfony (interface to Python) Many text mining efforts Workflows Web services
  16. 16. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Steps involved • Identifying experimental sections • Identifying chemical entities • Chemical name to structure conversion (including anaphora resolution) • Associating chemical entities with quantities • Assigning chemical roles • Atom-atom mapping
  17. 17. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Example Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate To a solution of methyl 4-(chlorosulfonyl)benzoate (606 mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N (540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred at room temperature until all of the starting material was consumed. The solvent was evaporated in vacuo and the residue redissolved in ethyl acetate (10 ml), washed with water (10 ml), saturated sodium hydrogen carbonate (10 ml), dried over sodium sulphate, filtered and evaporated to yield the title compound as a white solid (690 mg, 1.8 mmol, 85%).
  18. 18. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Graphical Output
  19. 19. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 CML output <reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-.. <dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([.. <productList> <product role="product"> <molecule id="m0"> <name dictRef="nameDict:unknown">title compound</name> </molecule> <amount units="unit:mmol">1.8</amount> <amount units="unit:mg">690</amount> <amount units="unit:percentYield">85.0</amount> <identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/> <identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H.. <dl:entityType>definiteReference</dl:entityType> <dl:state>solid</dl:state> </product> </productList> <reactantList> <reactant role="reactant" count="1"> <molecule id="m1"> <name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name> </molecule> <amount units="unit:mmol">2.1</amount> <amount units="unit:mg">606</amount> <amount units="unit:eq">1.0</amount> <identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/> Quantities including yield are extracted Entity is classified as an exact compound, definite reference, chemical class or fragment Reaction SMILES SMILES and InChIs for every structure resolvable reagent/product
  20. 20. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Current status • ~1 million reactions from US patent applications (2001-2013) • ~1 million reactions from US patent grants (1976-2013) • At minimum over a million constitutionally distinct reactions
  21. 21. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 https://bitbucket.org/dan2097/patent-reaction-extraction/downloads Current status
  22. 22. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Identify Synthetic Routes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Intermediates 197702103114 56611 31403 17268 9230 5057 2701 1256 639 301 136 58 15 5 2 Terminal Products 385149149445 81837 47579 27670 16619 9320 5263 2511 1330 678 373 111 63 8 6 5 0 100000 200000 300000 400000 500000 600000 700000 Occurrences Number of steps Intermediates Terminal Products
  23. 23. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Trends in Reaction Types 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Suzukicouplingsasapercentageofreactionsinayear
  24. 24. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Trends In Solvent Use 0.0% 5.0% 10.0% 15.0% 20.0% 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Percentageofreactionsinthatyear Tetrahydrofuran Dichloromethane Water Dimethylformamide Methanol Ethyl acetate Ethanol 1,4-Dioxane Toluene Acetonitrile Acetic acid Chloroform Acetone Benzene
  25. 25. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Are solvents getting greener? 1976 2013 Water (21%) Tetrahydrofuran (15%) Ethanol (11%) Dichloromethane (14%) Benzene (8%) Water (13%) Methanol (7%) Dimethylformamide (10%) Tetrahydrofuran (5%) Methanol (8%) Dichloromethane (4%) Ethyl acetate (7%) Dimethylformamide (4%) Ethanol (5%) Acetic acid (4%) 1,4-Dioxane (4%) Chloroform (3%) Toluene (3%) Acetone (3%) Acetonitrile (3%) Total for top 10: 71% 82%
  26. 26. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Conclusions Open Source tools facilitate reuse and remixing of code Open Data allows reuse in an infinite number of potential applications and analyses
  27. 27. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Acknowledgements • Albina Asadulina • Peter Corbett • Robert Glen • David Jessop • Lezan Hawizy • Peter Murray-Rust • Roger Sayle
  28. 28. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Thank you for your time! http://nextmovesoftware.com http://nextmovesoftware.com/blog daniel@nextmovesoftware.com

×