Your SlideShare is downloading. ×
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

From Open text mining solutions to Open Data resources

647

Published on

OPSIN (Open Parser for Systematic IUPAC nomenclature) has developed into a mature solution for chemical name to structure conversion. Together with other Open Source utilities such as OSCAR4, …

OPSIN (Open Parser for Systematic IUPAC nomenclature) has developed into a mature solution for chemical name to structure conversion. Together with other Open Source utilities such as OSCAR4, ChemSpot, and ChemicalTagger, we now have the tools to address many of the problems in chemical text mining. This ecosystem of tools has facilitated the extraction of over a million reactions, from the US patent literature, which are now available freely to all under CC-Zero. I will describe advances in OPSIN, how reactions can be extracted from text, and present some interesting analyses that are made possible by the public availability of this dataset.

Published in: Software
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
647
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 From Open text mining solutions to Open Data resources Daniel Lowe NextMove Software Cambridge, UK
  • 2. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 The idea Accessible text e.g. US patents Open Reaction Data resource Reaction Extraction System
  • 3. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Building on existing projects
  • 4. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 ol What is chemical name to structure? (2S)- but2-Amino 1-- Stereochemistry locant substituent locant alk unsaturation suffix an NH2•
  • 5. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Supported chain nomenclature Alkanes Heteroatom hydrides Heterogeneous heteroatom hydrides dodectetractkiliane pentaphosphane disilazane Trivial acids butyric acid
  • 6. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Supported ring nomenclature Monocyclic spiro dispiro[4.2.4.2]tetradecane Hantzsch-Widman 1,3,5-triazine furo[3,2-b]thieno[2,3-e]pyridine 2,2':6',2''-terpyridyl Fused ring Ring assembly Von Baeyer tricyclo[2.2.1.12,5]octane Polycyclic spiro spiro[piperidine-4,9'-xanthene]
  • 7. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Structural assembly nomenclature Conjunctive nomenclature benzeneethanol Substitutive nomenclature 2,4,6-trinitrotoluene Additive nomenclature methylsulfonyl Multiplicative nomenclature 4,4'-methylenedioxydibenzoic acid Functional class nomenclature ethyl alcohol
  • 8. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Structural modifications Heteroatom replacement 1-thia-4-aza-2,6-disilacyclohexane Unsaturation hexa-1,3-dien-5-yne Hydro, dehydro, indicated hydrogen and added hydrogen 2,7-dihydro-1H-azepine Functional replacement Suffixes including infixed suffixes methanedithioic acid 1-chloro-2,4- diimidotricarbonic acid Lambda convention 2λ6-trisulfane
  • 9. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Bridges and stereochemistry Bridges 4a,8a-propanoquinoline E/Z stereochemistry (Z)-2-chloro-but-2-ene Relative cis/trans stereochemistry trans-2,6-dimethyl-2,6-dihydronaphthalene R/S stereochemistry (1R,3S)-3-amino-3-methylcyclohexanol
  • 10. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Miscellaneous nomenclature 1,3-xylene Groups with indeterminately positioned structural features Charge and oxidation numbers methylmercury(1+) or methylmercury(II) “per-nomenclature” 2-deoxy-ᴅ-ribose Subtractive nomenclature perhydroanthracene perchlorobenzene
  • 11. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Polymer nomenclature poly[(benzo[1,2-d:4,5-d']bis[1,3]thiazole-2,6-diyl)-1,4-phenyleneoxy-1,3-phenylene(1,3,5,7-tetraoxo- 1,2,3,5,6,7-hexahydrobenzo[1,2-c:4,5-c']dipyrrole-2,6-diyl)-1,3-phenyleneoxy-1,4-phenylene] Structure-based polymer nomenclature
  • 12. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Domain specific nomenclature Steroid nomenclature 17β-Hydroxy-8α,9β,10α-androst-4-en-3-one ʟ-leucinamide Amino acid cyclo(ᴅ-alanyl-ʟ-phenylalanyl)ʟ-arginyl-O-phosphono-ʟ-seryl-ʟ-alanyl-ʟ-proline Oligopeptide Cyclic peptide guanylyl(3'-5')uridine 3'-monophosphate Nucleotide nomenclature
  • 13. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Carbohydrates ʟ-ribo-ᴅ-manno-nonose 2,7-anhydro-D-glycero-β-D-galacto-oct-2-ulopyranosonic acid β-ᴅ-Fructofuranosyl α-ᴅ-glucopyranoside β-ᴅ-glucopyranosyl-(1→3)-β-ᴅ-glucopyranosyl- (1→3)-ᴅ-glucopyranose
  • 14. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Usage Batch conversion on the command line RESTful web service (opsin.ch.cam.ac.uk) NameToStructure nts = NameToStructure.getInstance(); String chemicalName = "acetonitrile"; String smiles = nts.parseToSmiles(chemicalName); Java API java -jar opsin-1.6.0-jar-with-dependencies.jar -osmi input.txt output.smi
  • 15. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Who is using OPSIN? Commercial software Cinfony (interface to Python) Many text mining efforts Workflows Web services
  • 16. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Steps involved • Identifying experimental sections • Identifying chemical entities • Chemical name to structure conversion (including anaphora resolution) • Associating chemical entities with quantities • Assigning chemical roles • Atom-atom mapping
  • 17. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Example Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate To a solution of methyl 4-(chlorosulfonyl)benzoate (606 mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N (540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred at room temperature until all of the starting material was consumed. The solvent was evaporated in vacuo and the residue redissolved in ethyl acetate (10 ml), washed with water (10 ml), saturated sodium hydrogen carbonate (10 ml), dried over sodium sulphate, filtered and evaporated to yield the title compound as a white solid (690 mg, 1.8 mmol, 85%).
  • 18. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Graphical Output
  • 19. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 CML output <reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-.. <dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([.. <productList> <product role="product"> <molecule id="m0"> <name dictRef="nameDict:unknown">title compound</name> </molecule> <amount units="unit:mmol">1.8</amount> <amount units="unit:mg">690</amount> <amount units="unit:percentYield">85.0</amount> <identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/> <identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H.. <dl:entityType>definiteReference</dl:entityType> <dl:state>solid</dl:state> </product> </productList> <reactantList> <reactant role="reactant" count="1"> <molecule id="m1"> <name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name> </molecule> <amount units="unit:mmol">2.1</amount> <amount units="unit:mg">606</amount> <amount units="unit:eq">1.0</amount> <identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/> Quantities including yield are extracted Entity is classified as an exact compound, definite reference, chemical class or fragment Reaction SMILES SMILES and InChIs for every structure resolvable reagent/product
  • 20. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Current status • ~1 million reactions from US patent applications (2001-2013) • ~1 million reactions from US patent grants (1976-2013) • At minimum over a million constitutionally distinct reactions
  • 21. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 https://bitbucket.org/dan2097/patent-reaction-extraction/downloads Current status
  • 22. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Identify Synthetic Routes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Intermediates 197702103114 56611 31403 17268 9230 5057 2701 1256 639 301 136 58 15 5 2 Terminal Products 385149149445 81837 47579 27670 16619 9320 5263 2511 1330 678 373 111 63 8 6 5 0 100000 200000 300000 400000 500000 600000 700000 Occurrences Number of steps Intermediates Terminal Products
  • 23. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Trends in Reaction Types 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Suzukicouplingsasapercentageofreactionsinayear
  • 24. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Trends In Solvent Use 0.0% 5.0% 10.0% 15.0% 20.0% 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Percentageofreactionsinthatyear Tetrahydrofuran Dichloromethane Water Dimethylformamide Methanol Ethyl acetate Ethanol 1,4-Dioxane Toluene Acetonitrile Acetic acid Chloroform Acetone Benzene
  • 25. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Are solvents getting greener? 1976 2013 Water (21%) Tetrahydrofuran (15%) Ethanol (11%) Dichloromethane (14%) Benzene (8%) Water (13%) Methanol (7%) Dimethylformamide (10%) Tetrahydrofuran (5%) Methanol (8%) Dichloromethane (4%) Ethyl acetate (7%) Dimethylformamide (4%) Ethanol (5%) Acetic acid (4%) 1,4-Dioxane (4%) Chloroform (3%) Toluene (3%) Acetone (3%) Acetonitrile (3%) Total for top 10: 71% 82%
  • 26. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Conclusions Open Source tools facilitate reuse and remixing of code Open Data allows reuse in an infinite number of potential applications and analyses
  • 27. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Acknowledgements • Albina Asadulina • Peter Corbett • Robert Glen • David Jessop • Lezan Hawizy • Peter Murray-Rust • Roger Sayle
  • 28. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Thank you for your time! http://nextmovesoftware.com http://nextmovesoftware.com/blog daniel@nextmovesoftware.com

×