Data Mining Dissertations and Adventures and Experiences in the World of Chemistry


Published on

This presentation was given at the CLIR/DLF Postdoctoral Fellowship Summer Seminar at Bryn Mawr college in Pennsylvania on July 29th 2014. The intention was to communicate what we are doing in the fields of text and data mining in the domain of chemistry and specifically around mining the RSC archive publication and chemistry dissertations and theses. How would these experiences map over to the humanities?

Published in: Science
  • Be the first to comment

  • Be the first to like this

Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

  1. 1. Data mining dissertations Adventures and Experiences in the World of Chemistry Antony Williams CLIR/DLF Postdoctoral Fellowship Summer Seminar, July 2014
  2. 2. What a small world…
  3. 3. • Who’s got an ORCID? • Who has heard of/involved with AltMetrics? • Who has edited a Wikipedia page? • Who has direct experience of text mining? • All slides already on Slideshare here: • Before we start….
  4. 4. • Context – why do we want to mine data? • Our experiences in extracting theses: – Text and data mining – Chemistry as an example – Before you start – Resources and tools Contents
  5. 5. • Let’s map together all historical chemistry data and build systems to integrate • Heck, let’s integrate chemistry and biology data and add in disease data too • Let’s model the data and see if we can extract new relationships – quantitative and qualitative • Let’s make it all available on the web Taking on a big challenge…
  6. 6. • We’re going to map the world • We’re going to take photos of as many places as we can and link them together • We’ll let people annotate and curate the map • Then let’s make it available free on the web • We’ll make it available for decision making • Put it on Mobile Devices, Give it Away What about this….
  7. 7. I’m from here…on Google
  8. 8. Wikipedia
  9. 9. Wikipedia
  10. 10. The Power of Contribution
  11. 11. How do you spell Afonwen?
  12. 12. And there’s Denbigh…
  13. 13. • So the world can be mapped… • We can enter a 3D world within the map • We can add annotations • We can use the data, reference it, we can extract it, we can make decisions with it • And we can do it on our lap, in our hands • Let’s do this for chemistry… Whoa…
  14. 14. • Once upon a time we built a database…. In a basement not far away…
  15. 15. ChemSpider
  16. 16. ChemSpider and Data Validation
  17. 17. Dictionary Linking
  18. 18. Dictionary Linking
  19. 19. • This is not new, you known the story… • So much data of value contained within a publication and delivered in a PDF form • “PDF files, and especially unclear licensing, don’t allow me at the data so I can rework, reuse, repurpose, text mine etc.” • “I specialize in XXXX. I want a database of YYYY extracted from publications and made available, for free, with capabilities I need, and the publishers should just do it” Data in a Scientific Publication
  20. 20. It is so difficult to navigate… What’s the structure? Are they in our file? What’s similar? What’s the target?Pharmacology data? Known Pathways? Working On Now?Connections to disease? Expressed in right cell type? Competitors? IP?
  21. 21. • Manage “all” of the chemistry data associated with chemical substances • Data to be downloadable, reusable, interactive • Build a platform that enables the scientist • Data storage, validation, standardization and curation • Collaborative data sharing • Provide data platform that can enable and enhance publishing of scientific papers We set a vision…
  22. 22. • Every compound from every article at RSC is extracted, in a database, and linked • Chemical properties are extracted, databased and used for predictive models • Data tables are downloadable, interactive and not just “dumb-PDFs” • …and what can we extract from chemistry theses too? XXX Years from Now at RSC
  23. 23. • We are seen as one of the repositories for published AND unpublished research data • An intuitive platform for research data management in the cloud • Individual, collaborative and public data management of diverse data in the cloud • …and where all data referenced in a thesis is available at a button click XXX Years from Now at RSC
  24. 24. • But how does it map onto your domain?? So this is chemistry…
  25. 25. Mining as an allegory
  26. 26. • You have a mountain of stuff which contains valuable nuggets • You (more or less) know what you’re looking for • You know what you’re going to do with it once you have it Mining as an allegory - intent
  27. 27. • You get lots of stuff out • It requires sifting and grading • It’s a triumph if you manage to extract 80-90% of what is there • You will go back to the heap and redo it Mining as an allegory - result
  28. 28. • That which is easy to get out - is well known and unlikely to be novel • The novel and interesting stuff is likely to be rare and not easily defined Mining as an allegory - effort
  29. 29. • Do the initial investigations by hand • Send in the machines later • Still needs some humans tweaking Mining as allegory - automation
  30. 30. Context
  31. 31. • From Utopia Documents team • Good at extracting structure from typeset pdfs • PDFX
  32. 32. OCR recognition • Underlining doesn’t help OCR • In this case it was the only signpost to the department, supervisor and funding details
  33. 33. • Hardcopy • Scanned and OCR’d PDF • PDF derived from Word • Word or LaTeX • …and for OCR not all are borne equal • …and of course history and language is a major influence. “Oil of vitriol” Building blocks to mine…
  34. 34. • Ontologies, taxonomies, dictionaries • But these are very domain focussed… • As an example, Open PHACTS spend a lot of effort mapping biology to chemistry to disease over many data sources More building blocks
  35. 35. • Provide a controlled vocabulary – what your data describes, where it came from • Provide a shared vocabulary for integrating with other people’s data What can ontologies do for me?
  36. 36. Questions to ask: (1)Has someone already produced an ontology covering your area? (Places to look: Bioportal, OBO Foundry.) (2)Do they take requests? (3)Are they responsive? (4)Is the ontology kept up to date? Early days for ontologies and any ontology will almost certainly be a long way from complete! Best practices: experiences from biomedical ontologies
  37. 37. • Best that these don’t change • Best that everyone calls them the same things • Best that they are unambiguous • Meanwhile, back in the real world What things are you looking for?
  38. 38. • Place names – somewhat ambiguous • Species names – can change with time • Diseases – every pharmaceutical company has a different list • People – can be very ambiguous: Authors and researchers are hard to map…except for Google it seems! How easy?
  39. 39.
  40. 40.
  41. 41. Thankfully people follow…
  42. 42. Google Scholar Citations
  43. 43. ORCID take up???
  44. 44. • All publications easily connected but also – Important in early scientific career – consider every data point contribution, every “research object” – Every article – Every presentation – Thesis and dissertation – Provenance….and feeding AltMetrics So the benefits of ORCIDs?
  45. 45. The Alt-Metrics Manifesto
  46. 46. AltMetrics via Plum Analytics
  47. 47. Usage, Citations, Social Media
  48. 48. Detailed Usage Statistics
  49. 49. Indexed and Searchable
  50. 50. ORCIDS for reputation…
  51. 51. Tinman - mutant fly embryos lack a heart. Van Gogh - hair-like bristles on wings have a swirling pattern. INDY - acronym for I'm Not Dead Yet, they live twice as long as normal; from the scene in the movie "Monty Python and the Holy Grail" Ken and Barbie - males and females lack external genitalia. Tribbles - some cells divide uncontrollably Cheap date - flies are extra-sensitive to alcohol. Cleopatra - flies die when Cleopatra gene interacts with another gene, Asp. Kojak - no hairs on wings. Maggie - fly development is arrested; named after Maggie Simpson, who's development also seems to be arrested. Oh my..Fruitfly gene names •
  52. 52. • those that belong to the Emperor, • embalmed ones, • those that are trained, • suckling pigs, • mermaids, • fabulous ones, • stray dogs, • those included in the present classification, • those that tremble as if they were mad, • innumerable ones, • those drawn with a very fine camelhair brush, • others, • those that have just broken a flower vase, • those that from a long way off look like flies. Allegedly from “Celestial Emporium of Benevolent Knowledge” The Analytical Language of John Wilkins, Jorge Luis Borges Animal classification
  53. 53. • Are you just identifying entities? • Are you looking for sentiment? • In chemistry names will lead you to a recipe for synthesis, and analytical data about that compound Classification after “things”
  54. 54. • Used to aid discovery - directly • Used to aid discovery - indirectly • Extract data in electronic form for reuse • Needs to be use case driven – why, then what/how comes later End result
  55. 55. • Automation can give good results • Especially looked at in bulk • Less easy to judge at the article level • People accept discovery is fuzzy • Not so with data points • (but maybe can screen out) Quality
  56. 56. • Chemical names are both difficult and rewarding. • Difficult in the sense that they can break standard software. • Rewarding in the sense that you can extract useful information about the molecule they’re referring to without a dictionary. • Some examples… Chemistry-specific challenges and opportunities
  57. 57. • …and it gets worse
  58. 58. A series of mono and di-N-2,3-epoxypropyl N- phenylhydrazones have been prepared on a large scale by reaction of the corresponding N-phenylhydrazones of 9-ethyl-3-carbazolecarbaldehyde, 9-ethyl-3,6- carbazoledicarbaldehyde, 4-dimethyl-amino-, 4- diethylamino-, 4-benzylethylamino-, 4-(diphenylamino)-, 4-(4,4-4′-dimethyl-diphenylamino)-, 4-(4- formyldiphenylamino)- and 4-(4-formyl-4′- methyldiphenyl-amino)benzaldehyde with epichlorohydrin in the presence of KOH and anhydrous Na(2)SO(4). From Molecules, via the BioNLP list Annotate this...
  59. 59. How many explicit compounds? • How many numbered compounds actually are named in a given paper? • iloprost (1) • tributyl-1-hexynylstannane (2) • the desired 2-heptyne (3) • methyl–Pd(II) iodide 4 or 4′ • alkynylstannane 5 • the hypervalent stannate 6 • (alkynyl)(methyl)Pd(II) complex 7 • the desired methylalkyne 8 • compounds 9–14 • the stannyl precursors 15 and 16 • methylated compounds 17 and 18 • stannyl precursor 19 • iloprost methyl ester 20 • “iloprost methyl ester” is the real name, but you need to know that iloprost is a monocarboxylic acid!
  60. 60. Names from structures • Systematic names can be generated FROM chemical structures algorithmically
  61. 61. General-purpose parsers do NOT get chemical names Visualization by using d3.js; parsing by Stanford’s CoreNLP.
  62. 62. But names can reverse back to structures…
  63. 63. • OPSIN (chemical name to structure) Tools to try
  64. 64. Not all names are systematic.. Antony Williams vs Identifiers Passport ID Dad, Tony, others SSN Green Card License 5 email addresses ChemSpiderman (blog, Twitter account, Facebook, Friendfeed) OpenID ….
  65. 65. Many Names, One Structure
  66. 66. Aspirin on ChemSpider
  67. 67. Unique Structure Identifiers
  68. 68. Structure Searching the Web
  69. 69. Certainly happens with Welsh!
  70. 70. • All of the tasks below are possible to varying extents. Pioneered on journal abstracts and journal full text. – Named entity recognition: what is this about? Where are the places mentioned? Who are the people? – Clustering and classification: which other dissertations are like this one? What genres of dissertations are there? – Event extraction: what processes (chemical reactions, gene expression) occur? What are the participants? – Citation analysis: who do dissertations cite? – What sentiments towards the citations do authors express? Dissertation analysis
  71. 71. • Dissertation copyright varies • Institution • Author • Published or not? Copyright issues
  72. 72. • Probably less structured than papers • Not much work has been done here before Dissertation specifics
  73. 73. • For example: • Stylometrics (to find out who wrote this) • Language identification • What else?.... • In addition to above, there are different tasks we can perform on scientific publications and dissertations Digital Humanities textual analysis tasks
  74. 74. • We would LOVE to bring data out of our archive • What could we do? • Find chemical names and generate structures • Find chemical images and generate structures • Find reactions – and make a database! • Find data (MP, BP, LogP) and host. Build models! • Find figures and database them • Find spectra (and link to structures) • Validate the data algorithmically “Data enable” publications?
  75. 75. RSC Archive – since 1841
  76. 76. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  77. 77. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  78. 78. • 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC) Text spectra?
  79. 79. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  80. 80. Turn “Figures” Into Data
  81. 81. Make it interactive
  82. 82. SO MANY reactions!
  83. 83. Reactions From Patents
  84. 84. Experimental data checker
  85. 85. • Tools to try: ChemicalTagger
  86. 86. Tools to try: ChemicalTagger
  87. 87. • ChemicalTagger Tools to try
  88. 88. How is DERA going? • We have text-mined all 21st century articles… >100k articles from 2000-2013 • Marked up with XML and published onto the HTML forms of the articles • Required multiple iterations based on dictionaries, markup, text mining iterations • New visualization tools in development – not just chemical names. Add chemical and biomedical terms markup also!
  89. 89. Work in Progress
  90. 90. Work in Progress
  91. 91. Work in Progress
  92. 92. Work in Progress
  93. 93. Dictionary (ontologies) RSC ontologies (methods, reactions) Dictionary (chemistry) Text-mining Curated dictionaries for known names ACD N2S OPSIN Unknown names: automated name to structure conversion XML ready for publication Marked-up XML Production processes CDX integration (coming soon) Chemical structures SD file Is It Easy?
  94. 94. Our Supporting Ontologies
  95. 95. • The ‘National Compound Collection’ • Extracting compounds manually from theses • 700 theses, 44,000 compounds (growing…) • 4 months, 12 UK institutions • Deposited into ChemSpider A pilot examining theses
  96. 96. • Screening for interesting drug candidates • Mapping the chain from author to institution to data to industry • British Library involved (EThOS collection) • Build a business model for this Pilot objectives
  97. 97. • Funders encouraging submission from new dissertations • Mining of old collections (mostly automated, likely to need manual QA) • Extension to other areas of chemical science …and future (ideal)
  98. 98. • Don’t reinvent the wheel • Research your domain to find work already underway and test tools for value/utility In your domain??? Most Domains are Active
  99. 99. A good place to start
  100. 100. • NaCTeM tools for e.g sentiment analysis Tools to try
  101. 101. • NaCTeM tools for e.g sentiment analysis Tools to try
  102. 102. There is always something new
  103. 103. Email: ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: SLIDES: Thank you