Digitally enabling the RSC archive
Upcoming SlideShare
Loading in...5
×
 

Digitally enabling the RSC archive

on

  • 2,329 views

The Royal Society of Chemistry has an archive of published journals and books stretching back to 1841. In the past decade we have digitized this archive and semantically enriched our frontfile data ...

The Royal Society of Chemistry has an archive of published journals and books stretching back to 1841. In the past decade we have digitized this archive and semantically enriched our frontfile data with chemical structures linked to our free online chemical compound database, ChemSpider. In this talk we will survey our recent efforts to extract all kinds of data – chemical structures, experimental and bibliographic data – from both our backfile and frontfile. We will also discuss our future work to extract chemical reactions to host in our ChemSpider Reactions database and will discuss the potential applications of optical structure recognition technologies for converting structure images to structures as well as using similar techniques to convert experimental spectral data into interactive data formats. A key aspect of this project is the delivery of a crowdsourcing platform for the interactive annotation and validation of the extracted data.

Statistics

Views

Total Views
2,329
Slideshare-icon Views on SlideShare
1,173
Embed Views
1,156

Actions

Likes
1
Downloads
6
Comments
0

5 Embeds 1,156

http://billbrouwer.wordpress.com 736
http://www.chemspider.com 415
http://translate.googleusercontent.com 2
http://127.0.0.1 2
https://www.chemspider.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Digitally enabling the RSC archive Digitally enabling the RSC archive Presentation Transcript

    • Data Enhancing the RSC Archive Colin Batchelor, Ken Karapetyan, AlexeyPshenichov, Dave Sharpe, Jon Steele, Valery Tkachenko and Antony Williams ACS New Orleans April 2013
    • Overview• The big picture• Where we’ve been• Statistics as well as semantics• New directions in experimental data• Where we’re going
    • The big pictureWe have journal articles going back to 1841 and theaim is to extract:•Every small molecule we can (graphics and text)•Reactions•Spectra•Data in tablesand classify every paper in a way that makes senseto the reader.
    • Background• RSC Publishing moved to an all-XML workflow at the turn of the millennium.• We digitized the backfile (to 1841) in 2005.• We launched Project Prospect in 2007.• We acquired ChemSpider in 2009.
    • RSC AdvancesNew high-volume journal covering all of chemistry launched in 2011.Need a sensible way of navigating all this.http://www.rsc.org/advanceshttp://www.rsc.org/RSCAdvancesSubjects
    • Strategy• Use topic modelling: latent Dirichlet allocation (LDA) and Gibbs sampling to determine a set of “true” topicsThomas L. Griffiths and Mark Steyvers, “Finding scientific topics”, Proc. Natl. Acad. Sci. USA, 2004, 101, 5228–5235.• Publishing expertise gives us 12 broad subjects that will be intuitive to users• Merge first set to form second• Tweak
    • Classify that classificationGenerated 128 topics based on 2009 and 2010’s articles (> 20000 papers).Generated Wordle images (www.wordle.net) of the topics for internal staff.
    • Classify that classification: results7 topics (75, 57, 65, 67, 82, 113, 123) were rejected for being nonsense.1 topic (127) was rejected for being too general.120 topics were classified under the 12 headings and given names.Examples…
    • Examples1: “kinetics” → Physical2: “coordination complexes” → Inorganic3: “general materials” → Materials4: “misc. organic” → Organic5: “bacteria” → Biological + Food and health6: “theoretical” → Physical7: “cells” → Bio8: “water and solution chemistry” → Physical9: “gels” → Materials10: “inorganic material properties” → Physical + Inorganic + Materials11: “general organic” → Organic12: “coordination chemistry” → Inorganic13: “photochemistry” → Inorganic + Materials + Energy
    • “Very useful!” “Superb!”“… will make iteasier forreaders toidentify paperswhich might beinteresting tothem.”
    • What now?Shortly rolling out the subject classification toother general journals:•Chemical Communications•Chemical Science•Journal of Materials Chemistry A, B and C•New Journal of Chemistry
    • Beyond Prospect: further steps in text-miningMigration to Oscar 4https://bitbucket.org/wwmm/oscar4/wiki/HomeMultiple name to structure engines OPSIN, ACD/Labs, LexichemACD/Labs DictionaryBetter disambiguationParallelization with HadoopStructure validation and standardization (see later)Reaction extraction from text (see later)
    • On an experimentalrun with names fromOrganic andBiomolecular ChemistryIs any structurereturned at all by agiven n2s engine?Lexichem = a (2798)ACD = b (3049)OPSIN = c (3309)
    • StructuredisagreementsOut of 2588 nameswhere at least one ofthe engines differedor didn’t return aresult:A = ACD(1538 in total)B = Lexichem(1301 in total)C = OPSIN(2097 in total)
    • IterationsWith the Hadoop cluster, we can minethousands of articles a night.We’re initially iterating over the material back to2000, for which we have native XML. Then it’s acase of going back and testing out the OCRedmaterial.
    • http://cv.beta.rsc-us.org/This is the beta site for•Extracting chemical structures from ChemDrawfiles•Most importantly: structure validation andstandardizationWe will be using this for all of the extractedstructures.
    • Reaction extraction from textWe have had some preliminary experience of this with DanielLowe (NextMove, formerly Cambridge)’s ChemicalTaggerwork.To go to ChemSpider Reactions: http://csr.dev.rsc-us.org/
    • Experimental dataWe’ve already seen the possibilities forextracting data from organic experimentalsections, but what about other sorts of data?Given chemical structures and extracted datawe may be able to start building models andmaking them available.
    • New directions in experimental data (1)We are working with William Brouwer (PennState) to extract data from graphs.Obviously this is faute de mieux and we’d ratherhave the original data, but we’re giving a flavourof what might be possible.
    • Recent Work
    • Digitized Spectrum
    • Comparison of Spectra
    • And now on ChemSpider…
    • New directions in experimental data (2)Dye solar cell data is every bit as systematic asorganic experimental sections.
    • Human curation of resultsPreviously: built into partly-manual annotationworkflow.Currently: macro-scale, iterative.Coming: Challenger
    • DERA• DERA will unveil from our archive – Chemicals – Reactions – Figures – Spectra/Analytical Data – Property Data – And yes….it will need curation and filtering!