SlideShare a Scribd company logo
1 of 30
Data Enhancing the
RSC Archive
Colin Batchelor, Ken Karapetyan, Alexey
Pshenichov, Dave Sharpe, Jon Steele, Valery
Tkachenko and Antony Williams
ACS New Orleans April 2013
Overview
• The big picture
• Where we’ve been
• Statistics as well as semantics
• New directions in experimental data
• Where we’re going
The big picture
We have journal articles going back to 1841 and the
aim is to extract:
• Every small molecule we can (graphics and text)
• Reactions
• Spectra
• Data in tables
and classify every paper in a way that makes sense
to the reader.
Background
• RSC Publishing moved to an all-XML workflow
at the turn of the millennium.
• We digitized the backfile (to 1841) in 2005.
• We launched Project Prospect in 2007.
• We acquired ChemSpider in 2009.
RSC Advances
New high-volume journal covering all of chemistry
launched in 2011.
Need a sensible way of navigating all this.
http://www.rsc.org/advances
http://www.rsc.org/RSCAdvancesSubjects
Strategy
• Use topic modelling: latent Dirichlet allocation (LDA)
and Gibbs sampling to determine a set of “true” topics
Thomas L. Griffiths and Mark Steyvers, “Finding scientific topics”, Proc. Natl. Acad. Sci. USA, 2004, 101, 5228–5235.
• Publishing expertise gives us 12 broad subjects that
will be intuitive to users
• Merge first set to form second
• Tweak
Classify that classification
Generated 128 topics based on 2009 and 2010’s
articles (> 20000 papers).
Generated Wordle images (www.wordle.net) of
the topics for internal staff.
Classify that classification: results
7 topics (75, 57, 65, 67, 82, 113, 123) were
rejected for being nonsense.
1 topic (127) was rejected for being too general.
120 topics were classified under the 12 headings
and given names.
Examples…
Examples
1: “kinetics” → Physical
2: “coordination complexes” → Inorganic
3: “general materials” → Materials
4: “misc. organic” → Organic
5: “bacteria” → Biological + Food and health
6: “theoretical” → Physical
7: “cells” → Bio
8: “water and solution chemistry” → Physical
9: “gels” → Materials
10: “inorganic material properties” → Physical + Inorganic + Materials
11: “general organic” → Organic
12: “coordination chemistry” → Inorganic
13: “photochemistry” → Inorganic + Materials + Energy
“Very useful!”
“… will make it
easier for
readers to
identify papers
which might be
interesting to
them.”
“Superb!”
What now?
Shortly rolling out the subject classification to
other general journals:
• Chemical Communications
• Chemical Science
• Journal of Materials Chemistry A, B and C
• New Journal of Chemistry
Beyond Prospect: further steps in
text-mining
Migration to Oscar 4
https://bitbucket.org/wwmm/oscar4/wiki/Home
Multiple name to structure engines
OPSIN, ACD/Labs, Lexichem
ACD/Labs Dictionary
Better disambiguation
Parallelization with Hadoop
Structure validation and standardization (see later)
Reaction extraction from text (see later)
On an experimental
run with names from
Organic and
Biomolecular Chemistry
Is any structure
returned at all by a
given n2s engine?
Lexichem = a (2798)
ACD = b (3049)
OPSIN = c (3309)
Structure
disagreements
Out of 2588 names
where at least one of
the engines differed
or didn’t return a
result:
A = ACD
(1538 in total)
B = Lexichem
(1301 in total)
C = OPSIN
(2097 in total)
Iterations
With the Hadoop cluster, we can mine
thousands of articles a night.
We’re initially iterating over the material back to
2000, for which we have native XML. Then it’s a
case of going back and testing out the OCRed
material.
http://cv.beta.rsc-us.org/
This is the beta site for
• Extracting chemical structures from
ChemDraw files
• Most importantly: structure validation and
standardization
We will be using this for all of the extracted
structures.
Reaction extraction from text
We have had some preliminary experience of this with Daniel
Lowe (NextMove, formerly Cambridge)’s ChemicalTagger
work.
To go to ChemSpider Reactions:
http://csr.dev.rsc-us.org/
Experimental data
We’ve already seen the possibilities for
extracting data from organic experimental
sections, but what about other sorts of data?
Given chemical structures and extracted data we
may be able to start building models and making
them available.
New directions in experimental
data (1)
We are working with William Brouwer (Penn
State) to extract data from graphs.
Obviously this is faute de mieux and we’d rather
have the original data, but we’re giving a flavour
of what might be possible.
Recent Work
Digitized Spectrum
Comparison of Spectra
And now on ChemSpider…
New directions in experimental
data (2)
Dye solar cell data is every bit as systematic as
organic experimental sections.
Human curation of results
Previously: built into partly-manual annotation
workflow.
Currently: macro-scale, iterative.
Coming: Challenger
DERA
• DERA will unveil from our archive
– Chemicals
– Reactions
– Figures
– Spectra/Analytical Data
– Property Data
– And yes….it will need curation and filtering!

More Related Content

Similar to Digitally enabling the RSC archive

Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...Giannis Tsakonas
 
Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Anubhav Jain
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Anubhav Jain
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignAnubhav Jain
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineKen Karapetyan
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Rudy Potenzone
 
Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Anubhav Jain
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Anubhav Jain
 
Mining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsMining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsSean Ekins
 
TREC2010 Chemical IR Workshop
TREC2010 Chemical IR WorkshopTREC2010 Chemical IR Workshop
TREC2010 Chemical IR WorkshopRajarshi Guha
 
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)Marcel Swart
 
The eCrystals Federation
The eCrystals FederationThe eCrystals Federation
The eCrystals FederationManjulaPatel
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
 

Similar to Digitally enabling the RSC archive (20)

Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...
 
Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...
 
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
 
Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011
 
Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
 
Mining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsMining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning models
 
TREC2010 Chemical IR Workshop
TREC2010 Chemical IR WorkshopTREC2010 Chemical IR Workshop
TREC2010 Chemical IR Workshop
 
Open science 2014
Open science 2014Open science 2014
Open science 2014
 
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
 
The eCrystals Federation
The eCrystals FederationThe eCrystals Federation
The eCrystals Federation
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 

More from Ken Karapetyan

ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...Ken Karapetyan
 
The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...Ken Karapetyan
 
Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Ken Karapetyan
 
Royal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discoveryRoyal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discoveryKen Karapetyan
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
 
Data enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archiveData enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archiveKen Karapetyan
 
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...Ken Karapetyan
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...Ken Karapetyan
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectKen Karapetyan
 
Acs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvspAcs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvspKen Karapetyan
 

More from Ken Karapetyan (12)

ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...
 
The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...
 
Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...
 
Royal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discoveryRoyal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discovery
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Data enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archiveData enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archive
 
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
 
SERMACS 2012
SERMACS 2012SERMACS 2012
SERMACS 2012
 
Acs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvspAcs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvsp
 
Data model
Data modelData model
Data model
 

Recently uploaded

development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptxCherry
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxRenuJangid3
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACherry
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxDiariAli
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCherry
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptxMuhammadRazzaq31
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLkantirani197
 
Site specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfSite specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfCherry
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Cherry
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.Cherry
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Cherry
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 

Recently uploaded (20)

development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNA
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptx
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
Site specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfSite specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdf
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 

Digitally enabling the RSC archive

  • 1. Data Enhancing the RSC Archive Colin Batchelor, Ken Karapetyan, Alexey Pshenichov, Dave Sharpe, Jon Steele, Valery Tkachenko and Antony Williams ACS New Orleans April 2013
  • 2. Overview • The big picture • Where we’ve been • Statistics as well as semantics • New directions in experimental data • Where we’re going
  • 3. The big picture We have journal articles going back to 1841 and the aim is to extract: • Every small molecule we can (graphics and text) • Reactions • Spectra • Data in tables and classify every paper in a way that makes sense to the reader.
  • 4. Background • RSC Publishing moved to an all-XML workflow at the turn of the millennium. • We digitized the backfile (to 1841) in 2005. • We launched Project Prospect in 2007. • We acquired ChemSpider in 2009.
  • 5. RSC Advances New high-volume journal covering all of chemistry launched in 2011. Need a sensible way of navigating all this. http://www.rsc.org/advances http://www.rsc.org/RSCAdvancesSubjects
  • 6. Strategy • Use topic modelling: latent Dirichlet allocation (LDA) and Gibbs sampling to determine a set of “true” topics Thomas L. Griffiths and Mark Steyvers, “Finding scientific topics”, Proc. Natl. Acad. Sci. USA, 2004, 101, 5228–5235. • Publishing expertise gives us 12 broad subjects that will be intuitive to users • Merge first set to form second • Tweak
  • 7. Classify that classification Generated 128 topics based on 2009 and 2010’s articles (> 20000 papers). Generated Wordle images (www.wordle.net) of the topics for internal staff.
  • 8.
  • 9. Classify that classification: results 7 topics (75, 57, 65, 67, 82, 113, 123) were rejected for being nonsense. 1 topic (127) was rejected for being too general. 120 topics were classified under the 12 headings and given names. Examples…
  • 10. Examples 1: “kinetics” → Physical 2: “coordination complexes” → Inorganic 3: “general materials” → Materials 4: “misc. organic” → Organic 5: “bacteria” → Biological + Food and health 6: “theoretical” → Physical 7: “cells” → Bio 8: “water and solution chemistry” → Physical 9: “gels” → Materials 10: “inorganic material properties” → Physical + Inorganic + Materials 11: “general organic” → Organic 12: “coordination chemistry” → Inorganic 13: “photochemistry” → Inorganic + Materials + Energy
  • 11. “Very useful!” “… will make it easier for readers to identify papers which might be interesting to them.” “Superb!”
  • 12. What now? Shortly rolling out the subject classification to other general journals: • Chemical Communications • Chemical Science • Journal of Materials Chemistry A, B and C • New Journal of Chemistry
  • 13. Beyond Prospect: further steps in text-mining Migration to Oscar 4 https://bitbucket.org/wwmm/oscar4/wiki/Home Multiple name to structure engines OPSIN, ACD/Labs, Lexichem ACD/Labs Dictionary Better disambiguation Parallelization with Hadoop Structure validation and standardization (see later) Reaction extraction from text (see later)
  • 14. On an experimental run with names from Organic and Biomolecular Chemistry Is any structure returned at all by a given n2s engine? Lexichem = a (2798) ACD = b (3049) OPSIN = c (3309)
  • 15. Structure disagreements Out of 2588 names where at least one of the engines differed or didn’t return a result: A = ACD (1538 in total) B = Lexichem (1301 in total) C = OPSIN (2097 in total)
  • 16. Iterations With the Hadoop cluster, we can mine thousands of articles a night. We’re initially iterating over the material back to 2000, for which we have native XML. Then it’s a case of going back and testing out the OCRed material.
  • 17. http://cv.beta.rsc-us.org/ This is the beta site for • Extracting chemical structures from ChemDraw files • Most importantly: structure validation and standardization We will be using this for all of the extracted structures.
  • 18.
  • 19.
  • 20. Reaction extraction from text We have had some preliminary experience of this with Daniel Lowe (NextMove, formerly Cambridge)’s ChemicalTagger work. To go to ChemSpider Reactions: http://csr.dev.rsc-us.org/
  • 21. Experimental data We’ve already seen the possibilities for extracting data from organic experimental sections, but what about other sorts of data? Given chemical structures and extracted data we may be able to start building models and making them available.
  • 22. New directions in experimental data (1) We are working with William Brouwer (Penn State) to extract data from graphs. Obviously this is faute de mieux and we’d rather have the original data, but we’re giving a flavour of what might be possible.
  • 26. And now on ChemSpider…
  • 27.
  • 28. New directions in experimental data (2) Dye solar cell data is every bit as systematic as organic experimental sections.
  • 29. Human curation of results Previously: built into partly-manual annotation workflow. Currently: macro-scale, iterative. Coming: Challenger
  • 30. DERA • DERA will unveil from our archive – Chemicals – Reactions – Figures – Spectra/Analytical Data – Property Data – And yes….it will need curation and filtering!