SlideShare a Scribd company logo
Text-mining to produce large chemistry
datasets for community access
Valery Tkachenko1, Aileen Day1, Daniel Lowe2, Igor
Tetko3, Carlos Coba4 , Antony Williams5
1 Royal Society of Chemistry, UK
2 NextMove Software, UK
3 HelmholtzZentrum München, Germany
4 Mestrelab Research, Santiago de Compostela, Spain
5 EPA, US
ACS Fall 2015
Boston, MA
August 17th 2015
ChemSpider
Refs - we live in linked world
Properties
ChemSpider spectra
Knowledge systems
Datastore
Raw data
Data in process
Data out process
UI, API,
Services, etc
RSC Archive – since 1841
Prospecting RSC articles
Further work – properties and spectra mining
Text mining of the chemical documents
Term Examples of text matched
FromLiterature “lit.”
MeltingPoint “mpt”, “melting point”, “m.p.”
Qualifier “>”; “approximately”
Value “75° C”, “200° F”, “one hundred degrees Celsius”
Range “184-186° C”, “191.5 to 192.4° C”
MeasurementE
rror
“50±° C”
OutcomeQuali
fier
“decomp.”, “with decomposition”, “subl.”
FromLiterature? MeltingPoint Qualifier? (Value | Range | MeasurementError) OutcomeQualifier?
Why MP?
Used for water solubility prediction
Yalkowsky equation:
logS = 0.5 – 0.01(MP-25) – log Kow
Detecting suspicious melting
points
• Value was greater than 500° C
• Value was a range wider than 50° C
• Value was a range where the second
temperature was lower than the first
temperature
300k Melting Point Datasets
Bergström 277
Bradley 2886
OCHEM 22404
Enamine 21883
Patents 228079
data
Bergström
Bradley
OCHEM
Enamine
Patents
Tetko et al J. Chemoinformatics, in preparation
Melting point model: data distribution
Some modeling highlights
LibSVM grid search was used to select parameters in grid (ca
1.5 years of CPU-time optimization)
Largest model:
668k descriptors (MolPrint) ~ 0.2 trillions entries
Biggest model:
618Mb (Dragon descriptors)
Most accurate model: Consensus, average of 5 models
RMSE < 32°C for the drug like region, MP [50,250]°C
Prediction error
NMR data
• Extract from 1976-2014 USPTO applications
*unknown – starts off with NMR: peak list (no nucleus)
H 975543
C 56536
unknown 44306
F 9429
P 3241
B 91
Si 62
Sn 22
Se 11
N 8
NMR text mining
• We can find and index text spectra:13C NMR
(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH,
benzylic methane), 30.77 (CH, benzylic
methane), 66.12 (CH2), 68.49 (CH2), 117.72,
118.19, 120.29, 122.67, 123.37, 125.69, 125.84,
129.03, 130.00, 130.53 (ArCH), 99.42, 123.60,
134.69, 139.23, 147.21, 147.61, 149.41,
152.62, 154.88 (ArC)
NMR extracted by year of
publication
0
500000
1000000
1500000
2000000
2500000
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
CumulativedistinctNMRextracted
Year of Publication
USPTO grants
USPTO applications
NMR solvents
48.5%
38.3%
8.7%
1.1% 1.0% 1.0% 1.4%
CDCl3
DMSO-d6
CD3OD
D2O
Acetone-d6
MeOD
Others
Others: CD2Cl2, CD3CN-d3, C6D6, Pyridine-d5, THF-d8, CD3Cl, dimethylformamide-d7,
d1-trifluoroacetic acid, methanol-d3, acetic acid-d4, toluene-d8, sulfuric acid-d2, 1,1,2,2-
tetrachloroethane-d2, CD3OCD3, dioxane-d8, 1,2-dichloroethane-d4
1H-NMR frequency over time
0 Mhz
50 Mhz
100 Mhz
150 Mhz
200 Mhz
250 Mhz
300 Mhz
350 Mhz
400 Mhz
450 Mhz
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
Year of patent filing
MestreLabs Mnova NMR
1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t,
1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz,
C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane),
30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19,
120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42,
123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
Detecting suspicious NMR spectra
• Last peak of NMR spectra is unannotated
and:
– All other peaks are annotated
– Spectrum has 1 peak and is proton or
unknown NMR
> <SuspiciousValue>
true
> <Value>
1H-NMR (400 MHz, d6-Acetone): 11.8-10.8 (brs, 1H), 7.78
Comments: Only the labile proton is reported in the spectrum. The other aromatic and aliphatic protons are completely missing in the spectrum.
> <SuspiciousValue>
true
> <Value>
1H-NMR (400 MHz, CDCl3): 6.85 (1H, d, J=7.8 Hz), 6.10 (1H, dd, J=7.8 and 2.2 Hz), 6.06 (1H, d, J=2.2 Hz), 4.66 (1H, m), 3.75 (4H, br s), 3.40 (2H, s), 1.97
Comments: There are only 11 protons reported in the spectrum whilst the molecule contains more than 50 protons.
Knowledge systems
Datastore
Raw data
Data in process
Data out process
UI, API,
Services, etc
Synthetic chemistry article
Compounds
Reaction
Analytical Data
Text and References
RSC Databases
RSC Compounds
RSC Reactions
RSC Spectra
RSC Crystals
RSC Polymers
RSC Materials
RSC Assays
RSC Algorithms
RSC Models
…and on…
Input pipeline
Deposition Gateway
Staging
databases
Compounds Reactions Spectra Crystals
Materials
Compounds
Module
Spectra
Module
Reactions
Module
Materials
Module
Textmining
Module Module
Web UI for unified depositions
DropBox, Google Drive,
SkyDrive, etc
ELNs, templated data input
Documents
API, FTP, etc
Raw data
Validated
data
Staging
databases
All databases are
sliced by data
sources/ data
collections and
have simple
security model
where each data
slice/ source is
private, public or
embargoed
Etc
Experiments
Research
Output pipeline
Compounds Reactions Spectra Crystals Documents
Compounds
API
Reactions
API
Spectra
API
Crystals
API
Documents
API
Compounds
Widgets
Reactions
Widgets
Spectra
Widgets
Crystals
Widgets
Documents
Widgets
Data layer
Data access
layer
User
interface
widgets
layer
Analytical Laboratory application
User
interface
layer
(examples)
Electronic Laboratory Notebook
Paid 3rd party integrations
(various platforms – SharePoint, Google, etc)
Chemical Inventory application
ChemSpider 2.0
Cross-database links
Compounds domain
Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and
private databases
• ChemSpider
• PubChem
• DrugBank
• KEGG
• ChEBI/ChEMBL
– Automated quality control system
Chemistry Validation and Standardization Platform
Reactions domain
Reactions domain
Analytical data domain
Crystallography domain
3D printable structures
New Repository Architecture
doi: 10.1007/s10822-014-9784-5
Thank you
Email: tkachenkov@rsc.org
Slides:
http://www.slideshare.net/valerytkachenko16

More Related Content

Viewers also liked

The rsc e science - reflecting the change in the world we live in
The rsc e science - reflecting the change in the world we live inThe rsc e science - reflecting the change in the world we live in
The rsc e science - reflecting the change in the world we live inValery Tkachenko
 
Experiences and adventures with no sql and its applications to cheminformatic...
Experiences and adventures with no sql and its applications to cheminformatic...Experiences and adventures with no sql and its applications to cheminformatic...
Experiences and adventures with no sql and its applications to cheminformatic...
Valery Tkachenko
 
OpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and LearningsOpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and Learnings
Valery Tkachenko
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTS
Valery Tkachenko
 
Letter to my great-grandfather on his 18th birthday
Letter to my great-grandfather on his 18th birthdayLetter to my great-grandfather on his 18th birthday
Letter to my great-grandfather on his 18th birthday
Ross Mayfield
 
Ivan chakarov-2015.eng-1
Ivan chakarov-2015.eng-1Ivan chakarov-2015.eng-1
Ivan chakarov-2015.eng-1
Sim Aleksiev
 
The salvation army red kettle run
The salvation army red kettle runThe salvation army red kettle run
The salvation army red kettle run
william timperley
 
九個月的紐西蘭
九個月的紐西蘭九個月的紐西蘭
九個月的紐西蘭
honan4108
 
Xkr072015-myjurnal.ru
Xkr072015-myjurnal.ruXkr072015-myjurnal.ru
Xkr072015-myjurnal.ru
Vasya Pupkin
 
DigitalShoreditch: The gamification of customer service
DigitalShoreditch: The gamification of customer serviceDigitalShoreditch: The gamification of customer service
DigitalShoreditch: The gamification of customer service
Guy Stephens | @guy1067
 
Votre Entreprise sur Facebook... Pour quoi faire?
Votre Entreprise sur Facebook... Pour quoi faire?Votre Entreprise sur Facebook... Pour quoi faire?
Votre Entreprise sur Facebook... Pour quoi faire?Post Planner
 
梯田上的音符 哈尼
梯田上的音符 哈尼梯田上的音符 哈尼
梯田上的音符 哈尼honan4108
 
Je Suis Charlie
Je Suis CharlieJe Suis Charlie
Je Suis Charlieguimera
 
анализ рынка приложений в социальных сетях
анализ рынка приложений в социальных сетях анализ рынка приложений в социальных сетях
анализ рынка приложений в социальных сетях
Dmitriy Plekhanov
 
7 Tips for Design Teams Collaborating Remotely
7 Tips for Design Teams Collaborating Remotely7 Tips for Design Teams Collaborating Remotely
7 Tips for Design Teams Collaborating Remotely
Framebench
 
EPA DROE Email 6.30.03
EPA DROE Email 6.30.03EPA DROE Email 6.30.03
EPA DROE Email 6.30.03
Obama White House
 

Viewers also liked (20)

The rsc e science - reflecting the change in the world we live in
The rsc e science - reflecting the change in the world we live inThe rsc e science - reflecting the change in the world we live in
The rsc e science - reflecting the change in the world we live in
 
Experiences and adventures with no sql and its applications to cheminformatic...
Experiences and adventures with no sql and its applications to cheminformatic...Experiences and adventures with no sql and its applications to cheminformatic...
Experiences and adventures with no sql and its applications to cheminformatic...
 
OpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and LearningsOpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and Learnings
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTS
 
Letter to my great-grandfather on his 18th birthday
Letter to my great-grandfather on his 18th birthdayLetter to my great-grandfather on his 18th birthday
Letter to my great-grandfather on his 18th birthday
 
Ivan chakarov-2015.eng-1
Ivan chakarov-2015.eng-1Ivan chakarov-2015.eng-1
Ivan chakarov-2015.eng-1
 
The salvation army red kettle run
The salvation army red kettle runThe salvation army red kettle run
The salvation army red kettle run
 
quality control of food and drugs
quality control of food and drugsquality control of food and drugs
quality control of food and drugs
 
九個月的紐西蘭
九個月的紐西蘭九個月的紐西蘭
九個月的紐西蘭
 
Zaragoza turismo 211
Zaragoza turismo 211Zaragoza turismo 211
Zaragoza turismo 211
 
Zaragoza turismo-60
Zaragoza turismo-60Zaragoza turismo-60
Zaragoza turismo-60
 
Xkr072015-myjurnal.ru
Xkr072015-myjurnal.ruXkr072015-myjurnal.ru
Xkr072015-myjurnal.ru
 
DigitalShoreditch: The gamification of customer service
DigitalShoreditch: The gamification of customer serviceDigitalShoreditch: The gamification of customer service
DigitalShoreditch: The gamification of customer service
 
Votre Entreprise sur Facebook... Pour quoi faire?
Votre Entreprise sur Facebook... Pour quoi faire?Votre Entreprise sur Facebook... Pour quoi faire?
Votre Entreprise sur Facebook... Pour quoi faire?
 
梯田上的音符 哈尼
梯田上的音符 哈尼梯田上的音符 哈尼
梯田上的音符 哈尼
 
Je Suis Charlie
Je Suis CharlieJe Suis Charlie
Je Suis Charlie
 
анализ рынка приложений в социальных сетях
анализ рынка приложений в социальных сетях анализ рынка приложений в социальных сетях
анализ рынка приложений в социальных сетях
 
Zaragoza turismo 237
Zaragoza turismo 237Zaragoza turismo 237
Zaragoza turismo 237
 
7 Tips for Design Teams Collaborating Remotely
7 Tips for Design Teams Collaborating Remotely7 Tips for Design Teams Collaborating Remotely
7 Tips for Design Teams Collaborating Remotely
 
EPA DROE Email 6.30.03
EPA DROE Email 6.30.03EPA DROE Email 6.30.03
EPA DROE Email 6.30.03
 

Similar to Text mining to produce large chemistry datasets for community access

A Pde Silva Slintec
A Pde Silva SlintecA Pde Silva Slintec
A Pde Silva Slintec
SLINTEC
 
Teaching analytical spectroscopy using online spectroscopic data
Teaching analytical spectroscopy using online spectroscopic dataTeaching analytical spectroscopy using online spectroscopic data
Teaching analytical spectroscopy using online spectroscopic data
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Balaram Lecture slides
Balaram Lecture slidesBalaram Lecture slides
Balaram Lecture slides
Dipak Shetty
 
Evolution of open chemical information
Evolution of open chemical informationEvolution of open chemical information
Evolution of open chemical information
Valery Tkachenko
 
Tandem Mass Spectroscopy Basics
Tandem Mass Spectroscopy BasicsTandem Mass Spectroscopy Basics
Tandem Mass Spectroscopy Basics
Craig Webster
 
The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
ELUSIDASI STRUKTUR.pdf
ELUSIDASI STRUKTUR.pdfELUSIDASI STRUKTUR.pdf
ELUSIDASI STRUKTUR.pdf
DedenIndraDinata1
 
1 H NMR spectroscopy (nilam) (1).pptx
1 H NMR spectroscopy (nilam) (1).pptx1 H NMR spectroscopy (nilam) (1).pptx
1 H NMR spectroscopy (nilam) (1).pptx
Nilam71
 
ICP Presentation
ICP PresentationICP Presentation
ICP Presentation
K Thambi durai
 
Liquid Chromatography-Mass Spectrometry (LC-MS)
Liquid Chromatography-Mass Spectrometry (LC-MS)Liquid Chromatography-Mass Spectrometry (LC-MS)
Liquid Chromatography-Mass Spectrometry (LC-MS)
Hatim Hatim
 
Mass spectroscopy
Mass spectroscopyMass spectroscopy
Mass spectroscopy
Zainab&Sons
 
lectures-genova2006-lecture3.ppt
lectures-genova2006-lecture3.pptlectures-genova2006-lecture3.ppt
lectures-genova2006-lecture3.ppt
Arun Nt
 
Cheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural ProductsCheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural Products
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
ChemSpider - building an online database of open spectra
ChemSpider - building an online database of open spectra ChemSpider - building an online database of open spectra
NOMAD
NOMADNOMAD
NOMAD
Jisc RDM
 
Icpms basics and instrumentation
Icpms basics and instrumentationIcpms basics and instrumentation
Icpms basics and instrumentation
AMOL SHINDE
 
Fragmentation rules mass spectroscopy
Fragmentation rules mass spectroscopyFragmentation rules mass spectroscopy
Fragmentation rules mass spectroscopy
Santhosh Kalakar dj
 
Journal Club Presentation.
Journal Club  Presentation.Journal Club  Presentation.
Journal Club Presentation.
Sri Adichunchanagiri College of Pharmacy
 
Introduction To Proton NMR and Interpretation
Introduction To Proton NMR and InterpretationIntroduction To Proton NMR and Interpretation
Introduction To Proton NMR and Interpretation
Aamir Malik
 

Similar to Text mining to produce large chemistry datasets for community access (20)

A Pde Silva Slintec
A Pde Silva SlintecA Pde Silva Slintec
A Pde Silva Slintec
 
Teaching analytical spectroscopy using online spectroscopic data
Teaching analytical spectroscopy using online spectroscopic dataTeaching analytical spectroscopy using online spectroscopic data
Teaching analytical spectroscopy using online spectroscopic data
 
5
55
5
 
Balaram Lecture slides
Balaram Lecture slidesBalaram Lecture slides
Balaram Lecture slides
 
Evolution of open chemical information
Evolution of open chemical informationEvolution of open chemical information
Evolution of open chemical information
 
Tandem Mass Spectroscopy Basics
Tandem Mass Spectroscopy BasicsTandem Mass Spectroscopy Basics
Tandem Mass Spectroscopy Basics
 
The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...
 
ELUSIDASI STRUKTUR.pdf
ELUSIDASI STRUKTUR.pdfELUSIDASI STRUKTUR.pdf
ELUSIDASI STRUKTUR.pdf
 
1 H NMR spectroscopy (nilam) (1).pptx
1 H NMR spectroscopy (nilam) (1).pptx1 H NMR spectroscopy (nilam) (1).pptx
1 H NMR spectroscopy (nilam) (1).pptx
 
ICP Presentation
ICP PresentationICP Presentation
ICP Presentation
 
Liquid Chromatography-Mass Spectrometry (LC-MS)
Liquid Chromatography-Mass Spectrometry (LC-MS)Liquid Chromatography-Mass Spectrometry (LC-MS)
Liquid Chromatography-Mass Spectrometry (LC-MS)
 
Mass spectroscopy
Mass spectroscopyMass spectroscopy
Mass spectroscopy
 
lectures-genova2006-lecture3.ppt
lectures-genova2006-lecture3.pptlectures-genova2006-lecture3.ppt
lectures-genova2006-lecture3.ppt
 
Cheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural ProductsCheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural Products
 
ChemSpider - building an online database of open spectra
ChemSpider - building an online database of open spectra ChemSpider - building an online database of open spectra
ChemSpider - building an online database of open spectra
 
NOMAD
NOMADNOMAD
NOMAD
 
Icpms basics and instrumentation
Icpms basics and instrumentationIcpms basics and instrumentation
Icpms basics and instrumentation
 
Fragmentation rules mass spectroscopy
Fragmentation rules mass spectroscopyFragmentation rules mass spectroscopy
Fragmentation rules mass spectroscopy
 
Journal Club Presentation.
Journal Club  Presentation.Journal Club  Presentation.
Journal Club Presentation.
 
Introduction To Proton NMR and Interpretation
Introduction To Proton NMR and InterpretationIntroduction To Proton NMR and Interpretation
Introduction To Proton NMR and Interpretation
 

More from Valery Tkachenko

Evolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the futureEvolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the future
Valery Tkachenko
 
In silico design of new functional materials
In silico design of new functional materialsIn silico design of new functional materials
In silico design of new functional materials
Valery Tkachenko
 
Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...
Valery Tkachenko
 
Abstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representationsAbstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representations
Valery Tkachenko
 
Machine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpointsMachine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpoints
Valery Tkachenko
 
Chemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collectionChemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collection
Valery Tkachenko
 
Deep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpointsDeep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpoints
Valery Tkachenko
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Valery Tkachenko
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
Valery Tkachenko
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...
Valery Tkachenko
 
Development and comparison of deep learning toolkit with other machine learni...
Development and comparison of deep learning toolkit with other machine learni...Development and comparison of deep learning toolkit with other machine learni...
Development and comparison of deep learning toolkit with other machine learni...
Valery Tkachenko
 
Living in a world of federated knowledge challenges, principles, tools and ...
Living in a world of federated knowledge   challenges, principles, tools and ...Living in a world of federated knowledge   challenges, principles, tools and ...
Living in a world of federated knowledge challenges, principles, tools and ...
Valery Tkachenko
 
Open chemistry registry and mapping platform based on open source cheminforma...
Open chemistry registry and mapping platform based on open source cheminforma...Open chemistry registry and mapping platform based on open source cheminforma...
Open chemistry registry and mapping platform based on open source cheminforma...
Valery Tkachenko
 
Using the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical dataUsing the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical data
Valery Tkachenko
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
Valery Tkachenko
 
Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0
Valery Tkachenko
 
Open Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials researchOpen Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials research
Valery Tkachenko
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardization
Valery Tkachenko
 
OMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spacesOMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spaces
Valery Tkachenko
 
Not just another reaction database
Not just another reaction databaseNot just another reaction database
Not just another reaction database
Valery Tkachenko
 

More from Valery Tkachenko (20)

Evolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the futureEvolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the future
 
In silico design of new functional materials
In silico design of new functional materialsIn silico design of new functional materials
In silico design of new functional materials
 
Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...
 
Abstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representationsAbstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representations
 
Machine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpointsMachine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpoints
 
Chemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collectionChemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collection
 
Deep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpointsDeep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpoints
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...
 
Development and comparison of deep learning toolkit with other machine learni...
Development and comparison of deep learning toolkit with other machine learni...Development and comparison of deep learning toolkit with other machine learni...
Development and comparison of deep learning toolkit with other machine learni...
 
Living in a world of federated knowledge challenges, principles, tools and ...
Living in a world of federated knowledge   challenges, principles, tools and ...Living in a world of federated knowledge   challenges, principles, tools and ...
Living in a world of federated knowledge challenges, principles, tools and ...
 
Open chemistry registry and mapping platform based on open source cheminforma...
Open chemistry registry and mapping platform based on open source cheminforma...Open chemistry registry and mapping platform based on open source cheminforma...
Open chemistry registry and mapping platform based on open source cheminforma...
 
Using the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical dataUsing the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical data
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
 
Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0
 
Open Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials researchOpen Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials research
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardization
 
OMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spacesOMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spaces
 
Not just another reaction database
Not just another reaction databaseNot just another reaction database
Not just another reaction database
 

Recently uploaded

Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
Red blood cells- genesis-maturation.pptx
Red blood cells- genesis-maturation.pptxRed blood cells- genesis-maturation.pptx
Red blood cells- genesis-maturation.pptx
muralinath2
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
Anemia_ types_clinical significance.pptx
Anemia_ types_clinical significance.pptxAnemia_ types_clinical significance.pptx
Anemia_ types_clinical significance.pptx
muralinath2
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
SSR02
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
RASHMI M G
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 

Recently uploaded (20)

Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
Red blood cells- genesis-maturation.pptx
Red blood cells- genesis-maturation.pptxRed blood cells- genesis-maturation.pptx
Red blood cells- genesis-maturation.pptx
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
Anemia_ types_clinical significance.pptx
Anemia_ types_clinical significance.pptxAnemia_ types_clinical significance.pptx
Anemia_ types_clinical significance.pptx
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 

Text mining to produce large chemistry datasets for community access

  • 1. Text-mining to produce large chemistry datasets for community access Valery Tkachenko1, Aileen Day1, Daniel Lowe2, Igor Tetko3, Carlos Coba4 , Antony Williams5 1 Royal Society of Chemistry, UK 2 NextMove Software, UK 3 HelmholtzZentrum München, Germany 4 Mestrelab Research, Santiago de Compostela, Spain 5 EPA, US ACS Fall 2015 Boston, MA August 17th 2015
  • 2.
  • 4. Refs - we live in linked world
  • 7. Knowledge systems Datastore Raw data Data in process Data out process UI, API, Services, etc
  • 8. RSC Archive – since 1841
  • 10. Further work – properties and spectra mining
  • 11. Text mining of the chemical documents Term Examples of text matched FromLiterature “lit.” MeltingPoint “mpt”, “melting point”, “m.p.” Qualifier “>”; “approximately” Value “75° C”, “200° F”, “one hundred degrees Celsius” Range “184-186° C”, “191.5 to 192.4° C” MeasurementE rror “50±° C” OutcomeQuali fier “decomp.”, “with decomposition”, “subl.” FromLiterature? MeltingPoint Qualifier? (Value | Range | MeasurementError) OutcomeQualifier?
  • 12. Why MP? Used for water solubility prediction Yalkowsky equation: logS = 0.5 – 0.01(MP-25) – log Kow
  • 13. Detecting suspicious melting points • Value was greater than 500° C • Value was a range wider than 50° C • Value was a range where the second temperature was lower than the first temperature
  • 14. 300k Melting Point Datasets Bergström 277 Bradley 2886 OCHEM 22404 Enamine 21883 Patents 228079 data Bergström Bradley OCHEM Enamine Patents Tetko et al J. Chemoinformatics, in preparation
  • 15. Melting point model: data distribution
  • 16. Some modeling highlights LibSVM grid search was used to select parameters in grid (ca 1.5 years of CPU-time optimization) Largest model: 668k descriptors (MolPrint) ~ 0.2 trillions entries Biggest model: 618Mb (Dragon descriptors) Most accurate model: Consensus, average of 5 models RMSE < 32°C for the drug like region, MP [50,250]°C
  • 18. NMR data • Extract from 1976-2014 USPTO applications *unknown – starts off with NMR: peak list (no nucleus) H 975543 C 56536 unknown 44306 F 9429 P 3241 B 91 Si 62 Sn 22 Se 11 N 8
  • 19. NMR text mining • We can find and index text spectra:13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  • 20. NMR extracted by year of publication 0 500000 1000000 1500000 2000000 2500000 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 CumulativedistinctNMRextracted Year of Publication USPTO grants USPTO applications
  • 21. NMR solvents 48.5% 38.3% 8.7% 1.1% 1.0% 1.0% 1.4% CDCl3 DMSO-d6 CD3OD D2O Acetone-d6 MeOD Others Others: CD2Cl2, CD3CN-d3, C6D6, Pyridine-d5, THF-d8, CD3Cl, dimethylformamide-d7, d1-trifluoroacetic acid, methanol-d3, acetic acid-d4, toluene-d8, sulfuric acid-d2, 1,1,2,2- tetrachloroethane-d2, CD3OCD3, dioxane-d8, 1,2-dichloroethane-d4
  • 22. 1H-NMR frequency over time 0 Mhz 50 Mhz 100 Mhz 150 Mhz 200 Mhz 250 Mhz 300 Mhz 350 Mhz 400 Mhz 450 Mhz 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 Year of patent filing
  • 24. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  • 25. 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  • 26. Detecting suspicious NMR spectra • Last peak of NMR spectra is unannotated and: – All other peaks are annotated – Spectrum has 1 peak and is proton or unknown NMR
  • 27. > <SuspiciousValue> true > <Value> 1H-NMR (400 MHz, d6-Acetone): 11.8-10.8 (brs, 1H), 7.78 Comments: Only the labile proton is reported in the spectrum. The other aromatic and aliphatic protons are completely missing in the spectrum.
  • 28. > <SuspiciousValue> true > <Value> 1H-NMR (400 MHz, CDCl3): 6.85 (1H, d, J=7.8 Hz), 6.10 (1H, dd, J=7.8 and 2.2 Hz), 6.06 (1H, d, J=2.2 Hz), 4.66 (1H, m), 3.75 (4H, br s), 3.40 (2H, s), 1.97 Comments: There are only 11 protons reported in the spectrum whilst the molecule contains more than 50 protons.
  • 29. Knowledge systems Datastore Raw data Data in process Data out process UI, API, Services, etc
  • 31. RSC Databases RSC Compounds RSC Reactions RSC Spectra RSC Crystals RSC Polymers RSC Materials RSC Assays RSC Algorithms RSC Models …and on…
  • 32. Input pipeline Deposition Gateway Staging databases Compounds Reactions Spectra Crystals Materials Compounds Module Spectra Module Reactions Module Materials Module Textmining Module Module Web UI for unified depositions DropBox, Google Drive, SkyDrive, etc ELNs, templated data input Documents API, FTP, etc Raw data Validated data Staging databases All databases are sliced by data sources/ data collections and have simple security model where each data slice/ source is private, public or embargoed Etc Experiments Research
  • 33. Output pipeline Compounds Reactions Spectra Crystals Documents Compounds API Reactions API Spectra API Crystals API Documents API Compounds Widgets Reactions Widgets Spectra Widgets Crystals Widgets Documents Widgets Data layer Data access layer User interface widgets layer Analytical Laboratory application User interface layer (examples) Electronic Laboratory Notebook Paid 3rd party integrations (various platforms – SharePoint, Google, etc) Chemical Inventory application ChemSpider 2.0
  • 36. Data quality issue and CVSP – Robochemistry – Proliferation of errors in public and private databases • ChemSpider • PubChem • DrugBank • KEGG • ChEBI/ChEMBL – Automated quality control system
  • 37. Chemistry Validation and Standardization Platform
  • 43. New Repository Architecture doi: 10.1007/s10822-014-9784-5

Editor's Notes

  1. List of others probably isn’t completely comprehensive (solvent is free text!). 2 million spectra (from USPTO applications) have identified solvents
  2. Excluded results < 1MHz and >1GHz (…mixing up Hz and MHz not uncommon!). Just to confuse things this is from the grant data while the previous data was from applications :-p
  3. Extracted NMR spectrum is truncated as it finds the valid spectra up till before the error US20140378645A1 0057 typo? US20140378687A1 0195 missing open bracket
  4. Change to add more database, rearrange
  5. Information typically associated with reactions