SlideShare a Scribd company logo
1 of 20
Download to read offline
| page 1P. Tiikkainen / Drug repurposing
Estimating Error Rates
in Bioactivity Databases
Pekka Tiikkainen
6th Joint Sheffield Conference on Chemoinformatics
July 24, 2013
| page 2P. Tiikkainen / Drug repurposing
Presentation overview
• Merz Virtual Bioactivity Database
• Discrepancies and error rate estimates
• Conclusions and summary
| page 3P. Tiikkainen / Drug repurposing
Presentation overview
• Merz Virtual Bioactivity Database
• Discrepancies and error rate estimates
• Conclusions and summary
| page 4P. Tiikkainen / Drug repurposing
Bioactivity databases
• Central to modern drug discovery.
• Built with a largely manual curation of
scientific articles and patents. Some suppliers
include screening data sets etc.
• Both commercial and public databases available.
• Pharmacokinetic and pharmacodynamic data.
• In this talk, I will concentrate on activity data
where the following parameters have been
defined:
- ligand structure
- target protein (Uniprot accession)
- quantitative activity value
- activity type (Ki, IC50, EC50 etc.)
+
| page 5P. Tiikkainen / Drug repurposing
Merz Virtual Bioactivity Database (MVBD)
MVBD
• Our central resource for bioactivity data.
• Used for target prediction and enriching other data resources.
• Integrated from public, commercial and in-house data resources.
• Implemented as a MySQL database.
| page 6P. Tiikkainen / Drug repurposing
Bioactivity breakdown by target class
Tiikkainen and Franke. Analysis of commercial and public bioactivity databases. J Chem Inf Model. 2012 Feb 27;52(2):319-26.
Whereas GPCRs (still) are the largest target
class for launched drugs, enzymes are the
largest target class in the MVBD.
| page 7P. Tiikkainen / Drug repurposing
Historical breakdown of target data
Tiikkainen and Franke. Analysis of commercial and public bioactivity databases. J Chem Inf Model. 2012 Feb 27;52(2):319-26.
| page 8P. Tiikkainen / Drug repurposing
Why integrate?
• Much of the bioactivity data is unique to a single database.
• All databases use scientific papers as data sources, but...
- not all databases cite the same journals
- the extent a journal is cited varies across databases
- databases use additional data sources, e.g. Pubchem
and screening data sets for ChEMBL, patents for Liceptor
Overlap of bioactivities
| page 9P. Tiikkainen / Drug repurposing
Bioactivity standardization
When your data comes from heterogenous sources,
standardization becomes extremely important.
Supplier A
Target
P42574
Activity
Ki = 10 nM
Supplier B
Target
Caspase-3
(human)
Activity
Ki = 0.01 µM
Supplier C
Target
Hs.141125
Activity
pKi = 8
MVBD
Target
P42574
Activity
Ki = 0.01 µM
Standardize structure, target and the activity value
| page 10P. Tiikkainen / Drug repurposing
Presentation overview
• Merz Virtual Bioactivity Database
• Discrepancies and error rate estimates
• Conclusions and summary
| page 11P. Tiikkainen / Drug repurposing
Discrepancies
Tiikkainen and Franke. Analysis of commercial and public bioactivity databases. J Chem Inf Model. 2012 Feb 27;52(2):319-26.
Standardization and integration
allows us to compare the
different database suppliers.
When comparing bioactivities
different suppliers have curated
from the same article, it is not
uncommon to find discrepancies.
| page 12P. Tiikkainen / Drug repurposing
From discrepancies to error rate estimates
• Discrepancies alone do not tell
which of the suppliers is correct.
• However, we can calculate error
rate estimates using a special
subset of discrepancies: if two
database supplier agree on
a parameter value while a third
one disagrees, the latter value is
considered incorrect.
• Underlying assumption is that all
suppliers have independently curated
the articles.
A B
C
=
≠≠
| page 13P. Tiikkainen / Drug repurposing
Error rate estimation workflow
| page 14P. Tiikkainen / Drug repurposing
Error rate estimates
Figures inside and above the bars indicate the absolute number of bioactivities.
Calculating discrepancy frequencies gives us error rate estimates.
| page 15P. Tiikkainen / Drug repurposing
Types of ligand errors
Discrepancies in ligand structures can be split into two categories:
1) discrepancies in atom connectivity and
2) atom connectivity identical but discrepant stereochemistry
* Majority of stereochemistry discrepancies in WOMBAT is probably
due to lack of any stereochemical features in other databases.
| page 16P. Tiikkainen / Drug repurposing
Types of target errors
Also target discrepancies can be split into two categories:
1) target protein itself is discrepant (e.g. 5-HT1a vs. 5-HT2a)
2) target protein is correct but discrepant ortholog (source species)
| page 17P. Tiikkainen / Drug repurposing
Validating the approach, part 1
For the credibility of the approach, it was necessary to test
how often the underlying assumption holds.
For each activity parameter (excl. activity type) and supplier, we picked five
activities where the supplier had provided a discrepant value
(i.e. 45 activities).
These activities were manually checked from the original
source articles.
In 37 cases (82.2%), the discrepant activity value turned out
in fact to be wrong -> assumption correct.
In 3 cases (6.7%), the opposite was true, and the discrepant
supplier in fact had the correct value -> assumption incorrect.
In 5 cases (11.1%), we could not draw a conclusion since the
source article was lacked clarity on the exact parameter value.
| page 18P. Tiikkainen / Drug repurposing
Validating the approach, part 2
A more extensive validation was performed by the ChEMBL team
while checking discrepancies identified in ChEMBL release 14.
Louisa Bellis and Yvonne Light. ChEMBL team. European Bioinformatics Institute.
Parameter Set Results
Ligand
structure
1,936 ligands (corresponding to
2,181 activities) discrepant only
in ChEMBL.
310 (16.0%) correctly curated in ChEMBL
while the remaining 1,626 (84.0%) required
some changes.
1,486 ligands (2,429 activites)
where all suppliers disagreed.
280 ligands (18.8%) correctly curated in
ChEMBL. For 1,206 ligands (81.2%) some
changes had to be made.
Activity
type/value
259 cases checked so far.
In 83 cases (32.0%), ChEMBL had the
correct activity value and type. For 68.0% of
the cases, some corrections were made.
Target
764 bioactivities where ChEMBL
was the sole discrepant supplier.
In 137 cases (18.0%), ChEMBL was correct
while either the target or the species had to
be corrected in the remaining 627 cases.
| page 19P. Tiikkainen / Drug repurposing
Summary
• By comparing bioactivities three database suppliers have
extracted from the same source article, we were able to
identify discrepancies and calculate error rate estimates.
- Error rate estimates vary by parameter:
ligand > target > value > type
• Validation of the approach shows that it identifies
an incorrectly curated value ~65-80 % of the time.
• Database suppliers have been notified of discrepancies
in their respective databases for re-curation.
- thousands of data points have already
been corrected in the ChEMBL database
- similar work is being undertaken by companies
representing the WOMBAT and Liceptor databases
• Users of bioactivity data are encouraged, if possible,
to double-check the data from the original source.
| page 20P. Tiikkainen / Drug repurposing
Acknowledgements
Lutz Franke
Louisa Bellis
Yvonne Light

More Related Content

What's hot

Poster on systems pharmacology of the cholesterol biosynthesis pathway
Poster on systems pharmacology of the cholesterol biosynthesis pathwayPoster on systems pharmacology of the cholesterol biosynthesis pathway
Poster on systems pharmacology of the cholesterol biosynthesis pathwayGuide to PHARMACOLOGY
 
Will the correct drugs please stand up?
Will  the correct drugs please stand up?Will  the correct drugs please stand up?
Will the correct drugs please stand up?Chris Southan
 
Assessing GtoPdb ligand content in PubChem
Assessing GtoPdb ligand content in PubChemAssessing GtoPdb ligand content in PubChem
Assessing GtoPdb ligand content in PubChemChris Southan
 
Comparing ChEMBL, DrugBank, Human Metabolome db and Therapeutic Target db at ...
Comparing ChEMBL, DrugBank, Human Metabolome db and Therapeutic Target db at ...Comparing ChEMBL, DrugBank, Human Metabolome db and Therapeutic Target db at ...
Comparing ChEMBL, DrugBank, Human Metabolome db and Therapeutic Target db at ...Chris Southan
 
GtoPdb: A resource for cell-based perturbogens
GtoPdb:  A resource for cell-based perturbogensGtoPdb:  A resource for cell-based perturbogens
GtoPdb: A resource for cell-based perturbogensChris Southan
 
Significance of Analytics in Biosimilars
Significance of Analytics in BiosimilarsSignificance of Analytics in Biosimilars
Significance of Analytics in BiosimilarsEMMAIntl
 
Curatorial data wrangling for the Guide to PHARMACOLGY
Curatorial data wrangling for the Guide to PHARMACOLGY Curatorial data wrangling for the Guide to PHARMACOLGY
Curatorial data wrangling for the Guide to PHARMACOLGY Chris Southan
 
5 data analysis case study
5  data analysis case study5  data analysis case study
5 data analysis case studyDmitry Grapov
 
Predictive comparative qsar analysis of as 5 nitrofuran-2-yl derivatives myco...
Predictive comparative qsar analysis of as 5 nitrofuran-2-yl derivatives myco...Predictive comparative qsar analysis of as 5 nitrofuran-2-yl derivatives myco...
Predictive comparative qsar analysis of as 5 nitrofuran-2-yl derivatives myco...hiij
 
Guide to PHARMACOLOGY: a web-Based Compendium for Research and Education
Guide to PHARMACOLOGY: a web-Based Compendium for Research and EducationGuide to PHARMACOLOGY: a web-Based Compendium for Research and Education
Guide to PHARMACOLOGY: a web-Based Compendium for Research and EducationChris Southan
 
IUPHAR/BPS Guide to Pharmacology in 2018
IUPHAR/BPS Guide to Pharmacology in 2018IUPHAR/BPS Guide to Pharmacology in 2018
IUPHAR/BPS Guide to Pharmacology in 2018Guide to PHARMACOLOGY
 
Analysing curated protein targets: Partitioning the drugged and the druggable
Analysing curated protein targets: Partitioning the drugged and the druggable Analysing curated protein targets: Partitioning the drugged and the druggable
Analysing curated protein targets: Partitioning the drugged and the druggable Chris Southan
 
Chemical database preparation ppt
Chemical database preparation pptChemical database preparation ppt
Chemical database preparation pptsamantlalit
 
Sagar alone qsar studies of saponin analogues for anticancer activity
Sagar alone  qsar studies of saponin analogues for anticancer activitySagar alone  qsar studies of saponin analogues for anticancer activity
Sagar alone qsar studies of saponin analogues for anticancer activitysagar alone
 
Capturing BIA-10-2474 and related FAAH inhibitor data
Capturing BIA-10-2474 and related FAAH inhibitor dataCapturing BIA-10-2474 and related FAAH inhibitor data
Capturing BIA-10-2474 and related FAAH inhibitor dataChris Southan
 
Molecular docking and its importance in drug design
Molecular docking and its importance in drug designMolecular docking and its importance in drug design
Molecular docking and its importance in drug designdevilpicassa01
 

What's hot (20)

Poster on systems pharmacology of the cholesterol biosynthesis pathway
Poster on systems pharmacology of the cholesterol biosynthesis pathwayPoster on systems pharmacology of the cholesterol biosynthesis pathway
Poster on systems pharmacology of the cholesterol biosynthesis pathway
 
IUPHAR/BPS Guide to Pharmacology
IUPHAR/BPS Guide to PharmacologyIUPHAR/BPS Guide to Pharmacology
IUPHAR/BPS Guide to Pharmacology
 
Virtual screening
Virtual screeningVirtual screening
Virtual screening
 
ChemInform RxnFinder
ChemInform RxnFinderChemInform RxnFinder
ChemInform RxnFinder
 
Will the correct drugs please stand up?
Will  the correct drugs please stand up?Will  the correct drugs please stand up?
Will the correct drugs please stand up?
 
Assessing GtoPdb ligand content in PubChem
Assessing GtoPdb ligand content in PubChemAssessing GtoPdb ligand content in PubChem
Assessing GtoPdb ligand content in PubChem
 
Comparing ChEMBL, DrugBank, Human Metabolome db and Therapeutic Target db at ...
Comparing ChEMBL, DrugBank, Human Metabolome db and Therapeutic Target db at ...Comparing ChEMBL, DrugBank, Human Metabolome db and Therapeutic Target db at ...
Comparing ChEMBL, DrugBank, Human Metabolome db and Therapeutic Target db at ...
 
GtoPdb: A resource for cell-based perturbogens
GtoPdb:  A resource for cell-based perturbogensGtoPdb:  A resource for cell-based perturbogens
GtoPdb: A resource for cell-based perturbogens
 
Significance of Analytics in Biosimilars
Significance of Analytics in BiosimilarsSignificance of Analytics in Biosimilars
Significance of Analytics in Biosimilars
 
Curatorial data wrangling for the Guide to PHARMACOLGY
Curatorial data wrangling for the Guide to PHARMACOLGY Curatorial data wrangling for the Guide to PHARMACOLGY
Curatorial data wrangling for the Guide to PHARMACOLGY
 
5 data analysis case study
5  data analysis case study5  data analysis case study
5 data analysis case study
 
Predictive comparative qsar analysis of as 5 nitrofuran-2-yl derivatives myco...
Predictive comparative qsar analysis of as 5 nitrofuran-2-yl derivatives myco...Predictive comparative qsar analysis of as 5 nitrofuran-2-yl derivatives myco...
Predictive comparative qsar analysis of as 5 nitrofuran-2-yl derivatives myco...
 
Guide to PHARMACOLOGY: a web-Based Compendium for Research and Education
Guide to PHARMACOLOGY: a web-Based Compendium for Research and EducationGuide to PHARMACOLOGY: a web-Based Compendium for Research and Education
Guide to PHARMACOLOGY: a web-Based Compendium for Research and Education
 
IUPHAR/BPS Guide to Pharmacology in 2018
IUPHAR/BPS Guide to Pharmacology in 2018IUPHAR/BPS Guide to Pharmacology in 2018
IUPHAR/BPS Guide to Pharmacology in 2018
 
Analysing curated protein targets: Partitioning the drugged and the druggable
Analysing curated protein targets: Partitioning the drugged and the druggable Analysing curated protein targets: Partitioning the drugged and the druggable
Analysing curated protein targets: Partitioning the drugged and the druggable
 
Chemical database preparation ppt
Chemical database preparation pptChemical database preparation ppt
Chemical database preparation ppt
 
Sagar alone qsar studies of saponin analogues for anticancer activity
Sagar alone  qsar studies of saponin analogues for anticancer activitySagar alone  qsar studies of saponin analogues for anticancer activity
Sagar alone qsar studies of saponin analogues for anticancer activity
 
Capturing BIA-10-2474 and related FAAH inhibitor data
Capturing BIA-10-2474 and related FAAH inhibitor dataCapturing BIA-10-2474 and related FAAH inhibitor data
Capturing BIA-10-2474 and related FAAH inhibitor data
 
Molecular docking and its importance in drug design
Molecular docking and its importance in drug designMolecular docking and its importance in drug design
Molecular docking and its importance in drug design
 
Bioinformatics and Drug Discovery
Bioinformatics and Drug DiscoveryBioinformatics and Drug Discovery
Bioinformatics and Drug Discovery
 

Similar to Estimating bioactivity database error rates, tiikkainen

Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryAnn-Marie Roche
 
Metabolic engineering approaches in medicinal plants
Metabolic engineering approaches in medicinal plantsMetabolic engineering approaches in medicinal plants
Metabolic engineering approaches in medicinal plantsN Poorin
 
Multiplexing analysis of 1000 approved drugs in PubChem
Multiplexing analysis of 1000 approved drugs in PubChemMultiplexing analysis of 1000 approved drugs in PubChem
Multiplexing analysis of 1000 approved drugs in PubChemChris Southan
 
Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Syed Muhammad Ali Hasnain
 
Extending the "Web of Drug Identity" with knowledge extracted from United Sta...
Extending the "Web of Drug Identity" with knowledge extracted from United Sta...Extending the "Web of Drug Identity" with knowledge extracted from United Sta...
Extending the "Web of Drug Identity" with knowledge extracted from United Sta...Richard Boyce, PhD
 
Correct drug structures for pharmacology
Correct drug structures for pharmacologyCorrect drug structures for pharmacology
Correct drug structures for pharmacologyChris Southan
 
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekinge2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekingeProf. Wim Van Criekinge
 
Alternative animal model for compound characterization
Alternative animal model for compound characterizationAlternative animal model for compound characterization
Alternative animal model for compound characterizationWenlan Hu
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbChris Southan
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology Sean Ekins
 
Bioinformatics t9-t10-biocheminformatics v2014
Bioinformatics t9-t10-biocheminformatics v2014Bioinformatics t9-t10-biocheminformatics v2014
Bioinformatics t9-t10-biocheminformatics v2014Prof. Wim Van Criekinge
 
Best compound characterization protocol
Best compound characterization protocolBest compound characterization protocol
Best compound characterization protocolWenlan Hu
 
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013Prof. Wim Van Criekinge
 
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekingeProf. Wim Van Criekinge
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagensChris Southan
 
Guide to Pharmacology Poster - ELIXIR All Hands 2020
Guide to Pharmacology Poster - ELIXIR All Hands 2020Guide to Pharmacology Poster - ELIXIR All Hands 2020
Guide to Pharmacology Poster - ELIXIR All Hands 2020Guide to PHARMACOLOGY
 
Research Avenues in Drug discovery of natural products
Research Avenues in Drug discovery of natural productsResearch Avenues in Drug discovery of natural products
Research Avenues in Drug discovery of natural productsDevakumar Jain
 

Similar to Estimating bioactivity database error rates, tiikkainen (20)

Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistry
 
Metabolic engineering approaches in medicinal plants
Metabolic engineering approaches in medicinal plantsMetabolic engineering approaches in medicinal plants
Metabolic engineering approaches in medicinal plants
 
Multiplexing analysis of 1000 approved drugs in PubChem
Multiplexing analysis of 1000 approved drugs in PubChemMultiplexing analysis of 1000 approved drugs in PubChem
Multiplexing analysis of 1000 approved drugs in PubChem
 
Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...
 
Extending the "Web of Drug Identity" with knowledge extracted from United Sta...
Extending the "Web of Drug Identity" with knowledge extracted from United Sta...Extending the "Web of Drug Identity" with knowledge extracted from United Sta...
Extending the "Web of Drug Identity" with knowledge extracted from United Sta...
 
Correct drug structures for pharmacology
Correct drug structures for pharmacologyCorrect drug structures for pharmacology
Correct drug structures for pharmacology
 
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekinge2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
 
Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...
 
Alternative animal model for compound characterization
Alternative animal model for compound characterizationAlternative animal model for compound characterization
Alternative animal model for compound characterization
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
 
Bioinformatics t9-t10-biocheminformatics v2014
Bioinformatics t9-t10-biocheminformatics v2014Bioinformatics t9-t10-biocheminformatics v2014
Bioinformatics t9-t10-biocheminformatics v2014
 
AXP302
AXP302AXP302
AXP302
 
Best compound characterization protocol
Best compound characterization protocolBest compound characterization protocol
Best compound characterization protocol
 
Trends in Early Development
Trends in Early DevelopmentTrends in Early Development
Trends in Early Development
 
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
 
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
 
Guide to Pharmacology Poster - ELIXIR All Hands 2020
Guide to Pharmacology Poster - ELIXIR All Hands 2020Guide to Pharmacology Poster - ELIXIR All Hands 2020
Guide to Pharmacology Poster - ELIXIR All Hands 2020
 
Research Avenues in Drug discovery of natural products
Research Avenues in Drug discovery of natural productsResearch Avenues in Drug discovery of natural products
Research Avenues in Drug discovery of natural products
 

Recently uploaded

Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoUXDXConf
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsUXDXConf
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty SecureFemke de Vroome
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKUXDXConf
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 

Recently uploaded (20)

Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 

Estimating bioactivity database error rates, tiikkainen

  • 1. | page 1P. Tiikkainen / Drug repurposing Estimating Error Rates in Bioactivity Databases Pekka Tiikkainen 6th Joint Sheffield Conference on Chemoinformatics July 24, 2013
  • 2. | page 2P. Tiikkainen / Drug repurposing Presentation overview • Merz Virtual Bioactivity Database • Discrepancies and error rate estimates • Conclusions and summary
  • 3. | page 3P. Tiikkainen / Drug repurposing Presentation overview • Merz Virtual Bioactivity Database • Discrepancies and error rate estimates • Conclusions and summary
  • 4. | page 4P. Tiikkainen / Drug repurposing Bioactivity databases • Central to modern drug discovery. • Built with a largely manual curation of scientific articles and patents. Some suppliers include screening data sets etc. • Both commercial and public databases available. • Pharmacokinetic and pharmacodynamic data. • In this talk, I will concentrate on activity data where the following parameters have been defined: - ligand structure - target protein (Uniprot accession) - quantitative activity value - activity type (Ki, IC50, EC50 etc.) +
  • 5. | page 5P. Tiikkainen / Drug repurposing Merz Virtual Bioactivity Database (MVBD) MVBD • Our central resource for bioactivity data. • Used for target prediction and enriching other data resources. • Integrated from public, commercial and in-house data resources. • Implemented as a MySQL database.
  • 6. | page 6P. Tiikkainen / Drug repurposing Bioactivity breakdown by target class Tiikkainen and Franke. Analysis of commercial and public bioactivity databases. J Chem Inf Model. 2012 Feb 27;52(2):319-26. Whereas GPCRs (still) are the largest target class for launched drugs, enzymes are the largest target class in the MVBD.
  • 7. | page 7P. Tiikkainen / Drug repurposing Historical breakdown of target data Tiikkainen and Franke. Analysis of commercial and public bioactivity databases. J Chem Inf Model. 2012 Feb 27;52(2):319-26.
  • 8. | page 8P. Tiikkainen / Drug repurposing Why integrate? • Much of the bioactivity data is unique to a single database. • All databases use scientific papers as data sources, but... - not all databases cite the same journals - the extent a journal is cited varies across databases - databases use additional data sources, e.g. Pubchem and screening data sets for ChEMBL, patents for Liceptor Overlap of bioactivities
  • 9. | page 9P. Tiikkainen / Drug repurposing Bioactivity standardization When your data comes from heterogenous sources, standardization becomes extremely important. Supplier A Target P42574 Activity Ki = 10 nM Supplier B Target Caspase-3 (human) Activity Ki = 0.01 µM Supplier C Target Hs.141125 Activity pKi = 8 MVBD Target P42574 Activity Ki = 0.01 µM Standardize structure, target and the activity value
  • 10. | page 10P. Tiikkainen / Drug repurposing Presentation overview • Merz Virtual Bioactivity Database • Discrepancies and error rate estimates • Conclusions and summary
  • 11. | page 11P. Tiikkainen / Drug repurposing Discrepancies Tiikkainen and Franke. Analysis of commercial and public bioactivity databases. J Chem Inf Model. 2012 Feb 27;52(2):319-26. Standardization and integration allows us to compare the different database suppliers. When comparing bioactivities different suppliers have curated from the same article, it is not uncommon to find discrepancies.
  • 12. | page 12P. Tiikkainen / Drug repurposing From discrepancies to error rate estimates • Discrepancies alone do not tell which of the suppliers is correct. • However, we can calculate error rate estimates using a special subset of discrepancies: if two database supplier agree on a parameter value while a third one disagrees, the latter value is considered incorrect. • Underlying assumption is that all suppliers have independently curated the articles. A B C = ≠≠
  • 13. | page 13P. Tiikkainen / Drug repurposing Error rate estimation workflow
  • 14. | page 14P. Tiikkainen / Drug repurposing Error rate estimates Figures inside and above the bars indicate the absolute number of bioactivities. Calculating discrepancy frequencies gives us error rate estimates.
  • 15. | page 15P. Tiikkainen / Drug repurposing Types of ligand errors Discrepancies in ligand structures can be split into two categories: 1) discrepancies in atom connectivity and 2) atom connectivity identical but discrepant stereochemistry * Majority of stereochemistry discrepancies in WOMBAT is probably due to lack of any stereochemical features in other databases.
  • 16. | page 16P. Tiikkainen / Drug repurposing Types of target errors Also target discrepancies can be split into two categories: 1) target protein itself is discrepant (e.g. 5-HT1a vs. 5-HT2a) 2) target protein is correct but discrepant ortholog (source species)
  • 17. | page 17P. Tiikkainen / Drug repurposing Validating the approach, part 1 For the credibility of the approach, it was necessary to test how often the underlying assumption holds. For each activity parameter (excl. activity type) and supplier, we picked five activities where the supplier had provided a discrepant value (i.e. 45 activities). These activities were manually checked from the original source articles. In 37 cases (82.2%), the discrepant activity value turned out in fact to be wrong -> assumption correct. In 3 cases (6.7%), the opposite was true, and the discrepant supplier in fact had the correct value -> assumption incorrect. In 5 cases (11.1%), we could not draw a conclusion since the source article was lacked clarity on the exact parameter value.
  • 18. | page 18P. Tiikkainen / Drug repurposing Validating the approach, part 2 A more extensive validation was performed by the ChEMBL team while checking discrepancies identified in ChEMBL release 14. Louisa Bellis and Yvonne Light. ChEMBL team. European Bioinformatics Institute. Parameter Set Results Ligand structure 1,936 ligands (corresponding to 2,181 activities) discrepant only in ChEMBL. 310 (16.0%) correctly curated in ChEMBL while the remaining 1,626 (84.0%) required some changes. 1,486 ligands (2,429 activites) where all suppliers disagreed. 280 ligands (18.8%) correctly curated in ChEMBL. For 1,206 ligands (81.2%) some changes had to be made. Activity type/value 259 cases checked so far. In 83 cases (32.0%), ChEMBL had the correct activity value and type. For 68.0% of the cases, some corrections were made. Target 764 bioactivities where ChEMBL was the sole discrepant supplier. In 137 cases (18.0%), ChEMBL was correct while either the target or the species had to be corrected in the remaining 627 cases.
  • 19. | page 19P. Tiikkainen / Drug repurposing Summary • By comparing bioactivities three database suppliers have extracted from the same source article, we were able to identify discrepancies and calculate error rate estimates. - Error rate estimates vary by parameter: ligand > target > value > type • Validation of the approach shows that it identifies an incorrectly curated value ~65-80 % of the time. • Database suppliers have been notified of discrepancies in their respective databases for re-curation. - thousands of data points have already been corrected in the ChEMBL database - similar work is being undertaken by companies representing the WOMBAT and Liceptor databases • Users of bioactivity data are encouraged, if possible, to double-check the data from the original source.
  • 20. | page 20P. Tiikkainen / Drug repurposing Acknowledgements Lutz Franke Louisa Bellis Yvonne Light