Nizza , April 8
SciWalker
integrating open access & private sources
Ontologies: we understand life and material sciences.
Data Streams: normalized patents, scientific articles (open source and behind
paywalls) & news etc.
Software & Analytics tools: indexing, extracting compound properties, combining
internal & external sources - documents, databases, linked data.
We support our customers by ...
Compound Registration
OC|processor - chemistry annotator
dictionary
WebAPI190000000809
if no
register
deliver OCID
and data
• provides chemistry registration system based on InChI
• unique, stable OCID
• substructure searchable (JChem SQL Database)
• synonyms & classification connected to OCID
• can be used for local compounds in DB’s or documents as well
is it known?… composition comprising a neonicotinoid
such as imidacloprid <190000000809> and a …
formula &
molpuzzler
name-2-
structure
image-2-
structure
class/group
classify
compound
store
other applications
[O-][N+](=O)NC1=NCCN1CC1=CC=C(Cl)N=C1
About 50.000 new compounds per month from new patents and articles
Growth of registered unique compounds
Compound Registration
your chemistry
WebAPI190000000809
if no
register
deliver OCID
and data
is it known?
[O-][N+](=O)NC1=NCCN1CC1=CC=C(Cl)N=C1
Call WebAPI client
compound
store
Get the client & description at our public FTP transfer.ontochem.com
User: Mievoogh pwd: phae9Goo
other applications
compound
database
● Molecules: about 126 million unique InChI compounds from OntoChem registration server, 74 million
nucleotide and 11 million peptide sequences = 211 million unique molecules
○ OntoChem, PubChem, ChEMBL, Zinc, DssTox
○ Nucleotide sequences, protein and peptide sequences
● Ontologies - public OCID and hierarchy
○ anatomy, biomarker, chemistry, clinicalTrials, compound_classes, cosmetology, drugs, effects,
herbal_drugs, human_genes, inorganic_materials, institutions, magnitudes, methods,
natural_products, nutrition, proteins, regions, species, substances, toxicity
● Genes & Proteins
○ GWAS - Genome wide association studies, ClinVar
● Clinical & Drug Data
○ ClinicalTrials.gov, Drug Central, Drug labels
“SciWalker Open Data” in BigQuery
Use Examples
Data on
Sitagliptin ?
SciWalker + BigQuery
WebAPI
compound link-outs
Application structure changes
Qlik
Backend API
Middleware
Frontend
OIS API
Backend
OIS
Middleware
Looker
Tableau
Google API
BigQuery
Google Data
Studio
so far new
Integrity
COSMIC
GWAS
Text sources Databases
PMC
MedLine
Patents
Integrity
COSMIC
GWAS
Text sources Databases
PMC
MedLine
Patents
Frontend
BigQuery
your data
● Technology integration:
○ Lucene or SQL indexes: speed with indexed data
○ BigQuery: fast sorting, deduplication and aggregation of non-indexed data
○ Seamless prototyping of OLAP search and visualization using Google Data Studio, Tableau,
Looker, Qlik …
● Data integration:
○ Faster integration of novel data sources
○ JOIN operations on different sources: public + third party + private data
○ Scalability and speed
○ BigQuery: large amount of open access data
Advantages of hybrid BigQuery architectures
sciwalker-open-data:sequences.proteins
seq = DRVYIHPF
Searching for patents with AT2 sequences in BQ
Q: What tetrazole containing drug candidates are in how many clinical trials for which diseases ?
A: from 303.973 published tetrazoles, 35 compounds were in 926 trials:
Answering pharma related questions
Ian Wetherbee
@Google
https://datastudio.
google.com/u/0/re
porting/1xKxWJ9R
TOaJCjjzrBgB2p_
s-
OstfCikt/page/M6y
j
Thanks to
Stephen Boyer, Collabra
Ian Wetherbee, Google
Aleksandar Kapisoda, Karlheinz Spenny, Boehringer-Ingelheim
Team, OntoChem

IC-SDV 2019: OntoChem

  • 1.
    Nizza , April8 SciWalker integrating open access & private sources
  • 2.
    Ontologies: we understandlife and material sciences. Data Streams: normalized patents, scientific articles (open source and behind paywalls) & news etc. Software & Analytics tools: indexing, extracting compound properties, combining internal & external sources - documents, databases, linked data. We support our customers by ...
  • 3.
    Compound Registration OC|processor -chemistry annotator dictionary WebAPI190000000809 if no register deliver OCID and data • provides chemistry registration system based on InChI • unique, stable OCID • substructure searchable (JChem SQL Database) • synonyms & classification connected to OCID • can be used for local compounds in DB’s or documents as well is it known?… composition comprising a neonicotinoid such as imidacloprid <190000000809> and a … formula & molpuzzler name-2- structure image-2- structure class/group classify compound store other applications [O-][N+](=O)NC1=NCCN1CC1=CC=C(Cl)N=C1
  • 4.
    About 50.000 newcompounds per month from new patents and articles Growth of registered unique compounds
  • 5.
    Compound Registration your chemistry WebAPI190000000809 ifno register deliver OCID and data is it known? [O-][N+](=O)NC1=NCCN1CC1=CC=C(Cl)N=C1 Call WebAPI client compound store Get the client & description at our public FTP transfer.ontochem.com User: Mievoogh pwd: phae9Goo other applications compound database
  • 6.
    ● Molecules: about126 million unique InChI compounds from OntoChem registration server, 74 million nucleotide and 11 million peptide sequences = 211 million unique molecules ○ OntoChem, PubChem, ChEMBL, Zinc, DssTox ○ Nucleotide sequences, protein and peptide sequences ● Ontologies - public OCID and hierarchy ○ anatomy, biomarker, chemistry, clinicalTrials, compound_classes, cosmetology, drugs, effects, herbal_drugs, human_genes, inorganic_materials, institutions, magnitudes, methods, natural_products, nutrition, proteins, regions, species, substances, toxicity ● Genes & Proteins ○ GWAS - Genome wide association studies, ClinVar ● Clinical & Drug Data ○ ClinicalTrials.gov, Drug Central, Drug labels “SciWalker Open Data” in BigQuery
  • 7.
  • 8.
    Data on Sitagliptin ? SciWalker+ BigQuery WebAPI compound link-outs
  • 9.
    Application structure changes Qlik BackendAPI Middleware Frontend OIS API Backend OIS Middleware Looker Tableau Google API BigQuery Google Data Studio so far new Integrity COSMIC GWAS Text sources Databases PMC MedLine Patents Integrity COSMIC GWAS Text sources Databases PMC MedLine Patents Frontend BigQuery your data
  • 10.
    ● Technology integration: ○Lucene or SQL indexes: speed with indexed data ○ BigQuery: fast sorting, deduplication and aggregation of non-indexed data ○ Seamless prototyping of OLAP search and visualization using Google Data Studio, Tableau, Looker, Qlik … ● Data integration: ○ Faster integration of novel data sources ○ JOIN operations on different sources: public + third party + private data ○ Scalability and speed ○ BigQuery: large amount of open access data Advantages of hybrid BigQuery architectures
  • 11.
  • 12.
    Q: What tetrazolecontaining drug candidates are in how many clinical trials for which diseases ? A: from 303.973 published tetrazoles, 35 compounds were in 926 trials: Answering pharma related questions
  • 13.
  • 14.
    Thanks to Stephen Boyer,Collabra Ian Wetherbee, Google Aleksandar Kapisoda, Karlheinz Spenny, Boehringer-Ingelheim Team, OntoChem