The #cleandata company
• Majority of scientific information is
unstructured and underused
• Information overload (Volume,
Variety, Velocity, Quality)
• Highly synonymous and ambiguous
terminology
• Complex hierarchical relationships
Science isn’t simple
• Poor results when applying computational/AI
approaches
• Up to 80% of time taken to prepare data
• Inaccessible and underused data
• Duplicity of research
• Not building on existing knowledge
The downstream impact
Our Purpose
To enable scientists to use insights locked in
unstructured data to power their decision and speed up
innovation by:
• Using world class ontologies to revolutionise the
access to and utilisation of scientific information
• Transforming unstructured text into contextualised,
machine readable data suitable for computational
analysis
The SciBite Platform
Harmonise terminology
Scientific ontologies
Adhering to public standards
Manage / augment / curate your
own
#MANAGE
Automated cleansing of semi-
structured data
Text to data
Standardise data formats
Indexing data at point of entry
#CLEAN
Semantic search
Regular expressions
Knowledge networks
Visualise results
Platform enrichment
#DISCOVER
TERMite
TERMite
VOCab
TERMite
TEXT IN
Any format of
biomedical text-based
document can be
processed by
TERMite
STRUCTURED
DATA OUT
Contextualized,
machine readable
data ready for
analysis
Augment VOCabVOCab Creation
Variation engine
i.e. breast cancer auto expanded to
include the syns breast neoplasm,
cancer of the breast & mammary tumour
Source
Ontology Expert Curation
Synonym
Expansion
Disambiguation
settings
Iterative testing
VOCabs can be updated by users with simple 3 column augment
files, the following (saved as drug.dictionary.aug) would add the
extra synonym, extrasyn to the DRUG entity aspirin:
# ID name syns
CHEMBL25 extrasyn
Java based, RESTful service
RDBMS, NOSQL, Solr/Elastic, Hadoop, RDF, AWS &
Docker compatible
Scalable & fast. Runs on a server, cloud, laptop
• Hand curated and maintained by our expert
team
• Comprehensive coverage
• Aligned to industry standards to maintain
interoperability
• Enriched with synonyms and rules to
manage. the complexity of scientific
language
• Customize, augment our existing or deploy
your own vocabularies
VOCabs
Ontologies are at the heart of everything
CORE
CLINICAL
AGRO
BIO-PHARM
BUS
INT
GEN
PHEN
Modular Microservices Architecture
Compile / test
vocabularies
Manage /
distribute
ontologies
VOCabulary
curation
Data cleaning
platform
Smart forms
(HTML/JS)
Automated
data ingestion
Semantic
search UI
Pattern
matching
Browser-
based
enrichment
Workflow
automation
(PLP/KNIME)
AI-based
classification
Partnership Ecosystem
The SciBite Platform Principles
• Proven track record in semantic analytics
• Specialists in life sciences
• Micro-services architecture. Built for integration. Scalable
• End-to-end solution for processing, mining and query
• Combined benefits of machine learning & ontologies
• Supports IT, Data Science, Info Management & Comp.
Biology
• Great support connecting directly to our SciTech team
• Best in class vocabularies covering >100 concepts
with tooling to create your own
✔ ✔ ✔ ✔ ✘ ✔ ✔ ✘ ✔ ✘ ✔ ✘
✔ ✔ ✔ ✘ ✘ ✔ ✘ ✔ ✘ ✘ ✔ ✘
✔ ✘ ✔ ✔ ✘ ✘ ✘ ✘ ✘ ✘ ✔ ✘
✔ ✘ ✘ ✔ ✘ ✘ ✔ ✘ ✘ ✔ ✘ ✘
✔ ✘ ✘ ✔ ✔ ✘ ✘ ✘ ✘ ✔ ✘ ✔
✔ ✘ ✔ ✘ ✘ ✔ ✘ ✘ ✔ ✘ ✘ ✘
✔ ✘ ✘ ✘ ✘ ✔ ✘ ✘ ✔ ✔ ✘ ✘
✔ ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘
Others
One platform supports the most diverse set of use-cases: ELN enrichment, Target Identification, Pharmacovigilance,
Enterprise Search, Literature Analysis, Opportunity/C.I. Analysis, Data Integration, Drug Repurposing, Machine Learning…
What’s your use case?
Document Search ELN Enrichment Vocabulary building &
Mapping
Intelligent Forms
Pharmacovigilance Patient Forums CI / Horizon Scanning Clinical Phenotype
Mining
Disease Networks Connecting Silos
Document
Classification
Use Case Library
The Problem
• Poor keyword search results
• Inability to search across a specific concept e.g. [GENE]
• Unable to manage synonymy/ambiguity
The Solution
• SciBite Vocabularies cover >80 different Scientific concepts
• Rule-based system to translate language of science
• Flexible architecture to integrate seamlessly with partner systems
The Outcome
• Powerful, enterprise search transformed into scientifically aware system
Enterprise Search
© 2018 SciBite Limited
DOCstore – A Biomedically-Aware Search Engine
Articles identified that don’t
use the word “Gilenya” but
do use a synonym
Articles must mention an
indication. We don’t care
which at this stage
The Problem
• Search functionality often limited to keyword with no synonym
support
• Difficult to gain an aggregate view of the innovation within the
business
• Data not structured/tagged to facilitate linking with other stores
The Solution
• Extraction and semantic enrichment of ELN records transforms the
knowledge into richly annotated, machine-readable data
• Interoperable output able to be delivered into various downstream
environments
The Outcome
• Greatly improved ability to search and analyse internal R&D
information
• Understand who is researching what and how people/groups are
interacting
Improving searchability in an ELN
• Highly experienced curation team
• Active engagement with initiatives such as Pistoia, OBO, ICBO…
• Supported by custom-developed curation software to rapidly develop and
maintain new or existing ontologies
Ontology & Curation Services
For new domains or
with using internal
data sources
Bespoke
Ontology
Curation
For example in the
areas of bioassays,
technologies and
devices
Enrich &
Manage Public
Ontologies
Expand/customise
our hand-curated
vocabs (gene,
indication, etc…)
Augment
SciBite Vocabs
Thought leadership
on standards,
ontologies and
metadata
Engage Our
Experts
For new domains or
to find novel entity
relationships
Bespoke
Semantic
Queries
BioAssay Data Repositories
The Problem
• Legacy systems missing metadata
• Limited ability to search results in high duplicity
The Solution
• Retrospective generation of metadata using semantic
entity recognition to find relevant terms
• Prospective auto-metadata curation using intelligent
forms incorporating semantic autocomplete
• Flexible architecture to integrate seamlessly into existing
systems
The Outcome
• Greater ability to find relevant information improves re-
use of legacy data and reduced duplicity.
• Semantically annotated, interoperable assay data
Monitoring Patient Forums
The Problem
• The data from Patient/Social media forums are of increasing
interest/value to researchers but present a challenge to monitor
• Multiple formats, locations, structures of data make integration a difficult
process
• Consumer language used in these forums doesn’t map to standardised
ontologies
The Solution
• Index data irrespective of source and store in central repository for
analysis
• Customisable vocabularies can accommodate for consumer language
and map to existing public standards
• DOCstore provides customisable and extensible search capabilities
• Alerting function allows monitoring of relevant threads.
The Outcome
• Ability to transform, integrate and analyse patient forum data alongside
existing workflows.
• Powerful multi-source search through simple, easy to use interface.
• Tailored vocabularies provide unique search environment
CI/ Horizon Scanning
The Problem
• Many sources of unstructured external data difficult to monitor and
search consistently across
• Data aggregation and review is a time-consuming process
• Persisting legacy data – not all information is relevant right now
The Solution
• Index data irrespective of source and store in central repository
for analysis
• Customisable vocabularies allows for unique / proprietary search
methodologies
• DOCstore provides customisable and extensible search
capabilities
• Alerting function allows monitoring of many pre-defined search
strategies
The Outcome
• Powerful multi-source search through simple, easy to use
interface.
• Tailored vocabularies provide unique search environment
• Reduce data review times by up to 80% https://www.scibite.com/artificial-intelligence-platform/
DOCstore
News, Grants, Publications – any Data Source
Semantic
enrichment + text
analytics using
customised
vocabularies
Phenotypic Triangulation
The Problem
• Many diseases are understudied and lack clear molecular
mechanisms
• Some entities (e.g. Phenotypes) are highly synonymous and
difficult to standardise
• Scraping, standardising, and analysing research is time-
consuming
The Solution
• Standardise terminology using SciBite VOCabularies
• Transform unstructured text into interoperable machine-readable
data compatible with downstream applications
• Build network views of disease-phenotype mappings to identify
common mechanistic pathways and shared knowledge
The Outcome
• Uncovering novel relationships in disease biology not previously
evident in the source data
• Scalable, structured analysis mappable to public ontologies with
the flexibility to integrate additional sources over time
Data Preparation / Cleansing
The Problem
• Many sources of internal data is ‘messy’, even if
structured it’s not always consistently tagged
• Messy data in = Messy data out
• Cleaning/curating data is time-consuming manual
process
The Solution
• FactBio + SciBite integration = automated cleaning/
annotation using highly curated vocabularies spanning
life science research
• User-friendly blend of automated tagging augmented
with manual review where necessary
• Flexible architecture to integrate seamlessly into existing
systems
The Outcome
• Greatly reduced effort required to cleanse / prepare data
for downstream utility
• Semantically annotated, interoperable assay data
Julien Debeauvais– Head of Sales
Email: julien@scibite.com
Tel: +44 (0) 7825 732 364

SciBite

  • 1.
  • 2.
    • Majority ofscientific information is unstructured and underused • Information overload (Volume, Variety, Velocity, Quality) • Highly synonymous and ambiguous terminology • Complex hierarchical relationships Science isn’t simple
  • 3.
    • Poor resultswhen applying computational/AI approaches • Up to 80% of time taken to prepare data • Inaccessible and underused data • Duplicity of research • Not building on existing knowledge The downstream impact
  • 4.
    Our Purpose To enablescientists to use insights locked in unstructured data to power their decision and speed up innovation by: • Using world class ontologies to revolutionise the access to and utilisation of scientific information • Transforming unstructured text into contextualised, machine readable data suitable for computational analysis
  • 5.
    The SciBite Platform Harmoniseterminology Scientific ontologies Adhering to public standards Manage / augment / curate your own #MANAGE Automated cleansing of semi- structured data Text to data Standardise data formats Indexing data at point of entry #CLEAN Semantic search Regular expressions Knowledge networks Visualise results Platform enrichment #DISCOVER
  • 6.
    TERMite TERMite VOCab TERMite TEXT IN Any formatof biomedical text-based document can be processed by TERMite STRUCTURED DATA OUT Contextualized, machine readable data ready for analysis Augment VOCabVOCab Creation Variation engine i.e. breast cancer auto expanded to include the syns breast neoplasm, cancer of the breast & mammary tumour Source Ontology Expert Curation Synonym Expansion Disambiguation settings Iterative testing VOCabs can be updated by users with simple 3 column augment files, the following (saved as drug.dictionary.aug) would add the extra synonym, extrasyn to the DRUG entity aspirin: # ID name syns CHEMBL25 extrasyn Java based, RESTful service RDBMS, NOSQL, Solr/Elastic, Hadoop, RDF, AWS & Docker compatible Scalable & fast. Runs on a server, cloud, laptop
  • 7.
    • Hand curatedand maintained by our expert team • Comprehensive coverage • Aligned to industry standards to maintain interoperability • Enriched with synonyms and rules to manage. the complexity of scientific language • Customize, augment our existing or deploy your own vocabularies VOCabs Ontologies are at the heart of everything CORE CLINICAL AGRO BIO-PHARM BUS INT GEN PHEN
  • 8.
    Modular Microservices Architecture Compile/ test vocabularies Manage / distribute ontologies VOCabulary curation Data cleaning platform Smart forms (HTML/JS) Automated data ingestion Semantic search UI Pattern matching Browser- based enrichment Workflow automation (PLP/KNIME) AI-based classification
  • 9.
  • 10.
    The SciBite PlatformPrinciples • Proven track record in semantic analytics • Specialists in life sciences • Micro-services architecture. Built for integration. Scalable • End-to-end solution for processing, mining and query • Combined benefits of machine learning & ontologies • Supports IT, Data Science, Info Management & Comp. Biology • Great support connecting directly to our SciTech team • Best in class vocabularies covering >100 concepts with tooling to create your own ✔ ✔ ✔ ✔ ✘ ✔ ✔ ✘ ✔ ✘ ✔ ✘ ✔ ✔ ✔ ✘ ✘ ✔ ✘ ✔ ✘ ✘ ✔ ✘ ✔ ✘ ✔ ✔ ✘ ✘ ✘ ✘ ✘ ✘ ✔ ✘ ✔ ✘ ✘ ✔ ✘ ✘ ✔ ✘ ✘ ✔ ✘ ✘ ✔ ✘ ✘ ✔ ✔ ✘ ✘ ✘ ✘ ✔ ✘ ✔ ✔ ✘ ✔ ✘ ✘ ✔ ✘ ✘ ✔ ✘ ✘ ✘ ✔ ✘ ✘ ✘ ✘ ✔ ✘ ✘ ✔ ✔ ✘ ✘ ✔ ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ Others One platform supports the most diverse set of use-cases: ELN enrichment, Target Identification, Pharmacovigilance, Enterprise Search, Literature Analysis, Opportunity/C.I. Analysis, Data Integration, Drug Repurposing, Machine Learning…
  • 11.
    What’s your usecase? Document Search ELN Enrichment Vocabulary building & Mapping Intelligent Forms Pharmacovigilance Patient Forums CI / Horizon Scanning Clinical Phenotype Mining Disease Networks Connecting Silos Document Classification
  • 12.
  • 13.
    The Problem • Poorkeyword search results • Inability to search across a specific concept e.g. [GENE] • Unable to manage synonymy/ambiguity The Solution • SciBite Vocabularies cover >80 different Scientific concepts • Rule-based system to translate language of science • Flexible architecture to integrate seamlessly with partner systems The Outcome • Powerful, enterprise search transformed into scientifically aware system Enterprise Search
  • 14.
    © 2018 SciBiteLimited DOCstore – A Biomedically-Aware Search Engine
  • 15.
    Articles identified thatdon’t use the word “Gilenya” but do use a synonym Articles must mention an indication. We don’t care which at this stage
  • 16.
    The Problem • Searchfunctionality often limited to keyword with no synonym support • Difficult to gain an aggregate view of the innovation within the business • Data not structured/tagged to facilitate linking with other stores The Solution • Extraction and semantic enrichment of ELN records transforms the knowledge into richly annotated, machine-readable data • Interoperable output able to be delivered into various downstream environments The Outcome • Greatly improved ability to search and analyse internal R&D information • Understand who is researching what and how people/groups are interacting Improving searchability in an ELN
  • 17.
    • Highly experiencedcuration team • Active engagement with initiatives such as Pistoia, OBO, ICBO… • Supported by custom-developed curation software to rapidly develop and maintain new or existing ontologies Ontology & Curation Services For new domains or with using internal data sources Bespoke Ontology Curation For example in the areas of bioassays, technologies and devices Enrich & Manage Public Ontologies Expand/customise our hand-curated vocabs (gene, indication, etc…) Augment SciBite Vocabs Thought leadership on standards, ontologies and metadata Engage Our Experts For new domains or to find novel entity relationships Bespoke Semantic Queries
  • 18.
    BioAssay Data Repositories TheProblem • Legacy systems missing metadata • Limited ability to search results in high duplicity The Solution • Retrospective generation of metadata using semantic entity recognition to find relevant terms • Prospective auto-metadata curation using intelligent forms incorporating semantic autocomplete • Flexible architecture to integrate seamlessly into existing systems The Outcome • Greater ability to find relevant information improves re- use of legacy data and reduced duplicity. • Semantically annotated, interoperable assay data
  • 19.
    Monitoring Patient Forums TheProblem • The data from Patient/Social media forums are of increasing interest/value to researchers but present a challenge to monitor • Multiple formats, locations, structures of data make integration a difficult process • Consumer language used in these forums doesn’t map to standardised ontologies The Solution • Index data irrespective of source and store in central repository for analysis • Customisable vocabularies can accommodate for consumer language and map to existing public standards • DOCstore provides customisable and extensible search capabilities • Alerting function allows monitoring of relevant threads. The Outcome • Ability to transform, integrate and analyse patient forum data alongside existing workflows. • Powerful multi-source search through simple, easy to use interface. • Tailored vocabularies provide unique search environment
  • 20.
    CI/ Horizon Scanning TheProblem • Many sources of unstructured external data difficult to monitor and search consistently across • Data aggregation and review is a time-consuming process • Persisting legacy data – not all information is relevant right now The Solution • Index data irrespective of source and store in central repository for analysis • Customisable vocabularies allows for unique / proprietary search methodologies • DOCstore provides customisable and extensible search capabilities • Alerting function allows monitoring of many pre-defined search strategies The Outcome • Powerful multi-source search through simple, easy to use interface. • Tailored vocabularies provide unique search environment • Reduce data review times by up to 80% https://www.scibite.com/artificial-intelligence-platform/ DOCstore News, Grants, Publications – any Data Source Semantic enrichment + text analytics using customised vocabularies
  • 21.
    Phenotypic Triangulation The Problem •Many diseases are understudied and lack clear molecular mechanisms • Some entities (e.g. Phenotypes) are highly synonymous and difficult to standardise • Scraping, standardising, and analysing research is time- consuming The Solution • Standardise terminology using SciBite VOCabularies • Transform unstructured text into interoperable machine-readable data compatible with downstream applications • Build network views of disease-phenotype mappings to identify common mechanistic pathways and shared knowledge The Outcome • Uncovering novel relationships in disease biology not previously evident in the source data • Scalable, structured analysis mappable to public ontologies with the flexibility to integrate additional sources over time
  • 22.
    Data Preparation /Cleansing The Problem • Many sources of internal data is ‘messy’, even if structured it’s not always consistently tagged • Messy data in = Messy data out • Cleaning/curating data is time-consuming manual process The Solution • FactBio + SciBite integration = automated cleaning/ annotation using highly curated vocabularies spanning life science research • User-friendly blend of automated tagging augmented with manual review where necessary • Flexible architecture to integrate seamlessly into existing systems The Outcome • Greatly reduced effort required to cleanse / prepare data for downstream utility • Semantically annotated, interoperable assay data
  • 23.
    Julien Debeauvais– Headof Sales Email: julien@scibite.com Tel: +44 (0) 7825 732 364