Exploration of multidimensional
biomedical data in PubChem
Lianyi Han
National Center for Biotechnology Information
Advances science and health by providing access to biomedical and genomic
information.
Literatures
• PubMed
• PMC
• PubMed Health
• …
Sequences
• Proteins
• Genes &
Expression
• Genome & Maps
• …
Chemicals &
Bioassays
• PubChem
Databases
• BioSystems
• …
Software &
tools
• Blast
• Structure Search
• Entrez/Eutils
Structure &
Domains
• Structure
• CDD
• …
Provides information on the biological activities of small molecules and beyond
PubChemSubstance
Compound
Bioactivities
Literatures
(link)
Target
Patent
Pathways
23 million citations
The Challenge
•Varityheterogeneous
documents with many-
to-many relationships
•Volume
200M+ bioactivity data
40M+ compounds
600K+ bioassays
20K+ pathways
9k targets
•Velocityquery wide quickly,
query deep quickly,
facet search quickly
Answers
The Direction
Velocity
Volume
Existing Search Systems
• ASN.1, XML schema
• RDMS(SQL)
• In-house NoSQL Search Engine
• Specialized Search Engine
• Homebrewed messaging system
• Queue systems
A new search system
• Features?
• Scalability?
• Accessibility?
• Maintenance?
• Reusability?
• Extensibility?
• Cost effective?
Archive Analysis
The feature requirements for the new search system
• Full text search
• Highlighting
• Faceting
• Molecule formula search
• 2D similarity search
• Molecule superstructure/substructure search
• Joins, cascading joins to search wide and deep
• Transfer search result effectively across services
We can make the feature complete in SOLR!
• Full text search(SOLR)
• Highlighting(SOLR)
• Faceting(SOLR)
• Molecule formula search (implement MF search in SOLR)
• 2D similarity search (implement 2D fingerprint search in SOLR)
• Molecule superstructure/substructure search (SOLR-5244)
• Joins, cascading joins to search wide and deep (SOLR-4787)
• Transfer search result effectively across services(SOLR-4787, SOLR-5244)
Architecture
The Backend
• Backend Components (SOLR+SQL+ Specialized search engine)
– Configuration
– Importing pipeline
• Dumping & Importing (SGE Farm)
• DIH (jdbc)
– Replication
– Warm up
• Web API
– Encapsulate the backend implementation
– Load balancing and throttling
– Generic data model for heterogeneous document
– Query language
The Frontend
• Easier to develop or expand based on modern web technologies.
– One backend, multiple frontends
– One data model, multiple presentations
• UI/UX design
– MVC
– Reusability
– Mobile browser friendly
– Interactivity & Accessibility
The Frontends
• PubChem widgets (beta)
– A reusable UI components
• PubChem new search (beta)
– A new search system that delivers
multiple search features
Briefly on UI architecture
• PubChem widgets as an example
Demo : PubChem widget
• http://jsfiddle.net/Gtbg7/
PubChem.widget.CreateGridTable({
gridtabletype: 'pcassay',
cid: 2244,
renderTo: ‘table’,
width: "90%",
height: 400});
More PubChem widgets
Demo : PubChem Search
• https://pubchem.ncbi.nlm.nih.gov/search/
Desktop Mobile
Faceting
Molecular Formula Search
Super/sub Structure Search
Full-text Search
Brief Summary on PubChem Search Demo
Thanks
• Yu Bo
• Renata Geer
• Asta Gindulyte
• Siqian He
• Paul Thiessen
• Jiyao Wang
• Jeff Zhang
• Steve Bryant
• Lewis Geer
• Evan Bolton
• Yanli Wang
• NCBI IEB and IRB
This research was supported [in part] by
the Intramural Research Program of the NIH, National Library of Medicine.
Questions
About this talk: hanl@mail.nih.gov
PubChem: https://www.facebook.com/pubchem
NCBI: https://www.facebook.com/ncbi.nlm

Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

  • 2.
  • 3.
    National Center forBiotechnology Information Advances science and health by providing access to biomedical and genomic information. Literatures • PubMed • PMC • PubMed Health • … Sequences • Proteins • Genes & Expression • Genome & Maps • … Chemicals & Bioassays • PubChem Databases • BioSystems • … Software & tools • Blast • Structure Search • Entrez/Eutils Structure & Domains • Structure • CDD • …
  • 4.
    Provides information onthe biological activities of small molecules and beyond PubChemSubstance Compound Bioactivities Literatures (link) Target Patent Pathways 23 million citations
  • 5.
    The Challenge •Varityheterogeneous documents withmany- to-many relationships •Volume 200M+ bioactivity data 40M+ compounds 600K+ bioassays 20K+ pathways 9k targets •Velocityquery wide quickly, query deep quickly, facet search quickly Answers
  • 6.
    The Direction Velocity Volume Existing SearchSystems • ASN.1, XML schema • RDMS(SQL) • In-house NoSQL Search Engine • Specialized Search Engine • Homebrewed messaging system • Queue systems A new search system • Features? • Scalability? • Accessibility? • Maintenance? • Reusability? • Extensibility? • Cost effective? Archive Analysis
  • 7.
    The feature requirementsfor the new search system • Full text search • Highlighting • Faceting • Molecule formula search • 2D similarity search • Molecule superstructure/substructure search • Joins, cascading joins to search wide and deep • Transfer search result effectively across services
  • 8.
    We can makethe feature complete in SOLR! • Full text search(SOLR) • Highlighting(SOLR) • Faceting(SOLR) • Molecule formula search (implement MF search in SOLR) • 2D similarity search (implement 2D fingerprint search in SOLR) • Molecule superstructure/substructure search (SOLR-5244) • Joins, cascading joins to search wide and deep (SOLR-4787) • Transfer search result effectively across services(SOLR-4787, SOLR-5244)
  • 9.
  • 10.
    The Backend • BackendComponents (SOLR+SQL+ Specialized search engine) – Configuration – Importing pipeline • Dumping & Importing (SGE Farm) • DIH (jdbc) – Replication – Warm up • Web API – Encapsulate the backend implementation – Load balancing and throttling – Generic data model for heterogeneous document – Query language
  • 11.
    The Frontend • Easierto develop or expand based on modern web technologies. – One backend, multiple frontends – One data model, multiple presentations • UI/UX design – MVC – Reusability – Mobile browser friendly – Interactivity & Accessibility
  • 12.
    The Frontends • PubChemwidgets (beta) – A reusable UI components • PubChem new search (beta) – A new search system that delivers multiple search features
  • 13.
    Briefly on UIarchitecture • PubChem widgets as an example
  • 14.
    Demo : PubChemwidget • http://jsfiddle.net/Gtbg7/ PubChem.widget.CreateGridTable({ gridtabletype: 'pcassay', cid: 2244, renderTo: ‘table’, width: "90%", height: 400});
  • 15.
  • 16.
    Demo : PubChemSearch • https://pubchem.ncbi.nlm.nih.gov/search/ Desktop Mobile
  • 17.
    Faceting Molecular Formula Search Super/subStructure Search Full-text Search Brief Summary on PubChem Search Demo
  • 18.
    Thanks • Yu Bo •Renata Geer • Asta Gindulyte • Siqian He • Paul Thiessen • Jiyao Wang • Jeff Zhang • Steve Bryant • Lewis Geer • Evan Bolton • Yanli Wang • NCBI IEB and IRB This research was supported [in part] by the Intramural Research Program of the NIH, National Library of Medicine.
  • 19.
    Questions About this talk:hanl@mail.nih.gov PubChem: https://www.facebook.com/pubchem NCBI: https://www.facebook.com/ncbi.nlm