Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
COMPREHENSIVE SELF-SERVICE
LIFE SCIENCE DATA FEDERATION
WITH SADI SEMANTIC WEB SERVICES
AND HYDRA
Alexandre Riazanov, CTO
...
WHO WE ARE
• IPSNP Computing Inc -- a Canadian startup,
building on and commercializing prior academic
research on SADI.
•...
BIOMEDICAL RESEARCHERS AND CLINICIANS USE DATA
FROM MULTIPLE SOURCES
• Online and in-house databases, spreadsheets.
• Web ...
BIG VISION: FEDERATED QUERYING OF
HETEROGENEOUS AND DISTRIBUTED DATA SOURCES
• We want to query 1000s of data sources as a...
IS THIS SCI-FI?
WE CAN ACTUALLY DO THIS
WITH SEMANTIC WEB SERVICES
Here is how our data federation engine HYDRA works:
HOW IS THIS ALL POSSIBLE?
• Key ingredient: the SADI framework for
Semantic Web services (Semantic Automated
Discovery and...
PLAN OF THE TALK
• What are SADI services?
• Automatic service discovery and
invocation in query engines (HYDRA).
• Self-s...
SADI SERVICE I/O
• Input: RDF description of an input object.
• Output: another RDF graph providing more
(computed or retr...
COMPLETE SEMANTIC DESCRIPTIONS
OF SERVICE FUNCTIONALITY
• SADI services carry semantic descriptions of their
I/O that comp...
HYDRA QUERY ENGINE
● Given a SPARQL query, HYDRA analyses it
by using an intelligent logic-based algorithm
(proprietary, u...
QUERY EXAMPLE
• Find documents mentioning "haloalkane dehalogenase
activity", extract information about mutations and visu...
RESULTS
Deploying mutation impact text-mining software with the SADI Semantic Web Services framework
http://www.biomedcent...
WHAT IS SO COOL ABOUT IT?
• Data federation at its best:
– independent, heterogeneous data sources (PubMed
doc search, Pub...
MORE QUERY EXAMPLES
• Find drug products that contain active ingredient X.
• Find drugs that have been studied in clinical...
IT’S ONLY ½ OF THE STORY
REMEMBER THE BIG VISION?
HERE IS AN EVEN BIGGER VISION:
Self-service ad hoc querying of federated data.
HYDRA IMPLEMENTS SEMANTIC QUERYING
• Users need not know how the source data
is organised or accessed.
• They just need to...
HYDRA ALSO SUPPORTS
CONCEPT HIERARCHIES AND RULES
● Some queries would be too complex if we could not
exploit generality:
...
THERE ARE NO PRINCIPLE OBSTACLES
TO SELF-SERVICE QUERYING
We just need an adequate user interface
for building queries.
HYDRA QUERY TOOL = ENGINE + GUI
QUERY COMPOSITION
Queries built based on entry of “Google-like” keyphrases:
Keyphrase: “document mentions protein “P22607”
A QUERY GRAPH IS GENERATED
FOR THE KEYPHRASE
“document mentions protein “P22607””
Keyphrase: “has pubmed id”:
ADDING ANOTHER KEYPHRASE
QUERY GRAPH IS EXTENDED WITH NODES
CORRESPONDING TO THE SECOND KEYPHRASE
Keyphrase: “has pubmed id”
Keyphrase: “document m...
OPTION 2: MANUALLY ADD/DELETE CLASSES,
INCOMING AND OUTGOING PROPERTIES
MANUALLY ADDED PROPERTY
FINISHED QUERY: FIND PUBMED IDS OF DOCUMENTS MENTIONING
PROTEIN P22607 AND CO-MENTIONED PROTEINS
SERVICES IN THE REGISTRY
SPARQL GENERATION
QUERY EXECUTION WITH THE HYDRA ENGINE
EXPORTED RESULTS IN AN EXCEL SPREADSHEET
SADI AND HYDRA QUERY TOOL
AT WORK
BIOINFORMATICS AND CHEMINFORMATICS CASE
STUDIES AND PILOTS WITH SADI AND HYDRA
• Integrating genomics text mining results ...
INTERPRETING TOXICITY EXPERIMENT DATA
• Partner: university lab studying effects of
environmental pollutants.
• Querying n...
SUBTASK: DNA MICROARRAY ANNOTATION
• Toxicity experiments with microarrays: which DNA sequences
are under/overexpressed af...
RETRIEVAL OF TOXICITY DATA FROM
PUBLICATIONS
• Customer: government agency (Canada).
• Querying needs: online publication ...
CLASSIFYING NEW LIPID MOLECULES
• One of the early experiments with SADI.
• A group in Carleton U. had a program for
ident...
CLINICAL IT CASE STUDIES AND PILOTS
WITH SADI AND HYDRA
• Ad hoc querying of clinical data for Hospital
Acquired Infection...
THANK YOU!
Further materials/services are available on request:
• Live and recorded demos.
• Publications on previous (aca...
Upcoming SlideShare
Loading in …5
×

Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA

350 views

Published on

This is a moderately technical overview of SADI principles and capabilities, and IPSNP tools, including an overview of Life Science case studies. It is designed to be accessible to the general Computer Science and Software Engineering audience.

See also the sequel talk "A practical introduction to SADI semantic Web services and HYDRA query tool"

Published in: Science
  • Be the first to comment

Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA

  1. 1. COMPREHENSIVE SELF-SERVICE LIFE SCIENCE DATA FEDERATION WITH SADI SEMANTIC WEB SERVICES AND HYDRA Alexandre Riazanov, CTO IPSNP Computing Inc Oslo University, Sep 23, 2015
  2. 2. WHO WE ARE • IPSNP Computing Inc -- a Canadian startup, building on and commercializing prior academic research on SADI. • Founded to develop an industrial strength query tool for SADI, to supercede a research proof-of- concept prototype. • Looking for customers/partners and investors.
  3. 3. BIOMEDICAL RESEARCHERS AND CLINICIANS USE DATA FROM MULTIPLE SOURCES • Online and in-house databases, spreadsheets. • Web services, e.g., literature search, etc. • Nomenclatures, ontologies, controlled vocabularies. • Web sites, scientific publications, patents, etc. • Algorithms, e.g., BLAST, molecular structure prediction, various text mining programs, etc.
  4. 4. BIG VISION: FEDERATED QUERYING OF HETEROGENEOUS AND DISTRIBUTED DATA SOURCES • We want to query 1000s of data sources as a single database. • We want more agility than datawarehousing can provide: e.g., just-in-time algorithm execution, plug-and-play data source addition, live data querying. • We want to use simple and declarative queries, not to program workflow scripts.
  5. 5. IS THIS SCI-FI?
  6. 6. WE CAN ACTUALLY DO THIS WITH SEMANTIC WEB SERVICES Here is how our data federation engine HYDRA works:
  7. 7. HOW IS THIS ALL POSSIBLE? • Key ingredient: the SADI framework for Semantic Web services (Semantic Automated Discovery and Integration). • SADI services are: • RESTful services • consuming and producing one format -- RDF, • with semantic descriptions (in OWL) fully defining their functionality.
  8. 8. PLAN OF THE TALK • What are SADI services? • Automatic service discovery and invocation in query engines (HYDRA). • Self-service querying vision. • Query composition with HYDRA GUI. • An overview of Bioinformatics and Clinical Intelligence case studies. Tons of screenshots!
  9. 9. SADI SERVICE I/O • Input: RDF description of an input object. • Output: another RDF graph providing more (computed or retrieved) info about the input object or linking it to other objects. • Since all SADI services “talk the same language” (RDF), they are 100% syntactically interoperable: – output of one SADI service can be directly consumed by any other SADI services. Describe your input, and I will tell you something else about it”
  10. 10. COMPLETE SEMANTIC DESCRIPTIONS OF SERVICE FUNCTIONALITY • SADI services carry semantic descriptions of their I/O that completely define what the service expects and can accept as input, and what RDF assertions the service can output. • Unique and extremely powerful property: it facilitates completely automatic discovery and orchestration of services.
  11. 11. HYDRA QUERY ENGINE ● Given a SPARQL query, HYDRA analyses it by using an intelligent logic-based algorithm (proprietary, unlike SADI itself). ● HYDRA requests descriptions of potentially useful services from available SADI service registries. ● HYDRA processes the descriptions and figures out which services have to be invoked, on what data and in what order. SPARQL is a W3C standard semantic query language -- much more intuitive than SQL.
  12. 12. QUERY EXAMPLE • Find documents mentioning "haloalkane dehalogenase activity", extract information about mutations and visualise the mutations on 3D protein structure images. • HYDRA automatically finds and orchestrates 5 services from our registry: – PubMed search: keyword query ⟶ document PubMed IDs – PDF retrieval: PubMed ID ⟶ PDF file URL – ASCII extraction: PDF file ⟶ ASCII text – Text mining: ASCII text ⟶ mutation info – Visualisation: mutation & protein ⟶ 3D image (Jmol)
  13. 13. RESULTS Deploying mutation impact text-mining software with the SADI Semantic Web Services framework http://www.biomedcentral.com/qc/1471-2105/12/S4/S6
  14. 14. WHAT IS SO COOL ABOUT IT? • Data federation at its best: – independent, heterogeneous data sources (PubMed doc search, PubMed Central for PDFs); – not only data is integrated: ASCII extraction, text mining and 3D visualisation are algorithms! • Execution is completely automatic: HYDRA finds and invokes the services without any help from the user.
  15. 15. MORE QUERY EXAMPLES • Find drug products that contain active ingredient X. • Find drugs that have been studied in clinical trials targeting infections caused by bacteria X. • Annotate a DNA sequence X with molecular functions of proteins produced by the corresponding gene. • Find patients with precondition X diagnosed with infections Y resulting from procedure Z. • Many many other questions that Life Scientists and Clinicians ask on a daily basis.
  16. 16. IT’S ONLY ½ OF THE STORY
  17. 17. REMEMBER THE BIG VISION?
  18. 18. HERE IS AN EVEN BIGGER VISION: Self-service ad hoc querying of federated data.
  19. 19. HYDRA IMPLEMENTS SEMANTIC QUERYING • Users need not know how the source data is organised or accessed. • They just need to know the terminology of their subject domain. • Queries are completely declarative: specify what you want to find, not how.
  20. 20. HYDRA ALSO SUPPORTS CONCEPT HIERARCHIES AND RULES ● Some queries would be too complex if we could not exploit generality: o a query concerning all antibiotics requires generalisation, otherwise all types of antibiotics would have to be enumerated in the query. ● Much better way to do this is to import a classification of drugs and use it in query execution. ● HYDRA facilitates such reasoning and even more complex reasoning with rules.
  21. 21. THERE ARE NO PRINCIPLE OBSTACLES TO SELF-SERVICE QUERYING We just need an adequate user interface for building queries.
  22. 22. HYDRA QUERY TOOL = ENGINE + GUI
  23. 23. QUERY COMPOSITION Queries built based on entry of “Google-like” keyphrases: Keyphrase: “document mentions protein “P22607”
  24. 24. A QUERY GRAPH IS GENERATED FOR THE KEYPHRASE “document mentions protein “P22607””
  25. 25. Keyphrase: “has pubmed id”: ADDING ANOTHER KEYPHRASE
  26. 26. QUERY GRAPH IS EXTENDED WITH NODES CORRESPONDING TO THE SECOND KEYPHRASE Keyphrase: “has pubmed id” Keyphrase: “document mentions protein “P22607”
  27. 27. OPTION 2: MANUALLY ADD/DELETE CLASSES, INCOMING AND OUTGOING PROPERTIES
  28. 28. MANUALLY ADDED PROPERTY
  29. 29. FINISHED QUERY: FIND PUBMED IDS OF DOCUMENTS MENTIONING PROTEIN P22607 AND CO-MENTIONED PROTEINS
  30. 30. SERVICES IN THE REGISTRY
  31. 31. SPARQL GENERATION
  32. 32. QUERY EXECUTION WITH THE HYDRA ENGINE
  33. 33. EXPORTED RESULTS IN AN EXCEL SPREADSHEET
  34. 34. SADI AND HYDRA QUERY TOOL AT WORK
  35. 35. BIOINFORMATICS AND CHEMINFORMATICS CASE STUDIES AND PILOTS WITH SADI AND HYDRA • Integrating genomics text mining results with online biomedical data and visualisation algorithms. • Integrating programs for lipid molecule structural analysis and classification. • Interpreting toxicity experiment data by discovering relevant info in online databases. • Large-scale retrieval of toxicity information from publications.
  36. 36. INTERPRETING TOXICITY EXPERIMENT DATA • Partner: university lab studying effects of environmental pollutants. • Querying needs: finding relevant prior experiments, gene annotation, protein domain annotation, etc. • Data sources: ArrayExpress, BLAST, HMMER3, RefSeq, Pfam, ORFPredictor, GO, UniProt, NCBI Taxonomy -- all queried as a single DB!
  37. 37. SUBTASK: DNA MICROARRAY ANNOTATION • Toxicity experiments with microarrays: which DNA sequences are under/overexpressed after organism’s exposure to toxin X? • Interpretation requires knowing affected protein functions and domains. • HYDRA virtually implements this workflow:
  38. 38. RETRIEVAL OF TOXICITY DATA FROM PUBLICATIONS • Customer: government agency (Canada). • Querying needs: online publication search by organism and chemical types, text-mining for toxicity data. • Data sources: NCBI Taxonomy and ChEBI with free-text search, PubMed search, electronic libraries, journal Web sites, Google Scholar, specialised text-mining algorithm, text utilities. Apparent value: some queries save many man- weeks of work of a postdoc.
  39. 39. CLASSIFYING NEW LIPID MOLECULES • One of the early experiments with SADI. • A group in Carleton U. had a program for identifying functional groups in a molecule structure. • A group in U. of New Brunswick had a classifier estimating lipid classes based on presence/absence of functional groups. • Publishing the prototypes as SADI services allowed us to integrate them with each other and relevant external resources.
  40. 40. CLINICAL IT CASE STUDIES AND PILOTS WITH SADI AND HYDRA • Ad hoc querying of clinical data for Hospital Acquired Infections surveillance and research (with UNB, McGill SoM and Ottawa H.) • On-going pilot with a US hospital. • Looking for pilot opportunities for Clinical Trial Cohort selection: • trial eligibility criteria can be implemented as queries over heterogeneous and distributed clinical data; • benefits: cost reduction and timely alerts.
  41. 41. THANK YOU! Further materials/services are available on request: • Live and recorded demos. • Publications on previous (academic) case studies. • Training/consulting. • http://ipsnp.com/ (Canada) and http://ipsnp.co/ (UK)

×