Your SlideShare is downloading. ×
Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

442
views

Published on

Towards a brokering framework for knowledge-based services: learning from the Pistoia Alliance SESL pilot …

Towards a brokering framework for knowledge-based services: learning from the Pistoia Alliance SESL pilot
Ian Harrow PhD for the Pistoia Alliance

This presentation describes a pilot project to determine the feasibility of biomedical knowledge brokering. It shows query across multiple disparate data sources through a brokering demonstrator built from RDF triple store technology. The learning from this pilot is contributing to larger scale projects such as the Innovative Medicines Initiative, OpenPFACTs.

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
442
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Towards a brokering framework for knowledge-based services: Learning from the Pistoia Alliance SESL pilotIan Harrow, PhDCo-Leader of Pistoia Alliance SESL pilot (ex-Pfizer)Founder, Director & Principal Consultant at Ian Harrow Consulting LtdBio IT World, Hanover, October 2011http://pistoiaalliance.org
  • 2. Outline• Industry Drivers• Mission and Strategy of Pistoia• Vision for the SESL pilot• Minimal configuration to test a brokering service• Public demonstrator and standards• Deliverables achieved by SESL pilot• Learning and future direction 2
  • 3. What is Core to your Business? What is Critical? Core? Externalize Focus for 1990 Staff on BestCritical? Innovation Practices 2012 Reduce Externalize Non-Value for Cost Added Work Reduction 3
  • 4. Why the Pistoia Alliance?• Industry was at a cross roads Henry Chesbrough, UC Berlkey 2011 – Change in business models required• We are all in this (mess) together (Life Science, technology vendors, service IT, academia, etc.)• Need industry applicable services and standards• Collect all the stakeholders together – Agree on commonly-shared, pre-competitive use cases• Focus on delivery of proofs of concept to stimulate and foster new business models 4
  • 5. The Mission of the Pistoia AllianceLowering the barriers to innovationby improving the interoperability ofR&D business processesvia pre-competitive collaborations 5
  • 6. 6
  • 7. Pistoia Alliance Membership Sept 2011 7
  • 8. A Reality Check: Setting Expectations     8
  • 9. Signpostclearly 9
  • 10. PistoiaStrategy 10
  • 11. Domains of Action Biology & Translational Chemistry Medicine Scientific Collaboration 11
  • 12. The Focus of Each Domain Big Data, Supply Chain, Analytics, Tech Transfer Semantics Biology Chemistry Vocabularies, Use Cases, Best Practices Scientific Collaboration 12
  • 13. Try this at your desk….Which diseases are correlated to the gene, TCF7L2? Gene/Protein Literature - Abstracts Literature – Full Text Inherited diseases Gene expression 13
  • 14. Try it again with Pistoia’s SESL…. Gene naming/synonyms Gene Function Literature statistics Disease co-occurrences Gene/protein interactions …all in one report from one search HOW? A standard vocabulary, data model, query language, report structure, etc. 14
  • 15. SESL Pilot project description• Deliverables: – Publication of standards and recommendations for brokering service implementation – Public demonstrator service for a single disease area – Dialogue and assessment of potential business impact with key content suppliers• Scope: – Development of an assertion database in combination with a user interface and associated web services for one disease/indication/phenotype of broad interest: Type II Diabetes – Assertional content derived from 3 structured data sources and limited Journal content (co-occurrence and statistical derivation from full text) – Assertional evidence for filtering and drill down to primary data. – Limited vocabulary development for area of focus: Type II Diabetes• Participants and Cost: – AZ, Pfizer, GSK, Roche, Unilever, EMBL-EBI, NPG, OUP, Elsevier & RSC – Single contract between Pistoia Alliance & EMBL-EBI – £200K cost (=2 x FTEs) – shared by industry – 12 month project, January 2010 start 15
  • 16. The Knowledge Service Framework Multiple Consumers‘Consumer’ Disease Dossier Knowledge ApplicationsFirewall Service Layer Std Public CommonOpen Assertion & Meta Data Management Vocabularies ServiceStand Transform /Translate (RDF triples) Business Broker-ards Integrator/Aggregator (Triple store) RulesSupplierFirewall Content Suppliers Db 2 Db 4 Corpus 1 Db 3 Corpus 5 16 16
  • 17. Minimal configuration to test the technical feasibility of a Knowledge Broker Service Interface User Interface Layer Service Layer Std Public Service Layer Std PublicCondition: Brokering service Vocabularies Vocabularies Assertion & Meta Data Mgmt Assertion & Meta Data MgmtIdentical structure. Transform / Translate Query Transform / Translate QueryDifferent contentwhich can overlap Triple store 1 templates Triple store 2 templates Layer Broker #1 Broker #2 Primary source Layer RSC UK-Pubmed NPG OUP corpus Central corpus corpus EBI Uniprot corpus EBI Array EBI Uniprot database Express database Elsevier database NCBI OMIM corpus database 17
  • 18. Simple Graphical User Interface to the SESL public demonstrator1. Single point of query through a simple GUI 2. Aggregated Results on a single web page Full text detail A. Gene query results summary Title: Authors: 1) Co-occurrence Documents Citation 2) Uniprot names and annotation Co-occurrence of 3) OMIM disease names gene and disease 4) Array express disease and/or mentions in text pancreas expression extracts 5) Uniprot GO terms 6) Uniprot Binary interactions A. Gene Query Show: and/or The results include links out to the primary sources B. Disease Query Full text detail B. Disease query results summary Title: Authors: 1) Co-occurrence Documents Citation 2) OMIM disease names Co-occurrence of 3) Array express disease expression gene and disease Filtered by: 1) Everything mentions in text extracts 2) Consensus 3) Co-occurrence 4) OMIM 5) Array Express SESL public demonstrator: http://www.pistoia-sesl.org 18
  • 19. Type 2 diabetes genes in SESL demonstratorHuman protein names Human Source: SESL: Google Pubmed: SESL: gene Source: SESL: Source: SESL: Source: SESL: GO Source: SESL: gene UniProt UniProt Scholar: type 2 and type 2 OMIM OMIM Array Array Uniprot terms Uniprot Binary names diabetes diabetes type 2 diabetes diabetes diabetes diabetes Express Express GO terms Intact interactions mention mention diabetes June co- mention mention Atlas pancreas binary 2006 to 2011 occurrence pancreas interactions June 2011 in Full TextATP-binding cassette sub-family C ABCC8 1 1 753 37 6 6 6 5 7 7 9 0 0member 8Calpain-10 CAPN10 1 1 810 168 21 1 1 1 1 12 12 0 0Glucokinase GCK 1 1 3,950 708 12 7 7 0 0 19 19 2 2Hematopoietically-expressed HHEX 0 0 626 91 24 1 0 2 2 21 23 3 0homeobox proteinHepatocyte nuclear factor 1-alpha HNF1A 1 1 633 340 23 3 4 2 2 12 12 6 6Hepatocyte nuclear factor 1-beta HNF1B 1 1 408 269 20 1 1 2 2 9 8 1 0Hepatocyte nuclear factor 4-alpha HNF4A 1 1 811 173 34 2 2 3 3 22 20 5 5Insulin INS 2 1 166,000 37,670 5 9 0 7 0 59 59 0 0Insulin receptor substrate 1 IRS1 1 1 7,970 616 9 1 0 2 2 24 24 3 0Insulin receptor INSR 1 1 14,00 4,830 16 2 4 6 6 41 43 9 9ATP-sensitive inward rectifier KCNJ11 1 1 1,260 45 35 3 1 0 0 12 12 1 0potassium channel 11Hepatic triacylglycerol lipase LIPC 1 0 2,090 89 1 1 1 1 1 17 17 0 0C-Jun-amino-terminal kinase- MAPK8IP1 1 1 248 4 1 1 1 1 1 6 6 4 4interacting protein 1Neurogenic differentiation factor 1 NEUROD1 1 1 549 50 7 2 2 2 4 13 14 0 0Pancreas/duodenum homeobox PDX1 1 1 2,270 154 9 2 0 1 1 9 9 0 0protein 1Peroxisome proliferator-activated PPARG 1 1 9,540 1,556 48 1 1 2 2 40 42 7 7receptor gammaProtein phosphatase 1 regulatory PPP1R3A 1 1 141 23 3 1 0 1 1 2 2 0 0subunit 3AZinc transporter 8 SLC30A8 1 0 724 117 0 2 1 3 4 13 13 0 0Transcription factor 7-like 2 TCF7L2 1 1 2,000 284 65 1 1 3 3 33 31 5 5Mitochondrial brown fat uncoupling UCP1 1 0 1,760 50 3 0 0 0 0 6 6 0 0protein 1 19
  • 20. Gene discovery in SESL demonstrator Pancreas T2D disease 1 gene expression in Array mention Express db in OMIM db 3 1 Gene count 20 10 0 3 intersections from 4 the data sources in the demonstrator T2D disease T2D disease genes in gene Full Text 1 mention in documents Uniprot db 20
  • 21. Selected content loaded as RDF triples Source Description # triples % Expression data Array Express 182,840 0.5% Experimental Factor Ontology from Array Express 49,026 0.1% Disease vocabulary from UMLS 6,906,735 18.8% Vocabulary from Disease Ontology 1,863,664 5.1% Terms from Gene Ontology 495,595 1.3% Human genes from Uniprot 12,552,239 34.1% Meta data from Full Text documents 3,485,212 9.5% Gene annotations from Full Text documents 2,373,584 6.5% Disease annotations from Full Text documents 4,983,788 13.6% GO annotations from Full Text documents 3,870,834 10.5% Totals 36,763,517 100% 21
  • 22. Signposting: Standards used in SESL Category Name Community RDF W3C SPARQL W3CTriple Store Jena, Sesame, Open Source Virtuoso leXML EBI & CALBC EBI, NaCTeM, U ofText Mining LexEBI/BioLexicon Pisa CALCBC EBI & CALBC UniProt EBI, PIR, SBI, etc Disease Ontology and UMLS OBO, NIH/NLM Blending of URIs ArrayExpress EBI existing NCBI Taxonomy NCBI standards Dublin Core W3C N3 notation W3CRDF Schema Co-occurrence of gene- EBI disease PMC doc standard NCBI Relation ontology OBO Ontology URI server W3C 22
  • 23. The Deliverables of the SESL pilot• A proof-of-concept to demonstrate feasibility and clarify requirements – http://www.pistoia-sesl.org• A functional specification for query brokering, result filtering, report generation – Expect publication by end 2011 – http://www.pistoiaalliance.com/workinggroups/sesl.html• Academia, Life Science Industry and Publishers – Attained a better understanding of each other’s needs – Demonstration of potential for a new business model – Explore follow-on via Open Innovation consortia 23
  • 24. Learning and Future Direction• Framework to maximise re-use of existing standards – Minimise use of bespoke, hard-coded implementations• Crucial features of a knowledge brokering service:- – RDF triples for a scalable, meta index to broker across primary sources (both databases and literature) – Important to define business rules for query & extraction – Recommend a registry of suitable data sources • similar to web services registry• What is next? – Example, follow-on to the SESL pilot:- – Open PHACTs consortium => www.openphacts.org – 3 year IMI pre-competitive project (started early 2011) – Data providers and Life Science industry working together 24
  • 25. AcknowledgementsIndustry EMBL-EBI PublishersWendy Filsell - Unilever Dietrich Rebholz Schuhmann Claire Bird – OUP(SESL co-leader) (Technical Team Leader) Richard O’Bierne – OUPIan Stott - Unilever Christoph Grabmueller Silvestras Kavaliauskas Colin Batchelor – RSCNigel Wilkinson - PFE Richard Kidd – RSCCatherine Marshall - PFE Dominic Clark Roderigo Lopez David Hoole – NPGPeter Woollard - GSK Jo McEntyre – UK-PMC Alf Eaton – NGPAshley George - GSK Janet Thornton Jabe Wilson – ElsevierMike Westaway - AZ Bradley Allen – ElsevierNick Lynch - AZIan Dix - AZMichael Braxenthaler – RocheJohn Wise – Pistoia Alliance 25