• Like
Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries
Upcoming SlideShare
Loading in...5
×

Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

  • 398 views
Uploaded on

Life sciences is a fast becoming a data problem - in this presentation we explore the challenges faced by scientists wishing to leverage life science and healthcare big data. We demonstrate Qiagram - …

Life sciences is a fast becoming a data problem - in this presentation we explore the challenges faced by scientists wishing to leverage life science and healthcare big data. We demonstrate Qiagram - a collaborative visual, ad hoc query tool for exploring these large complex data sets. Using examples form Adverse Event Reporting Database, MedRA and SNOMED we illustrate how scientists with little IT knowledge can mine these data sets and unlock their potential.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
398
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Self explanatory

Transcript

  • 1. QIAGRAM: A NOVEL INTERFACE TO MINE AND UNDERSTAND LARGE DATA SETS WITH NATURAL LANGUAGE QUERIES Making data useful by making data smarter Matthew Clark, Ph. D. BioFortis February 6, 2013Copyright © 2012 - Proprietary & Confidential Copyright © 2009 Proprietary & Confidential
  • 2. Life sciences is a fast becoming a data problem Beyond the obvious issues of scale and reproducibility, the complexity and diversity of …data poses the greatest challenge to unlocking knowledge and scientific discovery.* Higdon et al (2012) Unraveling The Complexities Of Life Sciences Data: DOI: 10.1089/big.2012.1505Copyright © 2012 - Proprietary & Confidential
  • 3. Big Data Challenges Volume Large amounts of data Veracity The credibility/quality, how trusted is the data Velocity Need for rapid analysis Value Actionable outcomes for an organization Variety Many disparate typesCopyright © 2012 - Proprietary & Confidential
  • 4. A multitude of „omics and more genomics proteomics cellomics metabalomics lipidomics transcriptomics High Throughput technologies Other DataNGS, imaging, mass-spectrometry, high Healthcare data (EMR), demographics, capacity flow, arrays Adverse events, clinical trials Collecting data at a prodigious rate – not always clear on how to useCopyright © 2012 - Proprietary & Confidential
  • 5. Potential is huge… • Targeted trials • Adverse events from pharmacy/clinical data • Segment patients based on profiles/responses • Biomarker discovery • Observational/outcome studies • “Virtual” clinical trials • In-silico discovery Enormous promise…. Enormous challengesCopyright © 2012 - Proprietary & Confidential
  • 6. Big data challenges - general Source: The Economist, 2011, Big Data, Harnessing a Game changing assetCopyright © 2012 - Proprietary & Confidential
  • 7. Barriers to extracting value Source: The Economist, 2011, Big Data, Harnessing a Game changing assetCopyright © 2012 - Proprietary & Confidential
  • 8. Big Data in life sciences/healthcare • Multiple disparate data sources • Lack of integration patient-molecular-clinical- assay-payer • “Swiss cheese” problem • Data cleansing/verification/credibility • Standards for data interchange • Privacy concerns • Lack good tools for cross-domain analyticsCopyright © 2012 - Proprietary & Confidential
  • 9. Changing paradigms • Hypotheses driven – Traditional, test the hypotheses, scientific method • what‟s the mechanism of action of this drug • Discovery driven – More open, questioning, enumerates elements to drive hypotheses • What data do I have, what‟s interesting • Hybrid – Discovery driven + Hypotheses • Human Genome ProjectCopyright © 2012 - Proprietary & Confidential
  • 10. Data Exploration Often neglected, but now key to getting value from life science big dataCopyright © 2012 - Proprietary & Confidential
  • 11. What is Data Exploration? • Occurs before in-depth statistical/analytics • Explore and probe the data • Determine “what‟s interesting”, “what‟s relevant” • Generate hypotheses • Ensure data is there to support hypothesesCopyright © 2012 - Proprietary & Confidential
  • 12. Asking questions of the data Very complex query that touches many of the 5 V’s of Big DataCopyright © 2012 - Proprietary & Confidential
  • 13. Asking questions of the data More data Multiple data sources sources More data Domain expertise sources Requires considerable IT resources to program this queryCopyright © 2012 - Proprietary & Confidential
  • 14. Data Exploration challenges • Programming is hard – Mostly SQL, SAS • Lack of a shared “The hands-on analytics time to language to support write the SAS code and specify clearly what you need for each collaboration hypothesis is very time- – Multidisciplinary data consuming,” Felix Freuh, CEO, Medco* requires domain experts • Meaningful access to data – Sensitive to regulatory/compliance *Miller, K. Big Data Analytics, Biomedical Computational Review, Winter 2011/2012Copyright © 2012 - Proprietary & Confidential
  • 15. The Problem Data Managers, Researchers with many overwhelmed by questions across researchers questions on disciplines complex data sources No common language for questions SELECT DISTINCT PATIENT_ID, SAMPLE_ID, SAMPLE_NAME FROM SAMPLE_INVENTORY S INNER JOIN PATIENTS P ON S.PATIENT_ID = P.PATIENT_ID Clinical and INNER JOIN DIAGNOSIS D ON S.PATIENT_ID = D.PATIENT_ID INNER JOIN MEDICATIONS M ON S.PATIENT_ID = M.PATIENT_IDmolecular data INNER JOIN BIOMARKERS B ON S.PATIENT_ID = B.PATIENT_ID WHERE D.DIAGNOSIS_NAME = „LUNG CANCER‟ AND M.MEDICATION_GENERIC_NAME = „CETUXIMAB‟ AND B.BIOMARKER_NAME = „EGFR‟ AND B.OBSERVATION = 1 ORDER BY PATIENT_ID, SAMPLE_NAME “Weeks to months to NEVER” “Lost in Translation”Copyright © 2012 - Proprietary & Confidential
  • 16. Overcoming the challenges • Deep Collaboration – Easy access to dynamic data – Intuitive tools – Secure holistic view of data – CollaborationCopyright © 2012 - Proprietary & Confidential
  • 17. Big Data - Deep Collaboration Q I A G R A M Single researcher in a Small groups of Deep Collaboration is when silo often can go deep researchers may be able multiple groups of into the data, but maybe to collaborate on asking researchers can collaborate limited by their domain questions but can’t go in asking questions deeper expertise very deep with the tools into the layers of data. they have today Shared domain knowledge allows deeper insightsCopyright © 2012 - Proprietary & Confidential
  • 18. Qiagram – Collaborative Scientific Intelligence Researchers and data Qiagram acts as a managers can collaborate shared, visual on creating queries language for queries Clinical andmolecular data More efficient and effective query creation Transparent to all stakeholdersCopyright © 2012 - Proprietary & Confidential
  • 19. Examples MINING AERS DATACopyright © 2012 - Proprietary & Confidential
  • 20. Small Data Can Become Big Data 1000 Drugs 109 Possible 1000 Drug Categories Combinations 1000 Adverse EventsCopyright © 2012 - Proprietary & Confidential
  • 21. Introduction to QiagramCopyright © 2012 - Proprietary & Confidential
  • 22. Answer the Question –Which Sources Have Data on Cholestasis?Copyright © 2012 - Proprietary & Confidential
  • 23. Joining Data Sources is SimpleCopyright © 2012 - Proprietary & Confidential
  • 24. Combining Data Sources • SNOMED contains hierarchy of drug and medical terms – ~12M records • AERS contains reports of adverse events – ~70M records • MedDRA contains hierarchy of adverse event terms – 150k recordsCopyright © 2012 - Proprietary & Confidential
  • 25. SNOMED OntologyCopyright © 2012 - Proprietary & Confidential
  • 26. MedDRA Hierarchy hlgt preflow level term pref term hlt pref term term soc termabdominal migraine migraine migraine headaches headaches nervous system disordersacute migraine migraine migraine headaches headaches nervous system disordersband-like headache tension headache headaches nec headaches nervous system disordersbasilar migraine basilar migraine migraine headaches headaches nervous system disorderscephalalgia headache headaches nec headaches nervous system disorderscephalalgia or cephalgia headache headaches nec headaches nervous system disorderscephalgia headache headaches nec headaches nervous system disordersCopyright © 2012 - Proprietary & Confidential
  • 27. AERS • All drug-related adverse events reported to FDA since 2000 • Tables for Drugs, demographics, indications, therapy, reactions, outcomesCopyright © 2012 - Proprietary & Confidential
  • 28. Filter AERS Drugs by SNOMED Categories AERS Drug list Results in all analgesics in AERS, with associated case #sCopyright © 2012 - Proprietary & Confidential
  • 29. Count the Various MedDRA high-level group terms reported for all drugs in AERS from the SNOMED “antibiotic” category SNOMEDCategoriesAERS Drug list MedDRA Hierarchy Mapping Copyright © 2012 - Proprietary & Confidential
  • 30. Top Ten Antibiotic Adverse Event High- Level Group Terms Count hlgt pref term - MedDRA browser 692,058 general system disorders nec 522,739 epidermal and dermal conditions 491,409 neurological disorders nec 385,375 joint disorders 297,433 gastrointestinal signs and symptoms 290,760 respiratory disorders nec 278,723 allergic conditions 230,134 cardiac disorder signs and symptoms 208,985 injuries nec 197,446 gastrointestinal motility and defaecation conditionsCopyright © 2012 - Proprietary & Confidential
  • 31. Data Quality • With more date, more chance for inconsistencies. • Need easy ways to dynamically check the data, identify errant records • Example: AERS dataCopyright © 2012 - Proprietary & Confidential
  • 32. Query to Locate Patients with Treatment Dates After Death DatesCopyright © 2012 - Proprietary & Confidential
  • 33. Example Results Over 2,000 results days after death - start_dt - therapies +isr - drugs drug age death_dt therapies demographics 4016857 darbepoetin 59 9/5/2002 8/12/3003 365,583 4006065 interferon I 46 8/23/2002 1/27/2991 361,017 6013473 naloxone 54 6/15/1953 6/15/2008 20,089 6038344 combivent 50 12/10/1958 12/8/2008 18,261 6038344 levofloxacin 50 12/10/1958 12/8/2008 18,261 6105245 dexamethasone 49 1/27/1959 7/15/2008 18,067 6105245 bortezomib 49 1/27/1959 7/12/2008 18,064 6252126 enfuvirtide 50 8/29/1956 7/8/2005 17,845 6252126 efavirenz 50 8/29/1956 5/10/2002 16,690 6252126 didanosine 50 8/29/1956 5/10/2002 16,690 Copyright © 2012 - Proprietary & Confidential
  • 34. Answer Questions At the Speed of Thought • Many “purpose built” systems answer pre- defined questions. • However, in data exploration we need the ability to explore new questionsCopyright © 2012 - Proprietary & Confidential
  • 35. Collaborative Experience • Team of physicians, informaticians, safety experts collaboratively explored questions based on large amounts of clinical (SDTM) data – – Did subjects who were pre-treated with certain drug classes have the most change in cardiac function? – What was in common with the subjects that were outliers in cardiac function change? • Team defined baselines, changes, etcCopyright © 2012 - Proprietary & Confidential
  • 36. CPRD • The Clinical Practice Research Datalink (CPRD) is the new English NHS observational data and interventional research service, jointly funded by the NHS National Institute for Health Research (NIHR) and the Medicines and Healthcare products Regulatory Agency (MHRA). • 6 large fact tables with 1 B to 2 B rows • Example query – Identify patients with coronary artery disease who have taken aspirin, then study readmission rates.Copyright © 2012 - Proprietary & Confidential
  • 37. Thomson Reuters MarketScan • Several characteristics set MarketScan databases apart from other research databases. The core databases, Commercial, Medicare Supplemental, and Medicaid, are huge – over 170 million patients since 1995. • Over 25 Fact tables 100 M up to 1.5 B rows • Example – Identify cancer patients, looking at opiate treatment and study duration of the escalationCopyright © 2012 - Proprietary & Confidential
  • 38. Premier Research Services • Patient level data is available from more than 600 hospitals, 45 million records and 310 million hospital visits • 5 large Fact tables from 100 M to over 4 B rowsCopyright © 2012 - Proprietary & Confidential
  • 39. Cerner Health Facts Database • 8 Large Fact tables most in the 10s of millions of records • Example – Looking for type II diabetes patients, study infection rates of these patients based on hospital types.Copyright © 2012 - Proprietary & Confidential
  • 40. Launching in March, 2013, cloud based Qiagram offering with AERS and TCGA dataCopyright © 2012 - Proprietary & Confidential