QIAGRAM: A NOVEL INTERFACE TO       MINE AND UNDERSTAND LARGE DATA       SETS WITH NATURAL LANGUAGE       QUERIES       Ma...
Life sciences is a fast becoming a data problem    Beyond the obvious issues of scale     and reproducibility, the complex...
Big Data Challenges                                                             Volume                                    ...
A multitude of „omics and more  genomics                        proteomics    cellomics   metabalomics   lipidomics    tra...
Potential is huge… •       Targeted trials •       Adverse events from pharmacy/clinical data •       Segment patients bas...
Big data challenges - general Source: The Economist, 2011, Big Data, Harnessing a Game changing assetCopyright © 2012 - Pr...
Barriers to extracting value Source: The Economist, 2011, Big Data, Harnessing a Game changing assetCopyright © 2012 - Pro...
Big Data in life sciences/healthcare • Multiple disparate data sources • Lack of integration patient-molecular-clinical-  ...
Changing paradigms • Hypotheses driven            – Traditional, test the hypotheses, scientific method                   ...
Data Exploration              Often neglected, but now key to getting value from life science big dataCopyright © 2012 - P...
What is Data Exploration? •       Occurs before in-depth statistical/analytics •       Explore and probe the data •       ...
Asking questions of the data                           Very complex query that touches many of the 5 V’s of Big DataCopyri...
Asking questions of the data                                                 More data                          Multiple d...
Data Exploration challenges • Programming is hard            – Mostly SQL, SAS • Lack of a shared                         ...
The Problem                                                Data Managers,                                                 ...
Overcoming the challenges • Deep Collaboration            –      Easy access to dynamic data            –      Intuitive t...
Big Data - Deep Collaboration                                                                                          Q  ...
Qiagram – Collaborative Scientific                            Intelligence                                                ...
Examples       MINING AERS DATACopyright © 2012 - Proprietary & Confidential
Small Data Can Become Big Data                                     1000 Drugs                                             ...
Introduction to QiagramCopyright © 2012 - Proprietary & Confidential
Answer the Question –Which Sources         Have Data on Cholestasis?Copyright © 2012 - Proprietary & Confidential
Joining Data Sources is SimpleCopyright © 2012 - Proprietary & Confidential
Combining Data Sources • SNOMED contains hierarchy of drug and   medical terms            – ~12M records • AERS contains r...
SNOMED OntologyCopyright © 2012 - Proprietary & Confidential
MedDRA Hierarchy                                                                                        hlgt preflow level...
AERS • All drug-related adverse events reported to FDA   since 2000 • Tables for Drugs, demographics, indications,   thera...
Filter AERS Drugs by SNOMED                              Categories      AERS Drug list                           Results ...
Count the Various MedDRA high-level group    terms reported for all drugs in AERS from        the SNOMED “antibiotic” cate...
Top Ten Antibiotic Adverse Event High-             Level Group Terms          Count                      hlgt pref term - ...
Data Quality • With more date, more chance for   inconsistencies. • Need easy ways to dynamically check the data,   identi...
Query to Locate Patients with          Treatment Dates After Death DatesCopyright © 2012 - Proprietary & Confidential
Example Results                                                      Over 2,000 results                                   ...
Answer Questions At the Speed of                         Thought • Many “purpose built” systems answer pre-   defined ques...
Collaborative Experience • Team of physicians, informaticians, safety   experts collaboratively explored questions based  ...
CPRD • The Clinical Practice Research Datalink (CPRD) is the new English   NHS observational data and interventional resea...
Thomson Reuters MarketScan • Several characteristics set MarketScan   databases apart from other research databases.   The...
Premier Research Services • Patient level data is available from more than   600 hospitals, 45 million records and 310 mil...
Cerner Health Facts Database • 8 Large Fact tables most in the 10s of millions   of records • Example            – Looking...
Launching in March, 2013, cloud based Qiagram offering with                                           AERS and TCGA dataCo...
Upcoming SlideShare
Loading in …5
×

Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

859 views
722 views

Published on

Life sciences is a fast becoming a data problem - in this presentation we explore the challenges faced by scientists wishing to leverage life science and healthcare big data. We demonstrate Qiagram - a collaborative visual, ad hoc query tool for exploring these large complex data sets. Using examples form Adverse Event Reporting Database, MedRA and SNOMED we illustrate how scientists with little IT knowledge can mine these data sets and unlock their potential.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
859
On SlideShare
0
From Embeds
0
Number of Embeds
278
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Self explanatory
  • Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

    1. 1. QIAGRAM: A NOVEL INTERFACE TO MINE AND UNDERSTAND LARGE DATA SETS WITH NATURAL LANGUAGE QUERIES Making data useful by making data smarter Matthew Clark, Ph. D. BioFortis February 6, 2013Copyright © 2012 - Proprietary & Confidential Copyright © 2009 Proprietary & Confidential
    2. 2. Life sciences is a fast becoming a data problem Beyond the obvious issues of scale and reproducibility, the complexity and diversity of …data poses the greatest challenge to unlocking knowledge and scientific discovery.* Higdon et al (2012) Unraveling The Complexities Of Life Sciences Data: DOI: 10.1089/big.2012.1505Copyright © 2012 - Proprietary & Confidential
    3. 3. Big Data Challenges Volume Large amounts of data Veracity The credibility/quality, how trusted is the data Velocity Need for rapid analysis Value Actionable outcomes for an organization Variety Many disparate typesCopyright © 2012 - Proprietary & Confidential
    4. 4. A multitude of „omics and more genomics proteomics cellomics metabalomics lipidomics transcriptomics High Throughput technologies Other DataNGS, imaging, mass-spectrometry, high Healthcare data (EMR), demographics, capacity flow, arrays Adverse events, clinical trials Collecting data at a prodigious rate – not always clear on how to useCopyright © 2012 - Proprietary & Confidential
    5. 5. Potential is huge… • Targeted trials • Adverse events from pharmacy/clinical data • Segment patients based on profiles/responses • Biomarker discovery • Observational/outcome studies • “Virtual” clinical trials • In-silico discovery Enormous promise…. Enormous challengesCopyright © 2012 - Proprietary & Confidential
    6. 6. Big data challenges - general Source: The Economist, 2011, Big Data, Harnessing a Game changing assetCopyright © 2012 - Proprietary & Confidential
    7. 7. Barriers to extracting value Source: The Economist, 2011, Big Data, Harnessing a Game changing assetCopyright © 2012 - Proprietary & Confidential
    8. 8. Big Data in life sciences/healthcare • Multiple disparate data sources • Lack of integration patient-molecular-clinical- assay-payer • “Swiss cheese” problem • Data cleansing/verification/credibility • Standards for data interchange • Privacy concerns • Lack good tools for cross-domain analyticsCopyright © 2012 - Proprietary & Confidential
    9. 9. Changing paradigms • Hypotheses driven – Traditional, test the hypotheses, scientific method • what‟s the mechanism of action of this drug • Discovery driven – More open, questioning, enumerates elements to drive hypotheses • What data do I have, what‟s interesting • Hybrid – Discovery driven + Hypotheses • Human Genome ProjectCopyright © 2012 - Proprietary & Confidential
    10. 10. Data Exploration Often neglected, but now key to getting value from life science big dataCopyright © 2012 - Proprietary & Confidential
    11. 11. What is Data Exploration? • Occurs before in-depth statistical/analytics • Explore and probe the data • Determine “what‟s interesting”, “what‟s relevant” • Generate hypotheses • Ensure data is there to support hypothesesCopyright © 2012 - Proprietary & Confidential
    12. 12. Asking questions of the data Very complex query that touches many of the 5 V’s of Big DataCopyright © 2012 - Proprietary & Confidential
    13. 13. Asking questions of the data More data Multiple data sources sources More data Domain expertise sources Requires considerable IT resources to program this queryCopyright © 2012 - Proprietary & Confidential
    14. 14. Data Exploration challenges • Programming is hard – Mostly SQL, SAS • Lack of a shared “The hands-on analytics time to language to support write the SAS code and specify clearly what you need for each collaboration hypothesis is very time- – Multidisciplinary data consuming,” Felix Freuh, CEO, Medco* requires domain experts • Meaningful access to data – Sensitive to regulatory/compliance *Miller, K. Big Data Analytics, Biomedical Computational Review, Winter 2011/2012Copyright © 2012 - Proprietary & Confidential
    15. 15. The Problem Data Managers, Researchers with many overwhelmed by questions across researchers questions on disciplines complex data sources No common language for questions SELECT DISTINCT PATIENT_ID, SAMPLE_ID, SAMPLE_NAME FROM SAMPLE_INVENTORY S INNER JOIN PATIENTS P ON S.PATIENT_ID = P.PATIENT_ID Clinical and INNER JOIN DIAGNOSIS D ON S.PATIENT_ID = D.PATIENT_ID INNER JOIN MEDICATIONS M ON S.PATIENT_ID = M.PATIENT_IDmolecular data INNER JOIN BIOMARKERS B ON S.PATIENT_ID = B.PATIENT_ID WHERE D.DIAGNOSIS_NAME = „LUNG CANCER‟ AND M.MEDICATION_GENERIC_NAME = „CETUXIMAB‟ AND B.BIOMARKER_NAME = „EGFR‟ AND B.OBSERVATION = 1 ORDER BY PATIENT_ID, SAMPLE_NAME “Weeks to months to NEVER” “Lost in Translation”Copyright © 2012 - Proprietary & Confidential
    16. 16. Overcoming the challenges • Deep Collaboration – Easy access to dynamic data – Intuitive tools – Secure holistic view of data – CollaborationCopyright © 2012 - Proprietary & Confidential
    17. 17. Big Data - Deep Collaboration Q I A G R A M Single researcher in a Small groups of Deep Collaboration is when silo often can go deep researchers may be able multiple groups of into the data, but maybe to collaborate on asking researchers can collaborate limited by their domain questions but can’t go in asking questions deeper expertise very deep with the tools into the layers of data. they have today Shared domain knowledge allows deeper insightsCopyright © 2012 - Proprietary & Confidential
    18. 18. Qiagram – Collaborative Scientific Intelligence Researchers and data Qiagram acts as a managers can collaborate shared, visual on creating queries language for queries Clinical andmolecular data More efficient and effective query creation Transparent to all stakeholdersCopyright © 2012 - Proprietary & Confidential
    19. 19. Examples MINING AERS DATACopyright © 2012 - Proprietary & Confidential
    20. 20. Small Data Can Become Big Data 1000 Drugs 109 Possible 1000 Drug Categories Combinations 1000 Adverse EventsCopyright © 2012 - Proprietary & Confidential
    21. 21. Introduction to QiagramCopyright © 2012 - Proprietary & Confidential
    22. 22. Answer the Question –Which Sources Have Data on Cholestasis?Copyright © 2012 - Proprietary & Confidential
    23. 23. Joining Data Sources is SimpleCopyright © 2012 - Proprietary & Confidential
    24. 24. Combining Data Sources • SNOMED contains hierarchy of drug and medical terms – ~12M records • AERS contains reports of adverse events – ~70M records • MedDRA contains hierarchy of adverse event terms – 150k recordsCopyright © 2012 - Proprietary & Confidential
    25. 25. SNOMED OntologyCopyright © 2012 - Proprietary & Confidential
    26. 26. MedDRA Hierarchy hlgt preflow level term pref term hlt pref term term soc termabdominal migraine migraine migraine headaches headaches nervous system disordersacute migraine migraine migraine headaches headaches nervous system disordersband-like headache tension headache headaches nec headaches nervous system disordersbasilar migraine basilar migraine migraine headaches headaches nervous system disorderscephalalgia headache headaches nec headaches nervous system disorderscephalalgia or cephalgia headache headaches nec headaches nervous system disorderscephalgia headache headaches nec headaches nervous system disordersCopyright © 2012 - Proprietary & Confidential
    27. 27. AERS • All drug-related adverse events reported to FDA since 2000 • Tables for Drugs, demographics, indications, therapy, reactions, outcomesCopyright © 2012 - Proprietary & Confidential
    28. 28. Filter AERS Drugs by SNOMED Categories AERS Drug list Results in all analgesics in AERS, with associated case #sCopyright © 2012 - Proprietary & Confidential
    29. 29. Count the Various MedDRA high-level group terms reported for all drugs in AERS from the SNOMED “antibiotic” category SNOMEDCategoriesAERS Drug list MedDRA Hierarchy Mapping Copyright © 2012 - Proprietary & Confidential
    30. 30. Top Ten Antibiotic Adverse Event High- Level Group Terms Count hlgt pref term - MedDRA browser 692,058 general system disorders nec 522,739 epidermal and dermal conditions 491,409 neurological disorders nec 385,375 joint disorders 297,433 gastrointestinal signs and symptoms 290,760 respiratory disorders nec 278,723 allergic conditions 230,134 cardiac disorder signs and symptoms 208,985 injuries nec 197,446 gastrointestinal motility and defaecation conditionsCopyright © 2012 - Proprietary & Confidential
    31. 31. Data Quality • With more date, more chance for inconsistencies. • Need easy ways to dynamically check the data, identify errant records • Example: AERS dataCopyright © 2012 - Proprietary & Confidential
    32. 32. Query to Locate Patients with Treatment Dates After Death DatesCopyright © 2012 - Proprietary & Confidential
    33. 33. Example Results Over 2,000 results days after death - start_dt - therapies +isr - drugs drug age death_dt therapies demographics 4016857 darbepoetin 59 9/5/2002 8/12/3003 365,583 4006065 interferon I 46 8/23/2002 1/27/2991 361,017 6013473 naloxone 54 6/15/1953 6/15/2008 20,089 6038344 combivent 50 12/10/1958 12/8/2008 18,261 6038344 levofloxacin 50 12/10/1958 12/8/2008 18,261 6105245 dexamethasone 49 1/27/1959 7/15/2008 18,067 6105245 bortezomib 49 1/27/1959 7/12/2008 18,064 6252126 enfuvirtide 50 8/29/1956 7/8/2005 17,845 6252126 efavirenz 50 8/29/1956 5/10/2002 16,690 6252126 didanosine 50 8/29/1956 5/10/2002 16,690 Copyright © 2012 - Proprietary & Confidential
    34. 34. Answer Questions At the Speed of Thought • Many “purpose built” systems answer pre- defined questions. • However, in data exploration we need the ability to explore new questionsCopyright © 2012 - Proprietary & Confidential
    35. 35. Collaborative Experience • Team of physicians, informaticians, safety experts collaboratively explored questions based on large amounts of clinical (SDTM) data – – Did subjects who were pre-treated with certain drug classes have the most change in cardiac function? – What was in common with the subjects that were outliers in cardiac function change? • Team defined baselines, changes, etcCopyright © 2012 - Proprietary & Confidential
    36. 36. CPRD • The Clinical Practice Research Datalink (CPRD) is the new English NHS observational data and interventional research service, jointly funded by the NHS National Institute for Health Research (NIHR) and the Medicines and Healthcare products Regulatory Agency (MHRA). • 6 large fact tables with 1 B to 2 B rows • Example query – Identify patients with coronary artery disease who have taken aspirin, then study readmission rates.Copyright © 2012 - Proprietary & Confidential
    37. 37. Thomson Reuters MarketScan • Several characteristics set MarketScan databases apart from other research databases. The core databases, Commercial, Medicare Supplemental, and Medicaid, are huge – over 170 million patients since 1995. • Over 25 Fact tables 100 M up to 1.5 B rows • Example – Identify cancer patients, looking at opiate treatment and study duration of the escalationCopyright © 2012 - Proprietary & Confidential
    38. 38. Premier Research Services • Patient level data is available from more than 600 hospitals, 45 million records and 310 million hospital visits • 5 large Fact tables from 100 M to over 4 B rowsCopyright © 2012 - Proprietary & Confidential
    39. 39. Cerner Health Facts Database • 8 Large Fact tables most in the 10s of millions of records • Example – Looking for type II diabetes patients, study infection rates of these patients based on hospital types.Copyright © 2012 - Proprietary & Confidential
    40. 40. Launching in March, 2013, cloud based Qiagram offering with AERS and TCGA dataCopyright © 2012 - Proprietary & Confidential

    ×