Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Zen and the Art of Data Science Maintenance

4,004 views

Published on

Dr. Jabe Wilson, Elsevier's Consulting Director of Text and Data Analytics, gave this presentation at the Bio-IT World Conference in Boston on May 17, 2018.

Published in: Technology
  • Be the first to comment

Zen and the Art of Data Science Maintenance

  1. 1. | 0 Dr Jabe Wilson, Elsevier R&D Solutions Professional Services Zen and the Art of Data Science Maintenance Bio-IT World, 17 May 2018
  2. 2. | 1 The experience of doing Data Science “Programming is never easy … you’re kind of always on this frontier where you are out of your depth. And one of the things you have to learn is to accept this feeling – of being constantly wrong. Which makes coding sound like a branch of Zen Buddhism” - Andrew Smith, Code to Joy. April/May.The Economist, 1843.
  3. 3. | 2 Data Science as an Art • Intuition • Qualitative insights • Exploring a problem through solutions • “Inspiration exists, but it has to find you working.” - Pablo Picasso Studio, Tony Wilson (1973) my father. http://www.tonywilsonpainterprintmaker.com/
  4. 4. | 3 What can go wrong doing Data Science • Bad Data • Bad Models • Opaque Predictions
  5. 5. | 4 Good Data Curation: • Tagging against dictionaries • Mapping dictionaries • Regularising numeric units
  6. 6. | 5 Right Data Depends on your choice of model (semantic or machine learning). What happens when the model changes (do you still have enough data)?
  7. 7. | 6 In-time Data • Transactional workflows • Dynamic knowledge hubs • Opportunity costs
  8. 8. | 7 Examples of Data Science in practice
  9. 9. | 8 Examples of Data Science in practice • Rare disease treatment: Highly curated data allows us to make predictions, but also required judgement in the building of the model • Translational safety: Concordance data is predictive; but also shows the importance of curating taxonomies • Evidence selection: In order to select the right information sets you need to be able to filter on the context of parameter based assertions (machine learning can help improve data selection) • Real World Data interpretation: Machine learning classification can be enhanced with taxonomies, but also deliver across multimodal data sets
  10. 10. | 9 Examples of Data Science in practice • Rare disease treatment: Highly curated data allows us to make predictions, but also required judgement in the building of the model • Translational safety: Concordance data is predictive; but also shows the importance of curating taxonomies • Evidence selection: In order to select the right information sets you need to be able to filter on the context of parameter based assertions (machine learning can help improve data selection) • Real World Data interpretation: Machine learning classification can be enhanced with taxonomies, but also deliver across multimodal data sets
  11. 11. | 10 | 10 Biological Pathways extracted via semantic text mining A upregulates B B upregulates C C increases Disease A  B  C  disease Bioactivities through text analysis IC50 6.3nM, kinase binding assay 10mM concentration Chemical Structures And Properties InChi, Name NCBI, Uniprot EMTREE ReaxysTree, Structures Normalizing vocabularies required: proteins, diseases, drugs, chemicals
  12. 12. | 11 • Very large data sets - Order of ~107 documents published (patents, journals, books) - Each document has ~200 sentences ~109 statements. - Statements are about molecules, properties, reactions, indications etc. • Combinatorial connections between large data sets - “connecting the dots” among these facts results in a very large number of possible connections - 𝑛! 𝑘! 𝑛−𝑘 ! combinations of k elements chosen from a pool of n. 11 What Constitutes Big Data? Pathways • Relationships mined from 12,000 titles , 25M documents • <subject> <verb> <object> relationships • Each subject, object, verb has a taxonomy • Example: “protein” causes/induces disease Compounds • 16,000 journal titles plus patent offices • Compounds, Reactions, Properties • Over 6 million compounds with bioactivity Bioassays • Biological relationships mined from journals/patents (over 16 million) • <compound> <verb> <object> <quantity> • Example: Sunitinib binds-to Bcr-abl in <assay type> at 1nM
  13. 13. | 12 | 12 Building and refining the disease model for hyperinsulinism Picked relevant pathways (from a collection of 1800 models) Explored functions of proteins using 6.2M pre- text mined relations and embedded Gene Ontology Summarized what is known about CHI mechanism in an overview model
  14. 14. | 13 | 13 Automated analysis combines bioassay data with text-mined data • 88 targets related to hyperinsulinism with ≥3 literature references • Full relationship information Find all targets that could be used to affect the disease state Step 1 From pathways to treatments
  15. 15. | 14 | 14 Automated analysis combines bioassay data with text-mined data Find all targets that could be used to affect the disease state Query for each protein to find compounds that target it (>6 log units) Step 1 Step 2 Targets based on text mining Approved compounds Bioassay data From pathways to treatments
  16. 16. | 15 | 15 Automated analysis combines bioassay data with text-mined data Mean of activities among these targets Targets and activities for each compound Drug-likeness metrics for sorting/classification • All compounds that were observed to bind to targets in pathway • Sorted by number of active targets. Too many targets may suggest lack of specificity. Find all targets that could be used to affect the disease state Query for each protein to find compounds that target it (>6 log units) Collate data by compound to summarize the targets/activities related to disease that the compound hits • Compute geometric mean of activities for ranking • Rank by number of targets and geometric mean of activities against targets Step 1 Step 2 Step 3 From pathways to treatments
  17. 17. | 16 | 16 Approved compounds that may treat hyperinsulinism • Each binds to one or more targets related to the disease • Can easily be obtained and tested in preclinical studies • List includes a compound known to treat hyperinsulinism, sirolimus
  18. 18. | 17 17 Example: Process for Finding New Indications for a Drug (Ruxolitinib) Find all targets for which the compound has high affinity Collate the diseases by targets and activity of the compound Using unique set of proteins from steps 1 and search for all diseases reported to be related to them Step 1 Step 2 Step 3 Find all compound- protein/gene relationships with > 1 reference using text analysis Targets inhibited Targets Related to Disease
  19. 19. | 18 18 This Analysis Shows Connections of Ruxolitinib to Alopecia A cancer drug that grows hair! Trials are under way Alopecia areata is driven by cytotoxic T lymphocytes and is reversed by JAK inhibition Nature Medicine 20, 1043–1049 (2014) doi:10.1038/nm.3645 Global transcriptional profiling of mouse and human AA skin revealed gene expression signatures indicative of cytotoxic T cell infiltration, an interferon-γ (IFNG) response and upregulation of several γ-chain (γc) cytokines known to promote the activation and survival of IFN-γ–producing CD8+NKG2D+ effector T cells. Therapeutically, antibody- mediated blockade of IFN-γ, interleukin-2 (IL-2) or interleukin-15 receptor β (IL-15Rβ) prevented disease development, reducing the accumulation of CD8+NKG2D+ T cells in the skin and the dermal IFN response in a mouse model of AA.
  20. 20. | 19 Examples of Data Science in practice • Rare disease treatment: Highly curated data allows us to make predictions, but also required judgement in the building of the model • Translational safety: Concordance data is predictive; but also shows the importance of curating taxonomies • Evidence selection: In order to select the right information sets you need to be able to filter on the context of parameter based assertions (machine learning can help improve data selection) • Real World Data interpretation: Machine learning classification can be enhanced with taxonomies, but also deliver across multimodal data sets
  21. 21. | 20 • Concordance between preclinical studies and human adverse events, based on the calculation of positive likelihood ratios. - Chi-squared tells us if there is a statistically significant relationship of any kind between the human and animal observations (which is used as a filter). - The likelihood ratio measures the predictive value of the animal observation. A translational safety big data analysis
  22. 22. | 21 • If the chi-squared is high, and the likelihood ratio is low, one can state that there is high confidence that the animal observation does not predict human observation. • In which case the animal model should not be used. A translational safety big data analysis
  23. 23. | 22 • If the chi-squared is high, and the likelihood ratio is high, one can state that there is high confidence that the animal observation does predict human observation. • In which case checks for adverse events can be added to clinical trials. A translational safety big data analysis
  24. 24. | 23 • Curation of taxonomy data. • The higher levels of the MedDRA hierarchy sometimes include such a variety of events that the additional false positives and negatives result in no statistical confidence in the relationship. A translational safety big data analysis
  25. 25. | 24 Examples of Data Science in practice • Rare disease treatment: Highly curated data allows us to make predictions, but also required judgement in the building of the model • Translational safety: Concordance data is predictive; but also shows the importance of curating taxonomies • Evidence selection: In order to select the right information sets you need to be able to filter on the context of parameter based assertions (machine learning can help improve data selection) • Real World Data interpretation: Machine learning classification can be enhanced with taxonomies, but also deliver across multimodal data sets
  26. 26. | 25 Cold mice problem • If we can interpret and classify complex parameter based statements this allows us to select the right data. 22°C Cage (Standard Housing) 30°C Cage (Thermoneutrality) Stress/Immune response to cold No Immune response to cold Decreased response to chemotoxic drugs Increased response to chemotoxic drugs
  27. 27. | 26 All mice were maintained in a temperature controlled (22 ± 2 °C) environment 12-h light 12-h dark photocycle and fed rodent chow meal . The mice were individually placed into an acrylic cylinder (25 cm height 10 cm diameter) containing 8 cm of water maintained at 22–24 °C Cold mice problem: results Allowing research reports to be filtered based on whether results will be reliable due to experimental conditions.
  28. 28. | 27 Use case examples • Rare disease treatment: Highly curated data allows us to make predictions, but also required judgement in the building of the model • Translational safety: Concordance data is predictive; but also shows the importance of curating taxonomies • Evidence selection: In order to select the right information sets you need to be able to filter on the context of parameter based assertions (machine learning can help improve data selection) • Real World Data interpretation: Machine learning classification can be enhanced with taxonomies, but also deliver across multimodal data sets
  29. 29. | 28 Real World Data interpretation • Machine Learning: - Classify images. - Classify concepts (combining taxonomies with word embeddings improves performance on similarity measurement and entity classification). • Opportunities for developing multimodal classification of data sources with unstructured text and unlabelled images.
  30. 30. | 29 • These use case examples illustrate the challenges and creativity required to deliver Data Science. • We are developing a platform to help support these activities. o Good data: curated data. o Right data: export graph and feature data. o In-time data: bringing data sets together in a knowledge hub to enable in-time data. Supporting Data Scientists to deliver results
  31. 31. | 30 A platform for supporting Data Scientists • Inspiration exists, but it has to find you working. - Pablo Picasso • If you want to become a Data Science Platform development partner, or wish to hear more about continuing developments around Data Science at Elsevier please contact me: • www.linkedin.com/in/jabewilson/ • jabe.wilson@elsevier.com Studio, Tony Wilson (1973) my father. http://www.tonywilsonpainterprintmaker.com/
  32. 32. | 31 Acknowledgements • Helena F. Deus, Corey Harper, Darin McBeath and Ron Daniel Jr – Elsevier Labs • Matthew Clark, Frederik van den Broek, Anton Yuryev, Maria Shkrob – Elsevier Professional services • Thomas Steger-Hartmann, Investigational Toxicology, Bayer AG

×