Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data and Analytics Across the Interdisciplinary Divide

373 views

Published on

4th International Conference on Big Data and Information Analytics, Theories, Algorithms and Applications in Data Science, December 17-19, 2018, Houston Texas. https://sph.uth.edu/divisions/biostatistics/bigdia/

Published in: Education
  • Be the first to comment

  • Be the first to like this

Big Data and Analytics Across the Interdisciplinary Divide

  1. 1. Big Data & Analytics Across the Interdisciplinary Divide Philip E. Bourne PhD, FACMI Stephenson Chair of Data Science Director, Data Science Institute Professor of Biomedical Engineering peb6a@virginia.edu https://www.slideshare.net/pebourne 12/17/18 BigDIA 1 @pebourne
  2. 2. Perspective • I was not trained as a data scientist or computer scientist - I started as a physical chemist • At this point I can’t give you a deep technical perspective • My examples are taken from biomedicine, but broadly applicable • Deeply engaged in preparing one academic institution for a very different data driven interdisciplinary future 12/17/18 BigDIA 2
  3. 3. My motivation The biggest gains for our society are going to come through interdisciplinary research where data and analytics catalyze the collaboration 12/17/18 BigDIA 3
  4. 4. Consider a wake up call of sorts 12/17/18 BigDIA 4
  5. 5. A wake up call of sorts 12/17/18 BigDIA 5 https://www.sciencemag.org/news/2018/12/google-s-deepmind-aces-protein-folding https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-just-happened/
  6. 6. Data as driver 12/17/18 BigDIA 6 https://www.ebi.ac.uk/uniprot/TrEMBLstats Contents of the Protein Data Bank
  7. 7. This is a somewhat predictable outcome.. The real excitement comes from the unexpected … Witness the tale of the trauma surgeon … 12/17/18 BigDIA 7 But there is more…
  8. 8. Air pollution-ecosystem feedback: unmanned aerial vehicles and ecosystem models to quantify ozone-forest interactions 12/17/18 BigDIA 8 • Spatial heterogeneity • Novel sampling • Senor data Departments: Environmental Sciences Electrical Engineering
  9. 9. A working definition of what we are doing … It is the unexpected re-use of information which is the value added by the web Tim Berners-Lee 12/17/18 BigDIA 9 https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#116a5a2d55cf
  10. 10. A working definition of what we are doing … It is the unexpected re-use of information which is the value added by the web and subsequent analysis of that information for societal benefit Tim Berners-Lee / Phil Bourne 12/17/18 BigDIA 10 https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#116a5a2d55cf
  11. 11. Of course this was all predicted by smart people .. 12/17/18 BigDIA 11
  12. 12. 12 https://en.wikipedia.org/wiki/Jim_Gray_(computer_scientist) https://www.microsoft.com/en-us/research/wp- content/uploads/2009/10/Fourth_Paradigm.pdf https://twitter.com/aip_publishing/status/856825353645559808 12/17/18 BigDIA
  13. 13. I would suggest that this audience has a responsibility to promote the fourth paradigm which is not a well recognized phenomenon across disciplines … Here is one example of how to do so 12/17/18 BigDIA 13
  14. 14. How Will Science Change? 1412/17/18 BigDIA
  15. 15. Digitization Deception Disruption Demonetization Dematerialization Democratization Time Volume,Velocity,Variety Digital camera invented by Kodak but shelved Megapixels & quality improve slowly; Kodak slow to react Film market collapses; Kodak goes bankrupt Phones replace cameras Instagram, Flickr become the value proposition Digital media becomes bona fide form of communication From a presentation to the Advisory Board to the NIH Director Example - Photography 1512/17/18 BigDIA
  16. 16. Model Transportability Horizontal Integration Multi-scale Integration human mouse zebrafish DNA Gene/Protein Network Cell Tissue Organ Body Population CNV SNP methylation 3D structure Gene expression Proteomics Metabolomics MetabolicSignaling transduction Gene regulation Hepatic Myoepithelial Erythrocyte Epithelial Muscle Nervous Liver Kidney Pancreas Heart Physiologically based pharmacokinetics GWASPopulation dynamics Microbiota Open, complex, diverse digital data Systems Pharmacology Xie et al. Annu Rev Pharmacol Toxicol. 2017 57:245-262 12/17/18 16 BigDIA
  17. 17. How should we think about organizing ourselves in an interdisciplinary way to maximize the opportunities offered by the fourth paradigm? 12/17/18 BigDIA 17
  18. 18. The Pillars of Data Science 18 Application Domains 12/17/18 BigDIA
  19. 19. Lets briefly focus on those five pillars in the context of one area of biomedical informatics – structural bioinformatics What kinds of interchange should be taking place between this field and data science? 12/17/18 BigDIA 19 Mura et al. 2018 Curr Opin Struct Biol. 52:95-102
  20. 20. Data Acquisition • Persistence of raw data not clear • Some level of consistency across instrument manufacturers • Lessons in community/society drive 12/17/18 BigDIA 20 Mura et al. 2018 Curr Opin Struct Biol. 52:95-102
  21. 21. Data Integration and Engineering • URI’s no - stooped in tradition • Ontologies – somewhat • Linked data - somewhat 2112/17/18 BigDIA Years of experience to convey
  22. 22. Data Analytics 22 –SVM’s –Random forest –Neural nets –Deep learning –?? 12/17/18 BigDIA Opportunity to learn from many domains
  23. 23. Visualization & Dissemination • Avoid the curse of the ribbon • Think sonics • Look to video games 2312/17/18 BigDIA
  24. 24. Ethics, Law & Policy – Community Driven Data Sharing 12/17/18 BigDIA 24
  25. 25. How to implement this at any level? 12/17/18 BigDIA 25
  26. 26. Guiding Principles • Be constantly strategic and nimble - think supply chain • Be sustainable - do not over reach • Be interdisciplinary • Be a organization without walls • Be diverse, accessible and open • Be team not individually driven • Strive for quality not quantity in education & research • Be innovative and translational through new forms of engagement with the private sector, government, NGOs, local, state, national and international partners 2612/17/18 BigDIA
  27. 27. Guiding Principles • Be constantly strategic and nimble - think supply chain • Be sustainable - do not over reach • Be interdisciplinary • Be a organization without walls • Be diverse, accessible and open • Be team not individually driven • Strive for quality not quantity in education & research • Be innovative and translational through new forms of engagement with the private sector, government, NGOs, local, state, national and international partners 2712/17/18 BigDIA
  28. 28. Be Interdisciplinary – Be Without Walls • Satellites – discipline driven - located in another School focusing on the mission of that School where data and analytics play a role, e.g., – SOM – data governance and clinical translation – Education – working on educational analytics • Centers – Focus area driven e.g. – Ethics and justice – Neurodegenerative disorders – Alzheimer's, autism, TBI – Sports analytics 2812/17/18 BigDIA
  29. 29. Guiding Principles • Be constantly strategic and nimble - think supply chain • Be sustainable - do not over reach • Be interdisciplinary • Be a organization without walls • Be diverse, accessible and open • Be team not individually driven • Strive for quality not quantity in education & research • Be innovative and translational through new forms of engagement with the private sector, government, NGOs, local, state, national and international partners 2912/17/18 BigDIA
  30. 30. Be Diverse, Accessible and Open – Why? • Data science exists largely because of open data • Open knowledge encourages disciplinary and interdisciplinary collaboration • Yet much of the scholarship we produce is not accessible at all and certainly not accessible to socioeconomically disadvantaged groups • Gouging by commercial knowledge providers is making the knowledge produced by others less accessible to us • Research is suffering from a reproducibility crisis addressable through greater access to all aspects of the research lifecycle 3012/17/18 BigDIA
  31. 31. Be Diverse, Accessible and Open – Why? Consider Biomedicine • Big Data – Total data from NIH-funded research back in 2016 estimated at 650 PB* – 20 PB of that is in NCBI/NLM (3%) and it is expected to grow by 10 PB in 2016 • Dark Data – Only 12% of data described in published papers is in recognized archives – 88% is dark data^ • Cost – 2007-2014: NIH spent ~$1.2Bn extramurally on maintaining data archives * In 2012 Library of Congress was 3 PB ^ http://www.ncbi.nlm.nih.gov/pubmed/26207759 12/17/18 BigDIA 31
  32. 32. A call for making these data open • Mandates – NIH, NSF, Data Management Plans • Business models can be protected yet everyone benefits • It saves lives …. 12/17/18 BigDIA 32
  33. 33. Why a more open process? Use case: Diffuse Intrinsic Pontine Gliomas (DIPG) • Occur 1:100,000 individuals • Peak incidence 6-8 years of age • Median survival 9-12 months • Surgery is not an option • Chemotherapy ineffective and radiotherapy only transitive From Adam Resnick12/17/18 BigDIA 33
  34. 34. Timeline of genomic studies in DIPG • Landmark studies identify histone mutations as recurrent driver mutations in DIPG ~2012 • Almost 3 years later, in largely the same datasets, but partially expanded, the same two groups and 2 others identify ACVR1 mutations as a secondary, co-occurring mutation From Adam Resnick 12/17/18 BigDIA 34
  35. 35. What do we need to do differently to reveal ACVR1? • ACVR1 is a targetable kinase • Inhibition of ACVR1 inhibited tumor progression in vitro • ~300 DIPG patients a year • ~60 are predicted to have ACVR1 • If large scale data sets were only integrated with TCGA and/or rare disease data in 2012, ACVR1 mutations would have been identified • 60 patients/year X 3 years = 180 children’s lives (who likely succumbed to the disease during that time) could have been impacted if only data were FAIR From Adam Resnick 12/17/18 BigDIA 35
  36. 36. Research Data Infrastructure … Both funders and some institutions see the need to move from pipes to platforms to accelerate research… 12/17/18 BigDIA 36 https://blog.lexicata.com/wp-content/uploads/2015/03/platform-model- 750x410.png
  37. 37. If platforms are the answer we could ask the question… Will {biomedical} research become more like Airbnb? 12/17/18 BigDIA 37 Vivien Bonazzi Should biomedical research be Like Airbnb? doi: 10.1371/journal.pbio.2001818
  38. 38. I am not crazy, hear me out • Airbnb is a platform that supports a trusted relationship between consumer (renter) and supplier (host) • The platform focuses on maximizing the exchange of services between supplier and consumer and maximizing the amount of trust associated with a given stakeholder • It seems to be working: – 60 million users searching 2 million listings in 192 countries – Average of 500,000 stays per night. – Evaluation of US $25bn 12/17/18 BigDIA 38 Should biomedical research be Like Airbnb? doi: 10.1371/journal.pbio.2001818
  39. 39. Platforms will ultimately digitally integrate the scholarly workflow for human and machine analysis Should biomedical research be Like Airbnb? doi: 10.1371/journal.pbio.2001818 BigDIA 3912/17/18
  40. 40. Paper Author Paper Reader Data Provider Data Consumer Employer Employee Reagent Provider Reagent Consumer Software Provider Software Consumer Grant Writer Grant Reviewer Supplier Consumer Platform MS Project Google Drive Coursera Researchgate Academia.edu Open Science Framework Synapse F1000 Rio Educator Student Pilot Open Data Lab (ODL) underway BigDIA 4012/17/18
  41. 41. The NIH through the Big Data to Knowledge (BD2K) is experimenting with a platform, keeping in mind the need to overcome these impediments Enter The Commons https://en.wikipedia.org/wiki/Ealing_Common #/media/File:Ealing_Common_- _geograph.org.uk_-_17075.jpg12/17/18 BigDIA 41
  42. 42. Paper Author Paper Reader Data Provider Data Consumer Employer Employee Reagent Provider Reagent Consumer Software Provider Software Consumer Grant Writer Grant Reviewer Supplier Consumer Platform MS Project Google Drive Coursera Researchgate Academia.edu Open Science Framework Synapse F1000 Rio Educator Student Commons – Initial focus is on integrating two layers of the scholarly workflow 12/17/18 BigDIA 42
  43. 43. Commons topology Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface PaaS SaaS IaaS https://datascience.nih.gov/commons 12/17/18 BigDIA 43
  44. 44. Commons Compliance • Treat products of research – data, methods, papers etc. as digital objects • These digital objects exist in a shared virtual space • Digital object compliance through FAIR principles: – Findable – Accessible (and usable) – Interoperable – Reusable https://commonfund.nih.gov/bd2k/commons 12/17/18 BigDIA 44
  45. 45. Why a comparison to Airbnb is not fair • Airbnb was born digital • The exchange of services on Airbnb are simple compared to what is required of a platform to support biomedical research Nevertheless there is much to be learnt 12/17/18 BigDIA 45
  46. 46. Impediments to platforms • Current work practices by all stakeholders • Entrenched business models • Size of the undertaking aka resources needed • Trust • Incentives to use the platform http://www.forbes.com/sites/johnhall/2013/04/29/ 10-barriers-to-employee- innovation/#8bdbaa811133 12/17/18 BigDIA 46
  47. 47. Even if they are successful, platforms are likely to be domain specific and only address the infrastructure.. What else is needed? 12/17/18 BigDIA 47
  48. 48. We need to promote openness • Encourage persistent identifiers e.g., ORCID • Encourage preprints • Encourage Open Access (OA) • Recognize openness in hiring and P&T • Teach open scholarship • Promote institutional openness – repositories, wikimedian in residence • Support institutional open data governance • Support global community efforts…. 12/17/18 BigDIA 48
  49. 49. Wikidata – fast growing 12/17/18 BigDIA 49 • Get on board with developments in schema.org, knowledge graphs, etc… as part of the rule rather than the exception • Provide metadata and opinion for data we produce or use
  50. 50. Let me summarize: How do we address the interdisciplinary divide? • Promote the fourth paradigm • Work within your institutions to promote data science as an interdisciplinary field • Establish an open and integrated environment for data and analytics • Be patient and do not oversell … 12/17/18 BigDIA 50
  51. 51. 12/17/18 BigDIA 51 Haas & Schmidt 2018 http://iswc2018.semanticweb.org/workshops-tutorials/#ekg
  52. 52. Acknowledgements 12/17/18 BigDIA 52 The BD2K Team at NIH The 150 folks who have passed through my laboratory https://docs.google.com/spreadsheets/d/1QZ48UaKcwDl_iFCvBmJsT03FK-bMchdfuIHe9Oxc-rw/edit#gid=0
  53. 53. Thank You peb6a@virginia.edu 5312/17/18 BigDIA

×