Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Science, Data Curation, and Human-Data Interaction


Published on

Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.

Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.

I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.

In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.

The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data Science, Data Curation, and Human-Data Interaction

  1. 1. Data Science, Data Curation, Human-Data Interaction Bill Howe, Ph.D. Associate Professor, Information School Adjunct Associate Professor, Computer Science & Engineering Associate Director and Senior Data Science Fellow, eScience Institute 7/26/2016 Bill Howe, UW 1
  2. 2. Dave Beck Director of Research, Life Sciences Ph.D. Medicinal Chemistry, Biomolecular Structure & Design Jake VanderPlas Director of Research, Physical Sciences Ph.D., Astronomy Valentina Staneva Data Scientist Ph.D., Applied Mathematics and Statistics Ariel Rokem Data Scientist Ph.D., Neuroscience Andrew Gartland Research Scientist Ph.D., Biostatistics Bryna Hazelton Research Scientist Ph.D., Physics Bernease Herman Data Scientist BS, Stats was SE at Amazon Vaughn Iverson Research Scientist Ph.D., Oceanography Rob Fatland Director of Cloud and Data Solutions Senior Data Science Fellow PhD Geophysics Joe Hellerstein Senior Data Science Fellow IBM Research, Microsoft Research, Google (ret.) Data Scientists Research Scientists Research Faculty Cyberinfrastructure Brittany Fiore-Gartland Ethnographer Ph.D Communication Dir. Ethnography
  3. 3. Time Amountofdataintheworld Time Processingpower What is the rate-limiting step in data understanding? Processing power: Moore’s Law Amount of data in the world
  4. 4. Processingpower Time What is the rate-limiting step in data understanding? Processing power: Moore’s Law Human cognitive capacity Idea adapted from “Less is More” by Bill Buxton (2001) Amount of data in the world slide src: Cecilia Aragon, UW HCDE
  5. 5. How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 7/26/2016 Bill Howe, UW 5
  6. 6. 7/26/2016 Bill Howe, UW 8 Goal: Understand and optimize how people use and share quantitative information “Human-Data Interaction”
  7. 7. The SQLShare Corpus: A multi-year log of hand-written SQL queries Queries 24275 Views 4535 Tables 3891 Users 591 SIGMOD 2016 Shrainik Jain
  8. 8. lifetime = days between first and last access of table SIGMOD 2016 Shrainik Jain Data “Grazing”: Short dataset lifetimes
  9. 9. MYRIA: POLYSTORE MGMT Human-Data Interaction 7/26/2016 Bill Howe, UW 18
  10. 10. R A G K Modern Big Data Ecosystems many different platforms, complex analytics
  11. 11. Myria Algebra Tables KeyVal Arrays Graphs RACO: Relational Algebra COmpiler
  12. 12. Spark Accumulo CombBLAS GraphX Parallel Algebra Logical Algebra RACO Relational Algebra COmpiler CombBLAS API Spark API Accumulo Graph API rewrite rules Array Algebra MyriaL Services: visualization, logging, discovery, history, browsing Orchestration
  13. 13. 7/26/2016 Bill Howe, UW 22 ISMIR 2016
  14. 14. Laser Microscope Objective Pine Hole Lens Nozzle d1 d2 FSC (Forward scatter) Orange fluo Red fluo SeaFlow Francois Ribalet Jarred Swalwell Ginger Armbrust
  15. 15. 7/26/2016 Bill Howe, UW 25
  16. 16. Ashes CAMHD
  17. 17. Extract synchronized slices Co-register (camera jitter, bad time synch) Separate fore- and back-ground Classify critters in the foreground Measure growth rate over time
  18. 18. “DEEP” CURATION Human-Data Interaction
  19. 19. Microarray experiments
  20. 20. 7/26/2016 Bill Howe, UW 33 Microarray samples submitted to the Gene Expression Omnibus Curation is fast becoming the bottleneck to data sharing Maxim Gretchkin Hoifung Poon
  21. 21. color = labels supplied as metadata clusters = 1st two PCA dimensions on the gene expression data itself Can we use the expression data directly to curate algorithmically? Maxim Gretchkin Hoifung Poon The expression data and the text labels appear to disagree
  22. 22. Maxim Gretchkin Hoifung Poon Better Tissue Type Labels Domain knowledge (Ontology) Expression data Free-text Metadata 2 Deep Networks text expr SVM
  23. 23. Deep Curation Maxim Gretchkin Hoifung Poon Distant supervision and co-learning between text-based classified and expression-based classifier: Both models improve by training on each others’ results. Free-text classifier Expression classifier
  25. 25. Observations • Figures in the literature are the currency of scientific ideas • Almost entirely unexplored • Our thought: Mine patterns in the visual literature
  26. 26. Step 1: Dismantling Composite Figures Poshen Lee ICPRAM 2015
  27. 27. Step 2: Classification • Divide images into small patches • Take a random sample • Run k-means on samples (k = 200) • For each figure in training set, generate a length-200 feature vector by similarity to clusters. Train a model. • For each test image, create the vector and classify by the model
  28. 28. Do high-impact papers have fewer equations, as indicated by Fawcett and Higginson? (Yes) Poshen LeeJevin West high impact papers low impact papers
  29. 29. Do high-impact papers have more diagrams? (Yes) Poshen LeeJevin West
  30. 30. Do papers in top journals tend to involve more or less visual information? (More) Poshen LeeJevin West
  31. 31. 7/26/2016 Poshen Lee, UW 52
  32. 32. 7/26/2016 Poshen Lee, UW 53 Burrows-Wheeler Alignment Computation DNA Sequencing Citations: 7807 +11 since 2016 Eigenfactor: 0.0000574719 DNA Methylation Brain Cancer Chromosomal Aberrations Cancer Genome Atlas Citations: 2094 +7 since 2016 Eigenfactor: 0.0000279023 Memory-efficient Computation DNA Sequencing Citations: 7459 +17 since 2016 Eigenfactor: 0.0000875579 Molecular biology GeneticsGenomics DNA Citations: 3766 +15 since 2016 Eigenfactor: 0.0000183255
  33. 33. INFORMATION EXTRACTION FROM FIGURES Information-critical figures Metabolic pathway diagrams Phylogenetic heat maps Architecture diagrams
  34. 34. Sean Yang
  35. 35. Normalize Sean Yang
  36. 36. Corner Detection Line Detection
  37. 37. Extract Tree Structure Sean Yang
  38. 38. VISUALIZATION RECOMMENDATION 7/26/2016 Bill Howe, UW 59
  39. 39. 60
  40. 40. Example of a Learned Rule (1) low x-entropy => bad scatter plot 7/26/2016 Bill Howe, UW 61 bad scatter plotgood scatter plot
  41. 41. Example of a Learned Rule (3) 63 high x-periodicity => timeseries plot (periodicity = 1 / variance in gap length between successive values)
  42. 42. Voyager 7/26/2016 Bill Howe, UW 64 Kanit “Ham” Wongsuphasaw at Dominik Moritz InfoVis 15 Jeff Heer Jock Mackinlay Anushka Anand
  43. 43. SCALABLE GRAPH CLUSTERING 7/26/2016 Bill Howe, UW 65
  44. 44. Seung-Hee BaeScalable Graph Clustering Version 1 Parallelize Best-known Serial Algorithm ICDM 2013 Version 2 Free 30% improvement for any algorithm TKDD 2014 SC 2015 Version 3 Distributed approx. algorithm, 1.5B edges
  45. 45. Recap • “Human-Data Interaction” is the bottleneck! – SQLShare: Mining SQL logs to uncover user behavior – Myria/RACO: Polystore Optimization – Deep Curation: Zero-training labeling of scientific datasets – Viziometrics: Mining the scientific literature – Voyager: Visualization Recommendation – GossipMap: Scalable Graph Clustering
  46. 46. Voyager @billghowe github: billhowe
  47. 47. • OCCs:Big Data / Database researcher with broad impact and expertise in research data management, • Democratizing Data Science – Ourselves: Reduce overhead in attention-scarce regimes – Other fields: Reduce overhead of interdisciplinary research – The public: Reduce overhead of communicating with the public and policymakers • SQLShare – Why? What? Impact? – Key: RDM, NSF-funded, hundreds of users – Are these workloads any different than a typical database? • HaLoop – Why? What? Impact? – Key: Papers, new subfield in big data • Myria – Why? What? Impact? – Key: Funding • Viziometrics – Why? What? Impact? • Data Curation through an Algorithmic Lens – Why? What? Impact? – Volume, variety, velocity. Volume: tasks that scale with the number of records: movement, validation. Variety: tasks that scale with the number of datasets: metadata attachment, cataloging, metadata verfication. Velocity: tasks that scale with the time since release. Data journalism, legal cases – Example? Maxim’s work. Prevalence of missing and incorrect labels. – Is this dataset what it says it is? – Why? Reproducibility crisis – Is this fully automatic? No. Training data, computational steering
  48. 48. • guide--librarians-as-fact-checkers-innovation- 722.php?page_id=167
  49. 49. •
  50. 50. •
  51. 51. • Available – Can you get it if you know where to look? • Discoverable – Can you get it if you don’t know where to look? • Manipulable – What can you do with it, besides download it? Can the structure be readily parsed and transformed? • Interpretable – Is the information internally consistent with respect to provenance, metadata, column names, etc.? • Contextualizable – Is the information externally consistent with respect to other related datasets? Can it be connected to other datasets through standards or conventions? Does it admit connections to other datasets
  52. 52. Services emphasizing discovery, citation, and preservation
  53. 53. Query, Viz, and Analytics Services Google Fusion Tables
  54. 54. PredictDownload Query Join Visualize url doi tags space and time ontologies standards
  55. 55. Server software, locally installed
  56. 56. • ISMIR paper
  57. 57. • Allen Institute example: • flexibility gap between high level and low level – Domain-specific languages http://casestudies.brain- -dashboards-from-jupyter-notebooks/
  58. 58. Time Amountofdataintheworld Time Processingpower What is the rate-limiting step in data understanding? Processing power: Moore’s Law Amount of data in the world
  59. 59. Processingpower Time What is the rate-limiting step in data understanding? Processing power: Moore’s Law Human cognitive capacity Idea adapted from “Less is More” by Bill Buxton (2001) Amount of data in the world slide src: Cecilia Aragon, UW HCDE
  60. 60. A Typical Data Science Workflow 1) Preparing to run a model 2) Running the model 3) Interpreting the results Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging “80% of the work” -- Aaron Kimball “The other 80% of the work”
  61. 61. How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 7/26/2016 Bill Howe, UW 93