Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Science Data, Responsibly

471 views

Published on

Dagstuhl talk on Data Science Education and issues around data responsibility in science contexts.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Science Data, Responsibly

  1. 1. Data Ethics in Data Science Education (plus: Science Data, Responsibly) Bill Howe University of Washington
  2. 2. Plan • context: eScience Institute (1 min) • context: Data Science MOOC (3 min) • Vignette on Teaching Data Ethics (5 min) • Science Data, Responsibly (6 min) – Automated Curation – Viziometrics 9/25/2016 Data, Responsibly @ Dagstuhl 2
  3. 3. • People • Research Staff (~4 100% Data Scientists, ~4 50% Research Scientists) • Postdocs (~12 at steady state) • Faculty (~9 Exec Committee, ~20 Steering Committee, ~100 Affiliates) • Adminstrative Staff (Program Managers, Finance, Admin) • Programs – Short and long-term research, education programs ugrad/masters/Phd, software, research consulting – Leadership on all things data science around campus • Funding • $700k / yr permanent appropriation from the state of WA • $32.8M for 5 years jointly with NYU and UC Berkeley from the Gordon and Betty Moore Foundation and the Alfred P Sloan Foundation to build a “Data Science Environment” • $9M for 5 years from the Washington Research Foundation • $500k / yr from the Provost for half-lines for recruiting in relevant fields
  4. 4. 9/25/2016 Bill Howe, UW 4
  5. 5. Data Science Education 9/25/2016 Bill Howe, UW 5 Students Non-Students CS/Informatics Non-Major professionals researchers undergrads grads undergrads grads (2011) Data Science Certificate (2013) Data Science MOOC (2013) NSF IGERT Big Data PhD (2013) New CS Courses (2016) Data Science Masters (2015) Data Sci. for Social Good Data Ethics being incorporated in all programs
  6. 6. Session 2 Summer 2014 121,215 students Session 1 Spring 2013 119,504 students Introduction to Data Science MOOC on Coursera
  7. 7. Participation numbers • “Registered:” 119,517 totally irrelevant • Clicked play in first 2 weeks: 78,589 • Turned in 1st homework: 10,663 • Completed all assignments: ~9000 typical for a MOOC • “Passed:” 7022 • Forum threads: 4661 • Forum posts: 22,900 Fairly consistent with Coursera data across “hard” courses Define success however you want – Many love it in parts, start late, don’t turn in homework, etc. – Learning rather than watching television
  8. 8. Syllabus • Data Science Landscape (~1 week) • Data Manipulation at Scale – Relational Databases (~1 week) – MapReduce (~1 week) – NoSQL (~1 week) • Analytics – Statistics Topics (~1 week) – Machine Learning Topics (~2 weeks) • Visualization (~1 week) • Graph Analytics (~1 week)
  9. 9. 2015: MOOC Recast as a 4-course “Specialization” Data Manipulation at Scale Databases, Systems, Algorithms Practical Predictive Analytics Stats (resampling methods, multiple hypothesis testing, more) ML (rules/trees/forests, ensembles/boosting/bagging, SVMs, GD, eval…) Communicating Data Science Visualization, ethics and privacy Capstone
  10. 10. VIGNETTE ON TEACHING DATA ETHICS 9/25/2016 Bill Howe, UW 10
  11. 11. Alcohol Study, Barrow Alaska, 1979 Native leaders and city officials, worried about drinking and associated violence in their community invited a group of sociology researchers to assess the problem and work with them to devise solutions.
  12. 12. Methods • 10% representative sample (N=88) of everyone over the age of 15 using a 1972 demographic survey • Interviewed on attitudes and values about use of alcohol • Obtained psychological histories including drinking behavior • Given the Michigan Alcoholism Screening Test (Seltzer, 1971) • Asked to draw a picture of a person – Used to determine cultural identity
  13. 13. Results announced unilaterally and publicly At the conclusion of the study researchers formulated a report entitled “The Inupiat, Economics and Alcohol on the Alaskan North Slope” which was released simultaneously at a press release and to the Barrow community. The press release was picked up by the New York Times, who ran a front page story entitled Alcohol Plagues Eskimos
  14. 14. The results of the Barrow Alcohol Study in Alaska were revealed in the context of a press conference that was held far from the Native village, and without the presence, much less the knowledge or consent, of any community member who might have been able to present any context concerning the socioeconomic conditions of the village. Study results suggested that nearly all adults in the community were alcoholics. In addition to the shame felt by community members, the town’s Standard and Poor bond rating suffered as a result, which in turn decreased the tribe’s ability to secure funding for much needed projects. Backlash
  15. 15. Methodological Problems “The authors once again met with the Barrow Technical Advisory Group, who stated their concern that only Natives were studied, and that outsiders in town had not been included.” “The estimates of the frequency of intoxication based on association with the probability of being detained were termed "ludicrous, both logically and statistically.”” Edward F. Foulks, M.D., Misalliances In The Barrow Alcohol Study
  16. 16. Ethical Problems • Participants were not in control of their data nor the context in which they were presented. • Easy to demonstrate specific, significant harms: – Social: Stigmatization – Financial: Bond rating lowered • Important: Nothing to do with individual privacy – No PII revealed at any point, to anyone – No violations of best practices in data handling – But even those who did not participate in the study incurred harm
  17. 17. Two Topics • Social Component: Codes of Conduct • Technical Component: Managing Sensitive Data
  18. 18. Ethical principles vs. ethical rules • In the Barrow example, ethical rules were generally followed • But ethical principles were violated: The researchers appear to have placed their own interests ahead of those of the research subjects, the client, and society
  19. 19. Principles: Codes of Conduct • American Statistical Association – http://www.amstat.org/committees/ethics/ • Certified Analytics Professional – https://www.certifiedanalytics.org/ethics.php • Data Science Association – http://www.datascienceassn.org/code-of- conduct.html
  20. 20. SCIENCE DATA, RESPONSIBLY 9/25/2016 Bill Howe, UW 20
  21. 21. Science is a complete mess • Reproducibility – Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that approximated the original result (Science, 2015) – Ioannidis 2005: Why most public research findings are false – Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups 9/25/2016 Bill Howe, UW 21
  22. 22. Science, 2015
  23. 23. 9/25/2016 Data, Responsibly @ Dagstuhl 23 Retractions are increasing…..
  24. 24. Science is a complete mess • Reproducibility – Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that approximated the original result (Science, 2015) – Ioannidis 2005: Why most public research findings are false – Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups • Fraud – Diederik Stapel: 38 articles with fictitious data – Bharat Aggarwal: a huge number of images with evidence of manipulation 9/25/2016 Bill Howe, UW 24
  25. 25. Bharat Aggarwal alleged data manipulation
  26. 26. Science is a complete mess • Reproducibility – Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that approximated the original result (Science, 2015) – Ioannidis 2005: Why most public research findings are false – Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups • Fraud – Diederik Stapel: 38 articles with fictitious data – Bharat Aggarwal: a huge number of images with evidence of manipulation • Public Trust – Churn: Chocolate, egg yolks, red meat, red wine, etc. – Climate change, vaccines 9/25/2016 Bill Howe, UW 27
  27. 27. Vision: Validate scientific claims automatically – Check for manipulation (manipulated images, Benford’s Law) – Extract claims from papers – Check claims against the authors’ data – Check claims against related data sets – Automatic meta-analysis across the literature + public datasets • First steps – Automatic curation: Validate and attach metadata to public datasets – Longitudinal analysis of the visual literature 9/25/2016 Data, Responsibly @ Dagstuhl 32
  28. 28. “DEEP” CURATION Science Data, Responsibly
  29. 29. Microarray experiments
  30. 30. 9/25/2016 Bill Howe, UW 41 Microarray samples submitted to the Gene Expression Omnibus Curation is fast becoming the bottleneck to data sharing Maxim Gretchkin Hoifung Poon
  31. 31. color = labels supplied as metadata clusters = 1st two PCA dimensions on the gene expression data itself Can we use the expression data directly to curate algorithmically? Maxim Gretchkin Hoifung Poon The expression data and the text labels appear to disagree
  32. 32. Maxim Gretchkin Hoifung Poon Better Tissue Type Labels Domain knowledge (Ontology) Expression data Free-text Metadata 2 Deep Networks text expr SVM
  33. 33. Deep Curation Maxim Gretchkin Hoifung Poon Distant supervision and co-learning between text- based classified and expression-based classifier: Both models improve by training on each others’ results. Free-text classifier Expression classifier
  34. 34. Deep Curation: Our stuff wins, with no training data Maxim Gretchkin Hoifung Poon state of the art our reimplementation of the state of the art our dueling pianos NN amount of training data used
  35. 35. VIZIOMETRICS: COMPREHENDING VISUAL INFORMATION IN THE SCIENTIFIC LITERATURE Human-Data Interaction 9/25/2016 Bill Howe, UW 46
  36. 36. Step 1: Dismantling Composite Figures Poshen Lee ICPRAM 2015
  37. 37. Do high-impact papers have fewer equations, as indicated by Fawcett and Higginson? (Yes) Poshen LeeJevin West high impact papers low impact papers
  38. 38. Do high-impact papers have more diagrams? (Yes) Poshen LeeJevin West
  39. 39. TEACHING DATA ETHICS IN DATA SCIENCE
  40. 40. Session 2 Summer 2014 121,215 students Session 1 Spring 2013 119,504 students
  41. 41. Participation numbers • “Registered”: 119,517 totally irrelevant • Clicked play in first 2 weeks: 78,589 • Turned in 1st homework: 10,663 • Completed all assignments: ~9000 typical for a MOOC • “Passed”: 7022 • Forum threads: 4661 • Forum posts: 22,900 Fairly consistent with Coursera data across “hard” courses Define success however you want – Many love it in parts, start late, don’t turn in homework, etc. – Learning rather than watching television
  42. 42. Lectures • Data Science Context and Case Studies (~1 week) • Data Management at Scale – Relational Databases (~1 week) – MapReduce (~1 week) – NoSQL (~1 week) • Topics in Analytics – Permutation Methods, Bayesian Methods (~1 week) – Machine Learning Algorithms and Evaluation (~1 week) • Visualization (~1 week) • Graph Analytics (~1 week) • Guest Lectures
  43. 43. 9/25/2016 Bill Howe, UW 56 Who took the course?
  44. 44. 9/25/2016 Bill Howe, UW 57 Who took the course?
  45. 45. 9/25/2016 Bill Howe, UW 58 Who took the course? What programming language do you typically use?
  46. 46. 9/25/2016 Bill Howe, UW 59
  47. 47. 9/25/2016 Bill Howe, UW 60
  48. 48. 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 Attrition, video lectures Number of students watching videos by segment, ordered by time
  49. 49. 9/25/2016 Bill Howe, UW 62 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Twitter1 Twitter2 Twitter3 Twitter4 Twitter5 Twitter6 Database1 Database2 Database3 Database4 Database5 Database6 Database7 Database8 Database9 MapReduce1 MapReduce2 MapReduce3 MapReduce4 MapReduce5 MapReduce6 Kaggle Tableau Attrition, assignments Number of students completing assignments by part
  50. 50. 9/25/2016 Bill Howe, UW 64 Who took the course? In a directory with 1000 text files, you are asked to create a list of files that contain the word Drosophila
  51. 51. 9/25/2016 Bill Howe, UW 65 Who took the course? What if you were given a billion documents spread across many computers and asked to count the occurrences of a given phrase?
  52. 52. “I left the company I co-founded in 2005 to do data analytics with Wibidata, with whom I was introduced as a result of their guest lecture in your course.

×