DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Science Data, Responsibly
1. Data Ethics in Data Science Education
(plus: Science Data, Responsibly)
Bill Howe
University of Washington
2. Plan
• context: eScience Institute (1 min)
• context: Data Science MOOC (3 min)
• Vignette on Teaching Data Ethics (5 min)
• Science Data, Responsibly (6 min)
– Automated Curation
– Viziometrics
9/25/2016 Data, Responsibly @ Dagstuhl 2
3. • People
• Research Staff (~4 100% Data Scientists, ~4 50% Research Scientists)
• Postdocs (~12 at steady state)
• Faculty (~9 Exec Committee, ~20 Steering Committee, ~100 Affiliates)
• Adminstrative Staff (Program Managers, Finance, Admin)
• Programs
– Short and long-term research, education programs ugrad/masters/Phd,
software, research consulting
– Leadership on all things data science around campus
• Funding
• $700k / yr permanent appropriation from the state of WA
• $32.8M for 5 years jointly with NYU and UC Berkeley from the Gordon and
Betty Moore Foundation and the Alfred P Sloan Foundation to build a “Data
Science Environment”
• $9M for 5 years from the Washington Research Foundation
• $500k / yr from the Provost for half-lines for recruiting in relevant fields
5. Data Science Education
9/25/2016 Bill Howe, UW 5
Students Non-Students
CS/Informatics Non-Major
professionals researchers
undergrads grads undergrads grads
(2011) Data Science Certificate
(2013) Data Science MOOC
(2013) NSF IGERT Big Data PhD
(2013) New CS Courses
(2016) Data Science Masters
(2015) Data Sci. for Social Good
Data Ethics being incorporated in all programs
6. Session 2
Summer 2014
121,215 students
Session 1
Spring 2013
119,504 students
Introduction to Data Science MOOC on Coursera
7. Participation numbers
• “Registered:” 119,517 totally irrelevant
• Clicked play in first 2 weeks: 78,589
• Turned in 1st homework: 10,663
• Completed all assignments: ~9000 typical for a MOOC
• “Passed:” 7022
• Forum threads: 4661
• Forum posts: 22,900
Fairly consistent with Coursera data across “hard” courses
Define success however you want
– Many love it in parts, start late, don’t turn in homework, etc.
– Learning rather than watching television
11. Alcohol Study, Barrow Alaska, 1979
Native leaders and city officials,
worried about drinking and associated
violence in their community invited a
group of sociology researchers to
assess the problem and work with
them to devise solutions.
12. Methods
• 10% representative sample (N=88)
of everyone over the age of 15 using
a 1972 demographic survey
• Interviewed on attitudes and values
about use of alcohol
• Obtained psychological histories
including drinking behavior
• Given the Michigan Alcoholism
Screening Test (Seltzer, 1971)
• Asked to draw a picture of a person
– Used to determine cultural identity
13. Results announced unilaterally and publicly
At the conclusion of the study researchers formulated a report entitled “The
Inupiat, Economics and Alcohol on the Alaskan North Slope” which was released
simultaneously at a press release and to the Barrow community. The press
release was picked up by the New York Times, who ran a front page story
entitled Alcohol Plagues Eskimos
14. The results of the Barrow Alcohol Study in Alaska were revealed in the context of a
press conference that was held far from the Native village, and without the
presence, much less the knowledge or consent, of any community member who
might have been able to present any context concerning the socioeconomic
conditions of the village. Study results suggested that nearly all adults in the
community were alcoholics. In addition to the shame felt by community members,
the town’s Standard and Poor bond rating suffered as a result, which in turn
decreased the tribe’s ability to secure funding for much needed projects.
Backlash
15. Methodological Problems
“The authors once again met with the Barrow Technical Advisory
Group, who stated their concern that only Natives were studied,
and that outsiders in town had not been included.”
“The estimates of the frequency of intoxication based on
association with the probability of being detained were termed
"ludicrous, both logically and statistically.””
Edward F. Foulks, M.D., Misalliances In The Barrow Alcohol Study
16. Ethical Problems
• Participants were not in control of their data nor
the context in which they were presented.
• Easy to demonstrate specific, significant harms:
– Social: Stigmatization
– Financial: Bond rating lowered
• Important: Nothing to do with individual privacy
– No PII revealed at any point, to anyone
– No violations of best practices in data handling
– But even those who did not participate in the study
incurred harm
17. Two Topics
• Social Component: Codes of Conduct
• Technical Component: Managing Sensitive Data
18. Ethical principles vs. ethical rules
• In the Barrow example, ethical rules were
generally followed
• But ethical principles were violated: The
researchers appear to have placed their own
interests ahead of those of the research
subjects, the client, and society
19. Principles: Codes of Conduct
• American Statistical Association
– http://www.amstat.org/committees/ethics/
• Certified Analytics Professional
– https://www.certifiedanalytics.org/ethics.php
• Data Science Association
– http://www.datascienceassn.org/code-of-
conduct.html
21. Science is a complete mess
• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible
– Only about half of psychology 100 studies had effect sizes that approximated
the original result (Science, 2015)
– Ioannidis 2005: Why most public research findings are false
– Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups
9/25/2016 Bill Howe, UW 21
24. Science is a complete mess
• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible
– Only about half of psychology 100 studies had effect sizes that approximated
the original result (Science, 2015)
– Ioannidis 2005: Why most public research findings are false
– Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups
• Fraud
– Diederik Stapel: 38 articles with fictitious data
– Bharat Aggarwal: a huge number of images with evidence of manipulation
9/25/2016 Bill Howe, UW 24
26. Science is a complete mess
• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible
– Only about half of psychology 100 studies had effect sizes that approximated
the original result (Science, 2015)
– Ioannidis 2005: Why most public research findings are false
– Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups
• Fraud
– Diederik Stapel: 38 articles with fictitious data
– Bharat Aggarwal: a huge number of images with evidence of manipulation
• Public Trust
– Churn: Chocolate, egg yolks, red meat, red wine, etc.
– Climate change, vaccines
9/25/2016 Bill Howe, UW 27
27.
28.
29. Vision: Validate scientific claims automatically
– Check for manipulation (manipulated images, Benford’s Law)
– Extract claims from papers
– Check claims against the authors’ data
– Check claims against related data sets
– Automatic meta-analysis across the literature + public datasets
• First steps
– Automatic curation: Validate and attach metadata to public datasets
– Longitudinal analysis of the visual literature
9/25/2016 Data, Responsibly @ Dagstuhl 32
32. 9/25/2016 Bill Howe, UW 41
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin
Hoifung
Poon
33. color = labels supplied
as metadata
clusters = 1st two PCA
dimensions on the
gene expression data
itself
Can we use the expression data
directly to curate algorithmically?
Maxim
Gretchkin
Hoifung
Poon
The expression data
and the text labels
appear to disagree
35. Deep Curation Maxim
Gretchkin
Hoifung
Poon
Distant supervision and co-learning between text-
based classified and expression-based classifier: Both
models improve by training on each others’ results.
Free-text classifier
Expression classifier
36. Deep Curation:
Our stuff wins, with no training data
Maxim
Gretchkin
Hoifung
Poon
state of the art
our reimplementation
of the state of the art
our dueling
pianos NN
amount of training data used
45. Participation numbers
• “Registered”: 119,517 totally irrelevant
• Clicked play in first 2 weeks: 78,589
• Turned in 1st homework: 10,663
• Completed all assignments: ~9000 typical for a MOOC
• “Passed”: 7022
• Forum threads: 4661
• Forum posts: 22,900
Fairly consistent with Coursera data across “hard” courses
Define success however you want
– Many love it in parts, start late, don’t turn in homework, etc.
– Learning rather than watching television
46. Lectures
• Data Science Context and Case Studies (~1 week)
• Data Management at Scale
– Relational Databases (~1 week)
– MapReduce (~1 week)
– NoSQL (~1 week)
• Topics in Analytics
– Permutation Methods, Bayesian Methods (~1 week)
– Machine Learning Algorithms and Evaluation (~1 week)
• Visualization (~1 week)
• Graph Analytics (~1 week)
• Guest Lectures
53. 9/25/2016 Bill Howe, UW 62
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Twitter1
Twitter2
Twitter3
Twitter4
Twitter5
Twitter6
Database1
Database2
Database3
Database4
Database5
Database6
Database7
Database8
Database9
MapReduce1
MapReduce2
MapReduce3
MapReduce4
MapReduce5
MapReduce6
Kaggle
Tableau
Attrition, assignments
Number of students completing assignments by part
54.
55. 9/25/2016 Bill Howe, UW 64
Who took the course?
In a directory with 1000 text files, you are asked to
create a list of files that contain the word Drosophila
56. 9/25/2016 Bill Howe, UW 65
Who took the course?
What if you were given a billion documents spread across many
computers and asked to count the occurrences of a given phrase?
57. “I left the company I co-founded in 2005 to do data
analytics with Wibidata, with whom I was introduced
as a result of their guest lecture in your course.
Editor's Notes
We use this device to talk about this idea: the pi-shaped researcher.
Native leaders and city officials in Barrow, Alaska, worried about drinking and associated violence and accidental deaths in their community invited a group of sociology researchers to assess the problem and work with them to devise solutions. At the conclusion of the study researchers formulated a report entitled “The Inupiat, Economics and Alcohol on the Alaskan North Slope” which was released simultaneously at a press release and to the Barrow community. The press release was picked up by the New York Times, who ran a front page story entitled Alcohol Plagues
Responsibility to which parties?
* Society
* Employers and Clients
* Colleagues
* Research Subjects
ASA:
Professionalism
Responsibilities to Funders, Clients, Employers
Responsibilities in Publications and Testimony
Responsibilities to Research Subjects
Responsibilities to Research Team Colleagues
Responsibilities to Other Statisticians or Statistical Practitioners
Responsibilities Regarding Allegations of Misconduct
Responsibilities of Employers
Code of Conduct: Rules
Competence
Do what you client asks, unless violates law
Communication with clients
Confidential information
Conflicts of interest
Rule 7: More on conflicts of interest and confidentiality
Rule 8: Scientific integrity
+++ Interesting: If a data scientist reasonably believes a client is misusing data science to communicate a false reality or promote an illusion of understanding, the data scientist shall take reasonable remedial measures, including disclosure to the client, and including, if necessary, disclosure to the proper authorities. The data scientist shall take reasonable measures to persuade the client to use data science appropriately.
Rule 9: Misconduct (follow the rules)
This week we’re going to talk about estimation and prediction.
I want to begin with a non-research article from 2010 by Jonah Lehrer. In this article, the author describes cases where once-promising research results become weaker over time – they become harder to replicate, or the effect size becomes smaller.
He quotes John Davis speaking about the efficiacy of antidepressants, saying…
He talks about Anders Moller, a biologist who made an important discovery based on precise measurements of symmetry in the plumage of barn swallows, only to find the effect size shrank by 80 percent in the studies following the initial paper.
Jonathan Schooler made a discovery he called verbal overshadowing, which showed, counter-intuitively, that talking about something someone’s face made it harder to recognize later rather than easier. But this effect too became weaker over time.
Back in the 1930s, Joseph Rhine, a researcher at Duke Unviersity who coined the terms parapscyhology and etrasensor perception, reported data showing that some invdividuals could correctly guess the symbols on special cards without seeing them in remarkably long streaks. But the same individuals’ performance would decline over time. He called it the decline effect.
What’s going on? The article offers some sensible and some not-so-sensible ideas about the root cause.
One culprit is publication bias.
Joober et al. in 2012
You can’t roll the dice a bunch of times then yell “Yahtzee!”
Here’s a simulation of what Rhine in the 1930s referred to as the decline effect.
As the study size increases, the effect size diminishes. Other metrics on the x and y axes are possible: x-axis might be improvements in experimental design, y-axis might be statistical significance.
The units of effect size will be application specific – number of smokers who quit, number of T-cells in the blood, amount of ad revenue generated, etc. Something that measures how “good” the result is.
You can’t roll the dice a bunch of times then yell “Yahtzee!”
Google knowledge graph
Specialized Ontologies
"HeLa", "K562", "MCF-7" and "brain tumor”
PCA on expression values
Google knowledge graph – common knowledge, high redundancy, possibly crowdsourcing (visual: question answering via Google)
Text features:
presence of ontology terms
sibling of ontology term
Expression features