Science Data, Responsibly

Data Ethics in Data Science Education
(plus: Science Data, Responsibly)
Bill Howe
University of Washington

Plan
• context: eScience Institute (1 min)
• context: Data Science MOOC (3 min)
• Vignette on Teaching Data Ethics (5 min)
• Science Data, Responsibly (6 min)
– Automated Curation
– Viziometrics
9/25/2016 Data, Responsibly @ Dagstuhl 2

• People
• Research Staff (~4 100% Data Scientists, ~4 50% Research Scientists)
• Postdocs (~12 at steady state)
• Faculty (~9 Exec Committee, ~20 Steering Committee, ~100 Affiliates)
• Adminstrative Staff (Program Managers, Finance, Admin)
• Programs
– Short and long-term research, education programs ugrad/masters/Phd,
software, research consulting
– Leadership on all things data science around campus
• Funding
• $700k / yr permanent appropriation from the state of WA
• $32.8M for 5 years jointly with NYU and UC Berkeley from the Gordon and
Betty Moore Foundation and the Alfred P Sloan Foundation to build a “Data
Science Environment”
• $9M for 5 years from the Washington Research Foundation
• $500k / yr from the Provost for half-lines for recruiting in relevant fields

Data Science Education
9/25/2016 Bill Howe, UW 5
Students Non-Students
CS/Informatics Non-Major
professionals researchers
undergrads grads undergrads grads
(2011) Data Science Certificate
(2013) Data Science MOOC
(2013) NSF IGERT Big Data PhD
(2013) New CS Courses
(2016) Data Science Masters
(2015) Data Sci. for Social Good
Data Ethics being incorporated in all programs

Session 2
Summer 2014
121,215 students
Session 1
Spring 2013
119,504 students
Introduction to Data Science MOOC on Coursera

Participation numbers
• “Registered:” 119,517 totally irrelevant
• Clicked play in first 2 weeks: 78,589
• Turned in 1st homework: 10,663
• Completed all assignments: ~9000 typical for a MOOC
• “Passed:” 7022
• Forum threads: 4661
• Forum posts: 22,900
Fairly consistent with Coursera data across “hard” courses
Define success however you want
– Many love it in parts, start late, don’t turn in homework, etc.
– Learning rather than watching television

Syllabus
• Data Science Landscape (~1 week)
• Data Manipulation at Scale
– Relational Databases (~1 week)
– MapReduce (~1 week)
– NoSQL (~1 week)
• Analytics
– Statistics Topics (~1 week)
– Machine Learning Topics (~2 weeks)
• Visualization (~1 week)
• Graph Analytics (~1 week)

2015: MOOC Recast as a 4-course “Specialization”
Data Manipulation at Scale
Databases, Systems, Algorithms
Practical Predictive Analytics
Stats (resampling methods, multiple hypothesis testing, more)
ML (rules/trees/forests, ensembles/boosting/bagging, SVMs, GD, eval…)
Communicating Data Science
Visualization, ethics and privacy
Capstone

VIGNETTE ON TEACHING
DATA ETHICS
9/25/2016 Bill Howe, UW 10

Alcohol Study, Barrow Alaska, 1979
Native leaders and city officials,
worried about drinking and associated
violence in their community invited a
group of sociology researchers to
assess the problem and work with
them to devise solutions.

Methods
• 10% representative sample (N=88)
of everyone over the age of 15 using
a 1972 demographic survey
• Interviewed on attitudes and values
about use of alcohol
• Obtained psychological histories
including drinking behavior
• Given the Michigan Alcoholism
Screening Test (Seltzer, 1971)
• Asked to draw a picture of a person
– Used to determine cultural identity

Results announced unilaterally and publicly
At the conclusion of the study researchers formulated a report entitled “The
Inupiat, Economics and Alcohol on the Alaskan North Slope” which was released
simultaneously at a press release and to the Barrow community. The press
release was picked up by the New York Times, who ran a front page story
entitled Alcohol Plagues Eskimos

The results of the Barrow Alcohol Study in Alaska were revealed in the context of a
press conference that was held far from the Native village, and without the
presence, much less the knowledge or consent, of any community member who
might have been able to present any context concerning the socioeconomic
conditions of the village. Study results suggested that nearly all adults in the
community were alcoholics. In addition to the shame felt by community members,
the town’s Standard and Poor bond rating suffered as a result, which in turn
decreased the tribe’s ability to secure funding for much needed projects.
Backlash

Methodological Problems
“The authors once again met with the Barrow Technical Advisory
Group, who stated their concern that only Natives were studied,
and that outsiders in town had not been included.”
“The estimates of the frequency of intoxication based on
association with the probability of being detained were termed
"ludicrous, both logically and statistically.””
Edward F. Foulks, M.D., Misalliances In The Barrow Alcohol Study

Ethical Problems
• Participants were not in control of their data nor
the context in which they were presented.
• Easy to demonstrate specific, significant harms:
– Social: Stigmatization
– Financial: Bond rating lowered
• Important: Nothing to do with individual privacy
– No PII revealed at any point, to anyone
– No violations of best practices in data handling
– But even those who did not participate in the study
incurred harm

Two Topics
• Social Component: Codes of Conduct
• Technical Component: Managing Sensitive Data

Ethical principles vs. ethical rules
• In the Barrow example, ethical rules were
generally followed
• But ethical principles were violated: The
researchers appear to have placed their own
interests ahead of those of the research
subjects, the client, and society

Principles: Codes of Conduct
• American Statistical Association
– http://www.amstat.org/committees/ethics/
• Certified Analytics Professional
– https://www.certifiedanalytics.org/ethics.php
• Data Science Association
– http://www.datascienceassn.org/code-of-
conduct.html

SCIENCE DATA, RESPONSIBLY
9/25/2016 Bill Howe, UW 20

Science is a complete mess
• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible
– Only about half of psychology 100 studies had effect sizes that approximated
the original result (Science, 2015)
– Ioannidis 2005: Why most public research findings are false
– Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups
9/25/2016 Bill Howe, UW 21

Retractions are increasing…..

• Reproducibility
• Fraud
– Diederik Stapel: 38 articles with fictitious data
– Bharat Aggarwal: a huge number of images with evidence of manipulation
9/25/2016 Bill Howe, UW 24

Bharat Aggarwal
alleged data manipulation

• Reproducibility
• Fraud
– Diederik Stapel: 38 articles with fictitious data
– Bharat Aggarwal: a huge number of images with evidence of manipulation
• Public Trust
– Churn: Chocolate, egg yolks, red meat, red wine, etc.
– Climate change, vaccines
9/25/2016 Bill Howe, UW 27

Vision: Validate scientific claims automatically
– Check for manipulation (manipulated images, Benford’s Law)
– Extract claims from papers
– Check claims against the authors’ data
– Check claims against related data sets
– Automatic meta-analysis across the literature + public datasets
• First steps
– Automatic curation: Validate and attach metadata to public datasets
– Longitudinal analysis of the visual literature

“DEEP” CURATION
Science Data, Responsibly

9/25/2016 Bill Howe, UW 41
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin
Hoifung
Poon

color = labels supplied
as metadata
clusters = 1st two PCA
dimensions on the
gene expression data
itself
Can we use the expression data
directly to curate algorithmically?
Maxim
Gretchkin
Hoifung
Poon
The expression data
and the text labels
appear to disagree

Maxim
Gretchkin
Hoifung
Poon
Better Tissue
Type Labels
Domain knowledge
(Ontology)
Expression data
Free-text Metadata
2 Deep Networks
text
expr
SVM

Deep Curation Maxim
Gretchkin
Hoifung
Poon
Distant supervision and co-learning between text-
based classified and expression-based classifier: Both
models improve by training on each others’ results.
Free-text classifier
Expression classifier

Deep Curation:
Our stuff wins, with no training data
Maxim
Gretchkin
Hoifung
Poon
state of the art
our reimplementation
of the state of the art
our dueling
pianos NN
amount of training data used

VIZIOMETRICS:
COMPREHENDING VISUAL INFORMATION
IN THE SCIENTIFIC LITERATURE
Human-Data Interaction
9/25/2016 Bill Howe, UW 46

Step 1: Dismantling Composite
Figures Poshen Lee
ICPRAM 2015

Do high-impact papers have fewer
equations, as indicated by Fawcett and
Higginson? (Yes)
Poshen LeeJevin West
high impact papers low impact papers

Do high-impact papers have more
diagrams? (Yes)
Poshen LeeJevin West

TEACHING
DATA ETHICS IN DATA SCIENCE

Session 2
Summer 2014
121,215 students
Session 1
Spring 2013
119,504 students

Participation numbers
• “Registered”: 119,517 totally irrelevant
• Clicked play in first 2 weeks: 78,589
• Turned in 1st homework: 10,663
• Completed all assignments: ~9000 typical for a MOOC
• “Passed”: 7022
• Forum threads: 4661
• Forum posts: 22,900
Fairly consistent with Coursera data across “hard” courses
Define success however you want
– Many love it in parts, start late, don’t turn in homework, etc.
– Learning rather than watching television

Lectures
• Data Science Context and Case Studies (~1 week)
• Data Management at Scale
– Relational Databases (~1 week)
– MapReduce (~1 week)
– NoSQL (~1 week)
• Topics in Analytics
– Permutation Methods, Bayesian Methods (~1 week)
– Machine Learning Algorithms and Evaluation (~1 week)
• Visualization (~1 week)
• Graph Analytics (~1 week)
• Guest Lectures

9/25/2016 Bill Howe, UW 56
Who took the course?

9/25/2016 Bill Howe, UW 57

9/25/2016 Bill Howe, UW 58
What programming language do you typically use?

0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
Attrition, video lectures
Number of students watching videos by segment, ordered by time

9/25/2016 Bill Howe, UW 62
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Twitter1
Twitter2
Twitter3
Twitter4
Twitter5
Twitter6
Database1
Database2
Database3
Database4
Database5
Database6
Database7
Database8
Database9
MapReduce1
MapReduce2
MapReduce3
MapReduce4
MapReduce5
MapReduce6
Kaggle
Tableau
Attrition, assignments
Number of students completing assignments by part

9/25/2016 Bill Howe, UW 64
In a directory with 1000 text files, you are asked to
create a list of files that contain the word Drosophila

9/25/2016 Bill Howe, UW 65
What if you were given a billion documents spread across many
computers and asked to count the occurrences of a given phrase?

“I left the company I co-founded in 2005 to do data
analytics with Wibidata, with whom I was introduced
as a result of their guest lecture in your course.

Science Data, Responsibly

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Science Data, Responsibly

Similar to Science Data, Responsibly (20)

More from University of Washington

More from University of Washington (19)

Recently uploaded

Recently uploaded (20)

Science Data, Responsibly

Editor's Notes