Beyond the Belmont Principles:
Ethical Challenges, Practices,
and Beliefs in the Online Data
Research Community
Jessica Vitak, Katie Shilton, and Zahra Ashktorab
College of Information Studies, University of Maryland
CSCW 2016 | San Francisco, CA
Twitter: @jvitak, @katieshilton, @zashktorab
#cscw2016
Google processed
more than
20 PETABYTES
of data per day
in 2014.
Source: Mashable
Data Free for All?
Not so fast
(research) cowboy…
Let’s talk about data
ethics first.
Belmont Report & Common Rule
Belmont Report published in 1979 in response to
unethical studies in psychology & medicine
Further articulated in Common Rule legislation
Three main principles:
1. Respect for persons: participants know about and
consent to research
2. Beneficence: do no harm; maximize benefits, minimize
risks
3. Justice: fair distribution of costs/benefits to all potential
participants
Online data create gray area
Is it feasible to collect informed
consent?
Should you be more
transparent about your
research?
Who is being left out by your
data collection strategies?
Isn’t public data public?
Is it possible to truly
anonymize a dataset?
Statement from the SIGCOMM 2015 Program Committee: The
SIGCOMM 2015 PC appreciated the technical contributions made in
this paper, but found the paper controversial because some of the
experiments the authors conducted raise ethical concerns. The
controversy arose in large part because the networking research
community does not yet have widely accepted guidelines or rules
for the ethics of experiments that measure online censorship. In
accordance with the published submission guidelines for SIGCOMM
2015, had the authors not engaged with their Institutional Review
Boards (IRBs) or had their IRBs determined that their research was
unethical, the PC would have rejected the paper without review. But the
authors did engage with their IRBs, which did not flag the research as
unethical. The PC hopes that discussion of the ethical concerns these
experiments raise will advance the development of ethical guidelines in
this area. It is the PC’s view that future guidelines should include
as a core principle that researchers should not engage in
experiments that subject users to an appreciable risk of
substantial harm absent informed consent. The PC endorses
Research Questions
1. What are the research ethics practices of
researchers using online datasets?
2. What do researchers using online
datasets believe constitutes ethical
research?
3. How do these practices and beliefs vary
among social computing researchers?
Method: Survey Instrument
Survey items were created based on findings from
20 interviews with scholars in social-computing
fields1
Items captured:
Data Sources, Methods, and Analyses employed
Data collection, analysis and sharing practices
Attitudes toward appropriateness of various methods
and practices
Items capturing researchers personal codes of ethics
Demographics
1 See Shilton & Sayles (2016) for more information on this initial qualitative work.
Method: Identifying population
Focused on researchers in social computing-related
fields where online data analysis is common.
Used databases to identify articles published since 2011
at eight conferences: CSCW, CHI, ICWSM, iConference,
WWW, Ubicomp, CKIM, and KDD.
Search terms included: “trace ethnography,” “big data,”
“twitter,” “forums,” “text mining,” “logs,” “activity traces,”
and “social network”
Also posted survey invite to relevant listservs, including
AoIR, AIS, CITASA, AIS ICA, STS, and NCA
Method: Participant Details
Based on database search, we compiled a list of
~2800 unique names.
Sent initial invites plus two reminders through
SurveyGizmo over two-week period
Received 263 completed surveys, along with some
interesting emails from researchers…
Variable Mean (SD)/N (%)
Sex Male
Female
159 (60.5%)
93 (35.4%)
Education Bachelor’s
Master’s
PhD
15 (5.7%)
61 (23.2%)
180 (68.4%)
Current Location United States
UK
Canada
Germany
23 other countries (<10 participants)
162 (61.6%)
21 (8.0%)
14 (5.3%)
12 (4.6%)
47 (17.8%)
Degree In Communication & Media
Computer Science/Engineering
HCI
Information
Social Sciences
Other
33 (12.5%)
100 (38%)
10 (3.8%)
50 (19%)
37 (14.1%)
27 (10.3%)
Current Field of
Work
Academia (Research Focus)
Industry
Policy/Government/Non-Profit
207 (78.7%)
35 (13.3%)
12 (4.6%)
Sample Demographics (N=263)
Findings: Researcher Code of
Ethics
Analyzed two ways:
1. Iterative coding by authors of open-ended
responses to question, “How would you describe
your personal code of ethics regarding online
data?”*
2. EFA of 35 survey items relating to research
attitudes and practices
*We received 162 (63%) responses, with an average length of 35 words (median=23.5; SD=35).
Code Definition Example Statements
Public Data Only using public data / public data
being okay to collect and analyze
In general, I feel that what is posted online is a
matter of public record, though every case needs to
be looked at individually in order to evaluate the
ethical risks.
Do No Harm Comments related to Golden Rule Golden rule, do to others what you’d have them do to
you.
Informed
Consent
Always get informed consent /
stressing importance of informed
consent
I think at this point for any new study I started using
online data, I would try to get informed consent when
collecting identifiable information (e.g. usernames).
Greater
Good
Data collection should have a
social benefit
The work I do should address larger social
challenges, and not just offer incremental
improvements for companies to deploy.
Established
Guidelines
Including Belmont Report, IRBs
Terms of Service, legal
frameworks, community norms
I generally follow the ethical guidelines for human
subjects research as reflected in the Belmont Report
and codified in 45.CFR.46 when collecting online
data.
Risks vs.
Benefits
Discussion of weighing potential
harms and benefits or gains
I think I focus on potential harm, and all the ethical
procedures I put in place work towards minimizing
potential harm.
Protect
Participants
data aggregation, deleting PII,
anonymizing / obfuscating data
I aggregate unique cases into larger categories
rather than removing them from the data set.
Data
Judgments
Efforts to not make inferences or
judge participants or data
Do not expose users to the outside world by inferring
features that they have not personally disclosed.
Transparenc
y
Contact with participants or
methods of informing participants
about research
I prefer to engage individual participants in the data
collection process, and to provide them with explicit
information about data collection practices.
Item M SD
...notify participants about why they’re collecting online data1 3.89 0.96
...share research results with research subjects1 3.90 0.80
...Ask colleagues about their research ethics practices1 4.27 0.74
...Ask their IRB/internal reviews for advice about research ethics1 4.03 0.90
...Think about possible edge cases/outliers when designing
studies1
4.33 0.71
...Only collect online data when the benefits outweigh the potential
harms1
3.62 1.10
...Remove individuals from datasets upon their request1 4.56 0.71
Researchers should be held to a higher ethical standard than
others who use online data2
3.46 1.22
I think about ethics a lot when I'm designing a new research
project2
3.96 0.93
Full Scale (α=.71) 4.00 0.49
1 Prompt: “I think researchers should....”
2 Prompt: “To what extent do you agree with the following statements?”
Both sets of items were measured on five point, Likert-type scales (Strongly Agree-Strongly Disagree).
Codification of Ethical Attitudes Measure
Independent Variables ß (t-test) p
Sex: Female .01 (.13) .90
Academic Degree
Social Science (1)
Computer Science (1)
-.06 (-.72)
-.13 (-1.45)
.48
.15
Field of Work
Academia (1)
Industry (1)
-.16 (-1.23)
-.31 (-2.43)
.22
.02
Level of Education .01 (.07) .94
Location: U.S. -.01 (-.20) .84
F-test 1.56 .15
Adjusted R2 .02 --
OLS Regression for Codified Ethics
*Industry workers’ average agreement with the codified ethics scale (M=3.75) was significantly lower than
non-profit/policy workers (M=4.11). Academics (M=4.02) were not significantly different from either group.
Low variance (<.8) High variance (> 1.2)Agreement(>3.5)
 Remove subjects from datasets
upon request1
 Ask (1) colleagues or (2) IRBs
about research1
 Share results with participants1
 Think about edge
cases/outliers1
 Use non-representative
samples2
 Remove unique individuals
before sharing1
 Researchers can’t collect large-
scale online data if consent is
required
Disagreement(<3)
No items.
Notes:
1 “I think researchers should…”
2 “It’s permissible for researchers to…”
 Ignore ToS when necessary2
 Deceive participants2
 Share raw data with key
stakeholders1
 It’s possible to obtain informed
consent with large-scale studies
Comparing level of agreement and item variable across the
corpus
Key Takeaways
Context matters. A lot.
Dependencies between
ethical principles.
Beliefs & norms are often
in flux.
Disciplinary differences?
The data suggest no.
Such assumptions risk
talking past each other
rather than learning from
others’ experiences
Ethics Heuristics for Online Data
Research: Beyond the Belmont
Report
1. Focus on transparency
Openness about data collection
Sharing results with community
leaders or research subjects
2. Data minimization
Collecting only what you need to
answer an RQ
Letting individuals opt out
Sharing data at aggregate levels
3. Increased caution in sharing results
4. Respect the norms of the contexts in which online
data was generated.
Future directions for this
research
This is first of three surveys by our research group. Each
survey focuses on a different stakeholder:
1. Researchers (this study)
2. IRBs (paper in progress)
3. Users (study in development)
Other avenues for discussion and action:
workshops (e.g., CSCW 2015, SIGCOM 2015)
Panels (e.g., iConference 2016)
Working groups/meetings with policymakers (e.g., Future of
Privacy Forum, December 2015)
Thanks!
Jessica Vitak Katie Shilton Zahra
Ashktorab
jvitak@umd.edu kshilton@umd.edu parnia@umd.edu
@jvitak @katieshilton @zashktorab

CSCW 2016: Beyond the Belmont Principles

  • 1.
    Beyond the BelmontPrinciples: Ethical Challenges, Practices, and Beliefs in the Online Data Research Community Jessica Vitak, Katie Shilton, and Zahra Ashktorab College of Information Studies, University of Maryland CSCW 2016 | San Francisco, CA Twitter: @jvitak, @katieshilton, @zashktorab #cscw2016
  • 3.
    Google processed more than 20PETABYTES of data per day in 2014. Source: Mashable
  • 4.
    Data Free forAll? Not so fast (research) cowboy… Let’s talk about data ethics first.
  • 5.
    Belmont Report &Common Rule Belmont Report published in 1979 in response to unethical studies in psychology & medicine Further articulated in Common Rule legislation Three main principles: 1. Respect for persons: participants know about and consent to research 2. Beneficence: do no harm; maximize benefits, minimize risks 3. Justice: fair distribution of costs/benefits to all potential participants
  • 6.
    Online data creategray area Is it feasible to collect informed consent? Should you be more transparent about your research? Who is being left out by your data collection strategies? Isn’t public data public? Is it possible to truly anonymize a dataset?
  • 8.
    Statement from theSIGCOMM 2015 Program Committee: The SIGCOMM 2015 PC appreciated the technical contributions made in this paper, but found the paper controversial because some of the experiments the authors conducted raise ethical concerns. The controversy arose in large part because the networking research community does not yet have widely accepted guidelines or rules for the ethics of experiments that measure online censorship. In accordance with the published submission guidelines for SIGCOMM 2015, had the authors not engaged with their Institutional Review Boards (IRBs) or had their IRBs determined that their research was unethical, the PC would have rejected the paper without review. But the authors did engage with their IRBs, which did not flag the research as unethical. The PC hopes that discussion of the ethical concerns these experiments raise will advance the development of ethical guidelines in this area. It is the PC’s view that future guidelines should include as a core principle that researchers should not engage in experiments that subject users to an appreciable risk of substantial harm absent informed consent. The PC endorses
  • 9.
    Research Questions 1. Whatare the research ethics practices of researchers using online datasets? 2. What do researchers using online datasets believe constitutes ethical research? 3. How do these practices and beliefs vary among social computing researchers?
  • 10.
    Method: Survey Instrument Surveyitems were created based on findings from 20 interviews with scholars in social-computing fields1 Items captured: Data Sources, Methods, and Analyses employed Data collection, analysis and sharing practices Attitudes toward appropriateness of various methods and practices Items capturing researchers personal codes of ethics Demographics 1 See Shilton & Sayles (2016) for more information on this initial qualitative work.
  • 11.
    Method: Identifying population Focusedon researchers in social computing-related fields where online data analysis is common. Used databases to identify articles published since 2011 at eight conferences: CSCW, CHI, ICWSM, iConference, WWW, Ubicomp, CKIM, and KDD. Search terms included: “trace ethnography,” “big data,” “twitter,” “forums,” “text mining,” “logs,” “activity traces,” and “social network” Also posted survey invite to relevant listservs, including AoIR, AIS, CITASA, AIS ICA, STS, and NCA
  • 12.
    Method: Participant Details Basedon database search, we compiled a list of ~2800 unique names. Sent initial invites plus two reminders through SurveyGizmo over two-week period Received 263 completed surveys, along with some interesting emails from researchers…
  • 13.
    Variable Mean (SD)/N(%) Sex Male Female 159 (60.5%) 93 (35.4%) Education Bachelor’s Master’s PhD 15 (5.7%) 61 (23.2%) 180 (68.4%) Current Location United States UK Canada Germany 23 other countries (<10 participants) 162 (61.6%) 21 (8.0%) 14 (5.3%) 12 (4.6%) 47 (17.8%) Degree In Communication & Media Computer Science/Engineering HCI Information Social Sciences Other 33 (12.5%) 100 (38%) 10 (3.8%) 50 (19%) 37 (14.1%) 27 (10.3%) Current Field of Work Academia (Research Focus) Industry Policy/Government/Non-Profit 207 (78.7%) 35 (13.3%) 12 (4.6%) Sample Demographics (N=263)
  • 14.
    Findings: Researcher Codeof Ethics Analyzed two ways: 1. Iterative coding by authors of open-ended responses to question, “How would you describe your personal code of ethics regarding online data?”* 2. EFA of 35 survey items relating to research attitudes and practices *We received 162 (63%) responses, with an average length of 35 words (median=23.5; SD=35).
  • 15.
    Code Definition ExampleStatements Public Data Only using public data / public data being okay to collect and analyze In general, I feel that what is posted online is a matter of public record, though every case needs to be looked at individually in order to evaluate the ethical risks. Do No Harm Comments related to Golden Rule Golden rule, do to others what you’d have them do to you. Informed Consent Always get informed consent / stressing importance of informed consent I think at this point for any new study I started using online data, I would try to get informed consent when collecting identifiable information (e.g. usernames). Greater Good Data collection should have a social benefit The work I do should address larger social challenges, and not just offer incremental improvements for companies to deploy. Established Guidelines Including Belmont Report, IRBs Terms of Service, legal frameworks, community norms I generally follow the ethical guidelines for human subjects research as reflected in the Belmont Report and codified in 45.CFR.46 when collecting online data. Risks vs. Benefits Discussion of weighing potential harms and benefits or gains I think I focus on potential harm, and all the ethical procedures I put in place work towards minimizing potential harm. Protect Participants data aggregation, deleting PII, anonymizing / obfuscating data I aggregate unique cases into larger categories rather than removing them from the data set. Data Judgments Efforts to not make inferences or judge participants or data Do not expose users to the outside world by inferring features that they have not personally disclosed. Transparenc y Contact with participants or methods of informing participants about research I prefer to engage individual participants in the data collection process, and to provide them with explicit information about data collection practices.
  • 16.
    Item M SD ...notifyparticipants about why they’re collecting online data1 3.89 0.96 ...share research results with research subjects1 3.90 0.80 ...Ask colleagues about their research ethics practices1 4.27 0.74 ...Ask their IRB/internal reviews for advice about research ethics1 4.03 0.90 ...Think about possible edge cases/outliers when designing studies1 4.33 0.71 ...Only collect online data when the benefits outweigh the potential harms1 3.62 1.10 ...Remove individuals from datasets upon their request1 4.56 0.71 Researchers should be held to a higher ethical standard than others who use online data2 3.46 1.22 I think about ethics a lot when I'm designing a new research project2 3.96 0.93 Full Scale (α=.71) 4.00 0.49 1 Prompt: “I think researchers should....” 2 Prompt: “To what extent do you agree with the following statements?” Both sets of items were measured on five point, Likert-type scales (Strongly Agree-Strongly Disagree). Codification of Ethical Attitudes Measure
  • 17.
    Independent Variables ß(t-test) p Sex: Female .01 (.13) .90 Academic Degree Social Science (1) Computer Science (1) -.06 (-.72) -.13 (-1.45) .48 .15 Field of Work Academia (1) Industry (1) -.16 (-1.23) -.31 (-2.43) .22 .02 Level of Education .01 (.07) .94 Location: U.S. -.01 (-.20) .84 F-test 1.56 .15 Adjusted R2 .02 -- OLS Regression for Codified Ethics *Industry workers’ average agreement with the codified ethics scale (M=3.75) was significantly lower than non-profit/policy workers (M=4.11). Academics (M=4.02) were not significantly different from either group.
  • 18.
    Low variance (<.8)High variance (> 1.2)Agreement(>3.5)  Remove subjects from datasets upon request1  Ask (1) colleagues or (2) IRBs about research1  Share results with participants1  Think about edge cases/outliers1  Use non-representative samples2  Remove unique individuals before sharing1  Researchers can’t collect large- scale online data if consent is required Disagreement(<3) No items. Notes: 1 “I think researchers should…” 2 “It’s permissible for researchers to…”  Ignore ToS when necessary2  Deceive participants2  Share raw data with key stakeholders1  It’s possible to obtain informed consent with large-scale studies Comparing level of agreement and item variable across the corpus
  • 19.
    Key Takeaways Context matters.A lot. Dependencies between ethical principles. Beliefs & norms are often in flux. Disciplinary differences? The data suggest no. Such assumptions risk talking past each other rather than learning from others’ experiences
  • 20.
    Ethics Heuristics forOnline Data Research: Beyond the Belmont Report 1. Focus on transparency Openness about data collection Sharing results with community leaders or research subjects 2. Data minimization Collecting only what you need to answer an RQ Letting individuals opt out Sharing data at aggregate levels 3. Increased caution in sharing results 4. Respect the norms of the contexts in which online data was generated.
  • 21.
    Future directions forthis research This is first of three surveys by our research group. Each survey focuses on a different stakeholder: 1. Researchers (this study) 2. IRBs (paper in progress) 3. Users (study in development) Other avenues for discussion and action: workshops (e.g., CSCW 2015, SIGCOM 2015) Panels (e.g., iConference 2016) Working groups/meetings with policymakers (e.g., Future of Privacy Forum, December 2015)
  • 22.
    Thanks! Jessica Vitak KatieShilton Zahra Ashktorab jvitak@umd.edu kshilton@umd.edu parnia@umd.edu @jvitak @katieshilton @zashktorab

Editor's Notes

  • #16 Deception (mainly comm) & “in flux” were also themes
  • #22 IRB paper notes: