Big data in social sciences and IT developments (ethics considerations)
1. Big data in social sciences and IT
developments
Efthimios Tambouris
Associate Professor, University of Macedonia, Greece
tambouris@uom.gr
2. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
How many data are produced? The road to big data
• There is more data than ever in the history of data:
Beginning of recorded history till 2003—5 billion gigabytes
2011—5 billion gigabytes every two days
2013—5 billion gigabytes every 10 min
2015—5 billion gigabytes every 10 s
2
Source: Smolan R, Erwitt J (2012) The Human Face of Big Data, Sausalito, CA: Against All Odds Productions as
quoted by ZWITTER, A. (2014) Big Data ethics. [Online], SAGE journals. Feb. Available from:
http://bds.sagepub.com/content/1/2/2053951714559253
3. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
What is Big Data? The 4V’s
• User-generated data (mainly from Social Media – SM) are very important
to researchers (but also from lifestyle applications etc)
3
Image from http://parasdoshi.com/2012/11/22/three-vs-of-big-data-with-example/
4. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
What do they look like? Big data metaphors
• A number of metaphors for big data has been proposed (oil, bacon,
teenage sex, nuclear waste, etc.)
• Teenage sex (Dan Ariely, 2013)
– Everyone talks about it, nobody really knows how to do it, everyone
thinks everyone else is doing it, so everyone claims they are doing it …
• Nuclear waste (see
http://www.theguardian.com/technology/2008/jan/15/data.security)
– E.g. London Underground knows personal info (to obtain oystein card)
and travel habits
– What happen in these data are hijacked?
– How these data should be secured?
– For how long should data be preserved?
5. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
How are big data used? Current Practices
• Many companies use big data (mainly users-generated content) for a number
of purposes
– E.g. Facebook to make a large scale emotions experiment,
– E.g. Tweeter to explore likelihood of new mothers experiencing postnatal
depression based on their twitter posts (de Choudhury et al, 2013).
– E.g. Yahoo! Finance message board to predict stock market volatility
(Antweiler and Frank, 2004),
– E.g. weblog content to predict movies success (Mishne and Glance,
2006),
– E.g. Google search queries to track influenza-like illnesses (Ginsberg et
al., 2009),
– E.g. Amazon reviews to predict product sales (Ghose and Ipeirotis, 2011)
– E.g. Twitter posts (aka tweets) to infer levels of rainfall (Lampos and
Cristianini, 2012) and make a number of “predictions”
5
6. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
“Predictions” based on big data
6
7. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
Why are researchers interested in digital data?
• Digital Data can be easily collected or even harvested
• Digital data can be easily combined with other data
• Digital data can be easily stored even without explicit location knowledge
(e.g. cloud storage)
• Digital data can be easily and quickly transferred
• Digital data can be easily re-used
• Digital data can be processed in
– traditional ways but also in
– new (even unexpected or unpredicted) ways
– E.g. using data mining, data linkage etc.
7
8. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
Why is this relevant to ethics?
• Harvesting data from SM raises issues of informed consent
• Big data raise issues of identifiability
• …and at the same time they raise problems in authenticating online
identities
• Data linkage presents risks to privacy and confidentiality
• There are emerging and pressing issues of ensuring data security, privacy
and governance particularly in cross-border research (like most EU-
funded research)
8
Source: Carlton Connect Initiative, 2015, Guidelines for the Ethical Use of Digital Data in Human
Research, http://carltonconnect.com.au/wp-content/uploads/2015/06/Ethical-Use-of-Digital-Data.pdf
9. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
Source: Rivers CM and Lewis BL. Ethical research standards in a world of big data [version 1; referees: 2
approved with reservations] F1000Research 2014, 3:38, http://f1000research.com/articles/3-38/v1
[Accessed 05th November 2015]
How SN-based relates to Traditional research? Example
Under non-digital circumstances, ethics guidelines suggest that collecting
information from a public space where people could ‘reasonably expect to
be observed by strangers’ is considered appropriate even without
informed consent.
E.g. According to these guidelines, tweets are text that users publish for
the purpose of sharing with others.
The weakness of this argument is that it fails to distinguish between
population-level research and research focused on selected individuals.
10. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
Source: Rivers CM and Lewis BL. Ethical research standards in a world of big data [version 1; referees: 2
approved with reservations] F1000Research 2014, 3:38, http://f1000research.com/articles/3-38/v1
[Accessed 05th November 2015]
How SN-based relates to Traditional research? Example
It would be clearly unethical for a researcher to follow one specific
shopper around the mall and gather data exclusively about him without
his consent.
However, simply counting or observing behavior in aggregate at a mall is
an acceptable research practice.
The difference is that the latter example adheres to a level of privacy that
the observed individual might expect from being in public, whereas the
former violates those natural privacy boundaries.
A similar distinction is needed in digital research.
11. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
How can SM data be used? (Eg. Facebook “use” it’s users?)
•For one week in 2012, 500.000 Facebook users took part in a massive
psychological experiment aimed at discovering if emotions could be spread
through social media.
Source: http://www.scu.edu/r/ethics-center/ethicsblog/business-ethics-news/20119/FACEBOOK:-The-Psychological-Experiment-You-
Consented-to-in-FB's-Terms-of-Service
•In brief, the study separated its users into two groups. One was subjected to
a newsfeed of primarily positive posts; the other was flooded with
emotionally negative items.
•The results “suggest that the emotions expressed by friends, via online social
networks, influence our own moods, constituting, to our knowledge, the first
experimental evidence for massive-scale emotional contagion via social
networks”
Source: http://www.newsweek.com/facebook-performed-psychology-experiment-thousands-users-without-telling-them-256914
The problem?
Users had no idea for this experiment!
Is it ethical?
Image source: https://media.licdn.com/mpr/mpr/shrinknp_400_400/p/8/005/06f/0f4/0ddc9f1.jpg
12. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
Why are SM data so useful? (Eg. Twitter data)
12
• Each tweet is not just 140 characters
• Each tweet also contains metadata
about the authors, e.g. the text
location from their profile (e.g.
‘Baltimore’), their time zone, the time
they sent the tweet, the number of
followers they have, the number of
tweets they have ever sent, and more
• This info is available through
application programming interfaces
(APIs) provided by Twitter that allow
real-time access.
• There are numerous ways to access
the data through these APIs
• In 2010 Twitter donated its entire
historical record of tweets to the US
Library of Congress. An excerpt of a single tweet returned through the Twitter API11.
•RIVERS, C. & LEWIS, B (2014) Ethical research standards in a world of big data. [Online], F1000Research. 06th Feb. Available from:
http://f1000research.com/articles/3-38/v1 [Accessed 05th November 2015] .
13. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
What are the main ethics issues in research using digital data?
1. Consent
2. Privacy and confidentiality
3. Ownership and authorship
4. Governance and custodianship
5. Data sharing: assessing the social benefits of research
13
Source: Carlton Connect Initiative, 2015, Guidelines for the Ethical Use of Digital Data in Human
Research, http://carltonconnect.com.au/wp-content/uploads/2015/06/Ethical-Use-of-Digital-Data.pdf
14. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
1. Consent
• Researchers are increasingly using data posted in SM, blogs, etc.
• What king of consent procedures must be implemented?
• F2f interaction with participants is not possible; registration is used
• Registration provides consent on data collections, processing, storage etc;
but terms are rarely read and understood
• Is consent an ‘implied contract’ giving up personal data in exchange for a
service (“a social contract”) as seems to be the case in many digital
environments? Is ticking ‘I agree’ when asked online enough?
– In an experiment six Londoners have agreed to give up their first-born
child for free Wi-Fi (by pressing “I agree” without reading the terms)
• Identity is not always clear; e.g. false identities, multiple identities etc.
• Vulnerable, under-aged etc population may be unintentionally included in
research
14
Source: Carlton Connect Initiative, 2015, Guidelines for the Ethical Use of Digital Data in Human
Research, http://carltonconnect.com.au/wp-content/uploads/2015/06/Ethical-Use-of-Digital-Data.pdf
15. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
2. Privacy and confidentiality
• Differences between data directly collected and those “harvested”
without owner’s awareness
• Note the difference between:
– Individual and institutional protection of privacy; and
– Invasion of privacy (e.g. spam, personal harm etc).
• Data on the web are not necessarily free to any kind of reuse
• Are the data personal? Are they sensitive?
• The Data Protection Directive has different implementations in different
Member States
15
16. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
3. Ownership and authorship
• There is no consensus between academics about data ownership
• Good practice: clarify these issues in your research design
• Cloud storage further complicates things
• What are the risks associated with the use of a data depository?
• Who has authority to access, release and manage this data?
• What processes have been used to anonymise this data?
• What potential harms may result from stripping data of identifiable info?
• Who is accountable for data quality, protection and access to data?
• Who is responsible for providing documentation and meta-data?
• Who is responsible for long-term maintenance of this data?
• Is data destruction (as a requirement of ethics applications) a relevant
approach to digital data?
16
Source: Carlton Connect Initiative, 2015, Guidelines for the Ethical Use of Digital Data in Human
Research, http://carltonconnect.com.au/wp-content/uploads/2015/06/Ethical-Use-of-Digital-Data.pdf
17. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
4. Governance and custodianship
• Overlaps but also is distinguished from authorship as it deals with (a) data
storage (b) access to data (c) data reuse after the research.
• Clear agreements regarding accountability need to be developed
• Research applications should include information about data storage,
management and access
• Institutions need to put relevant processes in place
• For researchers the key ethical concerns are establishing good data
governance practices in order to ensure data security and thus protect
participants’ privacy and confidentiality.
17
Source: Carlton Connect Initiative, 2015, Guidelines for the Ethical Use of Digital Data in Human
Research, http://carltonconnect.com.au/wp-content/uploads/2015/06/Ethical-Use-of-Digital-Data.pdf
18. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
5. Data sharing: assessing the social benefits of research
• Ethical challenges: “when digital data produced by one project is used in
another project or combined with data from another source, where such
re-use must be approved or justified under the same framework as the
original use of the data”
• Does the approval/permission regime for the original data include or
preclude the new use of the data?
• Do researchers assessing data gathered in another context have a
responsibility to understand the conditions of its original collection?
• Do researchers have a responsibility to assess whether the secondary use
of the data is aligned with the original intent for which it was collected?
• Do researchers using data gathered by another research project have a
responsibility to ensure that access to, and use of, the data does not pose
a risk to individuals from whom it was originally collected?
18
Source: Carlton Connect Initiative, 2015, Guidelines for the Ethical Use of Digital Data in Human
Research, http://carltonconnect.com.au/wp-content/uploads/2015/06/Ethical-Use-of-Digital-Data.pdf
19. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
How ethics relates to everything else? Ethical position
• This figure has been proposed for organisations
• It can be also useful for researchers!
19
Source: http://www.ibmbigdatahub.com/sites/default/files/whitepapers_reports_file/TCG%20Study%20Report%20-
%20Ethics%20for%20BD%26A.pdf
20. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
How to approach ethics? Various Ethics Views
20
Image from http://www.lagiostradeidiritti.org/ Image from https://hutchinson-page.wikispaces.com/A+Study+of+Archetypes
As a lawyer As a scientist
As yourself As someone else
Image from http://thomaspmbarnett.com/globlogization/tag/us-military
Identify, understand,
comply with current
legislation
Review scientific literature to
get a better understanding on
state-of-the-art ethics issues
Would you mind if your
own data were
handled like that?
Do you think someone else
would mind if his/her data
were handled like that?
21. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
What is happening in the USA?
• Guidelines from US Consumer Privacy Bill of Rights
21
Source: CONSUMER DATA PRIVACY IN A NETWORKED WORLD: A FRAMEWORK FOR
PROTECTING PRIVACY AND PROMOTING INNOVATION IN THE GLOBAL DIGITAL ECONOMY
https://www.whitehouse.gov/sites/default/files/privacy-final.pdf
22. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
Are there any guidelines for the ethical use of Twitter data?
Based on USA Privacy Bill aiming to prevent violating the privacy and ethical
treatment of participants
1. The objectives, methodologies, and data handling practices of the project are
transparent and easily accessible
2. Study design and analyses respect the context in which a tweet was sent
3. The anonymity of tweet authors is protected, ensuring that subjects should
not be identifiable in any way
4. Tweet data are not used to harvest additional information from other sources
5. Twitter users’ efforts to control their personal data are honored
6. Researchers work collaboratively with ethics committees/data protection
authorities just as they would for any other human subject data collection
22
Rivers CM and Lewis BL. Ethical research standards in a world of big data [version 1; referees: 2
approved with reservations] F1000Research 2014, 3:38, http://f1000research.com/articles/3-38/v1
[Accessed 05th November 2015] .
23. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
Any more guidance? Here is some …
• Is the research problem clear?
• Do you have a research method in place?
– If no, one should be designed considering ethics
– If yes, the method should be examined for ethics issues
• To identify ethics issues in an existing research method - concerns
– What data do you collect? Which Social Media you will use? (check
terms of use, country of registration, etc.) How do you recruit? Is
privacy/anonymity a concern? Informed consent process? How you
verity participants’ age? Where are data stored? Etc.
• How usual processes/concerns change when it comes to big data?
• Speak to ethics committee!
23
For concerns see also:
SWATMAN, P. (2012) Ethical research standards in a world of big data. [Online]. Available from:
http://www.deakin.edu.au/__data/assets/pdf_file/0007/269701/Swatman-Ethics-and-Social-Media-Research.pdf
24. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
Conclusions
• Big data (mainly Social Media data) provide opportunities for
– new research
– New methods for conducting research (data gathering,
processing, re-using, transferring, storing)
• But, with “big data comes big responsibility”
• This research may raise multiple ethics issues
• Some issues are easy to identify and cope with
• Some issues might be more difficult to do so
• Investigate these issues; it can be rewarding both personally and
professionally
• Discuss these issues with others including ethics experts, ethics
committees etc.
24
25. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
Selected References
Carlton Connect Initiative, 2015, Guidelines for the Ethical Use of Digital Data in Human
Research, http://carltonconnect.com.au/wp-content/uploads/2015/06/Ethical-Use-of-
Digital-Data.pdf
Chessell M. (2014) Ethics for big data and analytics,
http://www.ibmbigdatahub.com/sites/default/files/whitepapers_reports_file/TCG%20Study
%20Report%20-%20Ethics%20for%20BD%26A.pdf
Rivers CM and Lewis BL. Ethical research standards in a world of big data [version 1; referees:
2 approved with reservations] F1000Research 2014, 3:38,
http://f1000research.com/articles/3-38/v1
SCHROEDER, R. (2014) Big Data and the brave new world of social media research. [Online],
SAGE journals. Feb. Available from:
http://bds.sagepub.com/content/1/2/2053951714563194.
Smolan R, Erwitt J (2012) The Human Face of Big Data, Sausalito, CA: Against All Odds
Productions.
SWATMAN, P. (2012) Ethical issues in social networking research. [Online]. Available from:
http://www.deakin.edu.au/__data/assets/pdf_file/0007/269701/Swatman-Ethics-and-
Social-Media-Research.pdf.
ZWITTER, A. (2014) Big Data ethics. [Online], SAGE journals. Feb. Available from:
http://bds.sagepub.com/content/1/2/2053951714559253.
25
26. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
Appendix
• Some Practical guidelines follow
• Source: SWATMAN, P. (2012) Ethical issues in social
networking research. [Online]. Available from:
http://www.deakin.edu.au/__data/assets/pdf_file/0007/269
701/Swatman-Ethics-and-Social-Media-Research.pdf.
26
27. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
Recruitment
Traditional participant recruitment is
generally “push” based
Social media participant recruitment is
generally “pull” based
Researchers know who they’re targeting Potential participants can discuss the invitation –
and tell others about it – without any researcher
control or influence
Even with snowball sampling, participant groups are
‘controlled
Response to participant invitation is interactive,
rather than static (And can lead to unexpected
outcomes)
Subsequent posts may modify already-posted
information
Privacy
Is the space being
researched seen as private
by its users?
Is everything what it
seems?
How do researchers
ensure their participants
really are anonymous?
Are they aware they are being
observed?
Are Fred X and Mary Y really who
they claim to be?
IP addresses are (usually)
traceable
What is the researcher’s role? How often did Fred X vote / Tweets may contain identifiers ...
28. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
Informed Consent
Adults – consent is relatively
straightforward
Children / Young People – this is a
minefield!
Links from social networking sites to other, more
reliable, sites solve most problems
No way of ensuring participant’s age / level of
maturity
• Surveys (Qualtrics, Survey Monkey, etc.) • E.g. are all Facebook users really 18+ ?
Difficult to obtain parental consent
• And even harder to be sure who actually consented!
Think carefully about whether some forms of research are worth doing via social
media sites …
Criteria for making data processing legitimate (Directive 95/46/EC)
(a) the data subject has unambiguously given his
consent; or
(b) processing is necessary for the performance of a
contract to which the data subject is party or in
order to take steps at the request of the data subject
prior to entering into a contract; or
(c) processing is necessary for compliance with a
legal obligation to which the controller is subject; or
(d) processing is necessary in order to protect the
vital interests of the data subject; or
(e) processing is necessary for the performance of a
task carried out in the public interest or in the
exercise of official authority vested in the controller
or in a third party to whom the data are disclosed; or
(f) processing is necessary for the purposes of the
legitimate interests pursued by the controller or by
the third party or parties to whom the data are
disclosed, except where such interests are
overridden by the interests for fundamental rights
29. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
Your data and social media sites
How secure are your data in a
social networking site?
Can you depend on privacy /
reliability claims from a social
networking service
Can you access them as/when you
need?
E.g. is your social network of choice
always available?
Can anyone else access them at will?
Foreign government access to your data
Social networking platforms are mostly US-based, with real implications for your data
US-based data are subject to the Patriot Act & the Foreign Intelligence and
Surveillance Act – data can be accessed by US federal law enforcement agencies, no
matter who owns them!
Australians storing data on US sites cannot claim protection under the Fourth
Amendment of the US Constitution (which protects against unlawful search and
seizure of property and information) because their data is stored by a 3rd-party
provider
I nter-governmental treaties and agreements (e.g. the European Convention on
30. ERC Workshop on Ethics in Research
26-27 November 2015, Brussels
E. Tambouris http://egov.it.uom.gr/wiki
Social Network Terms of Service
Who owns the data you create
in a social networking site?
How different are the ToS in
other social networking sites?
Facebook claims the rights to any data collected from applications
(including surveys) created within it
Jaquith (2009): “Facebook’s definition of data ownership does not include the right
to export that data. It’s “mine,” so long as I leave it under Facebook’s control”
Protalinski (2012): “In short, Facebook owns any IP you give it, because you gave it
permission to own it. If there is content you don't want Facebook to own, don't upload
it to Facebook”
As mentioned before Twitter provides freely several application
programming interfaces (APIs) that allow real-time access to vast
amounts of content. Furthermore, in 2010 it donated its entire
historical record of tweets to the US Library of Congress.