Stories from the Field: Data are Messy and that’s
(kind of) ok
Jude Towers, Lecturer in Sociology and Quantitative Methods
David Ellis, Lecturer in Computational Social Science
Introductions: who we are and why
we care about (even messy) data
Jude Towers
• Doctor of Applied Social Statistics, Lecturer in Sociology and
Quantitative Methods, Associate Director of the Violence &
Society UNESCO Centre and lead for the N8 Policing Research
Partnership, Training and Learning strand
• Current research is focused on the measurement of violence
• Work with data which is highly confidential and very, very
‘messy’ (e.g. individualised police records, NGO datasets
• Teach Making Research Count: Engaging with Quantitative
Data – Faculty of Arts & Social Sciences ‘prequel to technical
methods courses’ - thinking critically about data
• JISC-sponsored Data Champion
Introductions: why we care about
(even messy) data
David Ellis
• Doctor of Psychology, Lecturer in Computational Social Science
at Lancaster, Core Researcher as part of CREST Research Centre,
Honorary Research Fellow at Lincoln
• Current research considers the measurement of digital traces
• Data collected is often messy and cloud-based
• JISC-sponsored Data Champion
Data: what counts?
• Inclusive understanding of ‘data’ - the
collection, use and management of a
myriad of forms of data
– ‘field’ data
• Policing
• Health
• Replication crisis within
Why bother with (messy) data?
Data, and the analysis of data can entrench or contest our
understanding of the world – we cannot either accept them at
face-value, nor dismiss them as positivistic and of no use for
progressive social change…
• Need to better support academics, students, policy-makers,
practitioners and the general public to better understand the
implications of the construction and analysis of data, the
presentation of data, especially statistical findings, and the use
and interpretation of ‘evidence’
-> key tool is robust management of data
Contribution to a progressive society, the common good,
a public academia
Messy Data:
• All data are ‘messy’ to some degree: data from ‘the field’ can be
especially messy
• Concepts and definitions can be wildly different
• Getting data is hard
– Sources; collection methods; confidentiality and anonymity;
access; sampling frames -> consequences of explicit and implicit
inclusions and exclusions
• ‘Cleaning’ data is time consuming and can be highly political
– E.g. Outliers: important anomalies or data ‘mistakes’?
• Units of measurement
Data are messy
– but that’s (kind of) OK
GOAL: Distinguish between the signal and
the noise
• SIGNAL: real variation we want to explain
• NOISE: random variation probably caused by the
process of collecting and using data e.g.
measurement, sampling and human error ( caveat:
tomorrow with new knowledge or new techniques /
technology we might return to this seemingly random ‘noise’
and impose a new meaning)
Nate Silver (2012) The Signal and The Noise: The Art and Science of Prediction. London, Penguin.
Learning
GOAL: to expand the current knowledge base to improve
understanding of a particular issue/topic: learning is more than
collecting or producing (new) data -> data needs to be integrated
into and to change the existing knowledge base
Example 1. NHS Administrative
Data
Ellis, McQueenie, McConnachie et al., (2017). The Lancet Public
Health
Example 1. NHS Administrative
Data
Example 1. NHS Administrative
Data
code appointments
attended = 830,039
DNA = 56,441
appointments.csv
N=892,216
patients.csv
N=73,012
clinical.csv
N=704,828
remove non-appointments
based on time rules
compute number of
appointments attended/missed
for each patient
appointmenthistory dataframe
patient ID
DNA
attended
total
percentage missed
annual DNA rate
Categorise each patient. zero,
low medium, high
appointment History merged with
Patients file
(using patient ID as link)
patientappointments dataset
(N=70,165)
ID
sex
age
distance
Rur8
PracticeRur8
SIMD
PracticeSIMD
Ethnic
attended
DNA
total
percentage missed
category
annual rate (attended)
Ready for analysis and visualization
(N=67,705)
reclassify based on
codes of interest
N=825,784 remaining after (7.4%)
removed
Zero N = 44,685 (63.7%)
Low N = 19,281(27.5%)
Medium N = 5,097 (7.3%)
High N = 1,102 (1.6%)
N = 491 patients (<1%) with no
appointment data removed
remove patients with missing
data
N=2,460
(3.5%)
patients classified as frequent/non
frequent attenders
(10th centile (annual attendance
rate>=8.66))
Yes = 7,283
No = 62,882
subset to remove
remove ethnicity data
add age categories
remove administrative/
secretary appointments
N=891,921 remaining after (<.01%)
removed
remove duplicate
patients
N=2,356
Example 1. NHS Administrative
Data
Example 1. NHS Administrative
Data
Example 2. Problems within Social
Science
Example 2. Problems within Social
Science
Example 2. Problems within Social
Science 5
4
3
2
1
Example 2. Problems within Social
Science
5
Shaw, Ellis, Kendrick et al., (2016). Cyberpsychology, Behavior and
Social Networking
Example 2. Problems within Social
Science
Thank you!
j.towers1@Lancaster.ac.uk (@towersjude)
d.a.ellis@Lancaster.ac.uk (@davidaellis)
rdm@lancaster.ac.uk

Stories from the Field: Data are Messy and that's (kind of) ok

  • 1.
    Stories from theField: Data are Messy and that’s (kind of) ok Jude Towers, Lecturer in Sociology and Quantitative Methods David Ellis, Lecturer in Computational Social Science
  • 2.
    Introductions: who weare and why we care about (even messy) data Jude Towers • Doctor of Applied Social Statistics, Lecturer in Sociology and Quantitative Methods, Associate Director of the Violence & Society UNESCO Centre and lead for the N8 Policing Research Partnership, Training and Learning strand • Current research is focused on the measurement of violence • Work with data which is highly confidential and very, very ‘messy’ (e.g. individualised police records, NGO datasets • Teach Making Research Count: Engaging with Quantitative Data – Faculty of Arts & Social Sciences ‘prequel to technical methods courses’ - thinking critically about data • JISC-sponsored Data Champion
  • 3.
    Introductions: why wecare about (even messy) data David Ellis • Doctor of Psychology, Lecturer in Computational Social Science at Lancaster, Core Researcher as part of CREST Research Centre, Honorary Research Fellow at Lincoln • Current research considers the measurement of digital traces • Data collected is often messy and cloud-based • JISC-sponsored Data Champion
  • 4.
    Data: what counts? •Inclusive understanding of ‘data’ - the collection, use and management of a myriad of forms of data – ‘field’ data • Policing • Health • Replication crisis within
  • 5.
    Why bother with(messy) data? Data, and the analysis of data can entrench or contest our understanding of the world – we cannot either accept them at face-value, nor dismiss them as positivistic and of no use for progressive social change… • Need to better support academics, students, policy-makers, practitioners and the general public to better understand the implications of the construction and analysis of data, the presentation of data, especially statistical findings, and the use and interpretation of ‘evidence’ -> key tool is robust management of data Contribution to a progressive society, the common good, a public academia
  • 6.
    Messy Data: • Alldata are ‘messy’ to some degree: data from ‘the field’ can be especially messy • Concepts and definitions can be wildly different • Getting data is hard – Sources; collection methods; confidentiality and anonymity; access; sampling frames -> consequences of explicit and implicit inclusions and exclusions • ‘Cleaning’ data is time consuming and can be highly political – E.g. Outliers: important anomalies or data ‘mistakes’? • Units of measurement
  • 7.
    Data are messy –but that’s (kind of) OK GOAL: Distinguish between the signal and the noise • SIGNAL: real variation we want to explain • NOISE: random variation probably caused by the process of collecting and using data e.g. measurement, sampling and human error ( caveat: tomorrow with new knowledge or new techniques / technology we might return to this seemingly random ‘noise’ and impose a new meaning) Nate Silver (2012) The Signal and The Noise: The Art and Science of Prediction. London, Penguin.
  • 8.
    Learning GOAL: to expandthe current knowledge base to improve understanding of a particular issue/topic: learning is more than collecting or producing (new) data -> data needs to be integrated into and to change the existing knowledge base
  • 9.
    Example 1. NHSAdministrative Data Ellis, McQueenie, McConnachie et al., (2017). The Lancet Public Health
  • 10.
    Example 1. NHSAdministrative Data
  • 11.
    Example 1. NHSAdministrative Data code appointments attended = 830,039 DNA = 56,441 appointments.csv N=892,216 patients.csv N=73,012 clinical.csv N=704,828 remove non-appointments based on time rules compute number of appointments attended/missed for each patient appointmenthistory dataframe patient ID DNA attended total percentage missed annual DNA rate Categorise each patient. zero, low medium, high appointment History merged with Patients file (using patient ID as link) patientappointments dataset (N=70,165) ID sex age distance Rur8 PracticeRur8 SIMD PracticeSIMD Ethnic attended DNA total percentage missed category annual rate (attended) Ready for analysis and visualization (N=67,705) reclassify based on codes of interest N=825,784 remaining after (7.4%) removed Zero N = 44,685 (63.7%) Low N = 19,281(27.5%) Medium N = 5,097 (7.3%) High N = 1,102 (1.6%) N = 491 patients (<1%) with no appointment data removed remove patients with missing data N=2,460 (3.5%) patients classified as frequent/non frequent attenders (10th centile (annual attendance rate>=8.66)) Yes = 7,283 No = 62,882 subset to remove remove ethnicity data add age categories remove administrative/ secretary appointments N=891,921 remaining after (<.01%) removed remove duplicate patients N=2,356
  • 12.
    Example 1. NHSAdministrative Data
  • 13.
    Example 1. NHSAdministrative Data
  • 14.
    Example 2. Problemswithin Social Science
  • 15.
    Example 2. Problemswithin Social Science
  • 16.
    Example 2. Problemswithin Social Science 5 4 3 2 1
  • 17.
    Example 2. Problemswithin Social Science 5 Shaw, Ellis, Kendrick et al., (2016). Cyberpsychology, Behavior and Social Networking
  • 18.
    Example 2. Problemswithin Social Science
  • 19.

Editor's Notes

  • #7 My idea for this slide would be for David to give an example / exemplar from Health and I will give one from Crime to illustrate – as per below… we could talk to one slide or could separate out each bullet point into a separate slide and add examples – which ever you think is best… [I’ve just added some slides pointing to 2 examples, but the points you raise here apply to everything] Messy data – e.g. 80% of respondents reporting domestic violence to the Crime Survey for England and Wales have not reported to the police Concepts and definitions – what is violence is the most controversial question in the field – is it narrow and specific e.g. physical act which causes injury, fear or distress or is it wide e.g. Zizek ‘violence of capitalism’ Galtung ‘ any unnecessary civilian death’ – implications for data Often clashes with new ‘open data’ agenda e.g. CSEW Intimate Violence data – need to be certified, access via secure server from PC with static IP address, in a locked room with no public access; all outputs have to be checked and signed off before removal from server; only those certified can see data in ‘raw’ form or during analysis process; sampling frame CSEW excludes groups most likely to be victims of crime – homeless, anyone in an institutional setting e.g. prison, hospital, refuge, and anyone staying temporarily with friends or family (insecurely housed) Outliers – remove serial killers from homicide trends Unit of measurement: violent crime in UK going up if use crimes, going down if use victims
  • #8 Success stories…
  • #19 The truth is often far messier than what is presented within a journal https://psychology.shinyapps.io/example3/ https://psychology.shinyapps.io/smartphonepersonality/ https://t.co/DurJDuJHQM