1. Data for the Humanities
February 21, 2017
Rafia Mirza
Digital Humanities Librarian
rafia@uta.edu @librarianrafia
Peace Ossom Williamson
Director of Research Data Services
peace@uta.edu @123POW
2. Learning Outcomes
• Understand the use of data in answering humanities research
questions
• Understand descriptive metadata and the rationale for its use
• Recognize areas of potential bias and ambiguous or misleading
representation in reporting
4. “All content in digital formats can
be characterized as structured or
unstructured data.”
Introduction to Digital Humanities: Concepts, Methods, and Tutorials
11. “Humanists have data, and
they need data skills.”
Digital Humanities Data Curation
Data in the Humanities
12. Types of Humanities Data
• Scholarly editions
• Text corpora
• Text with markup
• Thematic research collections
• Data with accompanying analysis or annotation
• Finding aids and other information maps, such as
bibliographies
Digital Humanities Data Curation Introduction
13. Big Data Digital Humanities vs.
Small Data Digital Humanities
• “Research in Big Data Digital Humanities focuses on large or dense
cultural datasets, which call for new processing and interpretation
methods”
• “..Small Data Digital Humanities regroup more focused works that do
not use massive data processing..”
• A map for big data research in digital humanities, Frédéric Kaplan
14. 1. research the context:
know the data about the data (so meta!)
How to understand data
15. Data versus Metadata
Big? Smart? Clean? Messy? Data in the Humanities, Christof Schöch
Metadata Metadata Metadata Metadata
data data data data
data data data data
data data data data
data data data data
About this dataset:
Title: Metadata
Date Created: Metadata
Creator: Metadata
Methods Used: Metadata
19. Zine Librarians Code of Ethics
• “Zines are not like mass-distributed books. They are often self-
published and self-distributed, and sometimes printed in very
small runs, intended for a small audience. In addition, perzines
are by definition “personal”, and zinesters may feel different
about having their zines distributed in print than they would
about having them openly available on the internet or print.
This can be especially true in the case of “historical” zines in
library collections — for example, a teen girl writing a zine for
her close friends in 1994 may not want her zine distributed
online or in print 20 years later.”
• Via Zinelibraries
22. Recognizing uncertainty and bias
Data on killings in the Syrian conflict.
https://responsibledata.io/reflection-stories/uncertainty-
statistics/
Let’s investigate the source…
23. Recognizing uncertainty and bias
Sources include
• Syrian government
• Syrian Center for Statistics and Research
• Syrian Network for Human Rights
• Syrian Observatory for Human Rigets
and many more.
https://responsibledata.io/reflection-stories/uncertainty-
statistics/
24.
25. there are lots of human decisions that go into
creating these statistics
without knowing how these deaths have
been coded, it’s difficult to trust in the
figures
29. Bivariate descriptive statistics
fancy way of saying
we are looking at two
variables at once
Hamlet Macbeth Othello
Similes 50 9 59
Metaphors 20 38 58
Total 70 47 117
Evaluating Comparison Methods
31. Finding Data
What type of data are you looking for?
List of Data Repositories
DH Toychest: Data Collections and Datasets
• Texts: HathiTrust Digital Library
• Spatial or numeric datasets: Data.gov
• Images: British Library Images
• Hybrid data sets: Digital Public Library of America
Via
33. How to data
1. Determine what to say
2. Find/collect/create the data
you need
3. Wrangle!
4. Clean!
5. Do it many more times.
34. ID Religion Income Age Q1 Q2 Q3
26371 Jewish <$10K 19 Yes 6 20
26372 Atheist $50-75K 24 - 4 21
26373 Catholic $75-100K 56 Yes 3 21
26374 Withheld $75-100K 33 No 6 21
26375 Pentecostal withheld 49 Yes 8 20
26376 Jewish $40-50K 29 Yes 5 19
26377 Catholic $20-30K 37 No 4 22
http://vita.had.co.nz/papers/tidy-data.pdf
Tidy Data
35. Most common problems
• Column headers are values, not variable names.
• Multiple variables are stored in one column.
• Variables are stored in both rows and columns.
• Multiple types of observational units are stored in the same table.
• A single observational unit is stored in multiple tables
http://vita.had.co.nz/papers/tidy-data.pdf
36. if you torture data long enough,
it will confess to anything
42. Open Data: Things to Consider
http://www.slideshare.net/libereurope/humanities-data-literacy-student-
perspective-on-digital-cultural-heritage-collections?qid=70bd86f2-10c5-43a6-
b053-56d264ca28ab&v=&b=&from_search=1
43. Recommended Reading / Viewing
“Numbers are Only Human” – Brian Root
“Ethical Principles of Psychologists and Code of Conduct” –
American Psychological Association
“On Not Looking: Ethics and Access in the Digital Humanities” –
Kimberly Cristen-Withey
Data are anything which is used or created to generate new knowledge and interpretations. “Anything” may be objective or subjective; physical or emotional; persistent or ephemeral; personal or public; explicit or tactic; and is consciously or unconsciously referenced by the researcher at some point during the course of their research. Research data may or may not lead to a research output, which regardless of method of presentation, is a planned public statement of new knowledge or interpretation. Garrett, 2012)
Involves knowledge of quantitative (statistical) methods, metadata standards, and the data curation lifecycle. But also the understanding of
Identifying problems that a dataset can answer
Recognizing research potential of an existing heritage collection, or identifying ways to answer questions or problems. Develop a hypothesis based on the data.
Becoming aware of the data features that enable new quantitative methods (including access, format, systematicity, and metadata)
Understanding the context and provenience of a collection (including extension, representativeness, openness, and copyright and privacy issues)
As the materials and analytical practices of research become increasingly digital, the theoretical knowledge and practical skills of information science, librarianship, and archival science will become ever more vital to humanists and to anyone working with cultural heritage.”
Heritage collections
“Another important distinction is between data and metadata. Here, the term “data” refers to the part of a file or dataset which contains the actual representation of an object of inquiry, while the term “metadata” refers to data about that data: metadata explicitly describes selected aspects of a dataset, such as the time of its creation, or the way it was collected, or what entity external to the dataset it is supposed to represent.
Read papers and study accompanying documentation
Need to know your data. What is the background of the person, time period, or language you are studying? What elements are represented and how were they obtained? What elements are missing or misrepresented? What are other questions you can ask or ways you can find out answers?
How does this reflect itself in writing? How does one show their background? How does one signify?
The present moment is filled with DH practitioners creating visualizations of ‘big data,’ mapping connections between people and ancient cities, and building archives dedicated to long-dead authors. These worthwhile academic and practical pursuits point us to the center of the digital humanities landscape. But, if we move to the margins and begin to look at the projects and tools that emerge from indigenous communities, archivists and cultural specialists, we see a different pattern: images are purposely removed, archives are not ‘open to the public,’ maps of sacred sites are consciously not created, defined or linked to. How do we integrate these varied practices and philosophies into the possibilities offered by digital humanities scholars? It is one thing to call attention to difference, it is another to alter our display practices, question access parameters, and redefine our own ways of knowing based on systems of accountability that define an ethical field of visually based on not looking. If seeing is believing and a picture is worth a thousand words, what can we learn from the act of not looking, or perhaps, more specifically, not seeing?
PEACE
Talk to subject experts, read papers, and study accompanying documentation
More often than not, it is not the writer that is twisting the numbers but the numbers themselves twisting up the writer. Manipulation of the facts or of the reader is usually not intentional.
More often than not, it is not the writer that is twisting the numbers but the numbers themselves twisting up the writer. Manipulation of the facts or of the reader is usually not intentional.
Estimations from the Syrian Observatory for Human Rights.
Imagine the decision that might have to be made to categorize a typical citizen with no military training, who has picked up a gun shortly before his death. Perhaps the coder might have a bias to continue calling this person a civilian. But this person took up arms against the government, did they not? How would you code a Syrian army defector now fighting with an opposition group? … Without some sort of standard protocol, rigorously followed, the coding of affiliation allows for a degree of subjectivity
Talk to subject experts, read papers, and study accompanying documentation
What is a numerical way to describe data?
Frequency – how often a value exists (e.g., a name or gender) 50 women, 38 men, 2 other, 10 unknown
Rank from lowest to highest – list albums’ use of personal pronouns in order of high to low
Average - Measure of central tendency
Variability – how different or similar scores are to each other (range, standard deviation)
Contingency table, a type of summary table. This shows frequency distribution.
Contingency table, a type of summary table. This shows frequency distribution.
HathiTrust: More than 2 million volumes are in the public domain and freely viewable on the Web. More information about obtaining the texts can be found here
British Library Images: Millions of images from the pages of 17th, 18th and 19th century books digitized
DPLA: "brings together the riches of America’s libraries, archives, and museums, and makes them freely available to the world"; API enabled
Data literacy is important so that we can
• Tell compelling stories that others are more likely to repeat, remember, and act on
• Determine when someone else is trying to mislead using visualizations
This is similar to when you got that mosquito bite and were sure you were getting Zika Virus or West Nile, only to realize your arm itched for 24 hours.
Business insider published using the chart but sought permission to reverse it, keeping the same design but turning it upside down for their readers.
It’s true though, that images such as the visualisation above draw attention to some important issues. Though they state their data source (the Syria Network for Human Rights) what we’ve explored here so far makes it clear that this data has flaws. We can’t know for sure the extent of those flaws, though, and some might argue that as long as the main message is transmitted, the details don’t matter so much.