Slides from my presentation of "How Databases Learn" by Thomer & Twidale, at iConference 2014
For full paper:
https://www.ideals.illinois.edu/handle/2142/47268
1. How Databases Learn
Andrea K. Thomer
Michael B. Twidale
Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
5 Mar 2014 – iConference - Berlin
3. From ethnography to trace ethnography
Hine 2006, Bietz & Lee 2009 – traditional ethnographies of
scientific databases
Schuurman, 2007: “database ethnographies”
Geiger & Ribes, 2011: document-driven ethnography -> trace
ethnographies
looks at "how, where, and by whom [documents] are produced,
edited, revised or filed" -- in a database this is much the same but
we ask the same questions of tables and fields
Study of edits on Wikipedia
1000’s of agents making 1000’s of changes over hours
Our study: 1000s of changes by few people over many years;
looking for traces of schema change, reappropriation of fields
4. We ask: how do databases, like
buildings, learn?
5. Case study: the Universal Chalcidoidea
Database
Parasitic wasps: “gem-like
inhabitants of the woodlands
heretofore unknown and by
most never seen nor dreamt of.”
– Girard, 1924
8. “CRENCYRT”
How the
database
learned: • Additional table
added for
separate
project
• Duplicate, non-
normal tables
as an ad hoc
way to manage
workflow
“REFNEW”
9. Brand’s concept of shearing
Stuff (days – months)
Space plan (months to years)
Services (years – decades)
Skin (decades)
Structure (decades to
centuries)
Site (eternal)
11. Additional questions:
What makes a database able to adapt to changing uses and
users?
How does a database evolve when the people in charge of it
change?
How are fields repurposed over time?
How is it that some databases adapt to changing needs and
circumstances better than others?
How are major renovations (refactorings) best handled?
How do we work with unchangeable traces of earlier designs?
How and when are “best practices” not actually for the best?
12. Conclusions (and more questions)
For databases, thinking about preservation as a simple binary
between migration and emulation is too simplistic
They need to evolve, but how?
Buildings face a similar challenge – can we gain insights from
comparison?
Questions for you:
From Mike: Do you know of more examples of databases that
have gone through gradual tweaks or punctuated leaps?
From me: What else should we be reading?
13. Thank you!
Refs:
Bietz, M., & Lee, C. (2009). Collaboration in metagenomics: Sequence databases and the organization of
scientific work. ECSCW 2009, (September), 7–11. Retrieved from
http://www.springerlink.com/index/t7124470143464r9.pdf
Brand, S. (1995). How Buildings Learn: What Happens After They’re Built. Penguin Books.
Girault, A.A. (1925). “Some Gem Like Inhabitants of the Woodlands by Most Never Seen Nor Dreamt Of.” The
Literature of Platygastroidea. Retrieved from http://plazi.org:8080/dspace/handle/10199/15794
Geiger, R. S., & Ribes, D. (2011). Trace Ethnography: Following Coordination through Documentary Practices.
2011 44th Hawaii International Conference on System Sciences, 1–10. doi:10.1109/HICSS.2011.455
Hine, C. (2006). Databases as Scientific Instruments and Their Role in the Ordering of Scientific Work. Social
Studies of Science, 36(2), 269–298. doi:10.1177/0306312706054047
Schuurman, N. (2008). Database Ethnographies Using Social Science Methodologies to Enhance Data
Analysis and Interpretation. Geography Compass, 2(5), 1529–1548. doi:10.1111/j.1749-8198.2008.00150.
Acknowledgements: thanks to Katrina Fenlon, Nic Weber and Karen Wickett for feedback;
and thanks to CIRSS for funding
Today I’m going to be presenting a case study as part of a work-in-progress looking at how research databases, particularly relational databases, “learn” and change over time. This is something that I was particularly excited to bring to iConference as a note because when I attended last year, I was so impressed by the feedback I saw notes authors getting from their audience. So I’m really hoping to hear from you guys at the end of this talk!
In particular, we’re hoping you can help us think up more examples of the objects at the center of our study: long-lived research databases. As the title of this talk hopefully implies, we’re interested in how these databases – particularly relational databases that have been used more or less continuously for more than five years – change, grow and are repurposed over time.
Motivation: But before we get into the examples of long-lived databases that we have found, I want to explain why we’re looking at them in the first place.
Tony Hey this morning talked about the move toward “data intensive science” and the computational turn brought about by the “fourth paradigm” of data driven discovery. -- In doing so, he referenced the network of databases that makes up Pub Med Central by using a nice, tidy little diagram showing a bunch of circles linked together by arrows.
While tidy little diagrams involving circles and cyllindars look nice on slides and grant proposals, they’re a terrible representation of the reality and messiness of a relational database as it’s maintained by a number of people over a number of years.
It’s our opinion that while we as a discipline are very good at abstracting databases into formalisms like normal forms and Entity-relationship diagrams, we need to be better at relating those formalisms to long term use, and furthermore, need to work harder to account for use beyond an initial set of use cases.
Prior work has used ethrnographic methods to document and describe that complex interaction between humans and information infrastructures.
However, as Geiger and Ribes point out, traditional ethnographies are simply impossible when you’re hoping to desribe distrubted work
In their case study of edits made to wikipediea, they’re talking about geographically distributed but concurrent work. Here, we’re talking about temporally distributed but often spatially or site-constrained work: edits made to a particular database in situated in a laboratory
In this work we’re conducting something similar to what they call a trace ethnography, which builds on the tradition of document-driven ethnography, looking at "how, where, and by whom they are produced, edited, revised or filed" -- in a database this is much the same but we ask the following questions of records, and tables and different fields.
In addition to conducting this trace ethnography, we’re exploring the application of concepts outlined by Stewart Brand in his 1995 book, “How Buildings Learn”. In this book, brand describes how buildings “learn” from their owners and inhabitants, and how their structure, skin and space change in response to changing needs. Architecture-as-metaphor isn’t new in software engineering, but we think it can help clarify fuzzy concepts and pheneomena, particularly for database use.
Our case study focuses on the UCD – a database of chalcid wasp names, references and geolocations. Chalcid wasps are tiny but plentiful – there are 22k described but up to ½ a million in existance – and this databases contained some 10k records.
Originally built by John Noyes for British NHM, the database was in need of migration – specifically to a larger taxnomic database called “Taxon Works”
I was originally brought to this database as a part time employee of the natural history survye, with the job title of “taxonomic data modeler”. My primary goal was to interpret the UCD’s original creators’ original “schema” and turn it into something more formal so that the records could be migrated. Our ethnography at this point involved looking through the collection of files we were handed and trying to interpret the many partially or poorly described tables. The “Flowchart” on the left is the primary descriptive doucment that we had to work with, as well as a collectino of text files representing the individual tables.
However, in comparing the flowchart to the actual files and tables, we began finding discrepancies. Notably, we found far more tables than were described by the flowchart.
We realized that Noyes had made extensive alterations and edits to his db’s structure after creating his first set of documentation.
The files we were given included 34 tables, whereas Noyes’ original schema only contains 22. After consulting with Noyes, we learned that he had “ingested” several other datasets into his over the course of the UCD’s lifespan, and had furthermore, begun using the UCD for local data management of some related-but-separate projects, such as a table titled “crencyrt” which contains data from survey of Costa Rican Encyrtidae ranges otherwise unrelated to the rest of Noyes’ data aggregation efforts.
Additionally, Noyes relied on non-expert, unpaid museum volunteers for data entry, so he stringently checked all of their work before “accepting” it into the database. In order to do this Noyes created proxy tables into which volunteers could enter their data. Noyes then would manually migrate these new records to the main set of tables.
Like I said before, a lot of the the prior work looking at databases in the workplace has been ethnographic, and thus a bit difficult to generalize. And our representations of databases are too static – they don’t reflect change.
A combination of a trace ethnography as well as Brand’s framing allows us to do to study the interplay between engineering, a particular lab’s culture, and the day-to-day getting on with it. In the case of the UCD, the changes to this database particularly lend themselves to Brand’s architectural metaphors: tables were “added on” like spare rooms to make room for an expanding “family” of projects. Because Noyes so carefully curated his data, we did not find some of the quirks of long-term use that we have observed in our own prior work with databases, such as gradual change in the use of certain fields over time (the repurposing of a room, in Brand’s rendering), or of shearing of large tables into smaller subsections.
There have been some references to shearing related to software development in general, but nothing too well fleshed out
Systems of Record, Systems of Differentiation and Systems of Innovation.
We often discuss digital preservation as a straight forward dichotomy between emulation and preservation,
In conclusion: this opens up a lot of research questions like: