How databases learn - iconference 2014

•Download as PPTX, PDF•

1 like•157 views

Slides from my presentation of "How Databases Learn" by Thomer & Twidale, at iConference 2014 For full paper: https://www.ideals.illinois.edu/handle/2142/47268

Technology Education

How Databases Learn
Andrea K. Thomer
Michael B. Twidale
Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
5 Mar 2014 – iConference - Berlin

From ethnography to trace ethnography
 Hine 2006, Bietz & Lee 2009 – traditional ethnographies of
scientific databases
 Schuurman, 2007: “database ethnographies”
 Geiger & Ribes, 2011: document-driven ethnography -> trace
ethnographies
 looks at "how, where, and by whom [documents] are produced,
edited, revised or filed" -- in a database this is much the same but
we ask the same questions of tables and fields
 Study of edits on Wikipedia
 1000’s of agents making 1000’s of changes over hours
 Our study: 1000s of changes by few people over many years;
looking for traces of schema change, reappropriation of fields

We ask: how do databases, like
buildings, learn?

Case study: the Universal Chalcidoidea
Database
Parasitic wasps: “gem-like
inhabitants of the woodlands
heretofore unknown and by
most never seen nor dreamt of.”
– Girard, 1924

“CRENCYRT”
How the
database
learned: • Additional table
added for
separate
project
• Duplicate, non-
normal tables
as an ad hoc
way to manage
workflow
“REFNEW”

How do we account for shearing in
databases?

Additional questions:
 What makes a database able to adapt to changing uses and
users?
 How does a database evolve when the people in charge of it
change?
 How are fields repurposed over time?
 How is it that some databases adapt to changing needs and
circumstances better than others?
 How are major renovations (refactorings) best handled?
 How do we work with unchangeable traces of earlier designs?
 How and when are “best practices” not actually for the best?

Conclusions (and more questions)
 For databases, thinking about preservation as a simple binary
between migration and emulation is too simplistic
 They need to evolve, but how?
 Buildings face a similar challenge – can we gain insights from
comparison?
 Questions for you:
 From Mike: Do you know of more examples of databases that
have gone through gradual tweaks or punctuated leaps?
 From me: What else should we be reading?

Viewers also liked

Fun stuff we did

Sueford

Mindanao. peace communication

Albert Alejo

October 6, 2011

William Lodge

Rsv111

acej1_2

Leading Teams

Madhuri Chopurala, CSPO, CSM

Taller de Innovaciones Educativas

Maria Cristina Cordoba

Charity: Water Newsletter

FromTheTap

Tips For Evaluating Hr Technology

KristyM

Sj Kb Am 520 S10 Comp Timeline Pre1900

NC State University

Unidad Didáctica: Sentados sobre la muralla_ Tania Orts_ G1

Respiratory Drugs

Client service

Shakespeare

Brett Solomon: Transparency - the new norm

Global Utmaning

Una sociedad de la información es aquella en la cual las tecnologías que faci...

Roxyy Castro

Week1

ethrakaiss

толерантность

galkinalyudmila

Viewers also liked (17)

Fun stuff we did

Mindanao. peace communication

October 6, 2011

Rsv111

Leading Teams

Taller de Innovaciones Educativas

Charity: Water Newsletter

Tips For Evaluating Hr Technology

Sj Kb Am 520 S10 Comp Timeline Pre1900

Unidad Didáctica: Sentados sobre la muralla_ Tania Orts_ G1

Respiratory Drugs

Client service

Shakespeare

Brett Solomon: Transparency - the new norm

Una sociedad de la información es aquella en la cual las tecnologías que faci...

Week1

толерантность

Similar to How databases learn - iconference 2014

LSC Glasgow 061609

John MacColl

Wild data: collaborative e-research and university libraries

RAILS7

ACL2008

Frank Quinn

Pratt Sils Knowledge Organization Fall 2008

PrattSILS

The Timescapes Archive

Incremental Project

i3 Conference Keynote, Aberdeen

Eric Meyer

Pratt SILS Knowledge Organization Fall 2010

PrattSILS

Network Venues & Scholarly Monographs: Pioneering Initiatives in Publishing e-Scholarship Abstract Scholarly publishers are increasingly incorporating Web sites into facets of the enterprise. Often, such sites primarily serve basic promotional and purchasing functions, but occasionally sites of both publishers and authors reflect other functionalities: search facilities, availability of published text, referral to instructional and research materials, hyperlinks to external sources, opportunity for reader-author exchange. This presentation provides a panoramic overview of Web sites recently prepared by publishers and/or authors that complement traditionally published scholarly monograph. This overview is intended to stimulate discussion of suitable Web functionalities that might be incorporated into monograph publications being prepared by scholars affiliated to the Virtual Knowledge Studio.

Vks Presentation, Jankowski,15 Jan2009, Websites & Books, Near Final

Nick Jankowski

In March, I had the pleasure of being the inaugural speaker in a new lecture series (http://library.wustl.edu/research-data-testing/dss_speaker/dss_altman.html) initiated by the Libraries at the Washington University in St. Louis Libraries -- dedicated to the topics of data reproducibility, citation, sharing, privacy, and management. In the presentation embedded below, I provide an overview of the major categories of new initiatives to promote research reproducibility, reliability, and reuse and related state of the art in informatics methods for managing data.

State of the Art Informatics for Research Reproducibility, Reliability, and...

Micah Altman

Metadata in the age of data curation and linked data

Ryan Johnson

Studying people who can talk back, Meyer 2013 DH at Oxford summer school

Eric Meyer

“Happiness is…Library Automation:” The Rhetoric of Early Library Automation a...

Charleston Conference

Mdst3703 2013-10-08-thematic-research-collections

Rafael Alvarado

Scits 2014

Kevin Lanning

Data, Science, Society - Claudio Gutierrez, University of Chile

LEARN Project

The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...

Digital History

data science in academia and the real world

chris wiggins

For Community Informatics conference: CIRN Prato 2015. There is a critical and growing need to understand and embrace the complex memory and archival needs of an expanding, technologically savvy and actively participative society. The need for memory-making and heritage is as diverse as the people and communities creating the stories. Memory-making plays a significant role in the identification of social and cultural standards, as well as values and factors that influence recordkeeping across multiple plural (and contested) memory spaces including personal, community, collective and networked memories. In my research I saw that YouTube was an enabler, facilitator and platform of personal curation, mediation and memory-making, hence providing a space for recordkeeping that supports the ongoing use of records through spacetime - an emergent archive The Mediated Recordkeeping model (Figure 1) represents a framework to support the emergent archive to facilitate, enable and engage memory-making,rather than focus on selection, collection, and protection of cultural heritage within the bounds and custody of the institution.

Memory-making and the emergent archive poster

Leisa Gibbons

Leslie Johnston Keynote, Best Practices Exchange 2011

lljohnston

Libraries in a data-centered environment

Jakob .

Similar to How databases learn - iconference 2014 (20)

LSC Glasgow 061609

Wild data: collaborative e-research and university libraries

ACL2008

Pratt Sils Knowledge Organization Fall 2008

The Timescapes Archive

i3 Conference Keynote, Aberdeen

Pratt SILS Knowledge Organization Fall 2010

Vks Presentation, Jankowski,15 Jan2009, Websites & Books, Near Final

State of the Art Informatics for Research Reproducibility, Reliability, and...

Metadata in the age of data curation and linked data

Studying people who can talk back, Meyer 2013 DH at Oxford summer school

“Happiness is…Library Automation:” The Rhetoric of Early Library Automation a...

Mdst3703 2013-10-08-thematic-research-collections

Scits 2014

Data, Science, Society - Claudio Gutierrez, University of Chile

The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...

data science in academia and the real world

Memory-making and the emergent archive poster

Leslie Johnston Keynote, Best Practices Exchange 2011

Libraries in a data-centered environment

Recently uploaded

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

Real Time Object Detection Using Open CV

Khem

A Principled Technologies deployment guide Conclusion Deploying VMware Cloud Foundation 5.1 on next gen Dell PowerEdge servers brings together critical virtualization capabilities and high-performing hardware infrastructure. Relying on our hands-on experience, this deployment guide offers a comprehensive roadmap that can guide your organization through the seamless integration of advanced VMware cloud solutions with the performance and reliability of Dell PowerEdge servers. In addition to the deployment efficiency, the Cloud Foundation 5.1 and PowerEdge solution delivered strong performance while running a MySQL database workload. By leveraging VMware Cloud Foundation 5.1 and PowerEdge servers, you could help your organization embrace cloud computing with confidence, potentially unlocking a new level of agility, scalability, and efficiency in your data center operations.

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Principled Technologies

Top 10 Most Downloaded Games on Play Store in 2024

SynarionITSolutions

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

presentation ICT roal in 21st century education

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

GenAI Risks & Security Meetup 01052024.pdf

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Real Time Object Detection Using Open CV

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Top 10 Most Downloaded Games on Play Store in 2024

Exploring the Future Potential of AI-Enabled Smartphone Processors

Powerful Google developer tools for immediate impact! (2023-24 C)

Tata AIG General Insurance Company - Insurer Innovation Award 2024

2024: Domino Containers - The Next Step. News from the Domino Container commu...

A Domino Admins Adventures (Engage 2024)

Boost PC performance: How more available memory can improve productivity

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Boost Fertility New Invention Ups Success Rates.pdf

Data Cloud, More than a CDP by Matt Robison

How databases learn - iconference 2014

1. How Databases Learn Andrea K. Thomer Michael B. Twidale Graduate School of Library and Information Science University of Illinois at Urbana-Champaign 5 Mar 2014 – iConference - Berlin

2. Ceci n’est pas un database.

3. From ethnography to trace ethnography  Hine 2006, Bietz & Lee 2009 – traditional ethnographies of scientific databases  Schuurman, 2007: “database ethnographies”  Geiger & Ribes, 2011: document-driven ethnography -> trace ethnographies  looks at "how, where, and by whom [documents] are produced, edited, revised or filed" -- in a database this is much the same but we ask the same questions of tables and fields  Study of edits on Wikipedia  1000’s of agents making 1000’s of changes over hours  Our study: 1000s of changes by few people over many years; looking for traces of schema change, reappropriation of fields

4. We ask: how do databases, like buildings, learn?

5. Case study: the Universal Chalcidoidea Database Parasitic wasps: “gem-like inhabitants of the woodlands heretofore unknown and by most never seen nor dreamt of.” – Girard, 1924

6. How we learned the database

7. How the database learned:

8. “CRENCYRT” How the database learned: • Additional table added for separate project • Duplicate, non- normal tables as an ad hoc way to manage workflow “REFNEW”

9. Brand’s concept of shearing Stuff (days – months) Space plan (months to years) Services (years – decades) Skin (decades) Structure (decades to centuries) Site (eternal)

10. How do we account for shearing in databases?

11. Additional questions:  What makes a database able to adapt to changing uses and users?  How does a database evolve when the people in charge of it change?  How are fields repurposed over time?  How is it that some databases adapt to changing needs and circumstances better than others?  How are major renovations (refactorings) best handled?  How do we work with unchangeable traces of earlier designs?  How and when are “best practices” not actually for the best?

12. Conclusions (and more questions)  For databases, thinking about preservation as a simple binary between migration and emulation is too simplistic  They need to evolve, but how?  Buildings face a similar challenge – can we gain insights from comparison?  Questions for you:  From Mike: Do you know of more examples of databases that have gone through gradual tweaks or punctuated leaps?  From me: What else should we be reading?

13. Thank you! Refs: Bietz, M., & Lee, C. (2009). Collaboration in metagenomics: Sequence databases and the organization of scientific work. ECSCW 2009, (September), 7–11. Retrieved from http://www.springerlink.com/index/t7124470143464r9.pdf Brand, S. (1995). How Buildings Learn: What Happens After They’re Built. Penguin Books. Girault, A.A. (1925). “Some Gem Like Inhabitants of the Woodlands by Most Never Seen Nor Dreamt Of.” The Literature of Platygastroidea. Retrieved from http://plazi.org:8080/dspace/handle/10199/15794 Geiger, R. S., & Ribes, D. (2011). Trace Ethnography: Following Coordination through Documentary Practices. 2011 44th Hawaii International Conference on System Sciences, 1–10. doi:10.1109/HICSS.2011.455 Hine, C. (2006). Databases as Scientific Instruments and Their Role in the Ordering of Scientific Work. Social Studies of Science, 36(2), 269–298. doi:10.1177/0306312706054047 Schuurman, N. (2008). Database Ethnographies Using Social Science Methodologies to Enhance Data Analysis and Interpretation. Geography Compass, 2(5), 1529–1548. doi:10.1111/j.1749-8198.2008.00150. Acknowledgements: thanks to Katrina Fenlon, Nic Weber and Karen Wickett for feedback; and thanks to CIRSS for funding

14. Databases at RLB

Editor's Notes

Today I’m going to be presenting a case study as part of a work-in-progress looking at how research databases, particularly relational databases, “learn” and change over time. This is something that I was particularly excited to bring to iConference as a note because when I attended last year, I was so impressed by the feedback I saw notes authors getting from their audience. So I’m really hoping to hear from you guys at the end of this talk! In particular, we’re hoping you can help us think up more examples of the objects at the center of our study: long-lived research databases. As the title of this talk hopefully implies, we’re interested in how these databases – particularly relational databases that have been used more or less continuously for more than five years – change, grow and are repurposed over time.
Motivation: But before we get into the examples of long-lived databases that we have found, I want to explain why we’re looking at them in the first place. Tony Hey this morning talked about the move toward “data intensive science” and the computational turn brought about by the “fourth paradigm” of data driven discovery. -- In doing so, he referenced the network of databases that makes up Pub Med Central by using a nice, tidy little diagram showing a bunch of circles linked together by arrows. While tidy little diagrams involving circles and cyllindars look nice on slides and grant proposals, they’re a terrible representation of the reality and messiness of a relational database as it’s maintained by a number of people over a number of years. It’s our opinion that while we as a discipline are very good at abstracting databases into formalisms like normal forms and Entity-relationship diagrams, we need to be better at relating those formalisms to long term use, and furthermore, need to work harder to account for use beyond an initial set of use cases.
Prior work has used ethrnographic methods to document and describe that complex interaction between humans and information infrastructures. However, as Geiger and Ribes point out, traditional ethnographies are simply impossible when you’re hoping to desribe distrubted work In their case study of edits made to wikipediea, they’re talking about geographically distributed but concurrent work. Here, we’re talking about temporally distributed but often spatially or site-constrained work: edits made to a particular database in situated in a laboratory In this work we’re conducting something similar to what they call a trace ethnography, which builds on the tradition of document-driven ethnography, looking at "how, where, and by whom they are produced, edited, revised or filed" -- in a database this is much the same but we ask the following questions of records, and tables and different fields.
In addition to conducting this trace ethnography, we’re exploring the application of concepts outlined by Stewart Brand in his 1995 book, “How Buildings Learn”. In this book, brand describes how buildings “learn” from their owners and inhabitants, and how their structure, skin and space change in response to changing needs. Architecture-as-metaphor isn’t new in software engineering, but we think it can help clarify fuzzy concepts and pheneomena, particularly for database use.
Our case study focuses on the UCD – a database of chalcid wasp names, references and geolocations. Chalcid wasps are tiny but plentiful – there are 22k described but up to ½ a million in existance – and this databases contained some 10k records. Originally built by John Noyes for British NHM, the database was in need of migration – specifically to a larger taxnomic database called “Taxon Works”
I was originally brought to this database as a part time employee of the natural history survye, with the job title of “taxonomic data modeler”. My primary goal was to interpret the UCD’s original creators’ original “schema” and turn it into something more formal so that the records could be migrated. Our ethnography at this point involved looking through the collection of files we were handed and trying to interpret the many partially or poorly described tables. The “Flowchart” on the left is the primary descriptive doucment that we had to work with, as well as a collectino of text files representing the individual tables.
However, in comparing the flowchart to the actual files and tables, we began finding discrepancies. Notably, we found far more tables than were described by the flowchart.
We realized that Noyes had made extensive alterations and edits to his db’s structure after creating his first set of documentation. The files we were given included 34 tables, whereas Noyes’ original schema only contains 22. After consulting with Noyes, we learned that he had “ingested” several other datasets into his over the course of the UCD’s lifespan, and had furthermore, begun using the UCD for local data management of some related-but-separate projects, such as a table titled “crencyrt” which contains data from survey of Costa Rican Encyrtidae ranges otherwise unrelated to the rest of Noyes’ data aggregation efforts. Additionally, Noyes relied on non-expert, unpaid museum volunteers for data entry, so he stringently checked all of their work before “accepting” it into the database. In order to do this Noyes created proxy tables into which volunteers could enter their data. Noyes then would manually migrate these new records to the main set of tables.
Like I said before, a lot of the the prior work looking at databases in the workplace has been ethnographic, and thus a bit difficult to generalize. And our representations of databases are too static – they don’t reflect change. A combination of a trace ethnography as well as Brand’s framing allows us to do to study the interplay between engineering, a particular lab’s culture, and the day-to-day getting on with it. In the case of the UCD, the changes to this database particularly lend themselves to Brand’s architectural metaphors: tables were “added on” like spare rooms to make room for an expanding “family” of projects. Because Noyes so carefully curated his data, we did not find some of the quirks of long-term use that we have observed in our own prior work with databases, such as gradual change in the use of certain fields over time (the repurposing of a room, in Brand’s rendering), or of shearing of large tables into smaller subsections.
There have been some references to shearing related to software development in general, but nothing too well fleshed out Systems of Record, Systems of Differentiation and Systems of Innovation. We often discuss digital preservation as a straight forward dichotomy between emulation and preservation,
In conclusion: this opens up a lot of research questions like:

How databases learn - iconference 2014

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Similar to How databases learn - iconference 2014

Similar to How databases learn - iconference 2014 (20)

Recently uploaded

Recently uploaded (20)

How databases learn - iconference 2014

Editor's Notes