A presentation given at the "Data Stewardship: Increasing the Integrity and Effectiveness of Science and Scholarship" Session on Friday, June 8 2012 at the IASSIT 2012 conference in Washington DC.
This presentation introduced data publishing, using a social science (archaeology) case study to explore editorial processes and dissemination outcomes that increasingly demand “Linked Data” capabilities.
DevEX - reference for building teams, processes, and platforms
IASSIT Kansa Presentation
1. Case-Study: Publishing to the
“Web of Data” in Archaeology
Quality and Workflows
Eric Kansa
UC Berkeley / OpenContext.org
Unless otherwise indicated, this work is licensed under a Creative Commons
Attribution 3.0 License <http://creativecommons.org/licenses/by/3.0/>
2. “Small Science” data sharing
is hard:
(1) Complexity
(2) Scalability
(3) Ethics, cultural property
claims, IP
(4) Incentives
(5) Preservation
Image Credit: “Grand Canyon NPS” via Flickr (CC-By)
http://www.flickr.com/photos/grand_canyon_nps/5975537378/
3. Thousand Flowers
●
Open Context: Open access,
open licensed data for
arhaeology
●
Archiving by California Digital
Library
●
Persistent Identifiers (DOIs,
ARKs)
●
Web services
●
NSF/NEH links for data
management plans
4. Thousand Flowers
Fills a Gap:
Most data sources are institutional.
Open Context publishes individual,
small group contributions
5. Thousand Flowers
Fills a Gap:
Challenge:
Most data sources are institutional. Diverse
Open Context publishes individual, contributions,
small group contributions needing lots of
work to clean-
up and “link” to
the Web of Data
6. •
3-year project Oct 2010 – Sep 2013
•
Funded with a National Leadership Grant from the
Institute for Museum and Library Services, LG-06-
10-0140-10, “Dissemination Information Packages
for Information Reuse”
•
Ixchel Faniel, PI & Elizabeth Yakel, Co-PI
http://www.dipir.org
8. The Big DIPIR Questions
Research Questions
1. What are the significant
properties of data that
facilitate reuse by the
designated communities at the
three sites?
2. How can these significant
properties be expressed as
representation information to
ensure the preservation of
meaning and enable data
reuse?
9. Open Context Interviewees
•
22 Ph.D. or graduate students
interviewed
–
13 men
–
9 women
•
Novices / Experts
–
19 experts
–
3 novices
•
Interviewees who where
curators or professors also
with a curatorial role = 6
11. Data Documentation Practices
I use an Excel spreadsheet…which I … inherited from my research
advisers. …my dissertation advisor was still recording data for each
specimen on paper when I was in graduate school so that's what I
started …then quickly, I was like, "This is ridiculous.“… I just started
using an Excel spreadsheet that has sort of slowly gotten bigger and
bigger over time with more variables or columns…I've added …color
coding…I also use…a very sort of primitive numerical coding system,
again, that I inherited from my research advisers…So, this little book
that goes with me of codes which is sort of odd, but …we all know
that a 14 is a sheep.” (CCU13)
12. Data Documentation Practices
I use an Excel spreadsheet…which I … inherited from my research
advisers. …my dissertation advisor was still recording data for each
specimen on paper when I was in graduate school so that's what I
started …then quickly, I was like, "This is ridiculous.“… I just started
using an Excel spreadsheet that has sort of slowly gotten bigger and
bigger over time with more variables or columns…I've added …color
coding…I also use…a very sort of primitive numerical coding system,
again, that I inherited from my research advisers…So, this little book
that goes with me of codes which is sort of odd, but …we all know
that a 14 is a sheep.” (CCU13)
A long way to go before we
get usable, intelligible data
14. Thousand Flowers
●
Clean-up and document
contributed data
●
Map to ArchaeoML (general
ontology)
●
Mint URIs to entities
(potsherds, projects, contexts,
people)
●
Link to important vocabularies /
collections (Pleiades,
Encyclopedia of Life)
●
Working on CIDOC-CRM
(RDF) representations (not
straightforward)
16. Open Context: Record
●
XHTML + RDFa (Dublin Core,
Open Annotation, etc.)
●
XML (ArchaeoML)
●
Atom
●
RDF (draft CIDOC)
●
Link to GitHub versioned file
23. Publishing
Data Quality and Standards
Alignment
(1) Check consistency
(2) Edit functions
(3) Align to common standards
(“Linked Data” if applicable)
(4) Issue tracking, version
control
24. Publishing
Tools of the Trade
(1) Google Refine (check, edit,
consistancy)
(2) Mantis (issue-tracker,
coordinate edits, metadata
creation)
25. Publishing
Tools of the Trade
(1) Domain scientists (Editorial
Board) check data
(2) Iterative “coproduction”
between contributors and
editoris
27. Web of Data (2011)
Main Contributors:
●
Institutions (esp. government)
●
Thematic collections / projects
28. Publishing
Entity Reconciliation
(1) With Google Refine
(2) Implemented, EOL and
Pleiades (gazetteer)
(3) Use existing mappings to
improve future reconciliation
29. ●
CDL Archiving Service
●
EZID for persistent Identity: DOIs
(aggregate resources), ARKs
(granular resources) and Merritt
Repository
●
Helps build trust in community
30. CDL as Infrastructure
●
Platform / Services
disciplinary communities
can use for “Data
Publishing”
●
Different communities
work out
semantic/interoperability
needs, editorial policies, University of California (System)
incentives, etc. Repository,
All disciplines
(UC-funded library, grants)
31. CDL as Infrastructure Future data
Future data publisher
publisher
●
Platform / Services
disciplinary communities
can use for “Data
Publishing”
●
Different communities
work out
semantic/interoperability
needs, editorial policies, University of California (System)
incentives, etc. Repository,
All disciplines
(UC-funded library, grants)
35. Summary
Outcomes of Publishing Data:
(1) Communicate and set
expectations about content and
quality
(2) Organize workflows to improve
data quality and usability
(3) Make “datasets” first class citizens
in world of scholarly
communications
36. Final Thoughts
Publication needs to evolve!
(1) Participating in Linked Data is
a great goal, but far removed
from most everyday practice
(2) Researchers need help.
(3) 19th century publication norms
poorly suited to 21st century
methods, research, public
goals