Case-Study: Publishing to the
“Web of Data” in Archaeology

      Quality and Workflows



                              Eric Kansa
                UC Berkeley / OpenContext.org



      Unless otherwise indicated, this work is licensed under a Creative Commons
         Attribution 3.0 License <http://creativecommons.org/licenses/by/3.0/>
“Small Science” data sharing
                                                              is hard:
                                                              (1) Complexity
                                                              (2) Scalability
                                                              (3) Ethics, cultural property
                                                                  claims, IP
                                                              (4) Incentives
                                                              (5) Preservation
Image Credit: “Grand Canyon NPS” via Flickr (CC-By)
  http://www.flickr.com/photos/grand_canyon_nps/5975537378/
Thousand Flowers




         ●
             Open Context: Open access,
             open licensed data for
             arhaeology
         ●
             Archiving by California Digital
             Library
         ●
             Persistent Identifiers (DOIs,
             ARKs)
         ●
             Web services
         ●
             NSF/NEH links for data
             management plans
Thousand Flowers




Fills a Gap:

Most data sources are institutional.
Open Context publishes individual,
small group contributions
Thousand Flowers




Fills a Gap:
                                       Challenge:
Most data sources are institutional.   Diverse
Open Context publishes individual,     contributions,
small group contributions              needing lots of
                                       work to clean-
                                       up and “link” to
                                       the Web of Data
•
    3-year project Oct 2010 – Sep 2013


•
    Funded with a National Leadership Grant from the
    Institute for Museum and Library Services, LG-06-
    10-0140-10, “Dissemination Information Packages
    for Information Reuse”


•
    Ixchel Faniel, PI & Elizabeth Yakel, Co-PI


    http://www.dipir.org
DIPIR Collaboration
The Big DIPIR Questions
Research Questions
1. What are the significant
properties of data that
facilitate reuse by the
designated communities at the
three sites?
2. How can these significant
properties be expressed as
representation information to
ensure the preservation of
meaning and enable data
reuse?
Open Context Interviewees
•
    22 Ph.D. or graduate students
    interviewed
    –
        13 men
    –
        9 women
•
    Novices / Experts
    –
        19 experts
    –
        3 novices
•
    Interviewees who where
    curators or professors also
    with a curatorial role = 6
Raw Data is Unappetizing?
Data Documentation Practices
I use an Excel spreadsheet…which I … inherited from my research
advisers. …my dissertation advisor was still recording data for each
specimen on paper when I was in graduate school so that's what I
started …then quickly, I was like, "This is ridiculous.“… I just started
using an Excel spreadsheet that has sort of slowly gotten bigger and
bigger over time with more variables or columns…I've added …color
coding…I also use…a very sort of primitive numerical coding system,
again, that I inherited from my research advisers…So, this little book
that goes with me of codes which is sort of odd, but …we all know
that a 14 is a sheep.” (CCU13)
Data Documentation Practices
I use an Excel spreadsheet…which I … inherited from my research
advisers. …my dissertation advisor was still recording data for each
specimen on paper when I was in graduate school so that's what I
started …then quickly, I was like, "This is ridiculous.“… I just started
using an Excel spreadsheet that has sort of slowly gotten bigger and
bigger over time with more variables or columns…I've added …color
coding…I also use…a very sort of primitive numerical coding system,
again, that I inherited from my research advisers…So, this little book
that goes with me of codes which is sort of odd, but …we all know
that a 14 is a sheep.” (CCU13)


                                          A long way to go before we
                                          get usable, intelligible data
Sometimes data is better
served cooked.
Thousand Flowers



        ●
            Clean-up and document
            contributed data
        ●
            Map to ArchaeoML (general
            ontology)
        ●
            Mint URIs to entities
            (potsherds, projects, contexts,
            people)
        ●
            Link to important vocabularies /
            collections (Pleiades,
            Encyclopedia of Life)
        ●
            Working on CIDOC-CRM
            (RDF) representations (not
            straightforward)
Open Context: Record
Open Context: Record




                       ●
                           XHTML + RDFa (Dublin Core,
                           Open Annotation, etc.)
                       ●
                           XML (ArchaeoML)
                       ●
                           Atom
                       ●
                           RDF (draft CIDOC)
                       ●
                           Link to GitHub versioned file
Open Context: Record
Open Context: Record
Open Context: Visutalization of Data Linked to the EOL
My Precious Data




  Image Credit: “Lord of the Rings” (2003, New
      Line), All Rights Reserved Copyright
Data sharing as publication
Data Publishing
Publishing




             Data Quality and Standards
             Alignment
             (1) Check consistency
             (2) Edit functions
             (3) Align to common standards
                 (“Linked Data” if applicable)
             (4) Issue tracking, version
                 control
Publishing




             Tools of the Trade

              (1) Google Refine (check, edit,
                  consistancy)
              (2) Mantis (issue-tracker,
                  coordinate edits, metadata
                  creation)
Publishing




             Tools of the Trade

              (1) Domain scientists (Editorial
                  Board) check data
              (2) Iterative “coproduction”
                  between contributors and
                  editoris
Publishing




               Project Metadata


             Column Descriptions
Web of Data (2011)




         Main Contributors:

              ●
                  Institutions (esp. government)

              ●
                  Thematic collections / projects
Publishing




             Entity Reconciliation

              (1) With Google Refine
              (2) Implemented, EOL and
                  Pleiades (gazetteer)
              (3) Use existing mappings to
                  improve future reconciliation
●
    CDL Archiving Service
●
    EZID for persistent Identity: DOIs
    (aggregate resources), ARKs
    (granular resources) and Merritt
    Repository
●
    Helps build trust in community
CDL as Infrastructure



●
    Platform / Services
    disciplinary communities
    can use for “Data
    Publishing”
●
    Different communities
    work out
    semantic/interoperability
    needs, editorial policies,   University of California (System)
    incentives, etc.                       Repository,
                                          All disciplines
                                   (UC-funded library, grants)
CDL as Infrastructure                                   Future data
                                 Future data                           publisher
                                  publisher




●
    Platform / Services
    disciplinary communities
    can use for “Data
    Publishing”
●
    Different communities
    work out
    semantic/interoperability
    needs, editorial policies,             University of California (System)
    incentives, etc.                                 Repository,
                                                    All disciplines
                                               (UC-funded library, grants)
eScholarship: UC’s OA Publishing Platform
Platform for traditional publishing
Also supports new genres
Summary




 Outcomes of Publishing Data:
  (1) Communicate and set
      expectations about content and
      quality
  (2) Organize workflows to improve
      data quality and usability
  (3) Make “datasets” first class citizens
      in world of scholarly
      communications
Final Thoughts

Publication needs to evolve!

 (1) Participating in Linked Data is
     a great goal, but far removed
     from most everyday practice

 (2) Researchers need help.

 (3) 19th century publication norms
     poorly suited to 21st century
     methods, research, public
     goals

IASSIT Kansa Presentation

  • 1.
    Case-Study: Publishing tothe “Web of Data” in Archaeology Quality and Workflows Eric Kansa UC Berkeley / OpenContext.org Unless otherwise indicated, this work is licensed under a Creative Commons Attribution 3.0 License <http://creativecommons.org/licenses/by/3.0/>
  • 2.
    “Small Science” datasharing is hard: (1) Complexity (2) Scalability (3) Ethics, cultural property claims, IP (4) Incentives (5) Preservation Image Credit: “Grand Canyon NPS” via Flickr (CC-By) http://www.flickr.com/photos/grand_canyon_nps/5975537378/
  • 3.
    Thousand Flowers ● Open Context: Open access, open licensed data for arhaeology ● Archiving by California Digital Library ● Persistent Identifiers (DOIs, ARKs) ● Web services ● NSF/NEH links for data management plans
  • 4.
    Thousand Flowers Fills aGap: Most data sources are institutional. Open Context publishes individual, small group contributions
  • 5.
    Thousand Flowers Fills aGap: Challenge: Most data sources are institutional. Diverse Open Context publishes individual, contributions, small group contributions needing lots of work to clean- up and “link” to the Web of Data
  • 6.
    3-year project Oct 2010 – Sep 2013 • Funded with a National Leadership Grant from the Institute for Museum and Library Services, LG-06- 10-0140-10, “Dissemination Information Packages for Information Reuse” • Ixchel Faniel, PI & Elizabeth Yakel, Co-PI http://www.dipir.org
  • 7.
  • 8.
    The Big DIPIRQuestions Research Questions 1. What are the significant properties of data that facilitate reuse by the designated communities at the three sites? 2. How can these significant properties be expressed as representation information to ensure the preservation of meaning and enable data reuse?
  • 9.
    Open Context Interviewees • 22 Ph.D. or graduate students interviewed – 13 men – 9 women • Novices / Experts – 19 experts – 3 novices • Interviewees who where curators or professors also with a curatorial role = 6
  • 10.
    Raw Data isUnappetizing?
  • 11.
    Data Documentation Practices Iuse an Excel spreadsheet…which I … inherited from my research advisers. …my dissertation advisor was still recording data for each specimen on paper when I was in graduate school so that's what I started …then quickly, I was like, "This is ridiculous.“… I just started using an Excel spreadsheet that has sort of slowly gotten bigger and bigger over time with more variables or columns…I've added …color coding…I also use…a very sort of primitive numerical coding system, again, that I inherited from my research advisers…So, this little book that goes with me of codes which is sort of odd, but …we all know that a 14 is a sheep.” (CCU13)
  • 12.
    Data Documentation Practices Iuse an Excel spreadsheet…which I … inherited from my research advisers. …my dissertation advisor was still recording data for each specimen on paper when I was in graduate school so that's what I started …then quickly, I was like, "This is ridiculous.“… I just started using an Excel spreadsheet that has sort of slowly gotten bigger and bigger over time with more variables or columns…I've added …color coding…I also use…a very sort of primitive numerical coding system, again, that I inherited from my research advisers…So, this little book that goes with me of codes which is sort of odd, but …we all know that a 14 is a sheep.” (CCU13) A long way to go before we get usable, intelligible data
  • 13.
    Sometimes data isbetter served cooked.
  • 14.
    Thousand Flowers ● Clean-up and document contributed data ● Map to ArchaeoML (general ontology) ● Mint URIs to entities (potsherds, projects, contexts, people) ● Link to important vocabularies / collections (Pleiades, Encyclopedia of Life) ● Working on CIDOC-CRM (RDF) representations (not straightforward)
  • 15.
  • 16.
    Open Context: Record ● XHTML + RDFa (Dublin Core, Open Annotation, etc.) ● XML (ArchaeoML) ● Atom ● RDF (draft CIDOC) ● Link to GitHub versioned file
  • 17.
  • 18.
  • 19.
    Open Context: Visutalizationof Data Linked to the EOL
  • 20.
    My Precious Data Image Credit: “Lord of the Rings” (2003, New Line), All Rights Reserved Copyright
  • 21.
    Data sharing aspublication
  • 22.
  • 23.
    Publishing Data Quality and Standards Alignment (1) Check consistency (2) Edit functions (3) Align to common standards (“Linked Data” if applicable) (4) Issue tracking, version control
  • 24.
    Publishing Tools of the Trade (1) Google Refine (check, edit, consistancy) (2) Mantis (issue-tracker, coordinate edits, metadata creation)
  • 25.
    Publishing Tools of the Trade (1) Domain scientists (Editorial Board) check data (2) Iterative “coproduction” between contributors and editoris
  • 26.
    Publishing Project Metadata Column Descriptions
  • 27.
    Web of Data(2011) Main Contributors: ● Institutions (esp. government) ● Thematic collections / projects
  • 28.
    Publishing Entity Reconciliation (1) With Google Refine (2) Implemented, EOL and Pleiades (gazetteer) (3) Use existing mappings to improve future reconciliation
  • 29.
    CDL Archiving Service ● EZID for persistent Identity: DOIs (aggregate resources), ARKs (granular resources) and Merritt Repository ● Helps build trust in community
  • 30.
    CDL as Infrastructure ● Platform / Services disciplinary communities can use for “Data Publishing” ● Different communities work out semantic/interoperability needs, editorial policies, University of California (System) incentives, etc. Repository, All disciplines (UC-funded library, grants)
  • 31.
    CDL as Infrastructure Future data Future data publisher publisher ● Platform / Services disciplinary communities can use for “Data Publishing” ● Different communities work out semantic/interoperability needs, editorial policies, University of California (System) incentives, etc. Repository, All disciplines (UC-funded library, grants)
  • 32.
    eScholarship: UC’s OAPublishing Platform
  • 33.
  • 34.
  • 35.
    Summary Outcomes ofPublishing Data: (1) Communicate and set expectations about content and quality (2) Organize workflows to improve data quality and usability (3) Make “datasets” first class citizens in world of scholarly communications
  • 36.
    Final Thoughts Publication needsto evolve! (1) Participating in Linked Data is a great goal, but far removed from most everyday practice (2) Researchers need help. (3) 19th century publication norms poorly suited to 21st century methods, research, public goals