Humanities Networked Infrastructure (HuNI)


Published on

A report on the progress of the Humanities Networked Infrastructure Project presented at the 2013 Digital Humanities conference held in Lincoln Nebraska.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Components of CIDOC-CRM, FOAF and FRBR-OO ontologies have been reused for the integration of the initial datasets. This is a means to encode people, their existence (birth and death events), their occupations and associations with organisations. More components have been added to record two further events, i.e. creation and production events, and to record works and expressions. Work is underway to plugin SKOS and structure vocabularies, using the data supplied (in EAC type schemas) to manage the range of terminology, e.g. recreational, vocational, professional and occupational. This draft is based on a portion of the data analysed and a "mud map" (based on an assessment of data available through web interfaces). See the draft as a line diagram​. A view of the ontology generated in the tool Protege reveals FRBR-OO as an extension of CIDOC-CRM. Draft v0.3 using Initial DatasetsLimitations with using FOAF to handle personal names (culturally situated) have been found. The CIDOC component ​E41_Appellation and its subclasses will now be used, collections are being dealt with and further events are being added, e.g. ​E87_Curation_Activity to reflect actions of selection and collection development. Under discussion is: the inclusion of ​E90_Symbolic_Object to deal with citations (that are not feasible to strip apart and process but provide useful contextual information for an entity); the creation of "Floruit" as a time-related entity for ​E21_Person and ​E74_Group; categorising the datasets and collections as ​E89_Propositional_Object; and ​F3_Manifestation_Product_Type to deal with the disambiguation of portable and web formats of works.
  • This section of the HuNI ontology reveals the "joins" and class relationships, that reveal where the CIDOC-CRM and FRBR-OO ontologies align. The yellow-green bubbles record the CIDOC entities and the red bubbles record the FRBR entities. The bidirectional arrows indicate where there is a "sameAs" relationship, the unidirectional arrow indicate where there is a sub-class relationship.
  • The integration of partner data into HuNI requires two technical component:1. Live data feeds (at partner sites)Three technology options are available for the partners to publish their data as XML: jOAI, OAIcat and, for those who are not exposing their data via the OAI-PMH harvesting protocol, a custom-built solution that requires very little work to integrate at a provider’s site.We are not harvesting all the data – we are only harvesting the primary entity classes (and as much of the uniquely identifying information as possible for each class) that are common “touch points” across many of the partner data sites – people, places, events and objects. Therefore, the lowest common denominator for making the partner data harvestable is a flat XML file per class entity, together with the uniquely identifying information. For example, for the person class entity, uniquely identifying information will include first name, last name, date of birth/death, bio, occupation. 2. A data gateway component called CorbiculaTechnology is being deployed toharvest updated content from the partner XML data feeds and transform the data into forms suitable for ingestion into:A Solr search server: this aggregation of harvested XML records is referred to as ‘HuNI Data’ A Jena RDF Triple Store: this aggregation of stored RDF Graphs is referred to as ‘HuNI Linked Data’
  • Based on the data architecture as set out in the original RFP, there is a requirement to harvest, transform and ingest data each of the partner datasets into some sort of Linked Data store, and very early on in the technical decision making process, it was agreed that RDF (Resource Description Framework) – a metadata modeling specification - would be the lingua franca, and that all the technical components would be developed to work with this Linked Data specification.So we began by:Making some of the partner datasets harvestable to HuNI: by developing a harvest feed for those data providers who were technically able to publish their data in a standard export format/schema (EAC-CPF)Constructing the HuNI ontology and mapping partner data to this common data model. A number of standard cultural heritage ontologies were selected for examination because of their perceived close semantic fit to the nature and types of data in each of the 28 data sources. – CIDOC-CRM, FOAF, FRBR-OO, PROV-ODeploying a data gateway component – called Corbicula – on the NeCTAR Cloud, which is able to technically harvest and transforms the updated XML data from the partner feeds and ingest it into the RDF Triple Store. Once the mappings for a given data source are known, XSLT scripts are written to interpret the XML records and re-expresses (transforms) them as RDF graphs (essentially captures the relationship/link between records from all integrated data sets. But the integration into RDF has proven to be semantically complex and technically complex, because: The publishing format necessary to allow us to do the mappings is too high a technical barrier for most data custodians The data analysis and mapping to a common data model is proving time consuming and complexThe gateway component that harvests and transforms the data into RDF using XSLT has performance and memory issuesThe SPARQL-based search interface developments – where people can search and query the graphs – was proving too slowAs a result, after 10 months of development, only 6 partner data sources have completed their integration journey into the RDF Triple Store, and the search UI isn’t very performantSo back in May it was flagged that there is a as real project risk that we will not be able to fully transform all the partner data into Linked Data, and that only a small subset of partner datasets will be barely discoverable through the lab. This was a real problem, given that the main objective of HuNI is to provide a coSo the decision was made to exYour probably wondering – why have 2 data aggregates –why we mixed the data architectures – purely a project risk management decision – harvesting, mapping, transforming and ingesting into Linked Data is complex and time consuming, and there is a real danger that we won’t have a sufficient Linked Data layer in which to build the lab on – so in order to deliver some cross dataset search capability within the project timeframe, we introduced a new development strand which sees the accelerated harvesting and integration of data into the Solr aggregate So the decision has been made to continue populating the RDF store with partner data for the remainder of 2013, and work on UI in 2014To populate the Solr search server is easy, HuNI periodically harvests the updated XML records from the partner feeds, processes the XML content via a suitable transform, and submits the transformed XML data into the Solr search server. The transformation of partner XML records into HuNI Linked Data is complex and time consuming, and we’ve faced a number of technical issues, which isn’t surprising since we’re using a combination of largely unproven technologies, on the scale required for HuNI deployment First, the harvested data had to be cleaned and mapped to a core HuNI ontology. A range of cultural heritage ontologies were examined as the starting-point for building this core ontology framework. This has been an iterative process, determined by the nature of each data source and by the main types of data found in each source. The following standard ontologies are being aligned to create the HuNI Ontology:People and Organisations (using the CIDOC-CRM and FOAF ontologies) Items, Collections and Resources (using the PROV-O, CIDOC-CRM, FOAF and FRBR-OO ontologies) Events and Relations (using the PROV-O, CIDOC-CRM, FOAF and FRBR-OO ontologies) Place and Subject (using PROV-O, CIDOC-CRM, FOAF and FRBR-OO ontologies) Once the mappings to a common data model are known, the data needs to be technically transformed and ingested. This is made possible through the HuNI gateway component called Corbicula, which performs the following steps: Periodically harvests updated XML records from the source provider feedsUses XSLT to interpret the XML records and re-express (transform) them as RDF graphs.Stores the RDF graphs.The search feature needs to be based on the linked data, to take advantage of the semantic integration provided by the RDF aggregation
  • But of course this is a VL project and not a data integration project
  • Support the non-linear research methods practiced by humanities researchersHuNI is about inclusivity and not exclusivity – using 3rd party authentication for login - for the a community to form around HuNI, its user-base needs to extend beyond scholarly researchers. Also worth noting that any member of the general public interested in Australian culture can run a search across the related databases (the HuNI Data), and share their search results online – not just scholarly researchersThere are discovery limitations – whilst the context is given for each record found, what isn’t available are the known relationships between related records across the disparate data sources - so we’re currently working on a ‘Social Linked Data’ feature
  • Equipped with a full set of known facets and related data fields for each record type, researchers should be able to interact with, and construct complex queries of, the large-scale aggregation of Linked Data.
  • Link will be made available on soon
  • The lab is being designed to support the non-linear research methods practiced in the humanities and creative arts, and will support a workflow centred around discovery, analysis and sharing. As part of the discovery interface a researcher will be able to:Run a free text search across the aggregate and display their results Perform an advanced faceted browse of the aggregate by filtering their results by dataset and entity classes defined in the ontology: people, works, events, organisation, occupation/role, time, place, collections, language, objects. Narrow their search parameters at the start of their search by browsing for information within pre-defined access points. These are likely to be people, works and events since these entity classes are representative across all 28 data sources. Following the initial browse, the user can then filter their search results by dataset and the remaining entity classes.Run a SPARQL query to interrogate the underlying Linked Data The discovery interface is also going to enable serendipitous discovery (i.e. the ability to present information to users before they know what they want to search for):You might also be interested in... (based on the semantic relationships captured in the ontology)The notion of a generous interface is being included (based on some pre-defined daily query feeds), to give the researcher a sense of what is discoverable:On this day…Most popular searchesMost popular records The result sets will be displayed in a number of forms, with list being the default and map and timeline being optional. All search results will be displayed with hyperlinks that allow navigation to the source entity and will show the connections between records as per the ontology mappings
  • The LORE Tool (developed at UQ) will be made available in the lab where researchers will be able to:Display existing connections between relevant records held within their virtual collection, and Add further links between particular records, with commentary describing the relationship between them
  • Researchers will have the option to export their Virtual Collection as a .csv file so they can undertake further computational analysis outside of the HuNI lab and within their preferred tool environment.Whilst the lab will include a Tool Integration Framework specifying how third party tools can integrate within the lab and work with HuNI data, we recognize that tools come and go, and that researchers create their own relationship with their tools of choice. So offering an export function is crucial.
  • Researchers will have the option to share their virtual collection, and their analysis findings, via FB, twitter and email with other researchers
  • The development of HuNI is being managed as a projectHas a collaborative governance structure in place so that all key project decisions are only made as part of a consultative process Using Prince2 methodology in help manage the projectQuestion of consortial project management…Need to create best practice exemplars at the project management level…Staff in 4 states. Communication in skype or google hangout. Issues around discomfort with these communication technologies. Etc.
  • Humanities Networked Infrastructure (HuNI)

    2. 2. CRICOS Provider Code: 00113B NATIONAL E-RESEARCH COLLABORATION TOOLS AND RESOURCES (NECTAR) NeCTAR is a $47 million dollar, Australian Government project, conducted as part of the Super Science initiative and financed by the Education Investment Fund. The University of Melbourne is the lead agent, chosen by the Commonwealth Government.
    4. 4. • Ensure that Australian cultural datasets and the research associated with them become part of the emerging international Linked Open Data environment. • Enable research enquiries to move easily from: what is? to where is? • Support the role of annotation and metadata in discovery of new knowledge or the means to elucidate new knowledge • Position the idea of data as both a subject and object of analysis in humanities • Contribute to debates around standards for development and implementation HuNI BROAD BENEFITS
    5. 5. • Enable humanities researchers to work with cultural datasets more efficiently and effectively, and on a larger scale; • Encourage the systematic sharing of research data between humanities researchers (including the cultural dataset curators themselves), the community and cultural institutions; • Encourage a greater level of cross-disciplinary and interdisciplinary research, both within the humanities/creative arts and between the humanities/creative arts and other disciplines, and the wider public; • Support innovative methodologies such as network analysis, game theory and ‘virtual history’ that rely on large- scale datasets HUNI: SPECIFIC BENEFITS
    6. 6. 1. Organisational level: the goals and processes of the institutions involved 2. The semantic level: meaning of the exchanged digital resources 3. Technical level: implementing data interoperability requires both data integration and data exchange processes as well as enabling effective use of the data that becomes available Pasquale Pagano, ‘Data Interoperability’ (GRDI2020) 4. Project level: The advent of more complex ‘big humanities’ projects requires multiple and multi-disciplinary personnel which in turn entails the organization of different workflows and expectations: e.g. challenge of developing a comprehensive or consortial approach, common definition of project method etc. INTEROPERABILITY
    7. 7. 1. A PARTNERSHIP … a Deakin led consortium • Cultural data providers (10) – project co-operators • Humanities software developer (1) – project co- developers • eResearch organisations (2) – lead development agencies
    8. 8. HUNI PARTNER DATASETS AMHD MAP CAARP Bonza AFIRC Circus Oz AusStage Media: film, cinema, theatre, newspapers, magazines, advertis ing, music, live performances DAAO AustLit AWR ADB DoS Biographical: artists, designers, writers, significant people, scientists, Sydney demographics EOAS AUSTLANG Mura Indigenous languages
    9. 9. AUSTLIT
    10. 10. ADB
    11. 11. DAAO
    12. 12. AUSTLANG
    13. 13. bonza
    14. 14. AUSSTAGE
    15. 15. EOAS
    16. 16. TUGG
    17. 17. Welcome to the Cinema and Audiences Research Project (CAARP) database: An online encyclopaedia of cinema-going in Australia. Data This site contains information on film screenings and venues in Australia. 430,137 screenings 10,256 films 1,978 cinemas 1,649 companies From 1846 to now
    18. 18. • NeCTAR investment of $1.33M • Partner contributions of $480,000 • Partner in-kind contributions amounting to >$1M A FISCAL COLLABORATION
    19. 19. COMMUNITY BUILDING • Collated user-stories (20) • Online showcase events – next one is 4th September 2013 • Live link to the latest alpha prototype on; feedback buttons • Wider beta launch at eResearch Australasia in October 2013 • Stay up to date through our monthly Newsletter and blog feed • Follow us on twitter - @HuNIVL
    20. 20. Information design challenge to build an ontology and use linked data and controlled vocabularies for data to be aligned and related. • Reading the data. Characteristics of the data determine the ontological components selected and the major “entities” (aka “access points”). • Identified early as: people, organisations, events, relationships, places, dates, resources, and subjects. • Components from ontologies already available and being reused or kept in our sights: CIDOC- CRM, FOAF, FRBR, FRBR-OO, BibFrame and PROV-O. 2. INTEGRATING MEANING
    21. 21. PHASE ONE
    22. 22. HUNI ONTOLOGY March 2013
    23. 23. HUNI ONTOLOGY (all classes and object properties) cidoc:E41Appellation cidoc:E49TimeAppellation has subclass cidoc:E44PlaceAppellation has subclass cidoc:E18PhysicalThing cidoc:E24PhysicalManMadeThing has subclass cidoc:E19PhysicalObject has subclass frbr:F7Object has subclass cidoc:P1isIdentifiedBy (Domain>Range) frbr:F9Place cidoc:P53hasCurrentOrFormerLocation (Domain>Range) cidoc:P1isIdentifiedBy (Domain>Range)cidoc:E22Man-MadeObject has subclass cidoc:P1isIdentifiedBy (Domain>Range) cidoc:E52Time-Span cidoc:E2TemporalEntity has subclasscidoc:P4hasTimeSpan (Domain>Range) cidoc:E4Period has subclass frbr:F22Self-Contained_Expression frbr:F25Performance_Plan has subclass frbr:F26Recording has subclass frbr:F24Publication_Expression has subclass frbr:F15Complex_Work frbr:F18Serial_Work has subclass cidoc:E21Person frbr:F10Person has subclass cidoc:E67Birth cidoc:P98iwasBorn (Domain>Range) foaf:Person has subclass cidoc:E74Group cidoc:P107iisCurrentOrFormerMemberOf (Domain>Range) cidoc:E69Death cidoc:P101idiedIn (Domain>Range) cidoc:E7Activity cidoc:P14iperformed (Domain>Range) Thing cidoc:E39Actor has subclasscidoc:E15IdentifierAssignment has subclass huni:PrimaryTopic has subclass cidoc:E35Title has subclass cidoc:E71Man-MadeThing has subclass has subclass cidoc:E53Place has subclass has subclass huni:SKOS.Occupation has subclass has subclass foaf:Group has subclass huni:SKOS.Role has subclass frbr:F6Concept has subclass frbr:F11Corporate_Body has subclass huni:SKOS.Collection has subclass cidoc:E42Identifier has subclass has subclass frbr:F8Event has subclass huni:SKOS.Item has subclass has subclass cidoc:E56Language has subclass has subclass frbr:F13Identifier has subclass has subclass cidoc:E55Type has subclass has subclassfrbr:F40Identifier_Assignment has subclass cidoc:P2hasType (Domain>Range) cidoc:P11iparticipatedIn (Domain>Range) has subclass cidoc:P2HasType (Domain>Range) has subclass has subclass has subclass has subclass cidoc:E65Creation has subclass frbr:F31Performance has subclasshas subclass cidoc:E12Production has subclass cidoc:P1isIdentifiedBy (Domain>Range) cidoc:P1isIdentifiedBy (Domain>Range) has subclass huni:timeIsIdentifiedBy (Domain>Range) cidoc:E5Event has subclass cidoc:P1isIdentifiedBy (Domain>Range) has subclass cidoc:P1isIdentifiedBy (Domain>Range)cidoc:P1isIdentifiedBy (Domain>Range) cidoc:P7tookPlaceAt (Domain>Range)cidoc:P1isIdentifiedBy (Domain>Range) huni:hasOccupation (Domain>Range) huni:hasRole (Domain>Range) cidoc:E48PlaceName has subclass frbr:F30Publication_Event frbr:R24created (Domain>Range)frbr:F21Recording_Work frbr:R23createdARealisationOf (Domain>Range) frbr:F19Publication_Work frbr:R24created (Domain>Range) has subclass cidoc:P1isIdentifiedBy (Domain>Range)cidoc:P1isIdentifiedBy (Domain>Range) cidoc:P1isIdentifiedBy (Domain>Range) huni:placeIsIdentifiedBy (Domain>Range) frbr:F28Expression_Creation has subclass cidoc:P108hasProduced (Domain>Range) has subclassfrbr:F1Work frbr:R19createdARealisationOf (Domain>Range) frbr:F2Expression frbr:R17created (Domain>Range) frbr:F21Recording_Event has subclass cidoc:E28ConceptualObject has subclass has subclass has subclass cidoc:E89PropositionalObject has subclass frbr:F14Individual_Work frbr:F17Aggregation_Work has subclass cidoc:P94hasCreated (Domain>Range) frbr:f25Work_Conception has subclass cidoc:P102hasTitle (Domain>Range) huni:hasCollection (Domain>Range) cidoc:P2hasType (Domain>Range) has subclass cidoc:P148hasComponent (Domain>Range) cidoc:E73InformationObject has subclass huni:hasItem (Domain>Range) cidoc:P2HasType (Domain>Range) frbr:f16Container_Work has subclass has subclass has subclass has subclass frbr:F20Performance_Work has subclasshas subclass has subclasscidoc:P72hasLanguage (Domain>Range) has subclass cidoc:P2hasType (Domain>Range)cidoc:P2HasType (Domain>Range) has subclass frbr:R23createdARealisationOf (Domain>Range) frbr:R24created (Domain>Range) frbr:R24created (Domain>Range) has subclass cidoc:P102hasTitle (Domain>Range) frbr:R12isRealisedIn (Domain>Range) has subclass has subclass frbr:R16initiated (Domain>Range) cidoc:P14iperformed (Domain>Range) has subclass
    25. 25. 3. HuNI DATA ARCHITECTURE Data integration HuNI side Partner side Data harvest, transform and ingest Solr Search Server [HuNI Data] RDF Triple Store [HuNI Linked Data] Data analysis and mapping HuNI Virtual Laboratory Scholarly researcher workflow tasks Admin tasksPublic and citizen researcher workflow tasks Data discovery Data analysis Data sharing Analyse and annotate collection Export collection Share collection and analysis Share search results Corbicula Registration and login Profile management History recording Project management Simple search Advanced search Save search results as private collection Refine / expand collection Simple search Advanced search Deep (SPARQL-based) search Data update and publish ADB DAAO CAARP AFIRC AusStage
    26. 26. A total of 28 Australian datasets are being harvested for integration into HuNI • Data gateway components, called HuNI Corbicula, deployed on the NeCTAR Cloud to harvest the XML feed data and transforming it into forms suitable for ingestion into two HuNI data aggregates: a Solr search server [HuNI Data], and a Jena RDF Triple Store [HuNI Linked Data] DATA INTEGRATION The harvesting process requires: • Live data feeds deployed at the partner sites to publish updated partner data as XML Data integration HuNI side Partner side Data harvest, transform and ingest Solr Search Server [HuNI Data] RDF Triple Store [HuNI Linked Data] Data analysis and mapping Corbicula Data update and publish ADB DAAO CAARP AFIRC AusStage
    27. 27. TWO HUNI DATA AGGREGATES? Solr aggregate RDF aggregate 28 0 7 14 21 24 0 7 14 21 6 partnerdataset partnerdataset
    28. 28. TECHNOLOGY STACK • front-end frameworks - AngularJS and Twitter Bootstrap single page web app • tools hosting framework - Open Social via Apache Shindig • back-end framework - SpringMVC via Roo. • layer integration - RESTful web services
    29. 29. • Search the HuNI Data • Save their search results as a private collection • Refine their collection through additional searches • Analyse and annotate their collection with their own assertions and commentary • Export their collection for further analysis • Publish and share their collection and research RESEARCH ACTIVITIES A researcher with a HuNI account will be able to: HuNI Virtual Laboratory Scholarly researcher workflow tasks Admin tasksPublic and citizen researcher workflow tasks Data discovery Data analysis Data sharing Analyse and annotate collection Export collection Share collection and analysis Share search results Registration and login Profile management History recording Project management Simple search Advanced search Save search results as private collection Refine / expand collection Simple search Advanced search Solr Search Server [HuNI Data]
    30. 30. Scholarly researchers will also be able to perform a “deep search” of the graphs in RDF Triple Store. The large-scale aggregation of Linked Data makes explicit the relationships and connections between related records across all the partner datasets, enabling the researcher to construct more complex semantic queries. RESEARCH ACTIVITIES 2 HuNI Virtual Laboratory Scholarly researcher workflow tasks Admin tasksPublic and citizen researcher workflow tasks Data discovery Data analysis Data sharing Registration and login Profile management History recording Project management Deep (SPARQL-based) search RDF Triple Store [HuNI Linked Data]
    38. 38. 4. THE PROJECT • project director/community liaison (20%) • project manager (100%) • technical coordinator (100%) • information services coordinator (90%) • community engagement (30%) • communication coordinator (20%) • administrative support (20%) • software developer(s) NeCTAR Directorate HuNI Steering Committee Team HuNI Technical Working Group Expert Advisory Group Expert Data Group
    39. 39. PROJECT WEBSITE:
    40. 40. PROJECT WIKI:
    41. 41. HuNI: a virtual laboratory for the humanities