From SubjectA to TH G enius the Semantic Web searching 29 th ADLUG ANNUAL MEETING 2010 Centro Congressi Panorama – Trento Provincia Autonoma di Trento 22-24 September 2010 RDF and Open Linked Data, a first approach (part II)
The library catalogue (as traditional catalogue or as OPAC) has been the only context for library data since its inception.
The library catalogue purpose:
Identifying the library’s holding
Supporting management of those holdings
Providing entry and discovery points for librarians and nonlibrarians users
The efforts of librarians in the creation and maintenance of the catalog is rewarded by users? For different and various reasons, users favor the Web as an information platform over the library
The question for librarians and vendors has to be: how increase the feeling between libraries and users The question that we must face, and that we must face sooner rather than later, is how we can best transform our data so that it can become part of the dominant information environment that is the Web
the web is more and more the source of information for searchers and researchers, and the library needs to be interconnected with that web of data
The library catalog data must be transformed from the actual ‘textual description’ to a set of data elements to which machine processes can be applied. This data elements must be compatible with the current technology that is the World Wide Web This process is what we can define the evolution from ‘library catalog’ to Semantic Web As vendor this process means the evolution from SubjectA to TH G enius
“ A SPARQL endpoint enables users (human or other) to query a knowledge base via the SPARQL language. Results are typically returned in one or more machine-processable formats. Therefore, a SPARQL endpoint is mostly conceived as a machine-friendly interface towards a knowledge base”
“ Both the formulation of the queries and the human-readable presentation of the results should typically be implemented by the calling software, and not be done manually by human users”
Our proposal: TH G enius
TH G enius: the SPARQL endpoint The ‘Asia Minor Occidentalis’ in ThGenius (that reads the SKOS concept)
Keywords and keyphrases summarize and describe the content of single documents and provide additional semantic metadata that is useful for a lot of purposes.
The task of assigning keywords and keyphrases to a document is called keyphrase / keyword indexing.
In libraries, professional indexers select keyphrases and keywords from a controlled vocabulary (Subject Headings) according to defined cataloguing rules.
The idea behind the process described in the next slides is to automatize the indexing task in order to automatically add to our documents a set of keywords / keyphrases extracted using semantic relationships within a thesaurus (expressed in SKOS format).
This is another interesting advantage of having a thesaurus in SKOS format.
For our example we will use the following set of documents:
Format File Description Circulation.doc Amicus Circulation Module user manual Dubliners.pdf The Dubliners (by J.Joyce) Harry_Potter.pdf Harry Potter and the Quest of Values (a thesis) bondvaluation.xls Bond Calc Spreadsheet Moby-Dick.pdf Moby Dick (by H.Melville) Searching.odt Amicus Search Module user manual WeLoan.ppt Amicus Circulation Web Module (by T.Possemato)
As a second step, we will proceed with text extraction.
Regardless the file format, we will extract the textual content from each document.
Together with the previously extracted metadata, this is an important part for keyword indexing because later, using the extracted text, the system will be able to undestand terms occurrency, frequency and relevance within the documents.
Keep in mind that the file format is not important from this point of view. That means you can use doc, txt, pdf, rtf, xml, html, open office documents and generally speaking, all formats that have a (direct or indirect) textual content.
In order to give you an example, the following is a section of the text extracted from “Amicus Circulation Module user manual” (a Microsoft Word document)
... If the item is received in the requesting library, it needs to be checked in to make the item available for circulation. To do this, one has to follow the steps given below: Click on the Check In button on the Circulation Main Menu. Enter the barcode of the copy and press enter. A message appears that the item has arrived from transit. Click on the close button. If one checks the status of the copy in the requesting library, one will find an additional field on the status of copy screen: “Original branch”, showing the owning branch of the transferred item. After the check in of the item, the copy can be charged out to the borrower who requested the copy. The policies of the requesting library are valid as policies for this book. A hold can be placed on an item of another library from the moment the book has been transferred by the owning library. See: charge out policies Note that: The item can be charged out immediately when a borrower is present in the library at that moment. In this case, it is not necessary to check in the copy first before doing a charge out. To return the copy There are two options to return a transferred copy to the owning library. The first option is when the borrower comes to check in the copy: Enter the Barcode number of the copy on the Check In screen and press enter. ...