Your SlideShare is downloading. ×
THGenius, rdf and open linked data for thesaurus management
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

THGenius, rdf and open linked data for thesaurus management

2,881
views

Published on

Published in: Technology, Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,881
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. From SubjectA to TH G enius the Semantic Web searching 29 th ADLUG ANNUAL MEETING 2010 Centro Congressi Panorama – Trento Provincia Autonoma di Trento 22-24 September 2010 RDF and Open Linked Data, a first approach (part II)
  • 2. Library data in a modern context
    • The library catalogue (as traditional catalogue or as OPAC) has been the only context for library data since its inception.
    • The library catalogue purpose:
    • Identifying the library’s holding
    • Supporting management of those holdings
    • Providing entry and discovery points for librarians and nonlibrarians users
    • The efforts of librarians in the creation and maintenance of the catalog is rewarded by users? For different and various reasons, users favor the Web as an information platform over the library
    The question for librarians and vendors has to be: how increase the feeling between libraries and users The question that we must face, and that we must face sooner rather than later, is how we can best transform our data so that it can become part of the dominant information environment that is the Web
  • 3. The Web as context
    • Actual scenario: a change is in act
    • the web is more and more the source of information for searchers and researchers, and the library needs to be interconnected with that web of data
    The library catalog data must be transformed from the actual ‘textual description’ to a set of data elements to which machine processes can be applied. This data elements must be compatible with the current technology that is the World Wide Web This process is what we can define the evolution from ‘library catalog’ to Semantic Web As vendor this process means the evolution from SubjectA to TH G enius
  • 4. Data in the traditional catalogue
    • =LDR 00688nz a2200265n 4500
    • =001 000000008238
    • =005 20100519190730.0
    • =008 100519nnanoa \ anad =040 aOSZK$bhun$fKöztaurusz
    • =151 aAsia Minor Occidentalis
    • =551 wgnnn$aókori történeti táj
    • =551 whnnn$aBithynia
    • =551 whnnn$aCaria
    • =551 whnnn$aIonia
    • =551 whnnn$aLycaonia
    • =551 whnnn$aLycia
    • =551 whnnn$aLydia
    • =551 whnnn$aMysia
    • =551 whnnn$aPamphylia
    • =551 whnnn$aPhrygia
    • =551 whnnn$aPisidia
    • =551 wjnnn$aAsia Minor
    • =551 wpnnn$aTörökország
    • =551 wmnnn$aAsia Minor Orientalis
    • =751 4$a(392)
    The ‘Asia Minor Occidentalis’ as MARC21 authority record
  • 5. The knowledge base for Web
    • <skos:Concept rdf:about=&quot;http://nektar.oszk.hu/resource/auth/Asia_Minor_Occidentalis&quot;>
    • <skos:inScheme rdf:resource=&quot;http://www.oszk.hu/thesaurus/location&quot;/>
    • <dc:source>OSZK geotezaurusz</dc:source>
    • <dc:type>location</dc:type>
    • <skos:prefLabel xml:lang=&quot;hu&quot;>Asia Minor Occidentalis</skos:prefLabel>
    • <skos:broader rdf:resource=&quot;http://nektar.oszk.hu/resource/auth/ókori_történeti_táj&quot;/>
    • <skos:narrower rdf:resource=&quot;http://nektar.oszk.hu/resource/auth/Bithynia&quot;/>
    • <skos:narrower rdf:resource=&quot;http://nektar.oszk.hu/resource/auth/Caria&quot;/>
    • <skos:narrower rdf:resource=&quot;http://nektar.oszk.hu/resource/auth/Ionia&quot;/>
    • <skos:narrower rdf:resource=&quot;http://nektar.oszk.hu/resource/auth/Lycaonia&quot;/>
    • <skos:narrower rdf:resource=&quot;http://nektar.oszk.hu/resource/auth/Lycia&quot;/>
    • <skos:narrower rdf:resource=&quot;http://nektar.oszk.hu/resource/auth/Lydia&quot;/>
    • <skos:narrower rdf:resource=&quot;http://nektar.oszk.hu/resource/auth/Mysia&quot;/>
    • <skos:narrower rdf:resource=&quot;http://nektar.oszk.hu/resource/auth/Pamphylia&quot;/>
    • <skos:narrower rdf:resource=&quot;http://nektar.oszk.hu/resource/auth/Phrygia&quot;/>
    • <skos:narrower rdf:resource=&quot;http://nektar.oszk.hu/resource/auth/Pisidia&quot;/>
    • <skos:broader rdf:resource=&quot;http://nektar.oszk.hu/resource/auth/Asia_Minor&quot;/>
    • <skos:related rdf:resource=&quot;http://nektar.oszk.hu/resource/auth/Törökország&quot;/>
    • </skos:Concept>
    The ‘Asia Minor Occidentalis’ as web resource (in RDF/SKOS format )
  • 6. The Web as context
    • What we can manage now with THGenius (RDF Resource Description Framework)
    • RDF/SKOS objects: Simple Knowledge Organization System (to rapresentation of thesauri, classification schemes, taxonomies, subject-headings systems and so on)
    • RDF/FOAF objects: acronym of Friends of friends (ontology describing persons, their activities and their relations with other people and objects)
    • RDF/DC object: acronym for RDF Dublin Core metadata (used to describe information resources, such as documents)
    • To obtain the common goal: to publish on the web our data as linked entities
  • 7. Library data in a modern context
    • http:// semanticweb.org / wiki / SPARQL_endpoint
    • “ A SPARQL endpoint enables users (human or other) to query a knowledge base via the SPARQL language. Results are typically returned in one or more machine-processable formats. Therefore, a SPARQL endpoint is mostly conceived as a machine-friendly interface towards a knowledge base”
    • “ Both the formulation of the queries and the human-readable presentation of the results should typically be implemented by the calling software, and not be done manually by human users”
    • Our proposal: TH G enius
  • 8. TH G enius: the SPARQL endpoint The ‘Asia Minor Occidentalis’ in ThGenius (that reads the SKOS concept)
  • 9. TH G enius: search the Semantic Web
  • 10. TH G enius: search the Semantic Web
  • 11. The open search in TH G enius
  • 12. TH G enius: different perspectives to see concepts
  • 13. TH G enius: different perspectives to see concepts
  • 14. TH G enius: also a new Thesaurus management system WeCat: a traditional way to manage Thesauri
  • 15. TH G enius: also a new Thesaurus management system WeCat: a traditional way to manage Thesauri
  • 16. TH G enius: also a new Thesaurus management system WeCat: a traditional way to manage Thesauri
  • 17. TH G enius: also a new Thesaurus management system WeCat: a traditional way to manage Thesauri
  • 18. TH G enius: also a new Thesaurus management system ThGenius: authorised people to manage Thesaurus via web
  • 19. TH G enius: also a new Thesaurus management system ThGenius: authorised people to manage Thesaurus via web
  • 20. Keyword and Keyphrases Indexing (1/9)
    • Keywords and keyphrases summarize and describe the content of single documents and provide additional semantic metadata that is useful for a lot of purposes.
    • The task of assigning keywords and keyphrases to a document is called keyphrase / keyword indexing.
    • In libraries, professional indexers select keyphrases and keywords from a controlled vocabulary (Subject Headings) according to defined cataloguing rules.
    • The idea behind the process described in the next slides is to automatize the indexing task in order to automatically add to our documents a set of keywords / keyphrases extracted using semantic relationships within a thesaurus (expressed in SKOS format).
    • This is another interesting advantage of having a thesaurus in SKOS format.
  • 21. Keyword and Keyphrases Indexing (2/9)
    • For our example we will use the following set of documents:
    Format File Description Circulation.doc Amicus Circulation Module user manual Dubliners.pdf The Dubliners (by J.Joyce) Harry_Potter.pdf Harry Potter and the Quest of Values (a thesis) bondvaluation.xls Bond Calc Spreadsheet Moby-Dick.pdf Moby Dick (by H.Melville) Searching.odt Amicus Search Module user manual WeLoan.ppt Amicus Circulation Web Module (by T.Possemato)
  • 22. Keyword and Keyphrases Indexing (3/9)
    • First of all we will extract metadata from our documents. Specifically we will get the “title” and “author” metadata attributes.
    • The following is what the process produces:
    Metadata attribute : title=Bond Calculator Metadata attribute : author=Robert Jones Metadata attribute : title=Amicus Circulation Module - User Manual Metadata attribute : author=Anneke Metadata attribute : title= Dubliners Metadata attribute : author=James Joyce Metadata attribute : title=Harry Potter and the Quest for Values Metadata attribute : author=Tony Lennard Metadata attribute : title=Moby Dick Metadata attribute : author=Herman Melville Metadata attribute : title=WeLoan - The new circulation module Metadata attribute : author=Tiziana Possemato ...
  • 23. Keyword and Keyphrases Indexing (4/9)
    • As a second step, we will proceed with text extraction.
    • Regardless the file format, we will extract the textual content from each document.
    • Together with the previously extracted metadata, this is an important part for keyword indexing because later, using the extracted text, the system will be able to undestand terms occurrency, frequency and relevance within the documents.
    • Keep in mind that the file format is not important from this point of view. That means you can use doc, txt, pdf, rtf, xml, html, open office documents and generally speaking, all formats that have a (direct or indirect) textual content.
  • 24. Keyword and Keyphrases Indexing (5/9)
    • In order to give you an example, the following is a section of the text extracted from “Amicus Circulation Module user manual” (a Microsoft Word document)
    ... If the item is received in the requesting library, it needs to be checked in to make the item available for circulation. To do this, one has to follow the steps given below: Click on the Check In button on the Circulation Main Menu. Enter the barcode of the copy and press enter. A message appears that the item has arrived from transit. Click on the close button. If one checks the status of the copy in the requesting library, one will find an additional field on the status of copy screen: “Original branch”, showing the owning branch of the transferred item. After the check in of the item, the copy can be charged out to the borrower who requested the copy. The policies of the requesting library are valid as policies for this book. A hold can be placed on an item of another library from the moment the book has been transferred by the owning library. See: charge out policies Note that: The item can be charged out immediately when a borrower is present in the library at that moment. In this case, it is not necessary to check in the copy first before doing a charge out. To return the copy There are two options to return a transferred copy to the owning library. The first option is when the borrower comes to check in the copy: Enter the Barcode number of the copy on the Check In screen and press enter. ...
  • 25. Keyword and Keyphrases Indexing (6/9)
    • After extracting the textual content from our documents now it's time to extract
    • Keywords and keyphrases.
    • In order to do that we need:
      • Metadata attributes: see first step;
      • Text: see second step;
      • A controller vocabulary (thesaurus) in SKOS format;
    • Regarding the last point, for this example, we will use the Library of Congress Subject Headings (LCSH) but keep in mind that any Thesaurus in SKOS format can be used.
  • 26. Keyword and Keyphrases Indexing (7/9)
    • The following are keyphrases and keywords extracted from Moby Dick by Herman Melville using two different thesaurus (Library of Congress Subject Headings and Medical Subject Headings).
    Soils Whaling Whales Hand Boats and boating History Ships Steam engines Steam engineers Poultry Seas Journalism History Emotions Interest (Psychology) Fat Steam engineering Steam-engines ... Dogs Male Smell Simian Acquired Immunodeficiency Syndrome Female Spermatozoa Animals Animation [Publication Type] Sleep Leg Cattle Mouth Monsters Aged Aging Mortality DNA Transposable Elements Brain ... LCSH MESH
  • 27. Keyword and Keyphrases Indexing (8/9)
    • And finally, after indexing metadata, text, keyword and keyphrases we can search those documents using our favourite search engine.
  • 28. Keyword and Keyphrases Indexing (9/9)
  • 29.
    • What TH G enius is:
    • The best opportunity for a library to be attractive for modern and smart users
    • The evolution from traditional library catalog to semantic web : not only from a vendor but also from a library view point
    • A very powerful and userfriendly way to produce, use and share library data, available for web
    • A simple way to ‘manage’ thesaurus and authority data in a very standard and reusable format
    • A powerful and simply way to improve the search functions, increasing fulltext and other different file metadata
    TH G enius in few concepts