Relevance of classification and indexingin the organization of internet resources
The general opinion is that the digital age wipes the centuries old library system. There is a feeling that libraries and librarians are obsolete in present digital era. Two questions generally faced by the LIS professionals are: ‘What will be the future of libraries?’ ‘Why organization of information if you can find it on the internet?’
Will Sherman: 33 Reasons why libraries and librarians are still important (http://www.degreetutor.com/library) Not everything available on the internet Digital libraries are not the internet Internet compliments libraries but does not replace The internet is not free Digitization does not mean destruction, infact means survival Libraries are not just books Like business, digital libraries still need human beings Eliminating libraries would cut short cultural evolution Internet is a mess while libraries organize knowledge
Librarians employed three important tools for K.O. They are: Data element directory (Cataloging Manual) Classification Scheme for categorization of the documents; and Thesaurus (vocabulary control tool) for consistent indexing (assigning index terms)The web has grown without any of these tools, so unorganized(Devadason, F.J. Facet analysis and semantic Web: Musings of a student of Ranganathan http://www.reocities.com/Athens/5041/FASEMWEB.html)However the issue is: Enormous quantity of information outside libraries How to collect and organize world’s knowledge?
TRADITONAL WEB BASED Classification – shelf arrangement Search engines Catalogue – identification and Subject gateways location of information Directories Analysis & consolidation-Indexing / abtractingfor micro documents Result: The web is a sea of all kinds of dataResult: -improved precision or recall - difficult to find, access & -provide context for search terms retrieve pertinent information - enable browsing -extremely unorganized data - access to related information with - Too many false and missing links meaningful relationships Eg Building and architecture -serve as a mechanism for switching Travel and hotel between languages. Difference: Use of subject descriptors
Directories - Could not cope with the scale of Web growth - Were often built by amateurs in classification and vocabulary management - Were biased by the commercial use of the Web Vocabularies - Open Directory categories - Wikipedia categories- Metadata in html <head> - Spammed, not in sync with the content - Ignored by most search engines now- Bottom line : The Web is not and will never be an organized library(Bernard, V. Porting library vocabularies to the Semantic Web, and back A win-win round trip. IFLA 2010, Gothenburg)
Eg. Works on M. K. GandhiLibrary - The art of librarianship has been used for thousands of years to organise knowledge – catalogue/ librarian – class no. – shelfSearch engines - collections are built by robots; number count - aim for exhaustive indexing; - offer automatically generated metadataSubject gateways - collections are built by humans - aim to develop catalogues of high quality resources - offer human generated metadata
Can we apply classification principles?Can we apply Metadata?Can we apply indexing techniques?
Two distinct ways of finding resources on the Internet emerged (Dodd 1996). - the use of robot or spider based search engines and - producing ‘hotlists’, which would encourage users to browse the Web. This production of hierarchically arranged lists brought in the use of Library classification schemes Subject directories like Yahoo! and other quality controlled subject gateways started use of classification schemes to enhance searching the Net. They maximize the retrievability / visibility of information: clustering, browsing. e.g. LIS education through distance mode
Electronic versions of classification schemes (Web Dewey, UDC Online) made it to adopt them on the web. The Web, as an information environment, differs from the controlled setting of a traditional information retrieval system How and to what extent a classification is actually used to support subject access on web. Many Web sites, like Google and Yahoo, use hierarchical classification trees to organize text resources in Web. Subject gateways offer hierarchical browse structures based on subject classification schemes.
The DDC was adapted earlier and more quickly to usage in digital systems via the Internet.It is completely and easily available as "WebDewey" for all Web browsers and platforms.Examples: Library and Archives Canada (LAC) has capitalized on the Dewey Decimal Classification (DDC) potential for organizing Web resources in two projects. ADAM, the Art, Design, Architecture & Media Information Gateway Biz/ed is a subject gateway for business education BUBL uses the Dewey Decimal Classification system as the primary organisation structure for its catalogue of Internet resources. National Library of Canadas Canadian Information by Subject service
Since 1993UDC has been in subject gateways and become more prevalent in East European SGs, portals and hubs since 2000 UDC in SGs appeared to be linked to the following types of applications: manual classification of manually collected links on small to medium-size directories (from a few hundred to a few thousand resources) manual classification of a large number of automatically harvested resources using harvesting and metadata creation tools and more advanced technology (quality controlled SGs) automatic harvesting and classification (quality controlled SGs)(Aida Slavic. UDC in subject gateways: experiment or opportunity? Knowledge Organization, 33, 2006)
Examples: WAIS (Wide Area Information Server) NISS (National Information Services and System ) INTUTE FVL (Finnish Virtual Library ) GERHARD (German Harvest Automated Retrieval and Directory) PORT (Maritime Information Gateway) OKO (Slovenian catalogue of Web resources ) etcBut they are not displaying the UDC structure on the interface or UDC numbers in the metadata.The UDC is probably more "modern" and has made faster progress towards a faceted structure.
Descriptive metadata is to facilitate discovery of relevant information. In addition to resource discovery, metadata can help organize electronic resources, facilitate interoperability and legacy resource integration, provide digital identification, and support archiving and preservation. The process is automatic and cost effective In descriptive metadata, the medium of that resource becomes a non- issue. This enables DC metadata to be used by any organizations for cataloguing specialized types of mixed-media collections
Pre and post coordinated; Derived and assigned; context based; Thesaurus and classaurus (Classaurus is a faceted scheme of terms indicating hierarchy enriched with synonyms) Two concepts - Semantics and syntax Purpose – achieve precision out of recalled information Humans can do it since it is natural language Machines – ignorant and can’t make any senseHow to achieve precision out of recalled information of the Web?
Relationships – categorized as Hierarchical (internal) – whole – part composition Non hierarchical (external) – associative and equivalent Application in different areas Design of classification (thesaurus) Knowledge organization and Information retrieval (search strategies) Lexical cohesion Epistemology etc Design and development of databases Web design and development Artificial intelligence Text analysis and summarization Hypermedia
Creating representation of Web pages Providing standard identifiers (URI) associated to access protocol (http). The WWW is based on HTML / XML hierarchies for coding a body of text and images (multi media) and linking things together Via http protocol, hypertext etc Use of vocabularies as subject descriptors to organize Web content as in libraries
Taxonomies, subject headings, classifications - That’s where library heritage is strong and the Web is weak - Such vocabularies can be structuring for the web of data as they are for libraries - But it is more than in a library – the process should be automated Semantic enhancement of scholarly journal articles, by aiding publication of data and metadata and providing ‘lively’ interactive access is necessary Such semantic enhancements are already being undertaken by leading STM publishers Application of structured vocabularies, of course using artificial intelligence, is the ‘semantic Web’
Tim Berners-Lee: Computer Scientist at MIT, USA.;WWW Creator; Director of W3Consortium; Developer of Semantic Web Intention: to enhance the usability and usefulness of the web and its connected resources.“I have a dream for the Web [in which computers] become capable of analysing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize.” —Tim Berners Lee, 1999 Technologies enabling machines to make more sense of the Web making the Web more useful for humans. This means radically improving ability to find, sort, and classify information: an activity that takes up a large part
The Semantic Web is a project that intends to create a universal medium for information exchange by putting documents with computer-processable meaning (semantics) on the World Wide Web. “The Semantic Web is an extension of the current Web that will allow you to find, share, and combine information more easily. It relies on machine-readable information and metadata expressed in RDF.”www.noisebetweenstations.com/personal/essays/metadata_glossary/metadata_glossar Humans can easily connect the data when browsing the Web…for e.g. we disregard advertisements, we know the links that are interesting for our purpose (job –resume; air ticket – flights)… but machines can’t! Eg. automatic airline reservation can done (Ivan Herman, W3C) combining the local knowledge with remote services: airline preferences; dietary requirements; calendaring For e.g. a computer can find the nearest plastic surgeon and book an appointment that fits a personal schedule.
XML provides a surface syntax for structured documents, but imposes no semantic constraints on the meaning of these documents. XML SCHEMA is a language for restricting the structure of XML documents. RDF is a simple data model for referring to objects (“resources") and how they are related. An RDF-based model can be represented in XML syntax. RDF Schema is a vocabulary for describing properties and classes of RDF resources, with semantics for generalization-hierarchies of such properties and classes.
OWL adds more vocabulary for describing properties and classes: among others, relations between classes (e.g. disjointness), cardinality (e.g. "exactly one"), equality, richer typing of properties, characteristics of properties (e.g. symmetry), and enumerated classes. URI – Universal Resource Identifier - used as universal naming tools, including for properties NAME SPACE is a context in which a group of one or more identifiers might exist. An identifier defined in a namespace is associated with that namespace. E.g. Employee ID 123. Many modern computer languages provide support for namespaces.
All these are based on knowledge representation algorithms, say week AI. The primary facilitators of this technology are URIs which identify resources along with XML and namespaces. These with a bit of logic form RDF, which can be used to say anything about anything. FOAF: A popular application of the semantic web is Friend of a Friend or (FoaF), which describes relationships among people and other agents in terms of RDF.
The web is changing and offering new possibilities for communication and interaction by combining the concepts on the web. This is made possible by XML XML provides an interoperable syntactical foundation that facilitates to represent relationships and built meanings
RDF is an XML based standard for describing resources that exist on the web. RDF is a model for such relationships and Interchange RDF is the standard interchange format on the semantic web. Once information is in RDF form, it becomes easy to process it, since RDF is a generic format. It is a model of (s p o) triplets with p naming the relationship between s and o RDF is a graph: i.e., a set of RDF statements is a directed, labeled graph - the nodes represent the resources that are bound - the labeled edges are the relationships with their names
With an RDF application, it is easy to know which bits of data are the semantics of the application, and which bits are just syntactic fluff. RDF statements describe a resource, the resources properties and the values of the properties. RDF statements are often refer to as “triples” that consist of a subject, predicate and object which correspond to a resource (subject), a property (predicate) and a property value (object)
This piece of RDF basically says that this article has the title "The Semantic Web: An Introduction", and was written by someone whose name is "Sean B. Palmer". Here are the triples that this RDF produces:- <> <http://purl.org/dc/elements/1.1/creator> _:x0 . this <http://purl.org/dc/elements/1.1/title> "The Semantic Web: An Introduction" . _:x0 <http://xmlns.com/0.1/foaf/name> "Sean B. Palmer" . <rdf:Description rdf:about="http://www.ivan-herman.net"> <foaf:name>Ivan</foaf:name> <abc:myCalendar rdf:resource="http://…/myCalendar"/> <foaf:surname>Herman</foaf:surname> </rdf:Description>
URI is simply a web identifier like the strings starting with “http:” “ftp:” Anyone can create a URI and the ownership of them is clearly delegated so they form ideal base technology to build a global web. Resources on the web are identified by URIs, which uses a global naming convention. The W3C maintains list of URI schemes. The URI-s made the merge possible URI-s ground RDF into the Web URI-s make this the Semantic Web
Ontological analysis clarifies the structure of knowledge Defined as the terms used to describe and represent an area of knowledge. These are explicit specifications of a conceptualization The ontology is the study of the ‘categories, of things that exist or may exist in some domain’. A common ontology defines the vocabulary with which queries and assertions are exchanged among agents. These are the rules that help integration and operate on globally shared theory Often equated with taxonomic hierarchies of classes but need not be limited to this form as it adds knowledge about the word
The semantic Web is generally built on syntaxes which use URIs to represent data, usually in triples based structures i.e. many triples of URI data that can be held in databases, or interchanged on the WWW using a particular syntax developed especially for the task. These syntaxes are called “Resource Description Framework” Syntaxes. The application of Semantic Web is to create relations among resources on the Web and to interchange those data, like (hyper) links on the traditional web, except that: - there is no notion of “current” document; ie, relationship is between any two resources - a relationship must have a name: a link to my CV should be differentiated from a link to my calendar - there is no attached user-interface action like for a hyperlink
Map the various data onto an abstract data representation make the data independent of its internal representation… Merge the resulting representations Start making queries on the whole! queries that could not have been done on the individual data sets
Web lacks the coordination and organization of a traditional library. It has been practiced and proved that the use of traditional library tools and techniques could be a great help in taming the Net. The IFLA Information Technology section, with support of Cataloguing section, Classification and Indexing section, and Knowledge Management section, proposes the creation of a Semantic Web Special Interest Group (SWSIG) within IFLA. The SWSIG intends to be a platform where interested professionals could gather, and undertake whatever tasks are needed to develop, enhance and facilitate the adoption of semantic Web technologies in the library community. Librarians should start research projects to develop better techniques of organizing the web. Modern classification research must find order especially in the context of complexities of the Internet