• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Subject Access Enhancement: FocusOn Search and CategoryMap: An Integrated Approach for Discovery of University Resources and Library on the Web
 

Subject Access Enhancement: FocusOn Search and CategoryMap: An Integrated Approach for Discovery of University Resources and Library on the Web

on

  • 915 views

Subject access refers to find and locate the ‘Aboutness’ of a named entity (person, family, corporate body) or a concept, object, event, and place. Subject access enhancement refers to providing ...

Subject access refers to find and locate the ‘Aboutness’ of a named entity (person, family, corporate body) or a concept, object, event, and place. Subject access enhancement refers to providing integrated subject access to structured, semi-structured, and unstructured data. This presentation compared known and unknow-term search in Google, library OPAC and Website, and university website; introduced various subject access enhancement techniques applied to a library OPAC that supports unknown-term search through examples; and pointed out challenges in providing an integrated subject access across all resources of an enterprise - university website, library Opac, library website, and other data service points. FocusOn Search and CategoryMap are considered as essential components to enhance subject access for such data. The presentation also suggested how the two new utilities be implemented as plug-in to existing cataloging environment, which allow catalogers to 1) configure web services capable to consume metadata other than MARC format, 2) create and maintain categories conforming to enterprise service bus at local library level, home-institution level, consortium level, bibliographic utility level, and other data service level.

Statistics

Views

Total Views
915
Views on SlideShare
915
Embed Views
0

Actions

Likes
0
Downloads
10
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Acknowledgement: Prof. IsaelMoskowitz from NYU; Andrew Sankowsi, Cynthia Chambers and Theresa Maylone from St. John’s Univ. Libraries
  • Definition: Subject access refers to find and locate the ‘Aboutness’ of a named entity (person, family, corporate body) or a concept, object, event, and place To make this happen, at document processing side, we do subject analysis, and other processing as the followings:Provide classification to the document;Sometimes, provide categorization to the document;Describe the Aboutness of the document, e.g. identify the named entity, concept, object, event, and place;Tag the named entity, concept, object, event and place in a controlled manner (e.g. provide authority control for the named entity and subjects, including thesaurus such as AAT, LCSH, MESH); Index them, hoping that the query that the searcher enters into the system matches with the term that we have in the index;Usually store the tags/metadata in relational database systems, and the associated documents in flat file systems; By the way, this is what people call ‘Semi-structured’ data;Unstructured data refers to documents that are in .doc files, .txt files, .xls files, .email, and telephone transcripts; Structured data refers to data in relational database systems, object-oriented database systems, and other structured systems, etc.; The world has 20% of the data in structured systems, and 80% of the data in unstructured and semi-structured systems.Subject access enhancement refers to providing integrated subject access to structured, semi-structured, and unstructured data.FocusOnSearch and CategoryMap are considered as essential components to enhance subject access for such data.
  • Google Search example –Result list does not differentiate ‘books written by Henry George himself’ from ‘books or topics about Henry George.’OPAC search comes handy as we markup both books written by the Henry George and books or topics about Henry George in the bib records.
  • Google works for known item search, e.g. “ACM Communications in Computer Algebra” by title. Google does not work for unknown item search, e.g. “Algebra-Data Processing-Periodical”. The above title can not be found in the first page.
  • OPAC supports both known and unknown item search;Example of unknown item search in OPAC – WebVoyage by subject keyword “AND” with relevance ranking in simple search mode.
  • Unknown item search in OPAC – WebVoyage by subject Boolean keyword “AND” in advanced search mode.
  • OPAC rendering result for unknown item search by subject keyword in OPAC - WebVoyage
  • Unknown item search in OPAC – WebVoyage by subject browse.
  • Use LC classification - QA 150-272 to group items whose “Aboutness” is Algebra; and QA 155.7.E4 is Algebra – Electronic Data Processing.Unknown item search in OPAC – WebVoyage by LC classification browse in a bib. This is still a challenge. Why?
  • Only print collection can be collocated using call no. browse.E-J collections can not be browsed even though LC classification exists in the 050 field of a bib record. In Voyager, call number index & browse comes out of MARC holdings field 852$h, rather than the classification number in a bib. Collocate print, electronic, and other types of collections under LC classification is still a challenge.Serials Solutions supplies subject categories to full-text e-j A-Z list using LC classification. A work around can be made to Serials Solutions MARC title list using the same subject category scheme as the one used for full-text e-j A-Z list. The note fields in slide number 21 indicates the implications to ILS operations, and others.
  • Full-text e-j portal on the library Web supports only known-item search by title and ISSN.
  • Full-text e-j portal on the library Web supports unknown item browse by subject. Two titles have been highlighted: 1) ACM communications in computer algebra; and 2) Annals of combinatorics.
  • The next few slides are examples of a page rendering from the University website in a single search box: Algebra Electronic Data Processing, Combinatorics, Henry George, Charles Wankel, etc. A search of the term “Algebra Electronic Data Processing” retrieves schools and bulletin info on the university website.
  • A search of the term “Combinatorics” retrieves related academic events, faculty info, and school bulletins in 2004.Page rendering at the University Website – should we group the result list as bread crumbs/or folder structure for named entity, concept, object, event, place, timeline, etc.?What about lifecycle maintenance of the web page content?
  • A search of the term “Henry George” retrieves related academic events and library resource info page on the university Website.Group the “Aboutness” of “Henry George” into a sense-making page regardless the location of the page?
  • Identify works “BY” or “ABOUT” CharlesWankle on the University Website by linking between University Website and OCLC WorldCat Identities Services for Named Entities Resolution? More examples of Charles Wankel page lookup from OCLC WorldCat Identities Services in the next few slides.
  • More Examples on Result List – It supports horizontal and vertical scans, and featured recognition about the named entity “Wankel, Charles” on OCLC WorldCat IdentitiesLink between University Website and OCLC WorldCat Identities Services for Named Entities Resolution?
  • Link between University Website and OCLC WorldCat Identities Services for Named Entities Resolution?
  • Link between University Website and OCLC WorldCat Identities Services for Named Entities Resolution?
  • Link between University Website and OCLC WorldCat Identities Services for Named Entities Resolution?
  • The diagram depicts a snapshot of the information infrastructure for the University resources, especially in regard to faculty, and Libraries;FocusOn Search and CategoryMap sit on top of the information discovery layer, building the bridges, e.g. among faculty, university resources and libraries; Enable us to understand who the users are, and what processes involved in info creation and consumption especially in regard to faculty;More category types tomarkup faculty activities, university resources and the library?What are considered as input, what are considered as output? What are the processes to generate the output? How information flows between each process? This diagram details facts to collect and markup at contextual level.
  • This diagram indicates the flows of the systems. We aggregate contents through the aggregation of technologies, and distribute the contents to users.Librarians deploy systems, such as Collection Development, Cataloging, LibGuide, capable to select, organize, access, guide, enhance, and distribute contents to the user through technologies. Yet, there are still complaints ….Where is the user’s behavior context? – we index tons of info, present them to the user without any filtering, e.g. who are the users, and what are they looking for?
  • At document side, if we have a CategoryMap, it will:Lookup and consume vocabulary services provided by LC, NLM, OCLC, and Getty in manual and batch modes;Process vocabulary and enable the choice of the appropriate form of named entity in reference to terms clustered by applications, tagged by end-users, structured in classification scheme;Distribute the contents to the end-users through the analysis of existing collections, activities and users;Classify the users’ behavior context better?
  • There are objects to be embedded within the front end of FocusOn Search and CategoryMap. The objects being selected for insertion in a word document are:St. John’s Logo: Login/Create My Account; User preferences; Simple search and advanced search modes; Suggest; Reset; Email; Print; AskUs; Exit. The “Preview” button is expected to view full-text of ‘Search results selected’ when limiting to online only, etc.Refinement search results by subject, and then limit the subject to concept only. Click browse CategoryMap, relationship among highlighted subject terms about the person can be explored from OCLC Named Entities for the person. St. John’s FocusOn Search As Google Gadget.‘TextThis’ is the button to send a few and final result sets to mobile phone, mocked up from North Caroline State University’s Quick Search: http://www.lib.ncsu.edu/catalog/The button ‘Save’ means ‘Save To Bag’ for further processing. After ‘Save to Bag’, users have the choice of saving the items into ‘my library.’ The list of ‘Add Note’, ‘Edit labels’, ‘Write review’, and ‘Remove’ will appear in brief item listing display. Two selected books in users’ library are selected for such display, extracted and mocked up from Google Books. The label ‘Add note’ applies to the entire banner of the 1st book in brief display. The label ‘Write review’ applies to the entire banner of 2nd book in brief display. User created labels will be indexed byCategoryMap. Two trails of bread crumbs for folder navigation are designed to integrate FocusOn Search with existing Websites of the University and Libraries. The top one sits right above the user actions for Print, Attach/RSS, Libraries, Text This, Reformat and Gadget. It indicates users’ paths, e.g. Home > Academics & School > Libraries > Resources > Focus On > About Henry George. Click on the trail will lead users go back to the next higher level of the folder structured trail. The second trail in the bottom of the page indicates available services provided by the University, including feedback, privacy, safety, sitemap, and copyright information, etc. Click on the trail will lead users into the services provided by the University and the Libraries.
  • Build CategoryMap into the session configuration for existing cataloging client whether it is browser-based or window-based for a single user. Validation of content and record structure within CategoryMap. The example shows, how record structure such as Atom and Dublin Core can be accommodated and validated in such environment, including heading types, e.g. category, etc.
  • Client configurable CategoryMap Connection Options to consume data services from a list of databases, e.g. WorldCat, LC authority files, NLM Mesh, Getty AAT, NLC Authority file, dictionary, and common used reference tool, etc.
  • Build CategoryMap into the session configuration for general holdings library, including choice of call no. hierarchies, import and duplicated profiles, etc.
  • Build CategoryMap into the session configuration for format specific holdings library if MARC format is chosen
  • Build CategoryMap into the session configuration for format specific item in cataloging client, where item level category is displayed as category code, e.g. 900.Build CategoryMap into the session configuration for format specific item in circulation client where item level category is displayed as category name description, e.g. Management - Tobin.
  • Centralized catalog: 1. Is part of the common service of the discovery layer, sitting on top of existing university information resources and Libraries on the Web, ILS (Integrated Library Systems), university resource planning systems (enterprise legacy systems), teaching and learning systems, and discipline-specific research repositories at institutional and regional level once the systems implemented in full-scale; 2. Provides interfaces for human-machine and machine-machine communication, interaction, collaboration, problem solving, and decision support; 3. Provides an inventory of structured data (xml, RSS, atom) and unstructured data (email, web page, .doc, .pdf, .excel) via a set of meta-data records. A meta-data record conformed to the institutional and industry standards describe the of-ness and about-ness of an information object and provide links to the object. Media Type:All media types in the catalog will be given descriptive meta-data for media type identification, discovery, search and retrieval, and linkage. 1. Like the rest of the collections in the catalog, they are classified for role-based access, arranged alphabetically for browsing, categorized for discovery, filtered, ETL and indexed for search and retrieval, recommended for reputation, top-ranked for analysis and other processes in the pipeline, and linked for obtaining the media object locally or mashing up with external applications remotely via public available APIs on top of HTTP and enterprise service bus within the private cloud computing environment.  2. The administrative and structural metadata for the maintenance and manipulation of each media type (e.g. reformatting images, videos, and audios) as a media object is beyond the scope of this project at the moment.NAMED ENTITIESThe named entity for a person, family, and corporate is considered as an information object that comes with the following attributes when appropriate: Zip-code, address, country;Area code, phone number, device profiles, etc.; Web page and email in the form of URI;Language;Timeline that is specific to a named entity. For a person, timeline refers to dates associated with the person’s birth date, death date, and period of activity in Gregorian calendar; Category appropriate to the level of granularity of the information object, e.g. skills and specialty for a person, and correlated with: subject terms clustered by an application; controlled vocabulary such as LCSH and MESH provided by a lookup; user-tagged terms; classification scheme such as LC classification and Dewey; Association related to the about-ness of a named entity. For a person, the associated attributes are not limited to the followings, e.g. title, gender, affiliation, field of activity, occupation and biographical information. At runtime, a search of the named entity of a person, all resources, works, expressions, manifestations and items about the named entity will be retrieved and displayed along with the bio info of the person; Association related to the of-ness of a named entity. At runtime, a search of the named entity of a person, all works, expressions, manifestations, and items created by the named entity will be retrieved and displayed based on content model for rendering;Relationships between named entities for persons, families, and corporate bodies are tagged, mapped, grouped, and visualized according to user-tagged terms, association rules, classification, and user profiles specified in web form during initial registration. A user can also modify such relationship manually. The backend systems will recommend additional relationships by running a recommendation engine on behalf of the user; Top-ranked for other processes in the pipeline, e.g. supporting collection development decision, users and collection performance analysis, e.g. query expansion; Like media type, the specific named entity, e.g. person, will be linked and mashed up for obtaining the aboutness and of-ness of a person, locally and remotely via public available APIs on top of HTTP and ESBs within the private cloud computing network; Privacy, copyright, and information security, including opt-in and opt-out option for the named entities to be exposed and shared across the enterprise; The output of the focused page can also be rendered for import and export, RSS, preview, citation list generation, sharing, printing, email and texting in user-defined formats and devices. Other entities such as concept term, object name, event name, and geographic name will carry similar system functionality and capability as the named entities for persons, families, and corporate bodies. At run-time, given a concept term, for instance, works, expression, manifestations, and items related to the concept term will be retrieved and displayed regardless of its structure, media type, format, repository, etc. according to the classification of the documents, controlled vocabulary, role-based access, and content models for rendering. At run-time, the relationship between the concept term, for instance, and its broader terms, narrower terms, used terms, etc. can be exposed and consumed by other applications, which might take it as an input for making choices and validation of the form of a name or subject, assigning classification and subject terms to the resources, in addition to the development and maintenance of the vocabulary for categories. The search facility in FocusOn Search will suggest spelling correction based on patterns, rules, keywords, phonics, synonyms, dictionary, and controlled vocabulary within one dialogue box in a single interface. It will also suggest categories that would facilitate discovery based on statistical analysis of queries, documents, user profiles and activities, usage, and vocabulary services consumed from other vocabulary service providers. For geographic name, if applicable, zip code and area code processing will be a part of the application. Ideally, Google Map API look up should be supported as well if applicable.
  • Fine-grained taxonomy management is important for not only for subject searches, but also for mission critical operations at the University and Libraries. For Libraries, e.g. it is important to make informed decisions as what we are doing and how well we are doing through baselining and reporting on user services, collection management, circulation, acquisitions, cataloging, etc. The CategoryMap application and along with its program will link these processes across the units of the Libraries, and the University.  Therefore, it is our job to maintain such taxonomy for the reuse and sharing of enterprise-wide information resources among ERP systems, ILS, institutional repositories, etc. in conformance to institution and industry standards. The CategoryMap will serve as the backbone of an enterprise’s common data services, in addition to the time of the day and locations. The CategoryMap will manage category terms which can be in a form of concept, object, event and place, harmonized from subject terms:  Clustered by an application; Looked up through controlled vocabulary such as LCSH, MESH, and AAT; Tagged by user-defined terms; Structured by LC and Dewey classification; Referenced directly from fund expenditure structure in acquisitions;Analyzed based on usage statistics reports aggregated from circulation, content suppliers, etc., and no. of documents/objects likely carrying the category term;Managed in a knowledge base for vocabulary filtering, mapping, ETL, etc., and in a data warehouse for data mining; The search facility will also handle query processing in relational database management systems and ontological database management systems;Relationships between concepts, objects, events, and geographic names are constructed according to controlled vocabularies developed by LC, NLM, and Getty.All named entities such as personal name (PN), family name (FN), corporate name (CN), concept term (CT), object name (ON), event name (EN), geographic name (GN), and timeline (TN) in a meta-data record will have their own authority records stored and maintained centrally in a logical/physical name resolver facility distributed globally by authorized vocabulary service providers such as LC, OCLC, British Library, and National Library of Canada on the Web. Named headings in the authority records at the name resolver facility such as OCLC WorldCat are:  Constructed in conformance to tagging standards and rules; Contributed by a community of users who have defined their roles and responsibilities in service contribution and consumption, registered and exposed their services with major vocabulary service providers; Validated by templates, encoding levels, schemas, name authority files, controlled vocabularies, reference tools, and business rules; Governed for the enforcement of policies, service level agreements (SLAs), operational level agreements (OLAs), service reconciliation, service lifecycle management, compliance, SSO (Single Sign On), etc.; Monitored, measured and reported for information quality, fiduciary, and security.  The CategoryMap application will perform dynamic lookup or batch processing for named entities and subjects in a name resolver facility via Web-services for service consumption. User-tagged terms in such a manner will be reviewed, card-sorted, and integrated into a master list of commonly used vocabulary before they are contributed to the vocabulary service providers when appropriate. The application will map a user-tagged term for the object into its variant name, preferred form of name, and default form of name as appropriate to the user’s choice according to statistical processing and tag-based ranking algorithms, and others. See references for information criteria defined by COBIT Conceptual Framework, and ISACA Model Curriculum.
  • There are two tiers: 1) Cloud tier – user processes on the internet (OS for Browser); 2) Vocabulary tier – document processes on the intranet (OS for Windows); Sync desktop application from both tiers;
  • DFD (Data Flow Diagram) Context Level for FocusOnSearch
  • Reference:ER Diagram for RDA Taxonomy: High-Level Relationship Among Entities by IMT (Information Management Team)1. Uncontrolled access point, explanatory heading, community generated tags, etc. excluded from the diagram
  • Example of the named entity - Person:George, Henry, 1839-1897 using LC Authority File
  • Example of books about Henry George marked up in MARC 600 field. The personal heading has been established in LC Authority File.
  • This ER Diagram indicates entity relationship among named entities and subjects (e.g. concept, object, event, place).Reference:ER Diagram for RDA Taxonomy: High-Level Relationship Among Entities by IMT (Information Management Team) (4 of 8)
  • LC subject authority indicates relationship between topical headings – ‘Single tax, Land, Nationalization of, etc.’
  • Here is how it is marked up in the authority file.
  • Reference:ER Diagram for RDA Taxonomy: High-Level Relationship Among Entities by IMT (Information Management Team) (7 of 8)
  • Here is refined search by subject.
  • System Flow Chart for FocusOn Search and CategoryMapInfo Sharing Processes for Enterprise Wide Information Discovery
  • The CategoryMap has to leverage the vocabulary framework such as Topic Map as formal taxonomy building block, which sits on top of commonly thesauri such as LC LCSH, NLM MESH, and Getty AAT, and in addition, it presents the topic map and other vocabulary processing features for FocusOn Search in the discovery layer. On the one hand, we will leverage existing vocabulary framework such as OCLC WorldCat Identities by developing service consumption applications, and on the other hand, we have to actively collaborate with others in developing the common vocabulary infrastructure for the Web.
  • 1. Info Sharing Processes for Enterprise Wide Information Discovery2. Maintain, Trace, Track, Analyze, Report
  • Continue to collect sample unstructured source data at St. John’s Univ. Web Site from the faculty page of Tobin College of Business like Dr. Charles Wankel, and integrate the page using CategoryMap application that is going to integrate into the Discovery Layer for FocusOn Search. Continue to collect sample unstructured source data composed by a group of librarians as the libraries’ guides to the events of current and future interest, and published at St. John’s University Web site like one of the Topic Guides Titled “Focus on Henry George”.Continue to collect sample unstructured source data from OCLC WorldCat Identities Services for Named Entities Resolution using LCCN number as identifier to locate the personal name page for Dr. Charles Wankle.Continue to collect sample structured data to be syndicated from Google Books by FocusOn Search using Henry George’s “Dreamer or Realist” as use cases for developing detailed display of an selected item in the Front End of FocusOn SearchTosyndicate data feed from the university resources and Libraries on the Web, ‘Attach’ button would allow the system to obtain HTML pages and their associated files (e.g. PDF, Excel, Word, etc.) from sites recommended by the discovery layer of the FocusOn Search. The file filtering layer prebuilt within the FocusOn Search will automatically convert the native pages into format-independent files, ready to be reviewed, ETL (Extracted, Transformed, and Loaded), and integrated with the repositories of FocusOn Search.A plug-in meta-data conversion utility will capture the attached metadata and convert them into a centralized meta-data repository for the entire discovery layer, ready to be reused by other applications.The ‘RSS’ button is going to store dynamic contents on the web. Special change management strategies, packages, and techniques have been evaluated, e.g. Rational Asset Manger, for SOA services, etc.Reformat’ is an export facility that presents users with choices for output options of further processing, e.g. RefWork. All the cataloged resources are expected to have zip code lookup function, and would be interfaced with Google Map, and localized as how the systems behaved in OCLC Open WorldCathttp://www.worldcat.org/. Such visualized features are expected to be performed after final refinement.Two sample result sets indicate that the discovery layer of the FocusOn Search will send open API requests to a list of service providers, dynamically determining the appropriate copy to present if there are multiple choices, the appropriate format template to use for rendering based on criteria of the followings: a) predefined by the users, b) pre-processed open URL links according to known contracts, service level agreement and trust management, c) patterns, heuristic rules, statistical analysis, and data mining of resources, users, activities, etc. in the data warehouse and the knowledge-based of the discovery layer.

Subject Access Enhancement: FocusOn Search and CategoryMap: An Integrated Approach for Discovery of University Resources and Library on the Web Subject Access Enhancement: FocusOn Search and CategoryMap: An Integrated Approach for Discovery of University Resources and Library on the Web Presentation Transcript