Indexed items in an electronic collection allow both higher recall and greater precision in search returns. How can this feature be implemented in a SharePoint Collaboration environment?
Our example for this discussion is the electronic collection of in-house documents – meeting minutes, committee proposals, reports to colleagues and to the membership, best practice documents, etc. – of the professional association ASRT. The taxonomy includes a collection of terms – single words or short phrases that represent the concepts included in the documents. Additionally, the taxonomy is organized in a hierarchy that mirrors the organization’s structure.
The terms that are included in a taxonomy vocabulary (aka a thesaurus ) should represent a single meaning whenever possible. (Some words have different meanings in different contexts such as ‘paper’. So, ‘white paper’, ‘paper stock’, ‘newspaper’ work better as concept terms since their meanings are less ambiguous than ‘paper’ by itself.) The term’s meaning should be what a reader would offer as the subject (or one of the subjects) of a document when describing its content.
The ASRT taxonomy, organized by operational units, provided a structure for file organization and storage and for website navigation.
Underlying requirements for this implementation recognized that documents would be in Microsoft application format and (for those to be published in journals) in XML format. Documents already included some metadata such as date created, date modified, author/creator, etc. Existing metadata needed to be preserved with additional metadata added. Additional metadata would include category and subject (indexing) terms to enhance the document “usability” and “finadability”.
SharePoint Server 2005 had already been implemented at ASRT. It includes a taxonomy feature which consists of a list of keywords that can include synonyms and weightings. Unfortunately, its implementation is cumbersome and doesn’t achieve the expected results. A solution that enhances SharePoint’s strengths was needed.
The taxonomy design was carefully planned to best suit organizational needs. The configuration of SharePoint and organization of its storage needed to reflect the considerations addressed in the taxonomy design. Additionally, the SharePoint search engine “keyword search” feature needed to be implemented to produce the enhanced search results.
The Data Harmony Machine Aided Indexer (M.A.I.) can suggest keywords. It just needed to be integrated with the SharePoint workflow to quietly “do its stuff”.
The integration had to take into consideration document use, category, format and destination properties.
The services of a Microsoft Solutions Partner, Interlink Group, were employed to produce the required SharePoint code.
Part of the project involved the conversion of various document formats into plain text. Additionally, a SharePoint web part needed to be designed to make search-by-keyword an easily requested option.
This conversion task can now be done by the Sun Open Office Suite server. At the time of this project, an application needed to be developed specifically for the Windows platform.
Ultimately, M.A.I.’s indexing word was done at the time a document was saved (or uploaded) in SharePoint. The option for the user to review the suggested keywords before they were ‘attached’ to the document as a custom property was implemented selectively. For most users, the keyword attachment was accomplished “behind the scenes”. For editors maintaining the taxonomy, the process is visible and interactive. In that way, the taxonomy elements are continually updated and improved as the language of the field evolves.
Linking a Thesaurus To SharePoint for Content Management Scott Denning Tao Liu Access Innovations, Inc.
ASRT Taxonomy• American Society of Radiologic Technologists• Membership organization, more than 100,000 members• Access Innovations, Inc.• Taxonomy to encompass – Knowledge domain – Organizational structure
ASRT Taxonomy• Intent was to have the taxonomy serve both as a structure for indexing documents, and eventually as a tool which would facilitate keyword suggestion for documents at time of generation.• Thus, terms needed to be linked to content, as well as descriptive of content
ASRT Taxonomy• Not just for indexing, but in support of total content management of documents from many different sources
Requirements• Use metadata from existing documents, as well as providing/suggesting metadata for created documents• ASRT is a “MicroSoft Shop”• Support storage as XML documents• MS Office 2003, XML support features• SharePoint™
SharePoint• Supports taxonomies, but does not provide taxonomies• SharePoint’s strengths are collaboration, version control, and searching.• Provides some basic hierarchical structure: – Categories – Keywords – “Best Bets”
The Challenges:• Integrate ASRT taxonomy with SharePoint, allowing users to exploit familiar features while capitalizing on the hierarchical structure of the taxonomy.• Use M.A.I.™ (Machine Aided Indexer) to suggest terms from the taxonomy as keywords at the time of document generation.
The Challenges – cont’d• M.A.I. to run quietly in the background until needed• Provide/suggest indexing terms as document is versioned or finalized
Requirements• Encompass full trajectory of documents: creation – search – repurposing - archiving• Broad range of documents – administrative, accounting, archival, educational, etc.• Different document formats• Flexible for content management
Interlink• Colorado-based group specializing in technology architecture, including SharePoint
M.A.I. Considerations• M.A.I. is a text-based tool; documents are in many formats• Should allow familiar SharePoint search features to be used, while also suggesting indexing terms/keywords
Access work• Programs written to allow M.A.I. to handle documents in different formats: – Word (.doc) – Excel (.xls) – PowerPoint (.ppt) – Portable Document Format (.pdf)
The Future?• SharePoint/M.A.I. used to identify “expert users” within ASRT, based upon congruency of individuals’ keyword usage with taxonomy terms• M.A.I. embedded within/merged with other programs, using versions of code written for this project