Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A New Concept In Search


Published on

Published in: Technology, Design
  • Be the first to comment

  • Be the first to like this

A New Concept In Search

  1. 1. weighting influenced by parent, child and sib- A New Concept in Search ling can reduce taxonomy development and on-going maintenance by 66%-80%. Semantic Metadata Generation By John Challis, CEO, CTO, Concept Searching The metadata generation issue is increas- ingly a growing concern in large enterprises. With A comprehensive approach that requires more the exponential increase in unstruc- those items that are relevant to the query. than syntactic metadata and that requires end tured information, enterprises are seeking Recall is the retrieval of all items that are rel- users to add rich metadata is haphazard and new ways to improve not only the search evant to the query. Yet most information subjective at best. Since the suggested and retrieval process but to identify tools to retrieval technologies are less than 22% accu- approach is no longer restricted to keyword rate for both precision and recall. The ideal manage, capitalize on and leverage their in- identification, compound-term metadata can goal is to have them balanced. Compound formation assets to improve organizational be automatically generated either when the term processing has the ability to increase performance. Moving beyond keyword content is created or ingested. The generation precision with no loss of recall. identification and traditional taxonomy ap- of metadata based on concepts extracts com- proaches, the use of compound term pro- pound terms and keywords from a document cessing or identifying “concepts in context” or corpus of documents that are highly corre- Managing Content effectively addresses the issue of managing lated to a particular concept. By identifying Taxonomy development and mainte- unstructured content and enables organiza- the most significant patterns in any text, these nance has traditionally been a laborious tions to more effectively find, organize and compound terms can then be used to generate and on-going challenge, not to mention manage their information capital. non-subjective metadata based on an under- costly. The most effective approach is to Compound term processing automatically standing of conceptual meaning. use rules-based categorization, providing identifies the word patterns in unstructured Compound-term processing is a new enterprises complete control of rules-based text that convey the most meaning and uses approach to an old problem. Instead of descriptors unique to their organization. these higher order terms to improve precision identifying single keywords, compound- Since all rules can be defined and man- with no loss of recall. The algorithms adapt to term processing identifies multi-word aged, error-prone results utilizing “train- each customer’s content and they work in any terms that form a complex entity and iden- ing” algorithms typically found in other language regardless of vocabulary or linguis- tifies them as a concept. By forming these approaches are eliminated. tic style. The technology was originally compound terms and placing them in the developed by Concept Searching in 2002 and search engine’s index, the search can be is similar in many ways to the “phrase-based performed with a higher degree of accura- “Precision and recall indexing” techniques detailed in various U.S. cy because the ambiguity inherent in single patents filed in 2004 and to which Google words is no longer a problem. As a result, a are the two key subsequently acquired the rights. search for “survival rates following a triple heart bypass” will locate documents about performance this topic even if this precise phrase is not Keyword Search contained in any document. A concept versus Concept Search measurements for search using compound-term processing can extract the key concepts, in this case Knowledge workers need to identify “survival rates” and “triple heart bypass” content in the context of what they are information retrieval.” and use these concepts to select the most seeking. The fundamental problem with relevant documents. most enterprise search solutions, and all Compound-term processing can address statistical search solutions, is that they are A concept-based automatic classification many challenges facing large enterprises and based on an index of single words. Yet process identifies, during indexing, the cate- provide many benefits. Identification of con- most queries are expressed in short pat- gories each document belongs to. Each cate- cepts within a large corpus of information terns of words and not single words in iso- gory is identified by a unique descriptor and removes the ambiguity in search, eliminates lation which are highly ambiguous. is associated with key descriptive words inconsistent meta-tagging and automatic clas- A concept search engine can isolate the and/or phrases held in the database. This sification and taxonomy management based key meaning that is normally expressed as approach enables a rapid implementation of on concept identification, simplifies develop- proper nouns, nouns phrases and verb a corporate taxonomy with all documents ment and on-going maintenance. T phrases. Although linguistic products can classified to multiple nodes at index time. do this, their performance is highly vari- Ideally, the taxonomy can be used to browse John Challis has had success with several ventures able depending upon the vocabulary and the document collection or as a filter when involving the management of unstructured data. In language in use. A statistical-based, lan- running ad hoc searches. 1990, he founded Imagesolve International which guage-independent concept search can An easy-to-use taxonomy and automatic became the UK’s leading supplier of document image accept queries in natural language with the classification tool creates the framework to processing and workflow products. In 1995 he launched user typing words, phrases or whole sen- ImageFirst Office for BancTec in the US. He was also classify content based on concepts to one or CTO at Smartlogik, the company behind the world’s first tences. The system then analyzes the natu- more nodes in the taxonomy. Features that probabilistic search engine. ral language query to extract the keywords enable subject-matter experts to interact with and phrases to identify the main concepts the taxonomy can simplify on-going mainte- Providing advanced search, auto-classification, taxono- and retrieve content that is highly relevant. nance. For example: automatically generating my and semantic metadata-tagging solutions, Concept Precision and recall are the two key per- compound term clues from the document Searching is the first and only statistical search and clas- formance measurements for information corpus; dynamically showing the effect sification company that uses compound-term process- of changes on the taxonomy; and class retrieval. Precision is the retrieval of only ing to identify concepts within unstructured content. KMWorld October 2008 S17