Research: Developing an Interactive Web Information Retrieval and Visualization System


Published on

Finding the needed information (images, articles or other) is not always as simple as going to a search en-
gine. This paper aim at developing an interactive presentation system, able to cope with live presentation

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Research: Developing an Interactive Web Information Retrieval and Visualization System

  1. 1. Developing an Interactive Web Information Retrieval and Visualization System Research Project Master Artificial Intelligence Department of Knowledge Engineering R. Atachiants J. Meyer J. Steinhauer T. Stocksmeier June 24, 2009
  2. 2. Abstract Finding the needed information (images, articles or other) is not always as simple as going to a search en- gine. This paper aim at developing an interactive presentation system, able to cope with live presentation challenges: speech recognition, information retrieval, filtering unstructured data and its visualization. Speech recognition is achieved using Microsoft SAPI; information is retrieved using various Text Min- ing techniques in order to derive most relevant keywords and next, the search using expanded Google queries is performed. The information then is filtered: HTML tags are cleared and text summarization is performed in order to extract the most relevant information. Such system has been tested and per- forms satisfactory, and such interactive systems can be built with modern hardware and faster internet connections, but several challenges still need to be faced and some improvement is required in order to create smooth presentation experience. This paper also presents a novel approach of query expansion using Cyc concepts, based on knowledge base, it presents various query expansion patterns as: Or-Concept, Or-Most-Relevant-Alias and their respective results. It also presents 2 word matching algorithms which tries to match a word to the most relevant concept, using WordNet similarity measurement to accomplish the goal at semantic level. It shows that Or-Concepts pattern was improving the query precision with around a factor of 3.
  3. 3. CONTENTS CONTENTS Contents 1 Introduction 1 2 Web Retrieval 1 2.1 Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.3 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.4 Query Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.4.1 WordNet Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.4.2 Cyc Concepts Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.4.3 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4.4 Query for Google Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.5 Results Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5.1 Retrieving Unstructured Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5.2 Summarizing the Web Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Tests and Results 8 3.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Query Construction and Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.4 Results Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 Discussion 10 5 Conclusions 11 A Appendix: Query Expansion using Cyc 12 A.1 WordNet Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A.2 REST API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A.3 Cyc Concepts Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.3.1 DisplayName Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.3.2 Comment Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.3.3 Experiments with both algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.4 Query Expansion Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.5.1 Google Search on Wikipedia KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.5.2 Lemur Search in AP Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 June 24, 2009 2
  4. 4. 2 WEB RETRIEVAL 1 Introduction Why is there a need for an Interactive Web Infor- mation Retrieval and Visualization System? A person who is teaching at school, at university or somewhere else is often required to present vari- ous types of information (his own knowledge, texts written by others, pictures, diagrams etc.) more or less at once. In order to do this, he will collect his data beforehand, arrange it and set up a presenta- tion, using an ordinary presentation tool, such as Microsoft PowerPoint. During his lecture, he will start the presentation, showing one slide after an- other and explaining what can be seen on it. This is, compared to what lessons were like some decades ago, of course a great increase in possibil- ities of how information can be presented and how the listeners’ attention can be kept. But still, there are a lot of problems with this approach. First, all the work has to be done beforehand, resulting in a very long time one has to spend on the presen- tation before he can show something useful. An- other problem is that the prepared presentation is very static. Presentations about topics in which changes can occur during some time will have to be updated again and again with new information and media from the web. Further, during the pre- sentation one will learn that the audience’s knowl- edge about some parts of the covered subjects is different from what one had thought before. If the listeners know more than expected, that won’t be a real problem - some slides can be skipped. But what if they know less than expected? It is impos- sible to make new slides during the presentation, so the either problem would have to be ignored or one had to search for the missing information during the presentation. Another hardship with the usual presentations is the research itself. Finding information is nowa- days very easy. One just goes to Google or another search engine, enters what he is looking for and looks at the first results where he finds what he is looking for. But it is not always so easy, because one doesn’t always know what he is searching for and where he can find it. In addition, you often recieve a lot of results and the precision is too low to find the relevant information very fast. In our project, we want to meet all these prob- lems and difficulties. Our aim is to develop a sys- tem that both makes the research easier and more efficient and also increases the quality and change- ability of the presentation. To enhance the research itself, the program should increase the precision of the search by searching for concepts instead of keywords. If done properly, the results of the search are all relevant for the topic and can be used in some way. This can reduce the time spent on evaluating each of the search results. In addition, the texts on the re- sulted pages can be summarized, further reducing the time needed to evaluate the results. Another possibility is to do the usual web search and the image search at once, because e.g. tables can be found in both formats. To provide the possibility of using the system also during the presentation itself, the search query should be constructed, not solely from some key- board input, but as well directly from spoken text, in a way that provides a possibility to choose the currenty wanted way of communication. The re- sults have to be displayed in a way that the impor- tant parts are visible directly, while other things are suppressed. The user of the system has to be able to decide which contents (texts, images etc.) are shown and which not. To meet all these requirements, the system has to consist of the following steps: 1. Recognize the spoken text 2. Retrieve keywords from the text 3. Create a powerful search query 4. Filter and summerize the results 5. Visualize the results in a proper way These steps will be further described in the next chapters. 2 Web Retrieval 2.1 Architectural Overview The spoken words of the presentation are recog- nized and translated by a speech recognition en- gine. The output of the speech recognition is fed to a keyword extractor to create the keywords of the text. These are used by a query construction to produce an internet search query. The picture results of the search are directly transferred to the June 24, 2009 1
  5. 5. 2.3 Keyword Extraction 2 WEB RETRIEVAL visualisation module. The text results are summa- rized by a summarizer module and then also dis- played by the visualisation module. The workflow can be shown in figure 1. Figure 1: workflow 2.2 Speech Recognition The user of the information retrieval system has several possibilities of interaction with the system, but the most important way of communication is the speech. It is used to get a starting point as well as to define the wanted content more precisely. Thus, our application needs a well-working speech recognition engine, to ”understand” as much of the spoken text as possible. It would be interesting to write an own speech recognition engine, but the effort for this is too high to be done for our project. So the main task here is to choose a feasible speech recognition engine and to use it in a proper way. There are several speech recognition engine avail- able with different strengths and weaknesses. For example, the program ”Dragon Naturally Speak- ing” is an established commercial software product that works quite well. But due to the fact that it is commercial, it is very expensive and thus not the right solution for a research project. Beside the one mentioned there are several other commercial products that include a speech recognition engine. Another group of engines are open-source freeware- tools, found spread one the web. Unfortunatly, the tools tested turned out not to have the efficiency and the quality that we would need for the projet. So we decide to use one product of the third group - built-in search engines from operating systems. These engine are likely to have a high quality and a good performance, while they don’t have to be bought for a lot of money, because they are included in the operating system that everyone has to buy anyway. After some tests and some problems with the usage of some of the mentions software products, we choosed the Windows Speech Recognition API (SAPI 5.1) [9], [10] as a part of our project. The Windows speech module offers different kinds of functionality, including a dictation mode (beside others such as voice commands, text to speech and others). In this dictation mode, the spoken words are simply put into a plain text, which can by used by other programs via the programming interface [8]. 2.3 Keyword Extraction For the keyword generation from an input text the Natural Language Processing Open Source tool: SharpNLP is used [3]. SharpNLP is a collection of natural language pro- cessing tools written in C#. This tool provides the following NLP tools [4]: • a sentence splitter (to identify sentence bound- aries) • a tokenizer (to find tokens or word segments) • a part-of-speech tagger • a chunker (to find non-recursive syntactic an- notation such as noun phrase chunks a parser) • a name finder (Name finder to finding proper names and numeric amounts) June 24, 2009 2
  6. 6. 2.4 Query Construction 2 WEB RETRIEVAL • a coreference tool (to perform coreference res- olution) • an interface to the WordNet lexical database [14] This toolkit turned out to be tricky to setup, but subsequently offers good performance and useful features. To generate keywords the following fea- tures are used: • Sentence Splitting: Splitting the text based on sentence signs and additional rules (will not split ”and Mr. Smith said” into two sentences), uses a model trained on English data • Tokenizer: Resolves the sentences into independent to- kens, based on Maximum Entropy Modeling [5] • POS-Tagging (Part Of Speech-Tagging): As- sociating tokens with corresponding tags, de- nominating what grammatical part of a sen- tence it constitutes, based on a model trained on English data from the Wall Street Journal and the Brown corpus [7] • POS-Filtering: Filtering relevant word categories (nouns and foreign words) based on the POS-Tags (NN, NNS, NNP, NNPS, FW) • Stemming [13]: Reducing the different morphological versions of a word to their common word stem, based on the Porter-Stemmer-Algorithm [2] The relevance rating of the keywords is calculated based on word count, actuality and word length. 2.4 Query Construction In order to construct the query, which the system uses, several steps are performed: 1. each keyword is mapped to its context sen- tence; 2. using keyword and the context, the keywords are matched wit a particular Cyc concept; 3. query is expanded using keywords and con- cepts; 4. query is searched in Google search engine. More detailed description of Concepts Matching and Expansion can be found in Appendix A. 2.4.1 WordNet Similarity Measurement In order to be able to compare two words or sen- tences together, a semantic similarity measurement was needed. WordNet similarity measurement was used [11]. 2.4.2 Cyc Concepts Matching Cyc[12] is a very large Knowledge Base (KB) which tries to gather a formalized representation of a vast quantity of fondamental human knowledge: facts, rules, etc. The Cyc KB contains thousands of dif- ferent assertions and can be used in many different ways. For the system described in this paper, a particular subset of Cyc has been used: concepts and different relations. In Cyc, every concept linked to additional knowl- edge, as such: • a display name (readable name of the concept) • a comment (short description) • general concepts, for example: HumanAdult is linked to AdultAniman and HomoSapiens • specific concepts, for example: HumanAdult is linked to Professional-Adult, SoftwareEngin- ner... • aliases, for example: for HumanAdult there are: ”adult homo sapiens”, ”grownup” ... Unfortunately the whole Cyc KB was unavail- able to public when our research has been done, therefore [6] REST APIs have been used in order to interact with Cyc. More about the actual REST API implementation, as well as the explanation of different algorithms and specific results can be found in Appendix A. Since Cyc contains semantic knowledge about the concepts and not words, the word to concept matching algorithm was created. The algorithm is built in such way so it’s operates only at June 24, 2009 3
  7. 7. 2.4 Query Construction 2 WEB RETRIEVAL semantic level, therefore uses WordNet similarity measurement[11] (also described in section 2.4.1) in order to compute similarity score between two sentences. The algorithm takes a keyword and its corresponding sentence, then using CycFoun- dation APIs the set of relevant Cyc Concepts (aka Constants) are retrieved. The actual process of retrieval retrieves only concepts containing in it’s name the keyword, for example for keyword ”human”, a set of concepts would contain ”Hu- manAdult”, ”HumanActivity”, ”HumanBody”. Next, for each item in the set of concepts, the sim- ilarity score with the keyword’s context (sentence) is computed and the best one is used. Following straightforward pseudo-code imple- mentation illustrates the approach (more about matching type in A.3): Input: Keyword, its context (sentence) and a matching type Output: Cyc concept matched (BestConstant) Constants ⇐ GetConstants(Keyword) BestConstant ⇐ new CycConstant() BestDistance ⇐ ∞ foreach Constant in Constants do Distance ⇐ ∞ if MatchingType = DisplayNameMatching then dK ⇐ GetDistance(Keyword, Constant.DisplayName) dC ⇐ GetDistance(Context, Constant.DisplayName) Distance ⇐ (dK + dC) / 2 end if MatchingType = CommentMatching then dCK ⇐ 0 Comment ⇐ GetComment(Constant) Keywords = GetKeywords(Comment) foreach CK in Keywords do dCK ⇐ dCK + GetDistance(Keyword, CK) end dK ⇐ GetDistance(Keyword, Constant.DisplayName) dC ⇐ GetDistance(Context, Constant.DisplayName) Distance ⇐ (dK + dC + (dCK / Keywords.Count)) / 3 end if Distance < BestDistance then BestDistance ⇐ Distance BestConstant ⇐ Constant end end 2.4.3 Query Expansion After the keywords and their respective context are matched to a particular Cyc concept, the actual query expansion can be done. One can think about many different possible query expansions using the additional Cyc knowledge. In our research we have chosen several expansion methods, most of them are quite straightforward. After some experiments, one particular structured query expansion was cho- sen to be used in the system: Or-Concept. June 24, 2009 4
  8. 8. 2.4 Query Construction 2 WEB RETRIEVAL The algorithm constructs a structured query, which can be described using the following formula, where K is the set of keywords, k ∈ K and c(k) is a func- tion which gets a matched concept for a keyword (using a matching algorithm) : Q(K, n) = n i=1 (max(ki, c(ki))) (1) The following pseudo-code illustrates the approach used in the system in order to construct the structured query: Input: Keywords taggeds with Cyc Concepts Output: Set of expanded keywords for Google foreach TaggedWord in Query do Pattern ⇐ ”( [0] OR [1] ) + ,” PatternEmpty ⇐ ”[0] + ” Keyword ⇐ TaggedWord.Word BestGeneral ⇐ ”” BestDistance ⇐ ∞ if TaggedWord.Concept ¬ null then GeneralConcepts ⇐ GetGenerals(TaggedWord.Concept) foreach General in GeneralConcepts do Distance ⇐ GetDistance(Word.Word, General.DisplayName) if (Distance < BestDistance then BestGeneral ⇐ General.DisplayName BestDistance ⇐ Distance end end end if BestGeneral.Length = 0 then Pattern = PatternEmpty end Expanded.Add(String.Format(Pattern, Keyword, BestGeneral)) end 2.4.4 Query for Google Search In order to recieve the real data, we need to choose a search engine as well as a corpus. For several reasons, we quickly arrived at the decision that we want to use to Google web search for our appli- cation. First of all, Google is the most popular search engine that one can find in the web. This is a great advantage, because an application program- ming interface (API) is required in order to make interaction between our program and the search en- gine possible. The bigger and the more popular a search engine is, the more likely it is that a good API can be found for it. This turned out to be true partially. Google provides a lot of of program- ming interfaces, where most of them are specialized for a specific context. So it isn’t easy to find the best fitting programming interface for an applica- tion, in addition to the ”usual” problems one may encouter during the implementation. Another rea- son to choose the Google web search is the fact that it can easily be restricted to one or a set of several web pages. This can be done by simply adding the tag ”site:”, followed by the desired web page, to the keywords one wants to search for. As a result, the search only finds results that are somewhere on the indicated pages. As we don’t want to find arbi- trary results, but that from specific, serious pages, we really want to use this feature of the Google web search. If one does a web search (either with the Google engine or somewhere else), a lot of re- sults are found at very different pages and types of pages. This is of course important, since that is what most people want when they search the web. But with the information retrieval system, the aim is different. We don’t want to display content from any websites that contain the keywords, but only from websites with trustable content. So a white list with the sites that will be searched is the best way of avoiding useless results. Due to the fact that the results of the web search contain lots of HTML- tags and similar things that are (for our purpose) useless, some knowledge about the structure of the found results. By limiting the search to defined web pages, we can use our information about these pages to parse the results, depending on the web site they were found on. In a first approach, we lim- ited the search only to articles on Wikipedia is a very well-known and well-accepted page that contains useful information about a lot of domains. Retrieving data from only one page may seem to be too less, but it is enough in order to set up a working and useful system. In addition, the system can easily be enhanced by adding other sites. June 24, 2009 5
  9. 9. 2.5 Results Filtering 2 WEB RETRIEVAL 2.5 Results Filtering 2.5.1 Retrieving Unstructured Web Pages The search on the web engine gives a list of URLs as its results. Each of the pages can be downloaded as a text, but now is a HTML-formatted web page. There is some work left to do to recieve only the real content. First, we want to get rid of the HTML tags and any other formatting data. This task has to be performed by applications of very different types. For example, every web browser has to use functionality that distinguishes between content that will be displayed and content that will not. It is usually the best to use existing, tested and working packages, which shall be found at web browsers. The Microsoft Internet Explorer performs this task using a dynamic link library (DLL) called ”mshtml”. This library can easily be used to parse a complex html-file into plain text that contains only that parts that would be displayed when one is visiting this web page with the Internet Explorer (or another browser). This does the main part of this step. Many web pages don’t only contain the ”real” information, but also a lot of other things, such as navigation links, menus, advertisements, and so on. These are displayed by the browser and thus seen by the user. If the user of the information retrieval system chooses to display on of these sites, he maybe wants to see these content. But due to the fact that the texts from the results are summerized in the next step, we have to remove them from the text, because they don’t contain any information that is relevant for the information contained in the page itself. The format of these menus etc. depends on the web page where the result was found. The results on Wikipedia are formatted in a different way from the results in other lexigraphical web pages. Thus, it is useful to handle the results depending on the web page they were found on. As the web search in this approach has been limited to only one page, namely, we have to consider only one page in order to remove ”waste” like links and menus from it. An analysis of the structure unfolded an easy way of parsing the wikipedia articles. All articles on wikipedia start with article itself - beside some HTML-tags and a lot of scripts. The menus and navigation links follow after it. That means, that we can simply work on the first part of the article and skip everything that follows afterwards. The edge between article and links is clearly defined by the HTML-tag ’¡div class=”printfooter”¿’, which is contained in every article on Wikipedia. So we can simply search the results from wikipedia for the tag mentioned above, and remove everything that follows after this tag. It should be mentioned that this parsing is aimed at the result of the parsing done by the Microsoft dynamic link library, but has to be done before the other parsing step, because otherwise one wasn’t able to find the HTML-tag used as the edge. 2.5.2 Summarizing the Web Content In the document exists important, but also a lot of irrelevant information. Additionally the screen space to present the information is intentionally limited to allow a quick evaluation by the user. Because of this the documents are summarized in order to present only the important information to the user. For this we use the Open Text Summa- rizer [1]. This tool doesn’t abstract the document in a natural way because it does not rephrase the text. It is just producing a condensed version of the original by keeping only the important sentences. However it shows good results for non-fictional text and can be used with unformatted and html formatted text. It has received favourable mention from several academic publications and it is at least as good as commercial tools such as Copernic and Subject Search Summarizer. The Open Text Summarizer parses the text and utilizes Porter Stemming. For this an xml file with the parsing and stemming rules is used. It calculates the term frequency for each word and stores this in a list. After this stop word filter is performed. It removes all redundant common words in the list by using a stop word dictionary, which is also stored in the xml file. After sorting the list for frequency the keywords are determined, which are the most frequently occurring words. The original sentences are scored based on these keywords. A sentence that holds many important words, the keywords, is given a high grade. The June 24, 2009 6
  10. 10. 2.6 Visualization 2 WEB RETRIEVAL result is a text with only the highest scored sentences. For this the limiting factor can be set to either a percentage of the original text or sentences number. 2.6 Visualization Visualization is probably the one of most impor- tant parts of the system. It consist of two main parts: 1. The ”Presenter-View”, the view that the pre- senter sees on his computer during the pre- sentation. The presenter should be able to see the dataflow, manipulate and give simple com- mands as: ”show this image” or ”show that article”. The presenter also needs a preview of what’s shown to public. 2. The ”Public-View” the simplified view that shows only the needed information and mir- rors the lecturer interactions with the presen- ter’s view pre-rendering zone. Figure 2: Presenter’s view layout structure For the Graphical User Interface (GUI), several crucial specifications were defined, most impor- tantly: • rendering and screen mirroring should be done in real-time; • layout of the GUI should be as simple as pos- sible, presenting the retrived data in the sim- plest and fastest way for the presenter; • it shoud also be interactive, the images and articles should be manipulated at real-time. Figure 3: Real-time parallelization workflow In order to achieve smooth experience, several things must be done at the same time. As shows fig- ure 3, two main components: rendering and infor- mation extraction were completely separated and running in several threads. Moreover, speech recog- nition part and google search were also separated and all search done asynchronously. Therefore, the system leverages actual multi-core architectures in order to achieve significant speedup. In oder to achieve the simplest way to presenting the data, several prototypes of layout have been tested. Fi- nally, the one shown at the figure 2 and screenshots 4 6 have been proven to be a good way to present- ing the information. The system supports several screens rendering, and the actual mirroring and the development of the user interfaces have been achieved using a novel Microsoft Windows Presen- tation Foundation (WPF) technology. WPF technology is espectially designed for rich user interface develoment, and built on top of .Net Framework. It uses a markup language known as XAML to provide clear separation between the code and the design definition, which also greatly helped in parrallelization process. It also features 3D and Video/Audio rendering capabilities and an- imation storyboards (similar to flash). Therefore our system can be extended in order to perform June 24, 2009 7
  11. 11. 3 TESTS AND RESULTS Figure 4: Screenshot of the GUI, the rendering view is splitted automatically between several im- ages and an article video search or do 3D enhanced presentations. The rendering engine takes advantage of modern hard- ware (dedicated GPUs ...). One of the features of Figure 5: Screenshot of the GUI, the public view our system is the dynamic layout management of the render-view. As shown on the figures 4 6 the rendering zone automaticaly adjust itself depend- ing on the quantity of content presented. With such capabilites the system assures to use whole screen without leaving much empty space. Figure 6: Screenshot of the GUI, the rendering view shows only images since no articles have been se- lected by the lecturer 3 Tests and Results 3.1 Speech Recognition In the context of this project it is important that the speech recognition engine has a high quality of recognition and is working fast. For this reason these two aspects have been evaluated. After a two hour training phase the engine recognized about 70% of the spoken words cor- rectly. The recognition rate can be improved by further training. To achieve a good recognition rate it is necessary to speak loud and articulated and to pronounce the words always in the same way. Additionally punctuation marks are not recognized automatically but have to be spoken explicitly. This is however not a natural way to speak at a presentation. After speaking some sentences a break of a few seconds has to be made to initiate the recog- nition process for these sentences. The recognition process also needs a few seconds depending on the June 24, 2009 8
  12. 12. 3.4 Results Filtering 3 TESTS AND RESULTS number of words. The tests showed a recognition rate of about four words per second. Again these frequent breaks of a few seconds are not a natural way to speak at a presentation. 3.2 Keyword Extraction The keywords extraction module has been tested with different features and numbers of words. The computing time and the quality of the extracted keywords have been evaluated. The results are shown in figure 7, 8. The tests have been made for the complete feature set (explained in chapter 2.3) and with disregarding the stemmer. Due to the limitations of the speech recognition engine (no punctuation marks if not pronounced explicitly) the keyword extraction has also been evaluated for texts without punctuation marks. It shows that the quality of the keyword extraction without stemmer is generally lower than with the complete feature set. Especially if the keywords will be used in the query for the internet search the use of the stemmer shows advantages. It avoids for example the search for singular and plural of the same keyword. Also the quality of the relevancy of the keywords is improved. It also shows that the gain in computation time by disregarding the stemmer is minimal. With 1200 words the gain is only 120 ms. For the expected relatively short sentences usually provided by the speech recognition engine the gain is even less. The tests also showed that for text without punctuation marks the quality is nearly the same as for the same text with punctuation marks. 3.3 Query Construction and Expan- sion As figure 9 shows, the Or-Concept algorithm of query expansion makes a significant improvement of the precision. In that figure, 10 word query is compared to an extended 10 word query, and at N 5 one can see an improvement of almost 3 times. Cyc- enhanced queries gives better results, but should be evaluated (more about different experiments and query-enhancements in Appendix A. Figure 7: time results for an extraction of ten key- words (with different numbers of words) Figure 8: quality of results for an extraction of ten keywords (with different numbers of words) 3.4 Results Filtering The summarizer module has been tested with and without stemmer for different summarizing June 24, 2009 9
  13. 13. 4 DISCUSSION Figure 9: Precision at N, 10 word query with no keyword stemming percentages of the original unformatted text. The quality of the results represents their usability which has been evaluated from the meaningfulness of the summary and its length. A shorter length was considered a better usability because it allows a quicker evaluation by the user. Figure 10 shows that in these tests with stemmer the results with the best quality were achieved for a 15% summary. Without stemmer the quality was usually a little less. The calculation time with and without stemmer for different summarizer percentages is shown in figure 11. The difference between the different configurations is minimal. When comparing unformatted text and HTML- formatted text (with similar amount of content) the tests show that the html formatted text can take significantly longer to be processed. 4 Discussion As explained in the preceeding chapters, the sys- tem is works in its main parts. It is clearly possible to enrich web search as well as presentation of contents with different tools related to text mining, Figure 10: quality of results for different summary percentages for a text with 1200 content words Figure 11: Calculation times for different summary percentages for a text with 1200 content words information retrieval, and others. But also many problems turned out that had either to be solved or to be worked around. June 24, 2009 10
  14. 14. 5 CONCLUSIONS The speech recognition module did work, but still had some problems, which lie in the implementa- tion of the used software. The speech recognition itself works fine, but the chosen engine has some problems e.g. in recognizing accents, breaks and similar. These often result in a incorrectly recognized word or sentence. So either another, better engine has to found or another solution has to be found in order to meet the requirements. The keyword extraction from the spoken text works well and also very efficient. But still, this step can be enhanced. For example, the user may want to influence the chosen keywords more directly than he does by speaking. So it would be a useful feature if the user could decide manually which keywords are good and important and which keywords are or not relevant. The keywords provided by the system could then be viewed as suggestions that the user can accept or decline. Another nice feature related to the first one mentioned is a self-learning algorithm that tries to predict the user’s decision. Another part that works fine but still can be improved is the summerization of the found texts. It produces good summaries of the found web pages without consuming too much computation time. But the result could maybe be enhanced by using features such as coreference resolution, leading to a better estimation of which referents are important and which are not. But, before putting a big effort in it, it should further be checked whether it is worth the effort. 5 Conclusions In this paper a design and implementation of an In- teractive Web Retrieval and Visualization System have been discussed. It have been proven robust and satisfying real-time constraints of live presen- tation. Tests and results have shown that both key- word extraction and query expansion return satis- fying results, keeping good precision. During the experiments it have been also shown that several parts of such system still need improvement: speech recognition needs to be enhanced and propery pre- trained, unstructured web data should be properly cleared from all noise and summarization can be enhanced, creating more consize articles. Overall, such systems can be built with modern hardware and parallelization, and hopefully, more of those will be seen as commercial products in the near future. June 24, 2009 11
  15. 15. A APPENDIX: QUERY EXPANSION USING CYC References [1] Open text summarizer. [2] Porter stemmer. mar- tin/PorterStemmer/. [3] Sharpnlp - open source natu- ral language processing tools. [4] The Text Mining Handbook: Advanced Ap- proaches in Analyzing Unstructured Data, chapter page 60. Cambridge University Press, 2006. [5] The Text Mining Handbook: Advanced Ap- proaches in Analyzing Unstructured Data, chapter page 318. Cambridge University Press, 2006. [6] Cyc Foundation. [7] W.N. Francis and H. Kucera. Brown corpus manual. [8] M. Harrington. Giving computers a voice. 2006/10/31/909044.aspx. [9] J. Moskowitz. Speech recog- nition with windows xp. using/setup/expert/ moskowitz 02september23.mspx. [10] J. Moskowitz. Windows speech recognition. vista/ features/speech-recognition.aspx. [11] T. Ngoc Dao and T. Simpson. Measuring simi- larity between sentences. Codeproject Article. [12] Open Cyc Project. [13] J. C. Scholtes. 5 text mining: Preprocessing techniques part 1. [14] Princeton University. Wordnet, a lex- ical database for the english language. A Appendix: Query Expan- sion using Cyc A.1 WordNet Similarity Measure- ment In order to be able to compare two words or sen- tences together, a semantic similarity measurement was needed. For this the WordNet similarity mea- surement was used. Following are the steps that are performed to to the semantic similarity between two sentences [11]: • each sentence is partitioned into a list of tokens and the stop words are removed; • words are stemmed; • part of speech tagging is performed; • the most appropriate sense for every word in a sentence is found (Word Sense Disambigua- tion). To find out the most appropriate sense of a word, the original Lesk algorithm was used and expanded with the hypernym, hyponym, meronym, troponym relations from WordNet. The possible senses are scored with a new scor- ing mechanism based on ZipF’s Law and the sense with the highest score is chosen. • the similarity of the sentences based on the similarity of the pairs of words is computed. In order to do this, a semantic similarity rela- tive matrix is created, consisting of pairs of word senses for semantic similarity between the most appropriate sense of word. The Hun- garian method is used to get the semantic simi- larity between sentences. The match results of this are included to compute single similarity value for two sentences. The matching average is used to compute semantic similarity between two word-senses. This similarity is computed by dividing the sum of similarity values of all match candidates of both sentences by the to- tal number of set tokens. A.2 REST API As mentionned in section 2.4.2, the system used REST APIs in order to access the CycFounda- Cyc KB. REST stands for Representa- June 24, 2009 12
  16. 16. A.3 Cyc Concepts Matching A APPENDIX: QUERY EXPANSION USING CYC tional state transfer, a way to build a service ori- ented architecture, based on HTTP and XML, us- ing generally GET or POST methods of HTTP pro- tocol. CycFoundation Web Services exposes only a sub- set of Cyc’s capabilities, therefore the implemented API in our system is rather small. It contains fol- lowing queries: • GetConstants(Keyword) - performs a search for Cyc concepts for a keyword • GetComment(Concept) - returns a comment for a particular concept • GeCanonicalPrettyString(Concept) - returns a simplified name for a concept • GetDenotation(Concept) - returns a denota- tion for a particular concept • GetGenerals(Concept) - returns a set of gen- eral concepts for a particular concept • GetSpecifics(Concept) - returns a set of spe- cific concepts for a particular concept • GetInstances(Concept) - returns a set of in- stances (concepts) for a particular concept • GetIsA(Concept) - returns a set of Is A con- cepts for a particular concept • GetAliases(Concept) - returns a set of aliases (words or phrases) for a particular concept • GetSiblings(Concept) - returns a set of sibling concepts for a particular concept A.3 Cyc Concepts Matching The actual general algorithm descibed in section 2.4.2 was actually derived from two algorithms: display name matching and comment matching. Those algorithms have been tested separately. A.3.1 DisplayName Matching Algorithm The display name used in cyc, is the shortest con- cept description. For example the displayname for ”HumanAdult” concept is simply ”a human adult”. Therefore the matching algorithm computes the distance between the keyword contentex and this description, as following: Input: Keyword, its context (sentence) Output: Cyc concept matched (BestConstant) Constants ⇐ GetConstants(Keyword) BestConstant ⇐ new CycConstant() BestDistance ⇐ ∞ foreach Constant in Constants do dK ⇐ GetDistance(Keyword, Constant.DisplayName) dC ⇐ GetDistance(Context, Constant.DisplayName) Distance ⇐ (dK + dC) / 2 if Distance < BestDistance then BestDistance ⇐ Distance BestConstant ⇐ Constant end end This algorithm has proven to be quite efficient since less computation needs to be done, but on the other hand it tends to provide quite poor results and should be enhanced for proper use. A.3.2 Comment Matching Algorithm The comment matching algorithm takes the addi- tional knowledge about cyc concept, called com- ment. It is long description which tends to be as specific as possible. For example again, for ”Hu- manAdult” concept, the comment is: ”A specialization of Person, and an in- stance of HumanTypeByLifeStageType. Each instance of this collection is a per- son old enough to participate as an inde- pendent, mature member of society. In most modern Western contexts it is as- sumed that anyone over 18 is an adult. However, in many cultures, adulthood oc- curs when one reaches puberty. Adult- hood is contiguousAfter (q.v.) childhood. Notable specializations of this collection include AdultMaleHuman, AdultFemale- Human, MiddleAgedHuman and OldHu- man.” The actual pseudo code implementation of such algorithm, where GetComment method is actual REST API call: June 24, 2009 13
  17. 17. A.4 Query Expansion Patterns A APPENDIX: QUERY EXPANSION USING CYC Input: Keyword, its context (sentence) Output: Cyc concept matched (BestConstant) Constants ⇐ GetConstants(Keyword) BestConstant ⇐ new CycConstant() BestDistance ⇐ ∞ foreach Constant in Constants do dCK ⇐ 0 Comment ⇐ GetComment(Constant) Keywords = GetKeywords(Comment) foreach CK in Keywords do dCK ⇐ dCK + GetDistance(Keyword, CK) end dK ⇐ GetDistance(Keyword, Constant.DisplayName) dC ⇐ GetDistance(Context, Constant.DisplayName) Distance ⇐ (dK + dC + (dCK / Keywords.Count)) / 3 if Distance < BestDistance then BestDistance ⇐ Distance BestConstant ⇐ Constant end end A.3.3 Experiments with both algorithms After running several experiments (Fig. 13) com- paring the Comment Matching and Display Name algorithms, we found that Comment Matching is the one gives significantly better results, but still can be enhanced to get even better matching. This may be done by using general/specific con- cepts in the distance calculation formula or learning weights. A.4 Query Expansion Patterns In our research, several different query expan- sio algorithms have been implemented, all have a quite straightforward implementation as explained in section 2.4.3. The algorithms, which can be de- scribed using the following formulas, where K is the set of keywords, k ∈ K, have been implemented and evaluated : Or-Concept Expansion: Q(K, n) = n i=1 (max(ki, c(ki))) (2) where c(k) is a concept linked with the keyword Or-Aliases Expansion: Q(K, n) = n i=1 (max(ki, ( m j=1 aj(ki)))) (3) where a(k) is a alias linked with the keyword’s concept Or-Most-Relevant-Alias Expansion: Q(K, n) = n i=1 (max(ki, a(ki))) (4) where a(k) is a most relevant alias linked with the keyword’s concept Or-Most-Relevant-General Expansion: Q(K, n) = n i=1 (max(ki, g(ki))) (5) where g(k) is a most relevant general concept linked with the keyword’s concept Or-Most-Relevant-Specific Expansion: Q(K, n) = n i=1 (max(ki, s(ki))) (6) where s(k) is a most relevant specific concept linked with the keyword’s concept Or-Is-A-Concept Expansion: Q(K, n) = n i=1 (max(ki, isa(ki))) (7) June 24, 2009 14
  18. 18. A.5 Results and Discussion A APPENDIX: QUERY EXPANSION USING CYC where isa(k) is a most relevant ’Is A ...’ concept linked with the keyword’s concept Or-Most-Relevant-AGS Expansion: Q(K, n) = n i=1 (max(ki, a(ki), g(ki), s(ki))) (8) where: a(k) is a most relevant alias linked with the keyword’s concept, g(k) is a most relevant general concept linked with the keyword’s concept, s(k) is a most relevant specific concept linked with the keyword’s concept A.5 Results and Discussion A.5.1 Google Search on Wikipedia KB Our system have been tested and able to generate queries for Google search engine, however it is very difficult to evaluate such results, therefore this secrion of the paper is only to give an example of results. Using as input an introductory text of Standford Encyclopedia for the topic of ’evolution’, the system have derived several keywords: • ”species”, matched to ”speciesImmunity” • ”theory”, matched to ”theoryOfBeliefSystem” • ”evolution”, matched to ”Evolution” • ”change”, matched to ”changesSlot” • ”term”, matched to ”termExternalIDString” Next, several queries have been constructed and restricted to Wikipedia KB: • Google Non-Expanded Query: ”species theory evolution change term” • Google Or-Concepts Query: ”( species OR species immunity ) + ( theory OR Theory Of Belief System ) + ( evolution OR biological evolution ) + ( change OR Changes Slot ) + ( term OR Term External ID String ) +” Figure 12: Precision at N, results of Or-Concept expanded query with and without stemming com- pared • Google Or-Aliases Query: ”species + theory + ( evolution OR (biologically will have evolved OR biologically had evolved OR biologically will evolve OR biologically has evolved OR bi- ologically have evolved OR biologically evolv- ing OR biologically evolves OR biologically evolved OR biologically evolve OR most evo- lutionary OR more evolutionary OR evolu- tionary OR evolution) ) + change + term +” • Google Or-Most-Relevant-General Query: ”species + theory + ( evolution OR ”development” ) + change + term +” • ... After analyzing the results, we compared several of the results returned by Google. For the normal non expanded query the results are quite satisfy- ing: * Evolution * Punctuated equilibrium * Evolution as theory and fact * Macroevolution * History of evolutionary thought * On the Origin of Species June 24, 2009 15
  19. 19. A.5 Results and Discussion A APPENDIX: QUERY EXPANSION USING CYC Figure 13: Precision at N, results of Or-Concept expanded query using different matching * Species * Hopeful Monster However, the source article mentionned Charles Darwin and his name was not derived in the keywords, so no direct Wikipedia link to his theory of natural selection was found in the non-expanded results. After analyzing deeper, it was actually found in the extended query (Or-Most-Relevant- General): * Evolution * Punctuated equilibrium * Macroevolution * Evolution as theory and fact * Charles Darwin * On the Origin of Species * Hopeful Monster * Natural selection A.5.2 Lemur Search in AP Corpus In order to perform an evaluation of the query ex- pansion methods, a structured query generation for Lemur search engine have been implemented too. The queries have been automatically derived from the narrative of topics 101 to 115 of AP Corpus Figure 14: Precision at N, expansions of a 5 word query compared (part 1) and a batch evaluation of those queries has been performed. Several different sets of results have been produced by IREval function of Lemur, ones of them are: • 5-10 word query with no keyword stemming; • 5-10 word query with keyword stemming; • 5-10 word query using name matching. Most of the experiments have been performed with Comment Matching Algorithm with whole set of query expansion patterns. Some experiments have been performed in order to compare the Dis- play Name and Comment Matching Algorithms. The keyword stemming part was especially tested in order to suppress plurals, since in derived keywords from AP Corpus topics usually left it. So for example: ”changes” was stemmed to ”change” and therefore give a larger set of possible concepts to match. During our experiments (Figure 12) we have found that using the keyword stemmin in those algorithms decreases significantly the preci- sion at larger queries. One possible explanation to that is that the actual matching algorithms are making more errors, since the possible concepts set is larger. In the figures 14 and 15 one can see that only the Or-Concept algorithm gives better precision June 24, 2009 16
  20. 20. A.5 Results and Discussion A APPENDIX: QUERY EXPANSION USING CYC Figure 15: Precision at N, expansions of a 5 word query compared (part 2) Figure 16: Precision at N, expansions of a 10 word query compared (part 1) at 5 and more. The Or-Most-Relevant-Alias and Or-Most-Relevant-General give a precision improvement only at N > 20. Finally, it’s in the figures 16 and 17 that one can really see the improvement made by stemming and especially Or-Concept query ex- Figure 17: Precision at N, expansions of a 10 word query compared (part 2) pansion. While 5 words query expansion (at 5N) in this particular algorithm gives an improvement of almost 50 percent, with 10 words query the improvement is almost of the order of 200 percent (significant increase of precision). The results speak for themselves; we think that the Cyc-enabled query expansion is definitely way to go, but better patterns still needed to be built. In our experiments from 7 different patterns only one of them is actually making good precision in- crease: the Or-Concept pattern. June 24, 2009 17