Performance Evaluation of Relational
               Implementations of Inverted Text Index

                         Qi Su...
[KS95] compared an IR system, BASISPlus, an early         database, as well as storing the inverted index on the
version o...
Universal Database. It supports the creation of full-        <”hello”, 2, “10,90,12”>
text indexes on textual DB2 table co...
Sort, Tablescan,                             3.4. Term - Document - Position
                Groupby, Filter              ...
7.2. Our first relational approach uses the IBM DB2       We test Boolean (And/Or), phrase and ranked queries.
Text Inform...
1 word - 1 hit        0.359      1.094   0.406   0.448    queries. It has fast response time, and scales well over
1 word ...
All four systems perform well on And queries (Figure
3), with response times no more than 5 seconds. It                  1...
systems are used interactively, and users typically
process result hits in batches, it is often useful to
optimize for top...
Upcoming SlideShare
Loading in...5
×

Evaluation of Alternative

519

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
519
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Evaluation of Alternative

  1. 1. Performance Evaluation of Relational Implementations of Inverted Text Index Qi Su Yu-Shan Fung Stanford University Stanford University qi@db.stanford.edu yfung@stanford.edu ABSTRACT inverted index-based information retrieval systems. Information retrieval (IR) systems are adept at There are several key advantages to such an approach. processing keyword queries over unstructured text. In A pure relational implementation using standard SQL contrast, relational database management systems offers portability across multiple hardware platforms, (RDBMS) are designed for queries over structured OS, and database vendors. Such a system does not data. Recent work has demonstrated the benefits of require software modification in order to scale on a implementing the traditional IR system of inverted parallel machine, as the DBMS takes care of data index in RDBMS, such as portability, parallelism, and partitioning and parallel query processing. Use of a scalability. We perform an in-depth comparison of relational system enables searching over structured alternative relational implementations of inverted text metadata in conjunction with traditional IR queries. index versus a traditional IR system. The DBMS also provides features such as transactions, concurrent queries, and failure recovery. Most of the previous works have picked one relational 1. INTRODUCTION implementation and compared it with a special- Database and information retrieval (IR) are two rich purpose IR system. Some of them have focused on a fields of research that have produced ubiquitous tools particular advantage, such as scalability on a parallel such as the relational database management system cluster. (RDBMS) and the web search engine. However, historically, these two fields have largely developed We propose a comprehensive evaluation of the independently even though they share one overriding alternative relational implementations of inverted text objective, management of data. index that have been discussed in literature, with the special-purpose IR system Lucene being the baseline We know that traditional IR systems do not take for comparison. We will evaluate the systems on advantage of structure of data, or metadata, very well. Boolean queries, phrase queries, and relevance ranked Conversely, relational database systems tend to have queries, and benchmark their relative performance in limited support for handling unstructured text. Major terms of query response times. database vendors do offer sophisticated IR tools that are closely integrated with their database engines, for In section 2, we discuss the related work in literature example, Oracle Text, IBM DB2 Text Information concerning implementing inverted index as relations Extender, and Microsoft SQL Server Full-Text and integrating IR and DBMS. In section 3, we Search. These tools offer a full range of options, from present the baseline IR and alternative relational Boolean, to ranked, to fuzzy search. However, each implementations of inverted index systems. In section text index is defined over a single relational column. 4, we review evaluation of these systems, our test Hence, significant storage overhead is incurred, first dataset, and queries to be executed. Section 5 presents by storing the plain text in a relational column, and the relative performance results collected and our again by the inverted index built by the text search observations. Finally, in section 6, we present tool. These tools offer powerful extensions to the concluding remarks and future work. traditional relational database, but do not address the full range of IR requirements. Their vendor-specific nature also means they are not portable solutions. 2. RELATED WORK Several works have picked a single relational There has been research in the past decade implementation and compared its performance with a investigating the use of relational databases to build baseline special purpose IR system. Kaufmann et al
  2. 2. [KS95] compared an IR system, BASISPlus, an early database, as well as storing the inverted index on the version of Oracle’s text search tool, SQL*TR, and a side, incurs significant storage overhead. relational implementation of the inverted list with two relations, <term, docid> and <term, docfreq>. The 3. SYSTEM IMPLEMENTATIONS evaluation dataset is a small 850,000 tuples in the We evaluate four systems on information retrieval <term, docid> inverted list. The queries are strictly tasks. The baseline system is Lucene, a special- conjunctive Boolean queries. purpose IR search engine. The three relational designs are implemented using IBM DB2 Universal Database. More recent works have shown that Boolean, The first relational approach uses the DB2 Text proximity, and vector space ranked model searching Information Extender to take care of all the indexing can be effectively implemented as standard relations and query processing. The two remaining relational while offering satisfactory performance when approaches implement the inverted index as relations, compared to a baseline traditional IR system. and transform keyword queries to standard SQL Grossman et al [GFH97] demonstrates that relational queries. implementations are effective for Boolean, proximity, and ranked queries. The relational model 3.1. Lucene implemented using Microsoft SQL Server consists of Lucene is an open-source text search engine system doc_term table <docid, term, term freq>, under the Apache project. It is written entirely in Java. doc_term_prox table <docid, term, position>, and idf We chose this as our baseline system as it offers ease table <term, idf>. The baseline IR system is Lotus of deployment and a full feature set representative of Notes, which is a heavy weight system that is not built a traditional IR system. specifically for IR tasks. The authors also studied parallel performance on an AT&T 4-processor Lucene includes three key APIs, IndexWriter, database machine. IndexReader, and IndexSearcher. IndexWriter enables the user to construct an inverted text index over a Some works have focused on a single advantage of corpus of documents. Indexing may be customized relational implementations over traditional IR with parameters such as case folding, stemming, etc. inverted index. Grabs et al [GBS01] evaluated the The IndexReader allows the user to probe the contents performance and scalability of a database IR system of the inverted index, for example, enumerating all on a parallel cluster. The system is implemented with tokens in the index. This important aspect will be BEA middleware over Oracle database, with discussed in section 4.1. The IndexSearcher provides significant emphasis on the transaction semantics to a rich set of search options, including Boolean, ensure high levels of search and insert parallelism. phrase, ranked, and fuzzy search queries. The basic data model is <term, docid>. Only Boolean queries performances are measured. With our corpus, we will pass one document at a time to our IndexWriter instance to be tokenized and Brown et al [BCC94, BR95] demonstrated efficient indexed. The keys associated with document are the inverted index implementation and fast incremental document ID and URL, which is just the document index update using a database system. However, their file name. At execution time, we use the appropriate implementation used a persistent object store method of the IndexSearcher instance to retrieve the manager, which is beyond our scope of using the relevant document URLs from the Lucene index. traditional relational model and off-the-shelf RDBMS. By default, all Lucene search hits are returned with a ranked score. For our case, we only care about the A recent issue of IEEE Data Engineering Bulletin ranking when we measure the ranked query covered the work by major database vendors to performance. Lucene keyword queries are structured integrate full text search functionality into the as a single search string. And queries prefix each RDBMS. [MS01], [HN01], [DIX01] presented how keyword with a plus sign. Or queries are a space IBM DB2, Microsoft SQL Server, and Oracle delimited list of keywords. Phrase queries are a space introduce text extensions that are tightly coupled with delimited list of keywords enclosed in quotes. the database engine. However, as we discussed earlier, such an approach is limited in that each text 3.2. IBM DB2 Text Information Extender index must be defined over a single column, and IBM DB2 Text Information Extender (TIE) is a full- storing both the full text of the document in the text search engine tightly coupled with the IBM DB2
  3. 3. Universal Database. It supports the creation of full- <”hello”, 2, “10,90,12”> text indexes on textual DB2 table columns. TIE uses the table primary key to relate inverted index token This representation is more compact compared to the entries to their original source tuple in the table. TIE Term-Document-Position approach and is well suited is invoked as special function calls over columns, for Boolean and ranked queries. In the case of phrase much like an user-defined function. or positional queries, we implement application logic to merge position lists. We create a three column relation named fulltext <docid, url, text>. The text column is of the type There are two alternatives to implement an AND Binary Large Object (BLOB) and contains the full query in SQL. The natural choice is to translate an N- text of the document. Our simple parser creates a word query into an N-way equi-join on the document single large load file. DB2’s load utility batch loads ID, where each join relation has been filtered to select the data into the relation. Then we invoke TIE to documents containing one of the words. SELECT u.url index the text column. FROM url u, tf t1, tf t2 WHERE u.docid=t1.docid AND t1.docid=t2.docid AND t1.term=’keyword1’ AND t2.term=’keyword2’ Boolean and phrase queries use the contains function. Ranked queries also use the score function. And query: HashJoin SELECT url FROM full WHERE contains (text, ‘”keyword1” & “keyword2” & “keyword3”’)=1 HashJoin Indexscan on term Or query: SELECT url FROM full WHERE contains (text, ‘”keyword1” | “keyword2” | “keyword3”’)=1 Tablescan Indexscan TF on term Phrase query: SELECT url FROM full URL TF WHERE contains (text, ‘”keyword1 keyword2 keyword3”’)=1 Figure 1. Query Plan for equi-join And query Ranked query: WITH temptable (url, score) AS An alternative due to Grossman [GFH97] treats the (SELECT url, score(text, ‘“keyword1” & “keyword2” & query keywords as an artificial relation, and joins it “keyword3”’) FROM fulltext) SELECT url with the term-document relation. The result is subject FROM temptable to group by aggregation by document and only WHERE score>0 documents having the correct number of keyword ORDER BY score DESC matches are preserved. WITH query(term) 3.3. Term - Document AS (values ('keyword1'), ('keyword2')) In this representation, each tuple corresponds to a SELECT u.url FROM url u, tf d, query q WHERE u.docid=d.docid AND d.term=q.term term/document pair. The relations are: GROUP BY u.url tf <term, docid, termfreq, positionlist> HAVING count(d.term)=2 idf<term, idf> url<url, docid> However, upon further evaluation (results not shown), we discovered that the second implementation never The bulk load files are created by invoking the outperforms the first, and in some cases performs over Lucene IndexReader to probe the Lucene text index an order of magnitude worse. Hence, from now on, all over our corpus. We will discuss this aspect further in reference to And type queries on both the Term-Doc sec. 4. and Term-Doc-Position approaches refer to the first implementation. The positionlist attribute of the tf relation is an offset encoded list of all occurrence positions of the given term in the given document. For example, term “hello” appearing in document 2 at positions 10, 100, 102 would be encoded as the tuple
  4. 4. Sort, Tablescan, 3.4. Term - Document - Position Groupby, Filter The term-document-position approach stores a single tuple for every single occurrence of a term in a document. Hence the term t appears 5 times in document d corresponds to five distinct tuples. The Hashjoin relations are: posting <term, docid, position> idf<term, idf> url<url, docid> Tablescan NLJoin on term For Boolean and ranked queries, this representation is redundant, compared to the term-document approach. At query time, we must insert the distinct operator in Indexscan our query plan to eliminate duplicates in our join URL Tablescan on term results. However, this representation leads to straightforward SQL translation of phrase and proximity queries. There is no need for application Query TF logic or custom user-defined functions to post-process Figure 2. Query Plan for alternative And query position lists for positional matches. Positional matches are specified as SQL arithmetic predicates The OR query is a simple selection with multiple Or relating the position attributes. filters. SELECT DISTINCT(u.url) The Boolean queries are very similar to the Term- FROM url u, tf t WHERE u.docid=t.docid AND Document approach, except for the addition of ( t.term='keyword1' OR t.term='keyword2' ) distinct operators. Equi-join And: For phrase queries, we retrieve all candidate SELECT DISTINCT(u.url) documents, and the position list for all the keywords FROM url u, posting t1, posting t2 WHERE u.docid=t1.docid AND t1.docid=t2.docid AND in the query. Candidate documents are the results of t1.term='keyword1' AND t2.term='keyword2' the AND query with the same keywords. The result relation looks like <docid, position list for first Or: keyword, position list for second keyword, … >. SELECT DISTINCT(u.url) SELECT u.url, t1.positionlist, t2.positionlist FROM url u, posting d FROM url u, tf t1, tf t2 WHERE u.docid=d.docid AND WHERE u.docid=t1.docid AND t1.docid=t2.docid AND ( d.term='keyword1' OR d.term='keyword2' ) t1.term=’keyword1’ and t2.termid=’keyword2’ Phrase queries: The application logic then traverses the multiple SELECT distinct(u.url) FROM url u, posting t1, posting t2 position lists together to find an instance where the WHERE t1.docid=u.docid AND t1.docid=t2.docid AND positions in each list are one apart, in order. t1.term='keyword1' AND t2.term='keyword2' AND t2.position=t1.position+1 Our ranked query implementation is also due to Grossman [GFH97]. First, we precompute term IDF Ranked queries: WITH query(term, tf) values as log ( number of documents / document AS (values ('keyword1',1),('keyword2',1)) SELECT u.url, frequency of the term ). The relevance of a given SUM (q.tf * i.idf * i.idf) as score FROM url u, query q, posting d, idf i document d is the summation over all terms t WHERE u.docid=d.docid AND q.term = i.term AND d.term = occurring in both the query and the document: i.term GROUP BY u.url ∑ (query.termfreq for t * t’s IDF * d.termfreq for t * ORDER BY score DESC t’s IDF). WITH query(term, tf) AS (values ('keyword1',1),('keyword2',1)) 4. SYSTEM EVALUATION SELECT u.url, SUM (q.tf * i.idf * d.freq * i.idf) as score The systems are implemented on a Pentium III FROM url u, query q, tf d, idf i 800MHz workstation with 1GB of RAM running WHERE u.docid=d.docid AND q.term = i.term AND d.term = i.term Windows 2000. The baseline IR system is Lucene GROUP BY u.url version 1.2 running on JDK 1.3. The relational ORDER BY score DESC database is IBM DB2 UDB Enterprise Edition version
  5. 5. 7.2. Our first relational approach uses the IBM DB2 We test Boolean (And/Or), phrase and ranked queries. Text Information Extender version 7.2. Our queries are 1, 2 or 4 keywords long. We divide each query class into subclasses of 3 different 4.1. Dataset selectivities of approximately 1 document hit Our dataset consists of 199,932 Reuters newswire (0.0005% of corpus), 10 hits (0.005% of corpus), and articles from the year 1997. The raw text is 322MB. 100 hits (0.05% of corpus). For each subclass, we The corpus has 895308 distinct tokens after case generate three distinct queries and measure the folding. There are 28,507,457 distinct term-document average query execution time. pairs, which is the cardinality of the term-document relation tf. There are 51,108,145 tokens in the corpus, Table 2 Sample of Queries Executed and Selectivities which is the cardinality of the term-document- (expected number of hits) position relation posting. hits AND OR Phrase Rank 1 Gwil Photronics Exceeding Photronics The Reuter documents are wrapped in XML. We strip Industries (1) Yomazzo (6) Consensus (2) Yomazzo (6) the XML tags to produce the textual body, which are loaded into the Lucene index and the DB2 TIE Gaulle Videoserver West life (1) Videoserver relation. DB2 TIE’s tokenization process is a black Quebec (1) Tandberg (4) Tandberg (4) box to us, the end user. However, by default, queries Zygo Systems Genhold Scotland Genhold do use case folding. Fancier features such as (1) Femco (2) Bancorp (1) Femco (2) stemming and thesaurus can be specified in the search function invocation. Using Lucene, we can specify Queries are generated by sampling sets of 1, 2, or 4 options of case folding, stemming, etc at index keywords from the headlines of the first thousand creation and search time. For a uniform comparison of documents in the corpus, then repeatedly probing the our four alternatives, we want uniform tokenization. Lucene index until the desired selectivity for the Since DB2 TIE has case folding enabled by default, query subclass is achieved. For the 4 keyword Or we use it as our common denominator. We build the queries, we were unable to generate queries of Lucene index with case folding turned on. To produce selectivity 1 or 10, due to the unionization semantics uniform tokenization in our two relational of disjunctive queries. implementations, the term-document and term- document-position relations are populated from We want to measure a uniform response time of inverted index probe of the Lucene index. We use the keyword queries. Our standard is execution time Lucene IndexReader class to enumerate all term- between the when the keywords are submitted by the document-position information and produce the user, to when all result document URLs have been appropriate bulk load file for our two relational returned to the user. For the Lucene implementation, representations. Such an approach guarantees that we we measure the time between when our Java Lucene have uniform tokenization using only case-folding IndexSearcher instance receives the command line feature across our four implementations. keyword parameters and search option, to when it completes retrieval of hits from the index. For our Table 1 Space utilization three database implementations, we build an embedded SQL application that takes in command Table Index Total line keyword parameters and search options, Raw text 100% Lucene 133% 133% translates the query into appropriate SQL, connects to DB2 TIE 99% the database, executes the query and retrieves Term-Doc 337% 389% 726% document URLs. The execution time of the embedded Term-Doc-Pos 429% 783% 1212% SQL application is measured. DB2’s table size estimation utility could not estimate 5. RESULTS & OVSERVATIONS the size of the fulltext base relation because the text [The following abbreviations are used in the tables: TIE: IBM DB2 Text Information Extender body is stored in a BLOB attribute, rather than inline TD: Term-Doc relational model as varchar. TDP: Term-Doc-Pos relational model ] Table 3 And Queries Class Average Execution Time in 4.2. Queries seconds Lucene TIE TD TDP
  6. 6. 1 word - 1 hit 0.359 1.094 0.406 0.448 queries. It has fast response time, and scales well over 1 word - 10 hit 0.729 1.432 0.813 0.448 the query size. However, performance deteriorates 1 word - 100 hit 3.391 4.474 2.781 0.661 slightly as the expected result size increases. 2 word - 1 hit 0.375 0.557 1.250 0.651 2 word - 10 hit 0.578 4.343 1.922 0.662 5.2 IBM DB2 Text Information Extender (TIE) 2 word - 100 hit 2.078 4.900 2.078 0.802 TIE produced performance numbers comparable to 4 word - 1 hit 0.526 1.396 1.063 2.083 that of Lucene on And queries, slightly better on Or 4 word - 10 hit 0.739 2.380 2.125 2.573 queries, and slightly worse on Phrase queries. We are 4 word - 100 hit 2.954 3.895 3.870 4.453 not able to make direct comparison on their Rank Table 4 Or Queries Class Average Execution Time in query performance as TIE returns only documents seconds containing all query terms whereas the 3 other Lucene TIE TD TDP systems requires only 1 keyword match. However, 1 word - 1 hit 0.359 1.094 0.406 0.448 judging from its performance on single-keyword queries, its response time is comparable to that of 1 word - 10 hit 0.729 1.432 0.813 0.448 Lucene. Furthermore, TIE response time varies 1 word - 100 hit 3.391 4.474 2.781 0.661 significantly across different queries from the same 2 word - 1 hit 0.385 0.343 0.442 112.7 query class, a characteristic not seen in the other 2 word - 10 hit 0.719 0.446 0.469 113.5 systems. 2 word - 100 hit 2.500 0.501 0.693 115.9 4 word - 1 hit 5.3 Term- Doc 4 word - 10 hit The system using the Term-Doc relational model 4 word - 100 hit 3.016 0.922 0.805 120.5 performed comparably with Lucene and TIE on And, Table 5 Phrase Queries Class Average Execution Time Or and Phrase queries. It also appears to scale well in seconds (sub-linearly) on query and result size over those same types of queries. It also performs competitively Lucene TIE TD TDP on single-keyword ranked queries, but performance 1 word - 1 hit 0.359 1.094 0.406 0.448 degrades significantly on ranked queries with 2 or 1 word - 10 hit 0.729 1.432 0.813 0.448 more keywords. This is apparently due to the 1 word - 100 hit 3.391 4.474 2.781 0.661 optimizer choosing a different query plan on these 2 word - 1 hit 0.391 5.482 0.406 0.818 queries. However, performance still appears to scale 2 word - 10 hit 0.614 6.774 0.813 0.906 gracefully as query and result size increase. 2 word - 100 hit 2.552 4.067 2.781 2.354 4 word - 1 hit 0.578 1.453 1.359 2.541 5.4 Term- Doc- Pos 4 word - 10 hit 0.771 3.104 2.833 12.68 The system based on the Term-Doc-Pos relational 4 word - 100 hit 3.047 3.562 3.458 13.52 model produced reasonably fast response time on And and Phrase queries, though efficiency deteriorates Table 6 Rank Queries Class Average Execution Time in notably with the increase in the expected number of seconds hits as well as the query size. The system was much Lucene TIE TD TDP slower than the rest on Or and Rank type queries. In 1 word - 1 hit 0.359 0.635 0.492 ∞* fact, with single-keyword ranked queries, the system 1 word - 10 hit 0.729 1.099 0.526 ∞* does not respond within 10 minutes. Upon further 1 word - 100 hit 3.391 2.934 0.708 ∞* investigation, it was found that the query optimizer 2 word - 1 hit 0.385 0.339 13.55 257.6 was picking some unreasonably bad plans, involving 2 word - 10 hit 0.719 0.370 13.47 284.6 multiple sorts of large relations. We plan to look into 2 word - 100 hit 2.500 0.432 16.19 261.4 improvements in these areas in the future. Even with 4 word - 1 hit the downfall, this approach seems to scale reasonably 4 word - 10 hit well and is quite insensitive to the query or result 4 word - 100 hit 3.016 0.523 15.06 307.4 sizes. * did not finish in 10 minutes 5.5 Comparisons [The following abbreviations are used in the charts: 5.1 Lucene DB2text: IBM DB2 Text Information Extender Lucene, being a specialized IR system, performs DB2term_doc: Term-Doc relational model DB2term_doc_pos: Term-Doc-Pos relational model ] consistently well across the board on all four types of
  7. 7. All four systems perform well on And queries (Figure 3), with response times no more than 5 seconds. It 1000 should be noted that both the Term-Doc and Term- Doc-Pos models produced numbers comparable to 100 that of Lucene. In fact, both appear to scale more Lucene: gracefully on increasing size of the result set. 10 DB2text: DB2term_doc: DB2term_doc_pos: 6 1 5 4 0.1 Lucene: 1 10 100 DB2text: 3 N umb e r o f hit s DB2term_doc: DB2term_doc_pos: 2 Figure 5. Rank query with 2 keywords 1 0 1 10 100 6. CONCLUSIONS AND FUTURE N umb e r o f hit s WORK The Term-Doc representation offers competitive performance on Boolean and phrase queries compared Figure 3. And query with 2 keywords with special-purpose IR system Lucene. Hence one With regard to Phrase queries, all four systems may choose to incur the storage overhead to gain the finished within 7 seconds, with all except TIE advantages of a relational implementation, such as producing sub-3 second response times (Figure 4). portability, parallelism, and the ability to query over Note that both TIE and Term-Doc exhibits good both unstructured text and structured metadata. In scaling characteristics, as compared to Lucene and general, the Term-Doc representation offers better Term-Doc-Pos systems. performance than the Term-Doc-Position representation. DB2 TIE provides comparable 8 performance as Lucene, though it incurs significantly 7 higher space overhead by storing the base text in a 6 relation, as well as the inverted index. If a workload 5 Lucene: consists mostly of ranked queries, then Lucene or TIE 4 DB2text: should be used, as the DB2 optimizer seems to be DB2term_doc: using sub-optimal plans for the two relational 3 DB2term_doc_pos: implementations. 2 1 There are a number of avenues for future work. 0 Additional query classes to be investigated include 1 10 100 N umb e r o f hit s proximity queries and wild-card queries. We measured the query execution times in isolation. We may want to measure executions of sustained query Figure 4. Phrase query with 4 keywords workloads. Another important question is index With Rank queries, we see where the two specialized update performance. It is conceivable that RDBMS systems have their advantages (Figure 5, note the page layout and B-tree index may be more efficient logarithmic scale). Lucene and TIE significantly for inverted index insertion than the traditional IR outperform the two systems built on relational approach of maintaining a small update list and models. One reason behind the disparity is that the reorganizing the entire index periodically. Traditional relational system performs extensive sorting IR search engines are not well suited for high-insert operations on large relations. We also discovered that environments. We would like to find out if RDBMS the DB2 query optimizer could have picked better approaches are more attractive in such a setting. On plans if more indices/constraints were available, and the performance front, a number of our RDBMS we plan to investigate this in the future. queries were clearly being executed on sub-optimal plans, hence some amount of database tuning may result in significant performance boost. Since most IR
  8. 8. systems are used interactively, and users typically process result hits in batches, it is often useful to optimize for top-K results. Most database vendors provide language constructs to specify this constraint and to utilize it in query processing for better efficiency (e.g. reducing the size of sort results). 7. REFERENCES [BCC94] E. W. Brown, J. P. Callan, and W. B. Croft. Fast Incremental Indexing for Full-Text Information Retrieval. In Proceedings of the 20th International Conference on Very Large Databases, 1994. [BR95] E. W. Brown. Execution Performance Issues in Full-Text Information Retrieval. Ph.D. Thesis, University of Massachusetts, Amherst, 1995. [DDS95] S. DeFazio, A. Daoud, L. Smith, J. Srinivasan, B. Croft, and J. Callan. Integrating IR and RDBMS using cooperative indexing. In Proceedings of the 18th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995. [DIX01] P. Dixon. Basics of Oracle Text Retrieval. IEEE Data Engineering Bulletin, December 2001. [GBS01] T. Grabs, K. Böhm, and H.-J.Schek. PowerDBIR: Information Retrieval on Top of a Database Cluster. In Proceedings of 10th ACM International Conference on Information and Knowledge Management, 2001. [GFH97] D. A. Grossman, O. Frieder, D. O. Holmes, and D. C. Roberts. Integrating structured data and text: A relational approach. In Journal of the American Society for Information Science, 1997. [HN01] J. Hamilton, and T. Nayak. Microsoft SQL Server Full-Text Search. IEEE Data Engineering Bulletin, December, 2001. [KS95] H. Kaufmann, and H.-J. Schek. Text Search Using Database Systems Revisited - Some Experiments. In Proceedings of the 13th British National Conference on Databases, 1995. [LFH99] C. Lundquist, O. Frieder, D. O. Holmes, and D. A. Grossman. A Parallel Relational Database Management System Approach to Relevance Feedback in Information Retrievel. In Journal of the American Society for Information Science, 1999. [LS88] C. A. Lynch, and M. Stonebraker. Extended User- Defined Indexing with Application to Textual Databases. In Proceedings of the 14th International Conference on Very Large Databases, 1988. [MS01] A. Maier, and D. Simmen. DB2 Optimization in Support of Full Text Search. IEEE Data Engineering Bulletin, December 2001. [RAG01] P. Raghavan. Structured and Unstructured Search in Enterprises. IEEE Data Engineering Bulletin, December 2001.

×