‫بسم ال الرحمن الرحيم‬A Novel Web Search Engine Model Based on Index-            Query Bit-Level Compression              ...
i
AuthorizationI, the undersigned Saif Mahmood Saab authorize the ArabAcademy for Banking and Financial Sciences to provide ...
‫‪Dedications‬‬                                  ‫الى روح والدي الطاهرة ...‬                          ‫الى والدتي الحبيبة ...
AcknowledgmentsFirst and foremost, I thank Allah (Subhana Wa Taala) for endowing mewith health, patience, and knowledge to...
List of FiguresFigure                                 Title                                Page 1.1     Architecture and m...
List of TablesTable                                 Title                               Page 1.1    Document ID and its co...
AbbreviationsACW      Adaptive Character WordlengthAPI      Application Programming InterfaceASCII    American Standard Co...
Table of ContentsAuthorization                                                                - ii -Dedications           ...
Chapter Three                                                                                     - 39 -The Novel CIQ Web ...
AbstractWeb search engine is an information retrieval system designed to help findinginformation stored on the Web. Standa...
Chapter One                                    IntroductionA search engine is an information retrieval system designed to ...
data collections; in fact, data turn out to be useful only when properly indexed to supportsearch operations that efficien...
implemented at the back-end processor side, one after the indexer acting as a secondcompression layer to generate a double...
of a page, and some use the first sentence or paragraph on the sites. That is mean; a pagethat ranks higher on one web sea...
Index design factorsMajor factors should be carefully considered when designing a search engines, theseinclude [Bri 98, We...
Through the indexing, there are several processes taken place, here the processes thatrelated to our work will be discusse...
•   Normalization. The process by which text is transformed in some way to make it       consistent in a way which it migh...
ExampleLet us consider a case in which six documents have the text shown in Table (1.1). Thecontents of a record and word ...
When we search for the word “Amman”, we get three results which are documents 2, 4, 6if a record level inverted index is u...
Figure (1.1). Architecture and main components of standard search engine model.1.2      Challenges to Web Search EnginesBu...
•   Many Web pages are updated frequently, which forces the search engine to re-    visit them periodically.•   Query time...
1.3      Data Compression TechniquesThis section presents definition, models, classification methodologies and classes, an...
In substitution compression techniques, a shorter representation is used to replace asequence of repeating characters. Exa...
Where the probability of each symbol of the alphabet is constant, the entropy iscalculated as [Bel 89, Bel 90]:           ...
•    Data compression symbol table   •    Data compression processing timeIn what follows a brief definition is given for ...
A lossy data compression is used in applications wherein exact representation of        the original data is not necessary...
algorithms the compression/decompression processing time are almost the same; whilefor asymmetric algorithms, normally, th...
Definitions of these parameters are given below.(i)     Compression ratio (C)The compression ratio (C) is defined as the r...
q ⋅ Sc        Cr =                                                              (1.4)                  SoWhere q is the nu...
There are extreme cases where data compression works very well or other conditionswhere it is inefficient, the type of dat...
obvious solutions to these issues are to develop more succinct data structure, compressedindex, query optimization, and hi...
1.7    Organization of this ThesisThis thesis is organized into five chapters. Chapter 1 provides an introduction to thege...
Chapter Two                                 Literature ReviewThis work is concern with the development of a novel high-per...
S=s1s2 … sn, which is denoted by Rankc(S; q) the number of times the bit c appears in S[1;q]=s1s2 … sq, and by Selectc(S; ...
techniques such as cost-based and heuristic-based optimizers, plus the tools a search plat-form provides for explaining a ...
lion queries per day in 2009 and twitter handled about 635 millions queries per day [web1].Creating a Web search engine wh...
sition [Kir 08]. It is an ongoing trend that people increasingly reveal very personal infor-mation on social network sites...
the text words. As a consequence, it permits the sorting of the compressed strings to gen-erate the suffix array without d...
algorithm, and (iii) word-based approximate and extended search can be done efficientlywithout any decoding. The compressi...
and the alphabet size is small, they obtained a text index with O(m) search time thatrequires only O(n) bits.X. Long and T...
than document-oriented indexes, and this difference can affect the effectiveness of indexcompression algorithms greatly. T...
random access testing of candidates, and, if necessary, fast intersection operations prior tothe list of candidates being ...
most successful compressed full-text self-indexes, together with effective test-beds andscripts for their automatic valida...
source into an equivalent binary source using a new fixed-length code instead of theASCII code. The codes are chosen such ...
a net energy savings and an increase in battery life for portable computers. This workdemonstrated that, with several typi...
that is based on the error correcting Hamming codes. It was referred to as the HCDCalgorithm. In this algorithm, the binar...
adaptive character wordlength (ACW) algorithm, since the compression ratio of thealgorithm is a function of n, it was refe...
S. Ogg and B. Al-Hashimi [Ogg 06] proposed a simple yet effective real-time compres-sion technique that reduces the amount...
Chapter Three               The Novel CIQ Web Search Engine ModelThis chapter presents a description of the proposed Web s...
3.1    The CIQ Web Search Engine ModelIn this section, a description of the proposed Web search engine model is presented....
On the other hand, the query parser, instead of passing the query to the index file, itpasses it to the query compressor b...
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
Upcoming SlideShare
Loading in …5
×

A novel web search engine model based on index query bit-level compression - PhD

22,274 views

Published on

My PhD theses,
A novel web search engine model based on index query bit-level compression

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
22,274
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A novel web search engine model based on index query bit-level compression - PhD

  1. 1. ‫بسم ال الرحمن الرحيم‬A Novel Web Search Engine Model Based on Index- Query Bit-Level Compression Prepared By Saif Mahmood Saab Supervisor By Dr. Hussein Al-Bahadili Dissertation Submitted In Partial Fulfillment of the Requirements for the Degree of Doctorate of Philosophy in Computer Information Systems Faculty of Information Systems and Technology University of Banking and Financial Sciences Amman - Jordan (May - 2011)
  2. 2. i
  3. 3. AuthorizationI, the undersigned Saif Mahmood Saab authorize the ArabAcademy for Banking and Financial Sciences to provide copies ofthis Dissertation to Libraries, Institutions, Agencies, and anyParties upon their request.Name: Saif Mahmood SaabSignature:Date: 30/05/2011 ii
  4. 4. ‫‪Dedications‬‬ ‫الى روح والدي الطاهرة ...‬ ‫الى والدتي الحبيبة ...‬ ‫الى زوجتي الغالية ...‬ ‫الى ابنائي العزاء...‬‫أهدي عملي المتواضع هذا.‬ ‫‪iii‬‬
  5. 5. AcknowledgmentsFirst and foremost, I thank Allah (Subhana Wa Taala) for endowing mewith health, patience, and knowledge to complete this work.I am thankful to anyone who supported me during my study. I would liketo thank my honorific supervisor, Dr. Hussein Al-Bahadili, who acceptedme as his Ph.D. student without any hesitation and offered me so muchadvice, patiently supervising me, and always guiding me in the rightdirection.Last but not least, I would like to thank my parents for their support overthe years, my wife for her understanding and continuance encouragementand my friends specially Mahmoud Alsiksek and Ali AlKhaledi.It will not be enough to express my gratitude in words to all those peoplewho helped me; I would still like to give my many, many thanks to allthese people. iv
  6. 6. List of FiguresFigure Title Page 1.1 Architecture and main components of standard search engine 10 model. 3.1 Architecture and main components of the CIQ Web search 41 engine model. 3.2 Lists of IDs for each type of character sets assuming m=6. 483.3-a Locations of data and parity bits in 7-bit codeword 543.3-b An uncompressed binary sequence of 21-bit length divided 54 into 3 blocks of 7-bit length, where b1 and b3 are valid blocks, and b2 is a non-valid block3.3-c The compressed binary sequence (18-bit length). 54 3.4 The main steps of the HCDC compressor 55 3.5 The main steps of the HCDC decompressor. 56 3.6 Variation of Cmin and Cmax with p. 58 3.7 Variation of r with p. 1 59 3.8 Variations of C with respect to r for various values of p. 60 3.9 The compressed file header of the HCDC scheme. 65 4.1 The compression ratio (C) for different sizes index files 75 4.2 The reduction factor (Rs) for different sizes index files. 76 4.3 Variation of C and average Sf for different sizes index files. 89 4.4 Variation of Rs and Rt for different sizes index files. 89 4.5 The CIQ performance triangle. 90 5.1 The CIQ performance triangle. 92 v
  7. 7. List of TablesTable Title Page 1.1 Document ID and its contents. 8 1.2 A record and word level inverted indexes for documents in 8 Table (1.1). 3.1 List of most popular stopwords (117 stop-words). 47 3.2 Type of character sets and equivalent maximum number of 47 IDs 3.4 Variation of Cmin, Cmax, and r1 with number of parity bits (p). 58 3.6 Variations of C with respect to r for various values of p. 59 3.7 Valid 7-bit codewords. 61 3.8 The HCDC algorithm compressed file header. 64 4.1 List of visited Websites 71 4.2 The sizes of the generated indexes. 72 4.3 Type and number of characters in each generated inverted 73 index file. 4.4 Type and frequency of characters in each generated inverted 74 index file. 4.5 Values of C and Rs for different sizes index files. 75 4.6 Performance analysis and implementation validation. 77 4.7 List of keywords. 78 4.8 Values of No, Nc, To, Tc, Sf and Rt for 1000 index file 79 4.9 Values of No, Nc, To, Tc, Sf and Rt for 10000 index file 804.10 Values of No, Nc, To, Tc, Sf and Rt for 25000 index file 814.11 Values of No, Nc, To, Tc, Sf and Rt for 50000 index file 824.12 Values of No, Nc, To, Tc, Sf and Rt for 75000 index file 834.13 Variation of Sf for different index sizes and keywords. 854.14 Variation of No and Nc for different index sizes and keywords. 864.15 Variation of To and Tc for different index sizes and keywords. 874.16 Values of C, Rs, average Sf, and average Rt for different sizes 88 index files. vi
  8. 8. AbbreviationsACW Adaptive Character WordlengthAPI Application Programming InterfaceASCII American Standard Code for Information InterchangeASF Apache Software FoundationBWT Burrows-Wheeler block sorting transformCIQ compressed index-queryCPU Central Processing UnitDBA Database AdministratorFLH Fixed-Length HammingGFS Google File SystemGZIP GNU zipHCDC Hamming Code Data CompressionHTML Hypertext Mark-up LanguageID3 A metadata container used in conjunction with the MP3 audio fileformatJSON JavaScript Object NotationLAN Local Area NetworksLANMAN Microsoft LAN ManagerLDPC Low-Density Parity CheckLZW Lempel-Zif-WelchMP3 A patented digital audio encoding formatNTLM Windows NT LAN ManagerPDF Portable Document FormatRLE Run Length EncodingRSS Really Simple SyndicationRTF Rich Text FormatSAN Storage Area NetworksSASE Shrink And Search EngineSP4 Windows Service Pack 4UNIX UNiplexed Information and Computing ServiceURL Uniform Resource LocatorXML Extensible Markup LanguageZIP A data compression and archive format, the name zip (meaning speed) vii
  9. 9. Table of ContentsAuthorization - ii -Dedications - iii -Acknowledgments - iv -List of Figures -v-List of Tables - vi -Abbreviations - vii -Table of Contents - viii -Abstract -x-Chapter One -1-Introduction -1- 1.1 Web Search Engine Model -3- 1.1.1 Web crawler -3- 1.1.2 Document analyzer and indexer -4- 1.1.3 Searching process -9- 1.2 Challenges to Web Search Engines - 10 - 1.3 Data Compression Techniques - 12 - 1.3.1 Definition of data compression - 12 - 1.3.2 Data compression models - 12 - 1.3.3 Classification of data compression algorithms - 14 - 1.3.4 Performance evaluation parameters - 17 - 1.4 Current Trends in Building High-Performance Web Search Engine - 20 - 1.5 Statement of the Problem - 20 - 1.6 Objectives of this Thesis - 21 - 1.7 Organization of this Thesis - 21 -Chapter Two - 23 -Literature Review - 23 - 2.1 Trends Towards High-Performance Web Search Engine - 23 - 2.1.1 Succinct data structure - 23 - 2.1.2 Compressed full-text self-index - 24 - 2.1.3 Query optimization - 24 - 2.1.4 Efficient architectural design - 25 - 2.1.5 Scalability - 25 - 2.1.6 Semantic search engine - 26 - 2.1.7 Using Social Networks - 26 - 2.1.8 Caching - 27 - 2.2 Recent Research on Web Search Engine - 27 - 2.3 Recent Research on Bit-Level Data Compression Algorithms - 33 - viii
  10. 10. Chapter Three - 39 -The Novel CIQ Web Search Engine Model - 39 - 3.1 The CIQ Web Search Engine Model - 40 - 3.2 Implementation of the CIQ Model: CIQ-based Test Tool (CIQTT) - 42 - 3.2.1 COLCOR: Collects the testing corpus (documents) - 42 - 3.2.2 PROCOR: Processing and analyzing testing corpus (documents) - 46 - 3.2.3 INVINX: Building the inverted index and start indexing. - 46 - 3.2.4 COMINX: Compressing the inverted index - 50 - 3.2.5 SRHINX: Searching index (inverted or inverted/compressed index) - 51 - 3.2.6 COMRES: Comparing the outcomes of different search processes performed by SRHINX procedure. - 52 - 3.3 The Bit-Level Data Compression Algorithm - 52 - 3.3.1 The HCDC algorithm - 52 - 3.3.2 Derivation and analysis of HCDC algorithm compression ratio - 56 - 3.3.3 The Compressed File Header - 63 - 3.4 Implementation of the HCDC algorithm in CIQTT - 65 - 3.5 Performance Measures - 66 -Chapter Four - 68 -Results and Discussions - 68 - 4.1 Test Procedures - 69 - 4.2 Determination of the Compression Ratio (C) & the Storage Reduction Factor (Rs) - 70 - 4.2.1 Step 1: Collect the testing corpus using COLCOR procedure - 70 - 4.2.2 Step 2: Process and analyze the corpus to build the inverted index file using PROCOR and INVINX procedures - 72 - 4.2.3 Step 3: Compress the inverted index file using the INXCOM procedure - 72 - 4.3 Determination of the Speedup Factor (Sf) and the Time Reduction Factor (Rt) - 77 - 4.3.1 Choose a list of keywords - 77 - 4.3.2 Perform the search processes - 78 - 4.3.3 Determine Sf and Rt. - 84 - 4.4 Validation of the Accuracy of the CIQ Web Search Model - 88 - 4.5 Summary of Results - 88 -Chapter Five - 91 -Conclusions and Recommendations for Future Work - 91 - 5.1 Conclusions - 91 - 5.2 Recommendations for Future Work - 93 -References - 94 -Appendix I - 105 -Appendix II - 108 -Appendix III - 112 -Appendix IV - 115 - ix
  11. 11. AbstractWeb search engine is an information retrieval system designed to help findinginformation stored on the Web. Standard Web search engine consists of three maincomponents: Web crawler, document analyzer and indexer, and search processor. Dueto the rapid growth in the size of the Web, Web search engines are facing enormousperformance challenges, in terms of: storage capacity, data retrieval rate, queryprocessing time, and communication overhead. Large search engines, in particular,have to be able to process tens of thousands of queries per second on tens of billionsof documents, making query throughput a critical issue. To satisfy this heavyworkload, search engines use a variety of performance optimizations includingsuccinct data structure, compressed text indexing, query optimization, high-speedprocessing and communication systems, and efficient search engine architecturaldesign. However, it is believed that the performance of the current Web search enginemodels still short from meeting users and applications needs.In this work we develop a novel Web search engine model based on index-querycompression, therefore, it is referred to as the compressed index-query (CIQ) model.The model incorporates two compression layers both implemented at the back-endprocessor (server) side, one layer resides after the indexer acting as a secondcompression layer to generate a double compressed index, and the second layer belocated after the query parser for query compression to enable compressed index-query search. The data compression algorithm used is the novel Hamming code datacompression (HCDC) algorithm.The different components of the CIQ model is implemented in a number ofprocedures forming what is referred to as the CIQ test tool (CIQTT), which is used asa test bench to validate the accuracy and integrity of the retrieved data, and to evaluatethe performance of the CIQ model. The results obtained demonstrate that the newCIQ model attained an excellent performance as compared to the currentuncompressed model, as such: the CIQ model achieved a tremendous accuracy with100% agreement with the current uncompressed model.The new model demands less disk space as the HCDC algorithm achieves acompression ratio over 1.3 with compression efficiency of more than 95%, whichimplies a reduction in storage requirement over 24%. The new CIQ model performsfaster than the current model as it achieves a speed up factor over 1.3 providing areduction in processing time of over 24%. x
  12. 12. Chapter One IntroductionA search engine is an information retrieval system designed to help in finding files storedon a computer, for example, public server on the World Wide Web (or simply the Web),server on a private network of computers, or on a stand-alone computer [Bri 98]. Thesearch engine allows us to search the storage media for a certain content in a form of textmeeting specific criteria (typically those containing a given word or phrase) andretrieving a list of files that match those criteria. In this work, we are concerned with thetype of search engine that is designed to help in finding files stored on the Web (Websearch engine).Webmasters and content providers began optimizing sites for Web search engines in themid-1990s, as the first search engines were cataloging the early Web. Initially, all awebmaster needed to do was to submit the address of a page, or the uniform resourcelocator (URL), to various engines which would send a spider to crawl that page, extractlinks to other pages from it, and return information found on the page to be indexed [Bri98]. The process involves a search engine crawler downloading a page and storing it onthe search engines own server, where a second program, known as an indexer, extractsvarious information about the page, such as the words it contains and where therelocation are, as well as any weight for specific words, and all links the page contains,which are then placed into a scheduler for crawling at a later date [Web 4].Standard search engine consists of the following main components: Web crawler,document analyzer and indexer, and searching process [Bah 10d]. The main purpose ofusing certain data structure for searching is to construct an index that allows focusing thesearch for a given keyword (query). The improvement in the query performance is paidby the additional space necessary to store the index. Therefore, most of the research inthis field has been directed to design data structures which offer a good trade betweenqueries and update time versus space usage.For this reason compression appears always as an attractive choice, if not mandatory.However space overhead is not the only resource to be optimized when managing large 1
  13. 13. data collections; in fact, data turn out to be useful only when properly indexed to supportsearch operations that efficiently extract the user-requested information. Approaches tocombine compression and indexing techniques are nowadays receiving more and moreattention. A first step towards the design of a compressed full-text index is achievingguaranteed performance and lossless data [Fer 01].In the light of the significant increase in CPU speed that makes more economical to storedata in compressed form than uncompressed. Storing data in a compressed form mayintroduce significant improvement in space occupancy and also processing time. This isbecause space optimization is closely related to time optimization in a disk memory(improve time processing) [Fer 01].There are a number of trends that have been identified in the literature for building high-performance search engines, such as: succinct data structure, compressed full-text self-index, query optimization, and high-speed processing and communication systems.Starting from these promising trends, many researchers have tried to combine textcompression with indexing techniques and searching algorithms. They have mainlyinvestigated and analyzed the compressed matching problem under various compressionschemes [Fer 01].Due to the rapid growth in the size of the Web, Web search engines are facing enormousperformance challenges, in terms of: (i) storage capacity, (ii) data retrieval rate, (iii) queryprocessing time, and (iv) communication overhead. The large engines, in particular, haveto be able to process tens of thousands of queries per second on tens of billions ofdocuments, making query throughput a critical issue. To satisfy this heavy workload,search engines use a variety of performance optimizations including index compression.With the tremendous increase in users and applications needs we believe that the currentsearch engines model need more retrieval performance and more compact and cost-effective systems are still required.In this work we develop a novel web search engine model that is based on index-querybit-level compression. The model incorporates two bit-level data compression layers both 2
  14. 14. implemented at the back-end processor side, one after the indexer acting as a secondcompression layer to generate a double compressed index, and the other one after thequery parser for query compression to enable bit-level compressed index-query search.So that less disk space is required to store the compressed index file and also reducingdisk I/O overheads, and consequently higher retrieval rate or performance.An important feature of the bit-level technique to be used for performing the searchprocess at the compressed index-query level, is to generate similar compressed binarysequence for the same character from the search queries and the index files. The datacompression technique that satisfies this important feature is the HCDC algorithm [Bah07b, Bah 08a]. Therefore; it will be used in this work. Recent investigations on using thisalgorithm for text compression have demonstrated an excellent performance incomparison with many widely-used and well-known data compression algorithms andstate of the art tools [Bah 07b, Bah 08a].1.1 Web Search Engine ModelA Web search engine is an information retrieval system designed to help find files storedon a public server on the Web [Bri 98, Mel 00]. Standard Web search engine consists ofthe following main components: • Web crawler • Document analyzer and indexer • Searching processIn what follows we provide a brief description for each of the above components.1.1.1 Web crawlerA Web crawler is a computer program that browses the Web in a methodical, automatedmanner. Other terms for Web crawlers are ants, automatic indexers, bots, worms, Webspider and Web robot. Unfortunately, each spider has its own personal agenda as itindexes a site. Some search engines use META tag; others may use the META description 3
  15. 15. of a page, and some use the first sentence or paragraph on the sites. That is mean; a pagethat ranks higher on one web search engine may not rank as well on another. Given a setof “URLs” unified resource locations, the crawler repeatedly removes one URL from theset, downloads the targeted page, extracts all the URLs contained in it, and adds allpreviously unknown URLs to the set [Bri 98, Jun 00].Web search engines work by storing information about many Web pages, which theyretrieve from the Web itself. These pages are retrieved by a spider - sophisticated Webbrowser which follows every link extracted or stored in its database. The contents of eachpage are then analyzed to determine how it should be indexed, for example, words areextracted from the titles, headings, or special fields called Meta tags.1.1.2 Document analyzer and indexerIndexing is the process of creating an index that is a specialized file containing acompiled version of documents retrieved by the spider [Bah 10d]. Indexing processcollect, parse, and store data to facilitate fast and accurate information retrieval. Indexdesign incorporates interdisciplinary concepts from linguistics, mathematics, informatics,physics and computer science [Web 5].The purpose of storing an index is to optimize speed and performance in finding relevantdocuments for a search query. Without an index, the search engine would scan every(possible) document in the Internet, which would require considerable time andcomputing power (impossible with the current Internet size). For example, while an indexof 10000 documents can be queried within milliseconds, a sequential scan of every wordin the documents could take hours. The additional computer storage required to store theindex, as well as the considerable increase in the time required for an update to takeplace, are traded off for the time saved during information retrieval [Web 5]. 4
  16. 16. Index design factorsMajor factors should be carefully considered when designing a search engines, theseinclude [Bri 98, Web 5]: • Merge factors: How data enters the index, or how words or subject features are added to the index during text corpus traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content. Traversal typically correlates to the data collection policy. Search engine index merging is similar in concept to the SQL Merge command and other merge algorithms. • Storage techniques: How to store the index data, that is, whether information should be data compressed or filtered. • Index size: How much computer storage is required to support the index. • Lookup speed: How quickly a word can be found in the index. The speed of finding an entry in a data structure, compared with how quickly it can be updated or removed, is a central focus of computer science. • Maintenance: How the index is maintained over time. • Fault tolerance: How important it is for the service to be robust. Issues include dealing with index corruption, determining whether bad data can be treated in isolation, dealing with bad hardware, partitioning, and schemes such as hash- based or composite partitioning, as well as replication.Index data structuresSearch engine architectures vary in the way indexing is performed and in methods ofindex storage to meet the various design factors. There are many architectures for theindexes and the most used is inverted index. Inverted index save a list of occurrences ofevery keyword, typically, in the form of a hash table or binary tree [Bah 10c]. 5
  17. 17. Through the indexing, there are several processes taken place, here the processes thatrelated to our work will be discussed. These processes may be used and this depends onthe search engine configuration [Bah 10d]. • Extract URLs. A process of extracting all URLs from the document being indexed, it used to guide crawling the website, do link checking, build a site map, and build a table of internal and external links from the page. • Code striping. A process of removing hyper-text markup language (HTML) tags, scripts, and styles, and decoding HTML character references and entities used to embed special characters. • Language recognition. A process by which a computer program attempts to automatically identify, or categorize, the language or languages by which a document is written. • Document tokenization. A process of detecting the encoding used for the page; determining the language of the content (some pages use multiple languages); finding word, sentence and paragraph boundaries; combining multiple adjacent- words into one phrase; and changing the case of text. • Document parsing or syntactic analysis. The process of analyzing a sequence of tokens (for example, words) to determine their grammatical structure with respect to a given (more or less) formal grammar. • Lemmatization/stemming. The process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form, this stage can be done in indexing and/or searching stage. The stem doesnt need to be identical to the morphological root of the word; it is usually sufficient that relate words map to the same stem, even if this stem is not in itself a valid root. The process is useful in search engines for query expansion or indexing and other natural language processing problems. 6
  18. 18. • Normalization. The process by which text is transformed in some way to make it consistent in a way which it might not have been before. Text normalization is often performed before text is processed in some way, such as generating synthesized speech, automated language translation, storage in a database, or comparison.Inverted IndexThe inverted index structure is widely used in the modern supper fast Web search enginelike Google, Yahoo, Lucene and other major search engines. Inverted index (also referredto as postings file or inverted file) is an index data structure storing a mapping fromcontent, such as words or numbers, to its locations in a database file, or in a document ora set of documents. The main purpose of using the inverted index is to allow fast full textsearches, at a cost of increased processing when a document is added to the index [Bri 98,Nag 02, Web 4]. The inverted index is one of the most used data structure in informationretrieval systems [Web 4, Bri 98].There are two main variants of inverted indexes [Bae 99]: (1) A record level inverted index (or inverted file index or just inverted file) contains a list of references to documents for each word; we use this simple type in our search engine. (2) A word level inverted index (or full inverted index or inverted list) additionally contains the positions of each word within a document; these positions can be used to rank the results according to document relevancy to the query.The latter form offers more functionality (like phrase searches), but needs more time andspace to be created. In order to simplify the understanding of the above two invertedindexes let us consider the following example. 7
  19. 19. ExampleLet us consider a case in which six documents have the text shown in Table (1.1). Thecontents of a record and word level indexes are shown in Table (1.2). Table (1.1) Document ID and its contents. Document ID Text 1 Aqaba is a hot city 2 Amman is a cold city 3 Aqaba is a port 4 Amman is a modern city 5 Aqaba in the south 6 Amman in Jordan Table (1.2) A record and word level inverted indexes for documents in Table (1.1). Record level inverted index Word level inverted index Text Documents Text Documents: Location Aqaba 1, 3, 5 Aqaba 1:1 , 3:1 , 5:1 is 1, 2, 3, 4 is 1:2 , 2:2 , 3:2 , 4:2 a 1, 2, 3, 4 a 1:3 , 2:3 , 3:3 , 4:3 hot 1 hot 1:4 city 1, 2, 4 city 1:5 , 2:5 , 4:3 Amman 2, 4, 6 Amman 2:1 , 4:1 , 6:1 cold 2 cold 2:4 the 5 the 5:3 modern 4 modern 4:2 south 5 south 5:4 in 5, 6 in 5:2 , 6:2 Jordan 6 Jordan 6:3 8
  20. 20. When we search for the word “Amman”, we get three results which are documents 2, 4, 6if a record level inverted index is used, and 2:1, 4:1, 6:1 if a word level inverted index isused. In this work, the record level inverted index is used for its simplicity and becausewe dont need to rank our results.1.1.3 Searching processWhen the index is ready the searching can be perform through query interface, a userenters a query into a search engine (typically by using keywords), the engine examines itsindex and provides a listing of best matching Web pages according to its criteria, usuallywith a short summary containing the documents title and sometimes parts of the text[Bah 10d].In this stage the results ranked, where ranking is a relationship between a set of itemssuch that, for any two items, the first is either “ranked higher than”, “ranked lower than”or “ranked equal” to the second. In mathematics, this is known as a weak order or totalpre-order of objects. It is not necessarily a total order of documents because two differentdocuments can have the same ranking. Ranking is done according to document relevancyto the query, freshness and popularity [Bri 98]. Figure (1.1) outlines the architecture andmain components of standard search engine model. 9
  21. 21. Figure (1.1). Architecture and main components of standard search engine model.1.2 Challenges to Web Search EnginesBuilding and operating large-scale Web search engine used by hundreds of millions ofpeople around the world provides a number of interesting challenges [Hen 03, Hui 09,Ois 10, Ami 05]. Designing such systems requires making complex design trade-offs in anumber of dimensions and the main challenges to designing efficient, effective, andreliable Web search engine are: • The Web is growing much faster than any present-technology search engine can possibly index. • The cost of index storing which include data storage cost, electricity and cool- ing the data center. • The real time web which updated in real time requires a fast and reliable crawler and then indexes this content to make it searchable. 10
  22. 22. • Many Web pages are updated frequently, which forces the search engine to re- visit them periodically.• Query time (latency), the need to keep up with the increase of index size and to perform the query and show the results in less time.• Most search engine uses keyword for searching and this limited the results to text pages only.• Dynamically generated sites, which may be slow or difficult to index, or may result in excessive results from a single site.• Many dynamically generated sites are not indexable by search engines; this phenomenon is known as the invisible Web.• Several content types are not crawlable and indexable by search engines like multi-media and flash content.• Some sites use tricks to manipulate the search engine to display them as the first result returned for some keywords and this known as Spamming. This can lead to some search results being polluted, with more relevant links being pushed down in the result list.• Duplicate hosts, Web search engines try to avoid having duplicate and near- duplicate pages in their collection, since such pages increase the time it takes to add useful con-tent to the collection.• Web graph modeling, the open problem is to come up with a random graph model that models the behavior of the Web graph on the pages and host level.• Scalability, search engine technology should scale in a dramatic way to keep up with the growth of the Web.• Reliability, search engine requires a reliable technology to support it 24 hour operation to meet users needs. 11
  23. 23. 1.3 Data Compression TechniquesThis section presents definition, models, classification methodologies and classes, andperformance evaluation measures of data compression algorithms. Further details on datacompression can be found in [Say 00].1.3.1 Definition of data compressionData compression algorithms are designed to reduce the size of data so that it requiresless disk space for storage and less memory [Say 00]. Data compression is usuallyobtained by substituting a shorter symbol for an original symbol in the source data,containing the same information but with a smaller representation in length. The symbolsmay be characters, words, phrases, or any other unit that may be stored in a dictionary ofsymbols and processed by a computing system.A data compression algorithm usually utilizes an efficient algorithmic transformation ofdata representation to produce more compact representation. Such an algorithm is alsoknown as an encoding algorithm. It is important to be able to restore the original databack, either in an exact or an approximate form, therefore a data decompressionalgorithm, also known as a decoding algorithm.1.3.2 Data compression modelsThere are a number of data compression algorithms that have been developed throughoutthe years. These algorithms can be categorized into four major categories of datacompression models [Rab 08, Hay 08, Say 00]: 1. Substitution data compression model 2. Statistical data compression model 3. Dictionary based data compression model 4. Bit-level data compression model 12
  24. 24. In substitution compression techniques, a shorter representation is used to replace asequence of repeating characters. Example of substitution data compression techniquesinclude: null suppression [Pan 00], Run Length Encoding [Smi 97], bit mapping and halfbyte packing [Pan 00].In statistical techniques, the characters in the source file are converted to a binary code,where the most common characters in the file have the shortest binary codes, and theleast common have the longest, the binary codes are generated based on the estimatedprobability of the character within the file. Then, the binary coded file is compressedusing 8-bit character wordlength, or by applying the adaptive character wordlength(ACW) algorithm [Bah 08b, Bah 10a], or it variation the ACW(n,s) scheme [Bah 10a]Example of statistical data compression techniques include: Shannon-Fano coding [Rue06], static/adaptive/semi-adaptive Huffman coding [Huf 52, Knu 85, Vit 89], andarithmetic coding [How 94, Wit 87].Dictionary based data compression techniques involved the substitution of sub-strings oftext by indices or pointer code, relative to a dictionary of the sub-strings, such as Lempel-Zif-Welch (LZW) [Ziv 78, Ziv 77, Nel 89]. Many compression algorithms use acombination of different data compression techniques to improve compression ratios.Finally, since data files could be represented in binary digits, a bit-level processing can beperformed to reduce the size of data. A data file can be represented in binary digits byconcatenating the binary sequences of the characters within the file using a specificmapping or coding format, such as ASCII codes, Huffman codes, adaptive codes, …, etc.The coding format has a huge influence on the entropy of the generated binary sequenceand consequently the compression ratio (C) or the coding rate (Cr) that can be achieved.The entropy is a measure of the information content of a message and the smallestnumber of bits per character needed, on average, to represent a message. Therefore, theentropy of a complete message would be the sum of the individual characters’ entropy.The entropy of a character (symbol) is represented as the negative logarithm of itsprobability and expressed using base two. 13
  25. 25. Where the probability of each symbol of the alphabet is constant, the entropy iscalculated as [Bel 89, Bel 90]: n E=−∑ p i log 2 p i (1.1) i= 1Where E is the entropy in bits pi is the estimated probability of occurrence of character (symbol) n is the number of characters.In bit-level processing, n is equal to 2 as we have only two characters (0 and 1).In bit-level data compression algorithms, the binary sequence is usually divided intogroups of bits that are called minterms, blocks, subsequences, etc. In this work we shallused the term blocks to refer to each group of bits. These blocks might be considered asrepresenting a Boolean function.Then, algebraic simplifications are performed on these Boolean functions to reduce thesize or the number of blocks, and hence, the number of bits representing the output(compressed) data is reduced as well. Examples of such algorithms include: theHamming code data compression (HCDC) algorithm [Bah 07b, Bah 08a], the adaptiveHCDC(k) scheme [Bah 07a, Bah 10b, Rab 08], the adaptive character wordlength (ACW)algorithm [Bah 08b, Bah 10a], the ACW(n,s) scheme [Bah 10a], the Boolean functionsalgebraic simplifications algorithm [Nof 07], the fixed length Hamming (FLH) algorithm[Sha 04], and the neural network based algorithm [Mah 00].1.3.3 Classification of data compression algorithmsData compression algorithms are categorized by several characteristics, such as: • Data compression fidelity • Length of data compression symbols 14
  26. 26. • Data compression symbol table • Data compression processing timeIn what follows a brief definition is given for each of the above classification criteria.Data compression fidelityBasically data compression can be classified into two fundamentally different styles ofdata compression depending on the fidelity of the restored data, these are:(1) Lossless data compression algorithms In a lossless data compression, a transformation of the representation of the original data set is performed such that it is possible to reproduce exactly the original data set by performing a decompression transformation. This type of compression is usually used in compressing text files, executable codes, word processing files, database files, tabulation files, and whenever the original needs to be exactly restored from the compressed data. Many popular data compression applications have been developed utilizing lossless compression algorithms, for example, lossless compression algorithms are used in the popular ZIP file format and in the UNIX tool gzip. It is mainly used for text and executable files compression as in such file data must be exactly retrieved otherwise it is useless. It is also used as a component within lossy data compression technologies. It can usually achieve a 2:1 to 8:1 compression ratio range.(2) Lossy data compression algorithms In a lossy data compression a transformation of the representation of the original data set is performed such that an exact representation of the original data set can not be reproduced, but an approximate representation is reproduced by performing a decompression transformation. 15
  27. 27. A lossy data compression is used in applications wherein exact representation of the original data is not necessary, such as in streaming multimedia on the Internet, telephony and voice applications, and some image file formats. Lossy compression can provide higher compression ratios of 100:1 to 200:1, depending on the type of information being compressed. In addition, higher compression ratio can be achieved if more errors are allowed to be introduced into the original data [Lel 87].Length of data compression symbolsData compression algorithms can be classified, depending on the length of the symbolsthe algorithm can process, into fixed and variable length; regardless of whether thealgorithm uses variable length symbols in the original data or in the compressed data, orboth.For example, the run-length encoding (RLE) uses fixed length symbols in both theoriginal and the compressed data. Huffman encoding uses variable length compressedsymbols to represent fixed-length original symbols. Other methods compress variable-length original symbols into fixed-length or variable-length compressed data.Data compression symbol tableData compression algorithms can classified as either static, adaptive, or semi-adaptivedata compression algorithms [Rue 06, Pla 06, Smi 97]. In static compression algorithms,the encoding process is fixed regardless of the data content; while in adaptive algorithms,the encoding process is data dependent. In semi-adaptive algorithms, the data to becompressed are first analyzed in their entirety, an appropriate model is then built, andafterwards the data is encoded. The model is stored as part of the compressed data, as it isrequired by the decompressor to reverse the compression.Data compression/decompression processing timeData compression algorithms can be classified according to the compression/decompression processing time as symmetric or asymmetric algorithms. In symmetric 16
  28. 28. algorithms the compression/decompression processing time are almost the same; whilefor asymmetric algorithms, normally, the compression time is much more than thedecompression processing time [Pla 06].1.3.4 Performance evaluation parametersIn order to be able to compare the efficiency of the different compression techniquesreliably, and not allowing extreme cases to cloud or bias the technique unfairly, certainissues need to be considered.The most important issues need to be taken into account in evaluating the performance ofvarious algorithms includes [Say 00]: (1) Measuring the amount of compression (2) Compression/decompression time (algorithm complexity)These issues need to be carefully considered in the context for which the compressionalgorithm is used. Practically, things like finite memory, error control, type of data, andcompression style (adaptive/dynamic, semi-adaptive or static) are also factors that shouldbe considered in comparing the different data compression algorithms.(1) Measuring the amount of compressionSeveral parameters are used to measure the amount of compression that can be achievedby a particular data compression algorithm, such as: (i) Compression ratio (C) (ii) Reduction ratio (Rs) (iii) Coding rate (Cr) 17
  29. 29. Definitions of these parameters are given below.(i) Compression ratio (C)The compression ratio (C) is defined as the ratio between the size of the data beforecompression and the size of the data after compression. It is expressed as: So C = (1.1) ScWhere So is the size of the original data (uncompressed data) Sc is the sizes of the compressed data(ii) Reduction ratio (Rs)The reduction ratio (R) represents the ratio between the difference between the size of theoriginal data (So) and the size of the compressed data (Sc) to the size of the original data.It is usually given in percents and it is mathematically expressed as: (1.2) (1.3)(iii) Coding rate (Cr)The coding rate (Cr) expresses the same concept at the compression ratio, but it relatesthe ratio to a more tangible quantity. For example, for a text file, the coding rate may beexpressed in “bits/character” (bpc), where in uncompressed text file a coding rate of 7 or8 bpc is used. In addition, the coding rate of an audio stream may be expressed in“bits/analogue”. For still image compression, the coding rate is expressed in “bits/pixel”.In general, it can the coding rate can be expressed mathematically as: 18
  30. 30. q ⋅ Sc Cr = (1.4) SoWhere q is the number of bit represents each symbol in the uncompressed file. Therelationship between the coding rate (Cr) and the compression ratio (C), for example, fortext file originally using 7 bpc, can be given by: 7 Cr = (1.5) CIt can be clearly seen from Eqn. (1.5) that a lower coding rate indicates a highercompression ratio.(2) Compression/decompression time (algorithm complexity)The compression/decompression time (which is an indication of the algorithmcomplexity) is defined as the processing time required compressing or decompressing thedata. These compression and decompression times have to be evaluated separately. As ithas been discussed in Section 1.4.3, data compression algorithms are classified accordingto the compression/decompression time into either symmetric or asymmetric algorithms.In this context, data storage applications mainly concern with the amount of compressionthat can be achieved and the decompression processing time that is required to retrievethe data back (asymmetric algorithms). As in data compression applications, thecompression is only performed once or non-frequently repeated.Data transmission applications focus predominately on reducing the amount of data to betransmitted over communication channels, and both compression and decompressionprocessing times are the same at the respective junctions or nodes (symmetric algorithms)[Liu 05].For a fair comparison between the different available algorithms, it is important toconsider both the amount of compression and the processing time. Therefore, it would beuseful to be able to parameterize the algorithm such that the compression ratio andprocessing time could be optimized for a particular application. 19
  31. 31. There are extreme cases where data compression works very well or other conditionswhere it is inefficient, the type of data that the original data file contains and the upperlimits of the processing time have an appreciable effect on the efficiency of the techniqueselected. Therefore, it is important to select the most appropriate technique for aparticular data profile in terms of both data compression and processing time [Rue 06].1.4 Current Trends in Building High-Performance Web Search EngineThere are several major trends that can be identified in the literature for building high-performance Web search engine. A list of these trends is given below and furtherdiscussion will be given in Chapter 2; these trends include: (1) Succinct data structure (2) Compressed full-text self-index (3) Query optimization (4) Efficient architectural design (5) Scalability (6) Semantic Search Engine (7) Using Social Network (8) Caching1.5 Statement of the ProblemDue to the rapid growth in the size of the Web, Web search engines are facing enormousperformance challenges, in terms of storage capacity, data retrieval rate, query processingtime, and communication overhead. Large search engines, in particular, have to be able toprocess tens of thousands of queries per second on tens of billions of documents, makingquery throughput a critical issue. To satisfy this heavy workload, search engines use avariety of performance optimizations techniques including index compression; and some 20
  32. 32. obvious solutions to these issues are to develop more succinct data structure, compressedindex, query optimization, and higher-speed processing and communication systems.We believe that current search engine model cannot meet users and applications needsand more retrieval performance and more compact and cost-effective systems are stillrequired. The main contribution of this thesis is to develop a novel Web search enginemodel that is based on index-query compression; therefore, it is referred to as the CIQWeb search engine model or simply the CIQ model. The model incorporates two bit-levelcompression layers both implemented at the back-end processor side, one after theindexer acting as a second compression layer to generate a double compressed index, andthe other one after the query parser for query compression to enable bit-level compressedindex-query search. So that less disk space is required storing the index file, reducingdisk I/O overheads, and consequently higher retrieval rate or performance.1.6 Objectives of this ThesisThe main objectives of this thesis can be summarized as follows: • Develop a new Web search engine model that is accurate as the current Web search engine model, requires less disk space for storing index files, performs search process faster than current models, reduces disk I/O overheads, and consequently provides higher retrieval rate or performance. • Modify the HCDC algorithm to meet the requirements of the new CIQ model. • Study and optimize the statistics of the inverted index files to achieve maximum possible performance (compression ratio and minimum searching time). • Validate the searching accuracy of the new CIQ Web search engine model. • Evaluate and compare the performance of the new Web search engine model in terms of disk space requirement and query processing time (searching time) for different search scenarios. 21
  33. 33. 1.7 Organization of this ThesisThis thesis is organized into five chapters. Chapter 1 provides an introduction to thegeneral domain of this thesis. The rest of this thesis is organized as follows: Chapter 2presents a literature work and also summarizes some of the previous work that is relatedto Web search engine, in particular, works that is related to enhancing the performance ofthe Web search engine through data compression at different levels.Chapter 3 describes the concept, methodology, and implementation of the novel CIQ Websearch engine model. It also includes the detail description of the HCDC algorithm andthe modifications implemented to meet the new application needs.Chapter 4 presents a description of a number of scenarios simulated to evaluate theperformance of the new Web search engine model. The effect of index file size on theperformance of the new model is investigated and discussed. Finally, in Chapter 5, basedon the results obtained from the different simulations, conclusions are drawn andrecommendations for future work are pointed-out. 22
  34. 34. Chapter Two Literature ReviewThis work is concern with the development of a novel high-performance Web searchengine model that is based on compressing the index files and search queries using a bit-level data compression technique, namely, the novel Hamming codes based datacompression (HCDC) algorithm [Bah 07b, Bah 08a]. In this model the search process isperformed at a compressed index-query level. It produces a double compressed index file,which consequently requires less disk space to store the index files, reducescommunication time, and on the other hand, compressing the search query, reduces I/Ooverheads and increases retrieval rate.This chapter presents a literature review, which is divided into three sections. Section 2.1presents a brief definition of the current trends towards enhancing the performance ofWeb search engines. Then, in Section 2.2 and 2.3, we present a review on some of themost recent and related work on Web search engine and bit-level data compressionalgorithms, respectively.2.1 Trends Towards High-Performance Web Search EngineChapter 1 list several major trends that can be identified in the literature for buildinghigh-performance Web search engine. In what follows, we provide a brief definition foreach of these trends.2.1.1 Succinct data structureRecent years have witnessed an increasing interest on succinct data structures. Their aimis to represent the data using as little space as possible, yet efficiently answering querieson the represented data. Several results exist on the representation of sequences [Fer 07,Ram 02], trees [Far 05], graphs [Mun 97], permutations and functions [Mun 03], andtexts [Far 05, Nav 04].One of the most basic structures, which lie at the heart of the representation of morecomplex ones are binary sequences with rank and select queries. Given a binary sequence 23
  35. 35. S=s1s2 … sn, which is denoted by Rankc(S; q) the number of times the bit c appears in S[1;q]=s1s2 … sq, and by Selectc(S; q) the position in S of the q-th occurrence of bit c. The bestresults answer those queries in constant time, retrieve any sq in constant time, and occupynH0(S)+O(n) bits of storage, where H0(S) is the zero-order empirical entropy of S. Thisspace bound includes that for representing S itself, so the binary sequence is beingrepresented in compressed form yet allowing those queries to be answered optimally[Ram 02].For the general case of sequences over an arbitrary alphabet of size r, the only known re-sult is the one in [Gro 03] which still achieves nH0(S)+O(n) space occupancy. The datastructure in [Gro 03] is the elegant wavelet tree, it takes O(log r) time to answer Rankc(S;q) and Selectc(S; q) queries, and to retrieve any character sq.2.1.2 Compressed full-text self-indexA compressed full-text self-index [Nav 07] represents a text in a compressed form andstill answers queries efficiently. This represents a significant advancement over the full-text indexing techniques of the previous decade, whose indexes required several times thesize of the text. full-text indexing must be used.Although it is relatively new, this algorithmic technology has matured up to a point wheretheoretical research is giving way to practical developments. Nonetheless this requiressignificant programming skills, a deep engineering effort, and a strong algorithmic back-ground to dig into the research results. To date only isolated implementations and focusedcomparisons of compressed indexes have been reported, and they missed a common API,which prevented their re-use or deployment within other applications.2.1.3 Query optimizationQuery optimization is an important skill for search engine developers and database ad-ministrators (DBAs). In order to improve the performance of the search queries, develop-ers and DBAs need to understand the query optimizer and the techniques it uses to selectan access path and prepare a query execution plan. Query tuning involves knowledge of 24
  36. 36. techniques such as cost-based and heuristic-based optimizers, plus the tools a search plat-form provides for explaining a query execution plan [Che 01].2.1.4 Efficient architectural designAnswering large number of queries per second on a huge collection of data requires theequivalent of a small supercomputer, and all current major engines are based on largeclusters of servers connected by high-speed local area networks (LANs) or storage areanetworks (SANs).There are two basic ways to partition an inverted index structure over the nodes: • A local index organization where each node builds a complete index on its own subset of documents (used by AltaVista and Inktomi) • A global index organization where each node contains complete inverted lists for a subset of the words.Each scheme has advantages and disadvantages that we do not have space to discuss hereand further discussions can be found in [Bad 02, Mel 00].2.1.5 ScalabilitySearch engine technology should scale in a dramatic way to keep up with the growth ofthe Web [Bri 98]. In 1994, one of the first Web search engines, the World Wide WebWorm (WWWW) had an index of 110,000 pages [Mcb 94]. At the end of 1997, the topsearch engines claim to index from 2 million (WebCrawler) to 100 million Web docu-ments [Bri 98]. In 2005 Google claim to index 1.2 billion pages (as they were showing inGoogle home page) in July 2008 Google announced to hit a new milestone: 1 trillion (asin 1,000,000,000,000) unique URLs on the Web at once [Web 2].At the same time, the number of queries search engines handle has grown rabidly too. InMarch and April 1994, the WWWW received an average of about 1500 queries per day.In November 1997, Altavista claimed it handled roughly 20 million queries per day. Withthe increasing number of users on the web, and automated systems which query searchengines, Google handled hundreds of millions of queries per day in 2000 and about 3 bil- 25
  37. 37. lion queries per day in 2009 and twitter handled about 635 millions queries per day [web1].Creating a Web search engine which scales even to today’s Web presents many challeng-es. Fast crawling technology is needed to gather the Web documents and keep them up todate. Storage space must be used efficiently to store indexes and, optionally, the docu-ments themselves as cashed pages. The indexing system must process hundreds of giga-bytes of data efficiently. Queries must be handled quickly, at a rate of hundreds to thou-sands per second.2.1.6 Semantic search engineThe semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work together in cooperation[Guh 03]. It is the idea of having data on the Web defined and linked in a way that it canbe used for more effective discovery, automation, integration, and reuse across variousapplications.In particular, the semantic Web will contain resources corresponding not just to media ob-jects (such as Web pages, images, audio clips, etc.) as the current Web does, but also toobjects such as people, places, organizations and events. Further, the semantic Web willcontain not just a single kind of relation (the hyperlink) between resources, but many dif-ferent kinds of relations between the different types of resources mentioned above [Guh03].Semantic search attempts to augment and improve traditional search results (based on in-formation retrieval technology) by using data from the semantic Web and to produce pre-cise answers to user queries. This can be done easily by taking advantage of the availabil-ity of explicit semantics of information in the context of the semantic Web search engine[Lei 06].2.1.7 Using Social NetworksThere is an increasing interest about social networks. In general, recent studies suggestthat a social network of a person has a significant impact on his or her information acqui- 26
  38. 38. sition [Kir 08]. It is an ongoing trend that people increasingly reveal very personal infor-mation on social network sites in particular and in the Web.As this information becomes more and more publicly available from these various socialnetwork sites and the Web in general, the social relationships between people can beidentified. This in turn enables the automatic extraction of social networks. This trend isfurthermore driven and enforced by recent initiatives such as Facebook’s connect. MyS-pace’s data availability and Google’s FriendConnect by making their social network dataavailable to anyone [Kir 08].So to combine the social network data with the search engine technology to improve theresults relevancy to the users and to increase the sociality of the results is one of thetrends currently used by the search engine like Google and Bing. Microsoft and Facebookhave announced a new partnership that brings Facebook data and profile search to Bing.The deal marks a big leap forward in social search and also represents a new advantagefor Bing [Web 3].2.1.8 CachingPopular Web search engines receive a round hundred millions of queries per day, and foreach search query, return a result page(s) to the user who submitted the query. The usermay request additional result pages for the same query, submit a new query, or quitsearching process altogether. An efficient scheme for caching query result pages may en-able search engines to lower their response time and reduce their hardware requirements[Lem 04].Studies have shown that a small set of popular queries accounts for a significant fractionof the query stream. These statistical properties of the query stream seem to call for thecaching of search results [Sar 01].2.2 Recent Research on Web Search EngineE. Moura et al. [Mou 97] presented a technique to build an index based on suffix arraysfor compressed texts. They developed a compression scheme for textual databases basedon words that generates a compression code that preserves the lexicographical ordering of 27
  39. 39. the text words. As a consequence, it permits the sorting of the compressed strings to gen-erate the suffix array without decompressing. Their results demonstrated that as the com-pressed text is under 30% of the size of the original text, they were able to build the suffixarray twice as fast on the compressed text. The compressed text plus index is 55-60% ofthe size of the original text plus index and search times were reduced to approximatelyhalf the time. They presented analytical and experimental results for different variationsof the word-oriented compression paradigm.S. Varadarajan and T. Chiueh [Var 97] described a text search engine called shrink andsearch engine (SASE), which operates in the compressed domain. It provides an exactsearch mechanism using an inverted index and an approximate search mechanism using avantage point tree. SASE allows a flexible trade-off between search time and storagespace required to maintain the search indexes. The experimental results showed that thecompression efficiency is within 7-17% of GZIP, which is one of the best losslesscompression utilities. The sum of the compressed file size and the inverted indexes isonly between 55-76% of the original database, while the search performance iscomparable to a fully inverted index.S. Brin and L. Page [Bri 98] presented the Google search engine, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Googleis designed to crawl and index the Web efficiently and produce much more satisfyingsearch results than existing systems. They provided an in-depth description of the large-scale web search engine. Apart from the problems of scaling traditional search techniquesto data of large magnitude, there are many other technical challenges, such as the use ofthe additional information present in hypertext to produce better search results. In theirwork they addressed the question of how to build a practical large-scale system that canexploit the additional information present in hypertext.E. Moura et al. [Mou 00] presented a fast compression and decompression technique fornatural language texts. The novelties are that (i) decompression of arbitrary portions ofthe text can be done very efficiently, (ii) exact search for words and phrases can be doneon the compressed text directly by using any known sequential pattern matching 28
  40. 40. algorithm, and (iii) word-based approximate and extended search can be done efficientlywithout any decoding. The compression scheme uses a semi-static word-based model anda Huffman code where the coding alphabet is byte-oriented rather than bit-oriented.N. Fuhr and N. Govert [Fuh 02] investigated two different approaches for reducing indexspace of inverted files for XML documents. First, they considered methods forcompressing index entries. Second, they developed the new XS tree data structure whichcontains the structural description of a document in a rather compact form, such thatthese descriptions can be kept in main memory. Experimental results on two large XMLdocument collections show that very high compression rates for indexes can be achieved,but any compression increases retrieval time.A. Nagarajarao et al. [Nag 02] implemented an inverted index as a part of a masscollaboration system. It provides the facility to search for documents that satisfy a givenquery. It also supports incremental updates whereby documents can be added without re-indexing. The index can be queried even when updates are being done to it. Further,querying can be done in two modes. A normal mode that can be used when an immediateresponse is required and a batched mode that can provide better throughput at the cost ofincreased response time for some requests. The batched mode may be useful in an alertsystem where some of the queries can be scheduled. They implemented generators togenerate large data sets that they used as benchmarks. They tested there inverted indexwith data sets of the order of gigabytes to ensure scalability.R. Grossi et al. [Gro 03] presented a novel implementation of compressed suffix arraysexhibiting new tradeoffs between search time and space occupancy for a given text (orsequence) of n symbols over an alphabet α, where each symbol is encoded by log | α, |bits. They showed that compressed suffix arrays use just nHh+O(n log log n/ log| α, | n)bits, while retaining full text indexing functionalities, such as searching any patternsequence of length m in O(mlog | α, |+polylog(n)) time. The term Hh<log | α, | denotes thehth-order empirical entropy of the text, which means that the index is nearly optimal inspace apart from lower-order terms, achieving asymptotically the empirical entropy of thetext (with a multiplicative constant 1). If the text is highly compressible so that H h=O(1) 29
  41. 41. and the alphabet size is small, they obtained a text index with O(m) search time thatrequires only O(n) bits.X. Long and T. Suel [Lon 03] studied pruning techniques that can significantly improvequery throughput and response times for query execution in large engines in the casewhere there is a global ranking of pages, as provided by Page rank or any other method,in addition to the standard term-based approach. They described pruning schemes for thiscase and evaluated their efficiency on an experimental cluster based search engine withmillion Web pages. Their results showed that there is significant potential benefit in suchtechniques.V. N. Anh and A. Moffat [Anh 04] described a scheme for compressing lists of integersas sequences of fixed binary codewords that had the twin benefits of being both effectiveand efficient. Because Web search engines index large quantities of text the static costsassociated with storing the index can be traded against dynamic costs associated withusing it during query evaluation. Typically, index representations that are effective andobtain good compression tend not to be efficient, in that they require more operationsduring query processing. The approach described by Anh and Moffat results in areduction in index storage costs compared to their previous word-aligned version, with nocost in terms of query throughput.Udayan Khurana and Anirudh Koul [Khu 05] presented a new compression scheme fortext. The same is efficient in giving high compression ratios and enables super fastsearching within the compressed text. Typical compression ratios of 70-80% and reduc-ing the search time by 80-85% are the features of this paper. Till now, a trade-off betweenhigh ratios and searchability within compressed text has been seen. In this paper, theyshowed that greater the compression, faster the search.Stefan Buttcher and Charles L. A. Clarke [But 06] examined index compression tech-niques for schema-independent inverted files used in text retrieval systems. Schema-inde-pendent inverted files contain full positional information for all index terms and allow thestructural unit of retrieval to be specified dynamically at query time, rather than staticallyduring index construction. Schema-independent indexes have different characteristics 30
  42. 42. than document-oriented indexes, and this difference can affect the effectiveness of indexcompression algorithms greatly. There experimental results show that unaligned binarycodes that take into account the special properties of schema-independent indexesachieve better compression rates than methods designed for compressing document in-dexes and that they can reduce the size of the index by around 15% compared to byte-aligned index compression.P. Farragina et al [Fer 07] proposed two new compressed representations for general se-quences, which produce an index that improves over the one in [Gro 03] by removingfrom the query times the dependence on the alphabet size and the polylogarithmic terms.R. Gonzalez and G. Navarro [Gon 07a] introduced a new compression scheme for suffixarrays which permits locating the occurrences extremely fast, while still being muchsmaller than classical indexes. In addition, their index permits a very efficient secondarymemory implementation, where compression permits reducing the amount of I/O neededto answer queries. Compressed text self-indexes had matured up to a point where theycan replace a text by a data structure that requires less space and, in addition to givingaccess to arbitrary text passages, support indexed text searches. At this point thoseindexes are competitive with traditional text indexes (which are very large) for countingthe number of occurrences of a pattern in the text. Yet, they are still hundreds tothousands of times slower when it comes to locating those occurrences in the text.R. Gonzalez and G. Navarro [Gon 07b] introduced a disk-based compressed text indexthat, when the text is compressible, takes little more than the plain text size (and replacesit). It provides very good I/O times for searching, which in particular improve when thetext is compressible. In this aspect the index is unique, as compressed indexes have beenslower than their classical counterparts on secondary memory. They analyzed their indexand showed experimentally that it is extremely competitive on compressible texts.A. Moffat and J. S. Culpepper [Mof 07] showed that a relatively simple combination oftechniques allows fast calculation of Boolean conjunctions within a surprisingly smallamount of data transferred. This approach exploits the observation that queries tend tocontain common words, and that representing common words via a bitvector allows 31
  43. 43. random access testing of candidates, and, if necessary, fast intersection operations prior tothe list of candidates being developed. By using bitvectors for a very small number ofterms that (in both documents and in queries) occur frequently, and byte coded invertedlists for the balance can reduce both querying time and query time data-transfer volumes.The techniques described in [Mof 07] are not applicable to other powerful forms ofquerying. For example, index structures that support phrase and proximity queries have amuch more complex structure, and are not amenable to storage (in their full form) usingbitvectors. Nevertheless, there may be scope for evaluation regimes that make use ofpreliminary conjunctive filtering before a more detailed index is consulted, in which casethe structures described in [Mof 07] would still be relevant.Due to the rapid growth in the size of the web, web search engines are facing enormousperformance challenges. The larger engines in particular have to be able to process tensof thousands of queries per second on tens of billions of documents, making querythroughput a critical issue. To satisfy this heavy workload, search engines use a variety ofperformance optimizations including index compression, caching, and early termination.J. Zhang et al [Zha 08] focused on two techniques, inverted index compression and indexcaching, which play a crucial rule in web search engines as well as other high-performance information retrieval systems. We perform a comparison and evaluation ofseveral inverted list compression algorithms, including new variants of existingalgorithms that have not been studied before. We then evaluate different inverted listcaching policies on large query traces, and finally study the possible performance benefitsof combining compression and caching. The overall goal of this paper is to provide anupdated discussion and evaluation of these two techniques, and to show how to select thebest set of approaches and settings depending on parameter such as disk speed and mainmemory cache size.P. Ferragina et al [Fer 09] presented an article to fill the gap between implementationsand focused comparisons of compressed indexes. They presented the existing implemen-tations of compressed indexes from a practitioners point of view; introduced thePizza&Chili site, which offers tuned implementations and a standardized API for the 32
  44. 44. most successful compressed full-text self-indexes, together with effective test-beds andscripts for their automatic validation and test; and, finally, they showed the results of ex-tensive experiments on a number of codes with the aim of demonstrating the practical rel-evance of this novel algorithmic technology.Ferragina et al [Fer 09], first, presented the existing implementations of compressed in-dexes from a practitioner’s point of view. Second, they introduced the Pizza&Chili site,which offers tuned implementations and a standardized API for the most successful com-pressed full-text self-indexes, together with effective test beds and scripts for their auto-matic validation and test. Third, they showed the results of their extensive experiments onthese codes with the aim of demonstrating the practical relevance of this novel and excit-ing technology.H. Yan et al [Yan 09] studied index compression and query processing techniques forsuch reordered indexes. Previous work has focused on determining the best possible or-dering of documents. In contrast, they assumed that such an ordering is already given,and focus on how to optimize compression methods and query processing for this case.They performed an extensive study of compression techniques for document IDs and pre-sented new optimizations of existing techniques which can achieve significant improve-ment in both compression and decompression performances. They also proposed andevaluated techniques for compressing frequency values for this case. Finally, they studiedthe effect of this approach on query processing performance. Their experiments showedvery significant improvements in index size and query processing speed on the TRECGOV2 collection of 25.2 million Web pages.2.3 Recent Research on Bit-Level Data Compression AlgorithmsThis section presents a review of some of the most recent research on developing anefficient bit-level data compression algorithms, as the algorithm we use in thesis is a bit-level technique.A. Jardat and M. Irshid [Jar 01] proposed a very simple and efficient binary run-lengthcompression technique. The technique is based on mapping the non-binary information 33
  45. 45. source into an equivalent binary source using a new fixed-length code instead of theASCII code. The codes are chosen such that the probability of one of the two binarysymbols; say zero, at the output of the mapper is made as small as possible. Moreover,the "all ones" code is excluded from the code assignments table to ensure the presence ofat least one "zero" in each of the output codewords.Compression is achieved by encoding the number of "ones" between two consecutive"zeros" using either a fixed-length code or a variable-length code. When applying thissimple encoding technique to English text files, they achieve a compression of 5.44 bpc(bit per character) and 4.6 bpc for the fixed-length code and the variable length(Huffman) code, respectively.Caire et al [Cai 04] presented a new approach to universal noiseless compression basedon error correcting codes. The scheme was based on the concatenation of the Burrows-Wheeler block sorting transform (BWT) with the syndrome former of a low-densityparity-check (LDPC) code. Their scheme has linear encoding and decoding times anduses a new closed-loop iterative doping algorithm that works in conjunction with belief-propagation decoding. Unlike the leading data compression methods, their method isresilient against errors, and lends itself to joint source-channel encoding/decoding;furthermore their method offers very competitive data compression performance.A. A. Sharieh [Sha 04] introduced a fixed-length Hamming (FLH) algorithm asenhancement to Huffman coding (HU) to compress text and multimedia files. Heinvestigated and tested these algorithms on different text and multimedia files. His resultsindicated that the HU-FLH and FLH-HU enhanced the compression ratio.K. Barr and K. Asanovi’c [Bar 06] presented a study of the energy savings possible bylossless compressing data prior to transmission. Because wireless transmission of a singlebit can require over 1000 times more energy than a single 32-bit computation. It cantherefore be beneficial to perform additional computation to reduce the number of bitstransmitted.If the energy required to compress data is less than the energy required to send it, there is 34
  46. 46. a net energy savings and an increase in battery life for portable computers. This workdemonstrated that, with several typical compression algorithms, there was actually a netenergy increase when compression was applied before transmission. Reasons for thisincrease were explained and suggestions were made to avoid it. One such energy-awaresuggestion was asymmetric compression, the use of one compression algorithm on thetransmit side and a different algorithm for the receive path. By choosing the lowest-energy compressor and decompressor on the test platform, overall energy to send andreceive data can be reduced by 11% compared with a well-chosen symmetric pair, or upto 57% over the default symmetric scheme.The value of this research is not merely to show that one can optimize a given algorithmto achieve a certain reduction in energy, but to show that the choice of how and whetherto compress is not obvious. It is dependent on hardware factors such as relative energy ofthe central processing unit (CPU), memory, and network, as well as software factorsincluding compression ratio and memory access patterns. These factors can change, sotechniques for lossless compression prior to transmission/reception of data must be re-evaluated with each new generation of hardware and software.A. Jaradat et al. [Jar 06] proposed a file splitting technique for the reduction of the nth-order entropy of text files. The technique is based on mapping the original text file into anon-ASCII binary file using a new codeword assignment method and then the resultingbinary file is split into several sub files each contains one or more bits from eachcodeword of the mapped binary file. The statistical properties of the sub files are studiedand it was found that they reflect the statistical properties of the original text file whichwas not the case when the ASCII code is used as a mapper.The nth-order entropy of these sub files was determined and it was found that the sum oftheir entropies was less than that of the original text file for the same values ofextensions. These interesting statistical properties of the resulting subfiles can be used toachieve better compression ratios when conventional compression techniques wereapplied to these sub files individually and on a bit-wise basis rather than on character-wise basis.H. Al-Bahadili [Bah 07b, Bah 08a] developed a lossless binary data compression scheme 35
  47. 47. that is based on the error correcting Hamming codes. It was referred to as the HCDCalgorithm. In this algorithm, the binary sequence to be compressed is divided into blocksof n bits length. To utilize the Hamming codes, the block is considered as a Hammingcodeword that consists of p parity bits and d data bits (n=d+p).Then each block is tested to find if it is a valid or a non-valid Hamming codeword. For avalid block, only the d data bits preceded by 1 are written to the compressed file, whilefor a non-valid block all n bits preceded by 0 are written to the compressed file. Theseadditional 1 and 0 bits are used to distinguish the valid and the non-valid blocks duringthe decompression process.An analytical formula was derived for computing the compression ratio as a function ofblock size, and fraction of valid data blocks in the sequence. The performance of theHCDC algorithm was analyzed, and the results obtained were presented in tables andgraphs. The author concluded that the maximum compression ratio that can be achievedby this algorithm is n/(d+1), if all blocks are valid Hamming codewords.S. Nofal [Nof 07] proposed a bit-level files compression algorithm. In this algorithm, thebinary sequence is divided into a set of groups of bits, which are considered as mintermsrepresenting Boolean functions. Applying algebraic simplifications on these functionsreduce in turn the number of minterms, and hence, the number of bits of the file isreduced as well. To make decompression possible one should solve the problem ofdropped Boolean variables in the simplified functions. He investigated one possiblesolution and their evaluation shows that future work should find out other solutions torender this technique useful, as the maximum possible compression ratio they achievedwas not more than 10%.H. Al-Bahadili and S. Hussain [Bah 08b] proposed and investigated the performance of abit-level data compression algorithm, in which the binary sequence is divided into blockseach of n-bit length. This gives each block a possible decimal values between 0 to 2 n-1. Ifthe number of the different decimal values (d) is equal to or less than 256, then the binarysequence can be compressed using the n-bit character wordlength. Thus, a compressionratio of approximately n/8 can be achieved. They referred to this algorithm as the 36
  48. 48. adaptive character wordlength (ACW) algorithm, since the compression ratio of thealgorithm is a function of n, it was referred to it as the ACW(n) algorithm.Implementation of the ACW(n) algorithm highlights a number of issues that may degradeits performance, and need to be carefully resolved, such as: (i) If d is greater than 256,then the binary sequence cannot be compressed using n-bit character wordlength, (ii) theprobability of being able to compress a binary sequence using n-bit character wordlengthis inversely proportional to n, and (iii) finding the optimum value of n that providesmaximum compression ratio is a time consuming process, especially for large binarysequences. In addition, for text compression, converting text to binary using theequivalent ASCII code of the characters gives a high entropy binary sequence, thus only asmall compression ratio or sometimes no compression can be achieved.To overcome all drawbacks that mentioned in the ACW(n) algorithm, Al-Bahadili andHussain [Bah 10a] developed an efficient implementation scheme to enhance theperformance of the ACW(n) algorithm. In this scheme the binary sequence was dividedinto a number of subsequences (s), each of them satisfies the condition that d is less than256, therefore it is referred to as the ACW(n,s) scheme. The scheme achievedcompression ratios of more than 2 on most text files from most widely used corpora.H. Al-Bahadili and A. Rababa’a [Bah 07a, Rab 08, Bah 10b] developed a new schemeconsists of six steps some of which are applied repetitively to enhance the compressionratio of the HCDC algorithm [Bah 07b, Bah 08a], therefore, the new scheme was referredto as the HCDC(k) scheme, where k refers to the number of repetition loops. Therepetition loops continue until inflation is detected. The overall (accumulated)compression ratio is the multiplication of the compression ratios of the individual loops.The results obtained for the HCDC(k) scheme demonstrated that the scheme has a highercompression ratio than most well-known text compression algorithms, and also exhibits acompetitive performance with respect to many widely-used state-of-the-art software. TheHCDC algorithm and the HCDC(k) scheme will be discussed in details in the nextChapter. 37
  49. 49. S. Ogg and B. Al-Hashimi [Ogg 06] proposed a simple yet effective real-time compres-sion technique that reduces the amount of bits sent over serial links. The proposed tech-nique reduces the number of bits and the number of transitions when compared to theoriginal uncompressed data. Results of compression on two MPEG1 coded picture datashowed average bit reductions of approximately 17% to 47% and average transition re-ductions of approximately 15% to 24% over a serial link. The technique can be employedwith such network-on-chip (NoC) technology to improve the bandwidth bottleneck issue.Fixed and dynamic block sizing was considered and general guidelines for determining asuitable fixed block length and an algorithm for dynamic block sizing were shown. Thetechnique exploits the fact that unused significant bits do not need to be transmitted. Also,the authors outlined a possible implementation of the proposed compression technique,and the area overhead costs and potential power and bandwidth savings within a NoC en-vironment were presented.J. Zhang and X. Ni [Zha 10] presented a new implementation of bit-level arithmetic cod-ing using integer additions and shifts. The algorithm has less computational complexityand more flexibility, and thus is very suitable for hardware design. They showed that theirimplementation has the least complexity and the highest speed in Zhao’s algorithm [Zha98], Rissanen and Mohiuddin. (RM) algorithm [Ris 89], Langdon and Rissanen (LR) al-gorithm [Lan 82] and the basic arithmetic coding algorithm. Sometimes it has highercompression rate than basic arithmetic encoding algorithm. Therefore, it provides an ex-cellent compromise between good performance and low complexity. 38
  50. 50. Chapter Three The Novel CIQ Web Search Engine ModelThis chapter presents a description of the proposed Web search engine model. The modelincorporates two bit-level data compression layers, both installed at the back-endprocessor, one for index compression (index compressor), and one for query compression(query or keyword compressor), so that the search process can be performed at thecompressed index-query level and avoid any decompression activities during thesearching process. Therefore, it is referred to as the compressed index-query (CIQ)model. In order to be able to perform the search process at the compressed index-querylevel, it is important to have a data compression technique that is capable of producingthe same pattern for the same character from both the query and the index.The algorithm that meets the above main requirements is the novel Hamming code datacompression (HCDC) algorithm [Bah 07b, Bah 08a]. The HCDC algorithm creates acompressed file header (compression header) to store some parameters that are relevantto compression process, which mainly include the character-binary coding pattern. Thisheader should be stored separately to be accessed by the query compressor and the indexdecompressor. Introducing the new compression layers should reduce disk space forstoring index files; increase query throughput and consequently retrieval rate. On theother hand, compressing the search query reduces I/O overheads and query processingtime as well as the system response time.This section outlines the main theme of this chapter. The rest of this chapter is organizedas follows: The detail description of the new CIQ Web search engine model is given inSection 3.2. Section 3.3 presents the implementation of the new model and its mainprocedures. The data compression algorithm, namely, the HCDC algorithm is describedin Section 3.4. In addition in Section 3.4, derivation and analysis of the HCDCcompression ratio is given. The performance measures that are use to evaluate andcompare the performance of the new model is introduced in Section 3.5. 39
  51. 51. 3.1 The CIQ Web Search Engine ModelIn this section, a description of the proposed Web search engine model is presented. Thenew model incorporates two bit-level data compression layers, both installed at the back-end processor, one for index compression (index compressor) and one for querycompression (query compressor or keyword compressor), so that the search process canbe performed at the compressed index-query level and avoid any decompressionactivities, therefore, we refer to it as the compressed index-query (CIQ) Web searchengine model or simply the CIQ model.In order to be able to perform the search process at the CIQ level, it is important to have adata compression technique that is capable of producing the same pattern for the samecharacter from both the index and the query. The HCDC algorithm [Bah 07b, Bah 08a]which will be described in the next section, satisfies this important feature, and it will beused at the compression layers in the new model. Figure (3.1) outlines the maincomponents of the new CIQ model and where the compression layers are located.It is believed that introducing the new compression layers reduce disk space for storingindex files, increases query throughput and consequently retrieval rate. On the otherhand, compressing the search query reduces I/O overheads and query processing time aswell as the system response time.The CIQ model works as follows: At the back-end processor, after the indexer generatesthe index, and before sending it to the index storage device it keeps it in a temporarymemory to apply a lossless bit-level compression using the HCDC algorithm, and thensends the compressed index file to the storage device. So that it requires less-disk spaceenabling more documents to be indexed and accessed in comparatively less CPU time.The HCDC algorithm creates a compressed-file header (compression header) to storesome parameters that are relevant to compression process, which mainly include thecharacter-to-binary coding pattern. This header should be stored separately to be accessedby the query compression layer (query compressor). 40
  52. 52. On the other hand, the query parser, instead of passing the query to the index file, itpasses it to the query compressor before accessing the index file. In order to producesimilar binary pattern for the similar compressed characters from the index and the query,the character-to-binary codes used in converting the index file are passed to be used at thequery compressor. If a match is found the retrieved data is decompressed, using the indexdecompressor, and passed through the ranker and the search engine interface to the end-user. Figure (3.1). Architecture and main components of the CIQ Web search engine model. 41

×