▪ Implemented text processing for web pages to extract tokens, n- grams and anagrams in Python
▪ Designed a spider to crawl ics.uci.edu domain and accumulate crawled pages into a MySQL database
▪ Constructed an indexer for crawled pages and a page ranking mechanism based on Page Rank and collection frequency
1. PROJECT REPORT
Search Engine for ics.uci.edu
“Illuminati”
Submitted in partial fulfillment of the requirements of
COMPSCI 221 WINTER 2017
By
ABHIDNYA PATIL 91882839
SOHAM KULKARNI 20005264
MADHUR J. BAJAJ 36562594
PROJECT GUIDE:
PROF. CRISTINA LOPES
UNIVERSITY OF CALIFORNIA, IRVINE
2. Table of Contents
1. Introduction................................................................................................................................ 1
1.1. Problem Statement.................................................................................................................. 1
1.2. Purpose .................................................................................................................................... 1
2. Implementation.......................................................................................................................... 2
2.1. Naïve Implementation ............................................................................................................. 2
2.1.1. Term frequency - inverted document frequency............................................................. 2
2.2. Performance Improvement ..................................................................................................... 2
2.2.1. Term frequency - inverted document frequency with inflation ..................................... 2
2.2.2. Hyperlink Induced Topic Search (HITS) ............................................................................ 2
2.2.3. Page Rank ......................................................................................................................... 2
2.2.4. Link Analysis ..................................................................................................................... 3
2.2.5. Stemming.......................................................................................................................... 3
2.2.4. 2-gram .............................................................................................................................. 3
2.2.4. ALL Caps Analysis.............................................................................................................. 3
3. NDCG Comparitive Analysis....................................................................................................... 4
4. Future Scope............................................................................................................................... 5
3. 1. Introduction
Search engines are applications that search records for specified phrases and returns a
rundown of the reports where the keywords were found. A search engine is truly a general
class of applications, notwithstanding, the term is regularly used to explicitly depict
frameworks like Google, Bing, and Yahoo! Search that empower clients to search for records
on the World Wide Web. The search results are for the most part introduced in a line of results
regularly alluded to as search engine results pages. The information might be a blend of web
pages, pictures, and different sorts of documents. Some search engines likewise mine
information accessible in databases or open registries.
1.1 Problem Statement
In this project, we present a Search Engine for ics.uci.edu named as “Illuminati” to search the
ICS domain corpus. The search engine is formulated based on Information Retrieval techniques
imbibed in CS 221. The Search Engine making use of various performance improvement
techniques are then implemented incrementally to evaluate the efficiency of the proposed
implementation. The naïve implementation is a generic solution based on term document
frequency. Various performance enhancement methodologies are incorporated to increase
the precision of the search engine.
1.2 Purpose
The aim of the project is to build and investigate the efficiency of Search Engine with respect
to the results generated by Google for ics.uci.edu. The task is to develop a scalable and high
performance search engine, where the focus is on the algorithms challenges in efficiently
representing large dataset while supporting fast searches. The project is based on the
description posted on www.ics.uci.edu/~lopes/teaching/cs221W16/index.html
4. 2. Implementation
2.1 Naïve Implementation
Using the pages stored by crawling the ics.uci.edu domain as an input, the Indexer constructs
an inverted index that maps words to documents (pages). As a payload for the same we used
term frequency – inverted document frequency and the position of the word in each
document. We have employed tf-idf weighting scheme, since it facilitates relevant documents
listing. It increases with number of occurrences within a document and rarity of the term in
the collection. We have implemented cosine similarity measure to score every document with
respect to the search query fired.
2.2 Performance Improvement Parameters
To improve the efficiency of the search engine we have utilized following practices that we
came across during researching about Search Engine Implementation.
2.2.1 Term Frequency – Inverted document frequency with frequency inflation.
The term frequency is inflated based on tags the term is nested in. The level and the
tags in which data is nested is used as a measure for evaluating the importance of term.
For instance, a term which is embedded in the title tag of the webpage has higher
weight that the same term embedded in a paragraph tag. Likewise, a term embedded
within a nested structure of title and bold tag has even more importance.
2.2.2 Hyperlink Induced Topic Search
Graph analysis is conducted to compute inter-relation of pages, where a good hub page
points to multiple authoritative pages on that topic and a good authority page for a
topic is pointed my multiple good hub pages for the topic. Hub and Authority Analysis
is computed in iterative fashion and we prevent the values from growing too big by
scaling down the values using a normalizing factor, which is root of summation of
squares of all hub and authority values respectively
2.2.3 PageRank
PageRank algorithm is used to rank the websites in their search engine results. It works
by counting the number and quality of links to a page to determine a rough estimate
of how important the website is. The underlying assumption is that more important
websites are likely to receive more links from other websites.
5. 2.2.4 Link Analysis
Link Analysis is taken into account while Indexing a document S having a hyperlink to
document D, and reflect the term frequency for anchor text in source and inflated term
frequency in the destination document D. The inflation in frequency can also be done
on the authority value of a page which a measure of its importance.
2.2.5 Stemming
The linguistic morphological stemmer named “SnowBall”, was used to reduce the
inflected word to their word stem. Since all the words with same stem are synonymous
fetching relevant documents becomes easier irrespective of derived usage of that word
in the search query.
2.2.6 2-gram
2-gram is an instance of n-gram computational linguistic model, where the probability
of series of token being related to one another is considered. For instance, UC and
Irvine individually will fetch different set of search results since the term UC and Irvine
individually are open to multiple interpretation, but when put together, the phrase
needs to be considered as one to fetch documents with maximum precision. So n-gram
helps in maintaining the semantics of the search query with the results listed.
2.2.7 All Caps Analysis
All Caps Analysis helped us distinguish between query terms which have two different
interpretations based on the case it is searched as, for instance ‘rest’ and ‘REST’ both
the terms though alphabetically are synonymous but differ in their meaning.
Implementing the above techniques helped us realize how tweaking the search engine with their
functionality improve the efficiency of search engine results. At the same time, from the results
we observed that there must be a right balance between the parameters which determine the
overall ranking. Integrating all of them and striking the right balance is a task, so that they are not
contradicting with each other.
In our attempt of improving the performance, we have realized that factors used to scale the
impact of each of the parameter, dominate the ranking of documents in the search results. The
NDCG values for each query before and after performance improvement has been summarized
below.
6. Query
Before Performance
Improvement
After Performance
Improvement
Mondego 0.7075 0.7075
Machine learning 0.0 0.0
REST 0.4057 0.6452
Security 0.0 0.0
Student affairs 0.3957 0.3957
Graduate courses 0.0 0.1745
Crista Lopes 0.6492 0.6492
Software engineering 0.0 0.0
Computer games 0.6727 0.6727
Information retrieval 0.3157 0.3157
AVERAGE NDCG@5 0.3146 0.3561
C
B
NDCG
A
Search Engine Implementations
7. A: Before Performance Implementation NDCG 0.3146
B: After Performance Implementation NDCG 0.3561
C: Implementation based on tfidf NDCG 0.4670
As represented in the above diagram we speculate that the NDCG values of the Search Engine
Implementation follow similar trajectory, but we are missing out on hitting the peak of this
trajectory. The factors used for scaling the individual performance are to be fine-tuned to get the
optimal results. In our search implementation, we have prioritized the factors in order of tf-idf,
authority value, page rank and hub value, considering that tfidf is the primary measure of term
relevance, followed by authority value which is an indication of the page being informative about
the query term, followed by page rank which is the probability of a user to open given document
and finally hub value which points to a relevant document. In order increase the execution time,
we tried implementing the same in multithreading and map reduce environment, but the
overhead of context switching was dominating the indexing time.
3. Future Scope
• Document Clustering
• Machine Learning
• Spell Correction
• Multi-threading Implementation
• Acronym Analysis