SlideShare a Scribd company logo
1 of 7
Download to read offline
PROJECT REPORT
Search Engine for ics.uci.edu
“Illuminati”
Submitted in partial fulfillment of the requirements of
COMPSCI 221 WINTER 2017
By
ABHIDNYA PATIL 91882839
SOHAM KULKARNI 20005264
MADHUR J. BAJAJ 36562594
PROJECT GUIDE:
PROF. CRISTINA LOPES
UNIVERSITY OF CALIFORNIA, IRVINE
Table of Contents
1. Introduction................................................................................................................................ 1
1.1. Problem Statement.................................................................................................................. 1
1.2. Purpose .................................................................................................................................... 1
2. Implementation.......................................................................................................................... 2
2.1. Naïve Implementation ............................................................................................................. 2
2.1.1. Term frequency - inverted document frequency............................................................. 2
2.2. Performance Improvement ..................................................................................................... 2
2.2.1. Term frequency - inverted document frequency with inflation ..................................... 2
2.2.2. Hyperlink Induced Topic Search (HITS) ............................................................................ 2
2.2.3. Page Rank ......................................................................................................................... 2
2.2.4. Link Analysis ..................................................................................................................... 3
2.2.5. Stemming.......................................................................................................................... 3
2.2.4. 2-gram .............................................................................................................................. 3
2.2.4. ALL Caps Analysis.............................................................................................................. 3
3. NDCG Comparitive Analysis....................................................................................................... 4
4. Future Scope............................................................................................................................... 5
1. Introduction
Search engines are applications that search records for specified phrases and returns a
rundown of the reports where the keywords were found. A search engine is truly a general
class of applications, notwithstanding, the term is regularly used to explicitly depict
frameworks like Google, Bing, and Yahoo! Search that empower clients to search for records
on the World Wide Web. The search results are for the most part introduced in a line of results
regularly alluded to as search engine results pages. The information might be a blend of web
pages, pictures, and different sorts of documents. Some search engines likewise mine
information accessible in databases or open registries.
1.1 Problem Statement
In this project, we present a Search Engine for ics.uci.edu named as “Illuminati” to search the
ICS domain corpus. The search engine is formulated based on Information Retrieval techniques
imbibed in CS 221. The Search Engine making use of various performance improvement
techniques are then implemented incrementally to evaluate the efficiency of the proposed
implementation. The naïve implementation is a generic solution based on term document
frequency. Various performance enhancement methodologies are incorporated to increase
the precision of the search engine.
1.2 Purpose
The aim of the project is to build and investigate the efficiency of Search Engine with respect
to the results generated by Google for ics.uci.edu. The task is to develop a scalable and high
performance search engine, where the focus is on the algorithms challenges in efficiently
representing large dataset while supporting fast searches. The project is based on the
description posted on www.ics.uci.edu/~lopes/teaching/cs221W16/index.html
2. Implementation
2.1 Naïve Implementation
Using the pages stored by crawling the ics.uci.edu domain as an input, the Indexer constructs
an inverted index that maps words to documents (pages). As a payload for the same we used
term frequency – inverted document frequency and the position of the word in each
document. We have employed tf-idf weighting scheme, since it facilitates relevant documents
listing. It increases with number of occurrences within a document and rarity of the term in
the collection. We have implemented cosine similarity measure to score every document with
respect to the search query fired.
2.2 Performance Improvement Parameters
To improve the efficiency of the search engine we have utilized following practices that we
came across during researching about Search Engine Implementation.
2.2.1 Term Frequency – Inverted document frequency with frequency inflation.
The term frequency is inflated based on tags the term is nested in. The level and the
tags in which data is nested is used as a measure for evaluating the importance of term.
For instance, a term which is embedded in the title tag of the webpage has higher
weight that the same term embedded in a paragraph tag. Likewise, a term embedded
within a nested structure of title and bold tag has even more importance.
2.2.2 Hyperlink Induced Topic Search
Graph analysis is conducted to compute inter-relation of pages, where a good hub page
points to multiple authoritative pages on that topic and a good authority page for a
topic is pointed my multiple good hub pages for the topic. Hub and Authority Analysis
is computed in iterative fashion and we prevent the values from growing too big by
scaling down the values using a normalizing factor, which is root of summation of
squares of all hub and authority values respectively
2.2.3 PageRank
PageRank algorithm is used to rank the websites in their search engine results. It works
by counting the number and quality of links to a page to determine a rough estimate
of how important the website is. The underlying assumption is that more important
websites are likely to receive more links from other websites.
2.2.4 Link Analysis
Link Analysis is taken into account while Indexing a document S having a hyperlink to
document D, and reflect the term frequency for anchor text in source and inflated term
frequency in the destination document D. The inflation in frequency can also be done
on the authority value of a page which a measure of its importance.
2.2.5 Stemming
The linguistic morphological stemmer named “SnowBall”, was used to reduce the
inflected word to their word stem. Since all the words with same stem are synonymous
fetching relevant documents becomes easier irrespective of derived usage of that word
in the search query.
2.2.6 2-gram
2-gram is an instance of n-gram computational linguistic model, where the probability
of series of token being related to one another is considered. For instance, UC and
Irvine individually will fetch different set of search results since the term UC and Irvine
individually are open to multiple interpretation, but when put together, the phrase
needs to be considered as one to fetch documents with maximum precision. So n-gram
helps in maintaining the semantics of the search query with the results listed.
2.2.7 All Caps Analysis
All Caps Analysis helped us distinguish between query terms which have two different
interpretations based on the case it is searched as, for instance ‘rest’ and ‘REST’ both
the terms though alphabetically are synonymous but differ in their meaning.
Implementing the above techniques helped us realize how tweaking the search engine with their
functionality improve the efficiency of search engine results. At the same time, from the results
we observed that there must be a right balance between the parameters which determine the
overall ranking. Integrating all of them and striking the right balance is a task, so that they are not
contradicting with each other.
In our attempt of improving the performance, we have realized that factors used to scale the
impact of each of the parameter, dominate the ranking of documents in the search results. The
NDCG values for each query before and after performance improvement has been summarized
below.
Query
Before Performance
Improvement
After Performance
Improvement
Mondego 0.7075 0.7075
Machine learning 0.0 0.0
REST 0.4057 0.6452
Security 0.0 0.0
Student affairs 0.3957 0.3957
Graduate courses 0.0 0.1745
Crista Lopes 0.6492 0.6492
Software engineering 0.0 0.0
Computer games 0.6727 0.6727
Information retrieval 0.3157 0.3157
AVERAGE NDCG@5 0.3146 0.3561
C
B
NDCG
A
Search Engine Implementations
A: Before Performance Implementation NDCG 0.3146
B: After Performance Implementation NDCG 0.3561
C: Implementation based on tfidf NDCG 0.4670
As represented in the above diagram we speculate that the NDCG values of the Search Engine
Implementation follow similar trajectory, but we are missing out on hitting the peak of this
trajectory. The factors used for scaling the individual performance are to be fine-tuned to get the
optimal results. In our search implementation, we have prioritized the factors in order of tf-idf,
authority value, page rank and hub value, considering that tfidf is the primary measure of term
relevance, followed by authority value which is an indication of the page being informative about
the query term, followed by page rank which is the probability of a user to open given document
and finally hub value which points to a relevant document. In order increase the execution time,
we tried implementing the same in multithreading and map reduce environment, but the
overhead of context switching was dominating the indexing time.
3. Future Scope
• Document Clustering
• Machine Learning
• Spell Correction
• Multi-threading Implementation
• Acronym Analysis

More Related Content

Similar to CompSci: 221 Winter 2017 Search Engine for UCI

IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET Journal
 
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )SBGC
 
Research Report on Document Indexing-Nithish Kumar
Research Report on Document Indexing-Nithish KumarResearch Report on Document Indexing-Nithish Kumar
Research Report on Document Indexing-Nithish KumarNithish Kumar
 
Research report nithish
Research report nithishResearch report nithish
Research report nithishNithish Kumar
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal
 
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...IRJET Journal
 
Exploring the Efficiency of the Program using OOAD Metrics
Exploring the Efficiency of the Program using OOAD MetricsExploring the Efficiency of the Program using OOAD Metrics
Exploring the Efficiency of the Program using OOAD MetricsIRJET Journal
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ijnlc
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig
 
Developing project objectives and Execution plan in Economy management
Developing project objectives and Execution plan in Economy management Developing project objectives and Execution plan in Economy management
Developing project objectives and Execution plan in Economy management Nzar Braim
 
Context Based Classification of Reviews Using Association Rule Mining, Fuzzy ...
Context Based Classification of Reviews Using Association Rule Mining, Fuzzy ...Context Based Classification of Reviews Using Association Rule Mining, Fuzzy ...
Context Based Classification of Reviews Using Association Rule Mining, Fuzzy ...journalBEEI
 
Document Analyser Using Deep Learning
Document Analyser Using Deep LearningDocument Analyser Using Deep Learning
Document Analyser Using Deep LearningIRJET Journal
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperDerek Diamond
 
MongoDB What's new in 3.2 version
MongoDB What's new in 3.2 versionMongoDB What's new in 3.2 version
MongoDB What's new in 3.2 versionHéliot PERROQUIN
 
Context sensitive indexes for performance optimization of sql queries in mult...
Context sensitive indexes for performance optimization of sql queries in mult...Context sensitive indexes for performance optimization of sql queries in mult...
Context sensitive indexes for performance optimization of sql queries in mult...avinash varma sagi
 
Preliminry report
 Preliminry report Preliminry report
Preliminry reportJiten Ahuja
 

Similar to CompSci: 221 Winter 2017 Search Engine for UCI (20)

IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
 
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
 
Research Report on Document Indexing-Nithish Kumar
Research Report on Document Indexing-Nithish KumarResearch Report on Document Indexing-Nithish Kumar
Research Report on Document Indexing-Nithish Kumar
 
Research report nithish
Research report nithishResearch report nithish
Research report nithish
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
 
How Google Works
How Google WorksHow Google Works
How Google Works
 
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Exploring the Efficiency of the Program using OOAD Metrics
Exploring the Efficiency of the Program using OOAD MetricsExploring the Efficiency of the Program using OOAD Metrics
Exploring the Efficiency of the Program using OOAD Metrics
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
Dt35682686
Dt35682686Dt35682686
Dt35682686
 
Developing project objectives and Execution plan in Economy management
Developing project objectives and Execution plan in Economy management Developing project objectives and Execution plan in Economy management
Developing project objectives and Execution plan in Economy management
 
Context Based Classification of Reviews Using Association Rule Mining, Fuzzy ...
Context Based Classification of Reviews Using Association Rule Mining, Fuzzy ...Context Based Classification of Reviews Using Association Rule Mining, Fuzzy ...
Context Based Classification of Reviews Using Association Rule Mining, Fuzzy ...
 
Document Analyser Using Deep Learning
Document Analyser Using Deep LearningDocument Analyser Using Deep Learning
Document Analyser Using Deep Learning
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White Paper
 
Software Task Estimation
Software Task EstimationSoftware Task Estimation
Software Task Estimation
 
MongoDB What's new in 3.2 version
MongoDB What's new in 3.2 versionMongoDB What's new in 3.2 version
MongoDB What's new in 3.2 version
 
Context sensitive indexes for performance optimization of sql queries in mult...
Context sensitive indexes for performance optimization of sql queries in mult...Context sensitive indexes for performance optimization of sql queries in mult...
Context sensitive indexes for performance optimization of sql queries in mult...
 
Preliminry report
 Preliminry report Preliminry report
Preliminry report
 

Recently uploaded

UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 

Recently uploaded (20)

Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 

CompSci: 221 Winter 2017 Search Engine for UCI

  • 1. PROJECT REPORT Search Engine for ics.uci.edu “Illuminati” Submitted in partial fulfillment of the requirements of COMPSCI 221 WINTER 2017 By ABHIDNYA PATIL 91882839 SOHAM KULKARNI 20005264 MADHUR J. BAJAJ 36562594 PROJECT GUIDE: PROF. CRISTINA LOPES UNIVERSITY OF CALIFORNIA, IRVINE
  • 2. Table of Contents 1. Introduction................................................................................................................................ 1 1.1. Problem Statement.................................................................................................................. 1 1.2. Purpose .................................................................................................................................... 1 2. Implementation.......................................................................................................................... 2 2.1. Naïve Implementation ............................................................................................................. 2 2.1.1. Term frequency - inverted document frequency............................................................. 2 2.2. Performance Improvement ..................................................................................................... 2 2.2.1. Term frequency - inverted document frequency with inflation ..................................... 2 2.2.2. Hyperlink Induced Topic Search (HITS) ............................................................................ 2 2.2.3. Page Rank ......................................................................................................................... 2 2.2.4. Link Analysis ..................................................................................................................... 3 2.2.5. Stemming.......................................................................................................................... 3 2.2.4. 2-gram .............................................................................................................................. 3 2.2.4. ALL Caps Analysis.............................................................................................................. 3 3. NDCG Comparitive Analysis....................................................................................................... 4 4. Future Scope............................................................................................................................... 5
  • 3. 1. Introduction Search engines are applications that search records for specified phrases and returns a rundown of the reports where the keywords were found. A search engine is truly a general class of applications, notwithstanding, the term is regularly used to explicitly depict frameworks like Google, Bing, and Yahoo! Search that empower clients to search for records on the World Wide Web. The search results are for the most part introduced in a line of results regularly alluded to as search engine results pages. The information might be a blend of web pages, pictures, and different sorts of documents. Some search engines likewise mine information accessible in databases or open registries. 1.1 Problem Statement In this project, we present a Search Engine for ics.uci.edu named as “Illuminati” to search the ICS domain corpus. The search engine is formulated based on Information Retrieval techniques imbibed in CS 221. The Search Engine making use of various performance improvement techniques are then implemented incrementally to evaluate the efficiency of the proposed implementation. The naïve implementation is a generic solution based on term document frequency. Various performance enhancement methodologies are incorporated to increase the precision of the search engine. 1.2 Purpose The aim of the project is to build and investigate the efficiency of Search Engine with respect to the results generated by Google for ics.uci.edu. The task is to develop a scalable and high performance search engine, where the focus is on the algorithms challenges in efficiently representing large dataset while supporting fast searches. The project is based on the description posted on www.ics.uci.edu/~lopes/teaching/cs221W16/index.html
  • 4. 2. Implementation 2.1 Naïve Implementation Using the pages stored by crawling the ics.uci.edu domain as an input, the Indexer constructs an inverted index that maps words to documents (pages). As a payload for the same we used term frequency – inverted document frequency and the position of the word in each document. We have employed tf-idf weighting scheme, since it facilitates relevant documents listing. It increases with number of occurrences within a document and rarity of the term in the collection. We have implemented cosine similarity measure to score every document with respect to the search query fired. 2.2 Performance Improvement Parameters To improve the efficiency of the search engine we have utilized following practices that we came across during researching about Search Engine Implementation. 2.2.1 Term Frequency – Inverted document frequency with frequency inflation. The term frequency is inflated based on tags the term is nested in. The level and the tags in which data is nested is used as a measure for evaluating the importance of term. For instance, a term which is embedded in the title tag of the webpage has higher weight that the same term embedded in a paragraph tag. Likewise, a term embedded within a nested structure of title and bold tag has even more importance. 2.2.2 Hyperlink Induced Topic Search Graph analysis is conducted to compute inter-relation of pages, where a good hub page points to multiple authoritative pages on that topic and a good authority page for a topic is pointed my multiple good hub pages for the topic. Hub and Authority Analysis is computed in iterative fashion and we prevent the values from growing too big by scaling down the values using a normalizing factor, which is root of summation of squares of all hub and authority values respectively 2.2.3 PageRank PageRank algorithm is used to rank the websites in their search engine results. It works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
  • 5. 2.2.4 Link Analysis Link Analysis is taken into account while Indexing a document S having a hyperlink to document D, and reflect the term frequency for anchor text in source and inflated term frequency in the destination document D. The inflation in frequency can also be done on the authority value of a page which a measure of its importance. 2.2.5 Stemming The linguistic morphological stemmer named “SnowBall”, was used to reduce the inflected word to their word stem. Since all the words with same stem are synonymous fetching relevant documents becomes easier irrespective of derived usage of that word in the search query. 2.2.6 2-gram 2-gram is an instance of n-gram computational linguistic model, where the probability of series of token being related to one another is considered. For instance, UC and Irvine individually will fetch different set of search results since the term UC and Irvine individually are open to multiple interpretation, but when put together, the phrase needs to be considered as one to fetch documents with maximum precision. So n-gram helps in maintaining the semantics of the search query with the results listed. 2.2.7 All Caps Analysis All Caps Analysis helped us distinguish between query terms which have two different interpretations based on the case it is searched as, for instance ‘rest’ and ‘REST’ both the terms though alphabetically are synonymous but differ in their meaning. Implementing the above techniques helped us realize how tweaking the search engine with their functionality improve the efficiency of search engine results. At the same time, from the results we observed that there must be a right balance between the parameters which determine the overall ranking. Integrating all of them and striking the right balance is a task, so that they are not contradicting with each other. In our attempt of improving the performance, we have realized that factors used to scale the impact of each of the parameter, dominate the ranking of documents in the search results. The NDCG values for each query before and after performance improvement has been summarized below.
  • 6. Query Before Performance Improvement After Performance Improvement Mondego 0.7075 0.7075 Machine learning 0.0 0.0 REST 0.4057 0.6452 Security 0.0 0.0 Student affairs 0.3957 0.3957 Graduate courses 0.0 0.1745 Crista Lopes 0.6492 0.6492 Software engineering 0.0 0.0 Computer games 0.6727 0.6727 Information retrieval 0.3157 0.3157 AVERAGE NDCG@5 0.3146 0.3561 C B NDCG A Search Engine Implementations
  • 7. A: Before Performance Implementation NDCG 0.3146 B: After Performance Implementation NDCG 0.3561 C: Implementation based on tfidf NDCG 0.4670 As represented in the above diagram we speculate that the NDCG values of the Search Engine Implementation follow similar trajectory, but we are missing out on hitting the peak of this trajectory. The factors used for scaling the individual performance are to be fine-tuned to get the optimal results. In our search implementation, we have prioritized the factors in order of tf-idf, authority value, page rank and hub value, considering that tfidf is the primary measure of term relevance, followed by authority value which is an indication of the page being informative about the query term, followed by page rank which is the probability of a user to open given document and finally hub value which points to a relevant document. In order increase the execution time, we tried implementing the same in multithreading and map reduce environment, but the overhead of context switching was dominating the indexing time. 3. Future Scope • Document Clustering • Machine Learning • Spell Correction • Multi-threading Implementation • Acronym Analysis