JAVA 2013 IEEE DATAMINING PROJECT Ginix generalized inverted index for keyword search

Ginix Generalized Inverted Index for Keyword Search
ABSTRACT
Keyword search has become a ubiquitous method for users to access text data
in the face of information explosion. Inverted lists are usually used to index
underlying documents to retrieve documents according to a set of keywords
efficiently. Since inverted lists are usually large, many compression
techniques have been proposed to reduce the storage space and disk I/O time.
However, these techniques usually perform decompression operations on the
fly, which increases the CPU time. This paper presents a more efficient index
structure, the Generalized Inverted IndeX (Ginix), which merges consecutive
IDs in inverted lists into intervals to save storage space. With this index
structure, more efficient algorithms can be devised to perform basic keyword
search operations, i.e., the union and the intersection operations, by taking
the advantage of intervals. Specifically, these algorithms do not require
conversions from interval lists back to ID lists. As a result, keyword search
using Ginix can be more efficient than those using traditional inverted indices.
The performance of Ginix is also improved by reordering the documents in
datasets using two scalable algorithms. Experiments on the performance and
GLOBALSOFT TECHNOLOGIES
IEEE PROJECTS & SOFTWARE DEVELOPMENTS
IEEE FINAL YEAR PROJECTS|IEEE ENGINEERING PROJECTS|IEEE STUDENTS PROJECTS|IEEE
BULK PROJECTS|BE/BTECH/ME/MTECH/MS/MCA PROJECTS|CSE/IT/ECE/EEE PROJECTS
CELL: +91 98495 39085, +91 99662 35788, +91 98495 57908, +91 97014 40401
Visit: www.finalyearprojects.org Mail to:ieeefinalsemprojects@gmail.com

scalability of Ginix on real datasets show that Ginix not only requires less
storage space, but also improves the keyword search performance, compared
with traditional inverted indexes.
SYSTEM ANALYSIS
Existing System:
Beyond asking for explicit user input, earlier work focused on handling
recency queries, which are queries that are after recent events or breaking
news. The time sensitive approach processes a recency query by computing
traditional topic similarity scores for each document, and then “boosts” the
scores of the most recent documents, to privilege recent articles over older
ones. In contrast to traditional models, which assume a uniform prior
probability of relevance for each document d in a collection, define the prior to
be a function of document d’s creation date. The prior probability decreases
exponentially with time, and hence recent documents are ranked higher than
older documents. Li and Croft’s strategy is designed for queries that are after
recent documents, but it does not handle other types of time-sensitive queries,
such as [Madrid bombing], [Google IPO], or even that implicitly target one or
more past time periods.

Proposed System:
Many compression techniques have been proposed to reduce the storage space and disk I/O
time. However, these techniques usually perform decompression operations on the fly, which
increases the CPU time. This paper presents a more efficient index structure, the Generalized
INverted IndeX (Ginix), which merges consecutive IDs in inverted lists into intervals to save
storage space. The problem of document reordering is equivalent to making similar
documents stay near to each other. Silvestri[5] proposed a simple method that sorts web
pages in lexicographical order based on their URLs as an acceptable solution to the problem.
This method is reasonable because the URLs are usually good indicates of the web page
content. The performance of Ginix is also improved by reordering the documents in datasets
using two scalable algorithms. Experiments on the performance and scalability of Ginix on real
datasets show that Ginix not only requires less storage space, but also improves the keyword
search performance, compared with traditional inverted indexes.
Advantages:
1. Efficient algorithms are given to support basic operations on interval lists, such as union
and intersection without decompression.
2. The problem of enhancing the performance of Ginix by document reordering is investigated,
and two scalable and effective algorithms based on signature sorting and greedy heuristic of
Traveling Salesman Problem (TSP)[3] are proposed.
3. Extensive experiments that evaluate the performance of Ginix are conducted. Results show
that Ginix not only reduces the index size but also improves the search performance on real
datasets.

Module Description:
1. Search over Blogs
2. Time interval feedback
3. Temporal relevance feedback (Time Sensitive results
4. Overall ranking document identification Search over blogs.
5. Blogs Growth Charts.
A large number of searches, such as over blogs and news archives. So far, research
on searching over such collections has largely focused on retrieving topically similar
documents for a query. Unfortunately, ignoring or not fully exploiting the time dimension can
be detrimental for a large family of queries for which we should consider not only the
document topical relevance.
Time Interval Feedback:
Time-sensitive query over a news archive, our approach automatically identifies
important time intervals for the query. These intervals are then used to adjust the document
relevance scores by boosting the scores of documents published within the important
intervals. We have implemented our system on top of Indri, 2 a state-of-the-art search engine
that combines language models and inference networks for retrieval, as well as over Lemur3,
into its implementation. Our system provides a web interface for searching the News blaster
archive4, an operational news archive and summarization system, and for experimenting with
variations of our approach.

Temporal Relevance Feedback:
We discuss several techniques to estimate the temporal relevance of a day to a query at
hand. These estimation techniques use the temporal distribution of matching articles for the
query to compute the probability that a day in the archive has a relevant document for the
query.
Overall ranking document identification:
We integrate temporal relevance with state-of-the- art retrieval models, including a
query likelihood model, a relevance model, a probabilistic relevance model, and a query
expansion with pseudo relevance feedback model, to naturally process time-sensitive queries.
In these models, we combine topical relevance and temporal relevance to determine the
overall relevance of a document.
Blogs Growth Charts:
The scalability of Ginix was evaluated using different numbers of reocrds in the DBLP dataset.
Search time: Since the current algorithms take advantage of the intervals, the search time of
Ginix is nearly 2x faster than that of InvIndex.

Algorithm:
SYSTEM SPECIFICATION
Hardware Requirements:
• System : Pentium IV 2.4 GHz.
• Hard Disk : 80 GB.
• Floppy Drive: 1.44 Mb.

• Monitor : 15’ VGA Colour.
• Mouse : Optical Mouse
• RAM : 512 MB.
Software Requirements:
• Operating system : Windows 7 32 Bit.
• Coding Language : ASP.Net 4.0 with C#
• Data Base : SQL Server 2008

JAVA 2013 IEEE DATAMINING PROJECT Ginix generalized inverted index for keyword search

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to JAVA 2013 IEEE DATAMINING PROJECT Ginix generalized inverted index for keyword search

Similar to JAVA 2013 IEEE DATAMINING PROJECT Ginix generalized inverted index for keyword search (20)

More from IEEEGLOBALSOFTTECHNOLOGIES

More from IEEEGLOBALSOFTTECHNOLOGIES (20)

Recently uploaded

Recently uploaded (20)

JAVA 2013 IEEE DATAMINING PROJECT Ginix generalized inverted index for keyword search