The document provides instructions for building a search engine application in three parts. It discusses requirements for each part, including designing user interfaces, implementing persistent data storage, and completing the application by implementing indexing and search functions. Suggestions are provided for data structures to represent the file list and inverted index, and algorithms for performing Boolean searches. The overall goal is to create an application that can index local text files and allow searching by word or phrase through a graphical user interface.
Build search engine with GUI for AND, OR, PHRASE searches
1. Must be similar to screenshots
I must be able to run the projects on Eclipse so that I can upload
the codes to my Github account
The projects must say that they were created by
Juliet Mercado
Zachary Willis
Ihor Panchenko
Craig Anderson
Building a Search Engine, Part I: Governance, Workflow, and
UI
(This is the first project in this series)
You are going to design, build, and test a scaled-down version
of “Google Search”. Rather than searching the Internet's files,
you will only search local files added to your search engine's
index. Your search engine will allow an administrator to add,
update, and remove files from the index. Users will be able to
enter search terms, and select between Boolean AND, OR, or
PHRASE search. The matching file names (if any) are then
displayed in a list.
You also need to design the system architecture (the high-level
design), so you can plan each part.
Search Engine Project Proposal:
2. Build a search engine with simple GUI, that can do AND, OR,
and PHRASE Boolean searches on a small set of text files. The
user should be able to say the type of search to do, and enter
some search terms. The results should be a list of file
pathnames that match the search. This should be a stand-alone
application
User Interfaces
In addition to the main user interface (for doing searching), you
will need a separate administrator or maintenance interface to
manage your application. It should be easy to add and remove
files (from the set of indexed files), and to regenerate the index
anytime. When starting, your application should check if any of
the files have been changed or deleted since the application last
saved the index. If so, the administrator should be able to have
the index updated with the modified file(s).
Note that with HTML, Word, or other types of documents, you
would need to extract a plain text version before indexing. That
isn't hard, but the search engine is complex enough already. For
these projects, limit your search engine to only plain text files
(including .txt, .html, and other text files).
The index must be stored on disk, so next time your application
starts it can reload its data. The index, list of files, and other
data, can be stored in one or more file(s) or in a database. The
saved data should be read whenever your application starts. The
saved data should be updated (or recreated) when you add,
update, or remove documents from your set (of indexed
documents), or perhaps just when your application exits. If you
use files, the file formats are up to you; have a format that is
fast and simple to load and store.
To keep things as simple as possible, in this project you can
3. assume that only a small set of documents will be indexed, and
thus the whole index can be kept in memory at once. (That's
probably not the case for Google's data!) All you need to do is
be able to read the index data from disk at startup into memory,
and write it back either when updating the index, or when your
application shuts down. Note, the names (pathnames) of the
added files as well as their last modification time must be
stored in addition to the index.
If using XML file, you can define an XML schema for it and
have some tool such as Notepad++ validate your file format for
you. XML may have other benefits, but it isn't as simple as
using plain text files. JSON might be the easiest format for
storing and reading the index data. In any case, don't forget to
include the list of file pathnames and other data you decide is
needed, along with the index itself.
Requirements:
In this project, we will follow the model-view-controller design
pattern for the project organization. This allows one to develop
each part mostly independently from the other parts.
Develop Stub User Interfaces:
In this part of the project, you must implement a non-functional
(that means looks good but doesn't do a thing) graphic user
interface for the application. (The “view”.) The main (default)
user interface must support searching and displaying results. It
should have various other features, such as an “About...” menu
or button, a way to quit the application (if a stand-alone
application; if your group creates a web application, there is no
need to quit), and a way to get to the administrator/maintenance
view.
The maintenance/administrator view must allow the user to
4. perform various administration operations: view the list of
indexed file names, adding files to the index, remove files from
the index, and update the index (when files have been modified
since they were indexed).
The user interface should be complete, but none of the
functionality needs to be implemented at this time. You should
implement stub methods for the functionality not yet
implemented, and invoke them from your event handlers. The
stub methods can either return “canned” (fake but realistic)
data, or throw an OperationNotSupported exception. The only
button that needs to do anything is the one used to switch to the
maintenance view.
Since the user interfaces don't do anything, there is nothing to
test yet. However, you must create a test class with at least one
test method (it can just return success if you wish). I suggest
you agree to use JUnit 4 style tests for now.
Building a Search Engine, Part II: Persistent Data
Please read the background information and full project
description from Search Engine Project, Part I. In this project,
you will implement the persistent data (the “model”) part of the
project: the saving of data and the loading of data at the next
start. The persistent data contains the list of files used in the
index, and the index itself.
First discuss which persistence solution you will use: text files,
XML or JSON files, or a database (and chose between
embedded (my suggestion) or server, and if using a database,
chose between the JDBC and JPA database APIs (I suggest
JPA). You can make this decision before knowing the details of
the data structures used.
Before working on actual code, you need to decide on the data
5. structures to be used for the file list and the inverted index. Try
to read the Java collections material before deciding.
It should be easy to add and remove files (from the set of
indexed files). When starting, your application should check if
any of the files used have been changed or deleted since the
application last saved the index. If so, the “admin” user should
be able to have the inverted index file(s) updated, from the
maintenance interface.
(Note that with HTML or Word documents, you would need to
extract a plain text version before indexing.) In this project, all
the “indexible” files are plain text. You are free to assume the
system-default text file encoding, or assume UTF-8 encoding,
for all files.
The inverted index can be stored in one or more file(s), and that
should be read whenever your application starts. The file(s)
should be updated (or recreated) when you add, update, or
remove documents from your set (of indexed documents). The
file format is up to you, but should have a format that is fast
and simple to search. However, to keep things simpler, in this
project you can assume that only a small set of documents will
be indexed, and thus the whole index can be kept in memory.
All you need to do is be able to read the index data from a file
at startup into memory, and write it back when updating the
index. Don't forget the names (pathnames) of the files as well as
their last modification time must be stored as well. It is your
choice to use a single file or multiple files, in plain text, JSON,
XML, or any format your group chooses, to hold the persistent
data. If you want, you can use any DBMS. (In that case, I
suggest using the JavaDB included with the JDK, as an
embedded database.) In any case, your file format(s) or database
schema must be documented completely, so that someone else,
without access to your source code could use your file(s) or
database correctly.
6. If using XML format, you can define an XML schema for your
file and have some tool such as Notepad++ validate your file
format for you. XML may have other benefits, but it isn't as
simple as plain text files or even JSON files. In any case, don't
forget to include the list of file (path) names, along with the
index itself, in your persistent data store.
Part II Requirements:
In this part, you must implement the file operations of your
search engine application (the model). That includes reading
and updating your persistent data (that is, the inverted index as
well as any other information you need to store between runs of
your application, such as the list of files (their pathnames) that
have been indexed). The main file operations are reading each
file to be indexed a “word” at a time; you also need to checking
if the previously indexed files still exist or have been modified
since last indexed.
The maintenance part of the user interface should allow users to
select files for indexing, and to keep track of which files have
been added to the index. For each file, you need to keep the full
pathname of the file as well as the file's last modification time.
Your code should correctly handle the user entering in non-
existent files and unreadable files. How you handle such errors
is up to you
You can download a Search Engine model solution, to play with
it and inspect its user interface. My solution keeps all persistent
data in a single text file in the user's home directory, but you
can certainly use a different persistence solution.
Possible Data Structures you can use. In part III, you will
implement the index operations, including Boolean searching,
adding to the index, and removing files from the index. (The
7. index is a complex collection of collections.) Because the
format of the index and file list will affect the code used to read
and write them to and from storage, you must decide on the in-
memory data structures to be used early. In the model solution,
I used a List of FileItem objects for the list of indexed files;
each FileItem contained a file's pathname and date it was read
for the index. The index data itself is stored in a Map, with the
using the indexed words as keys, and a Set of IndexData objects
as the values. Each IndexData object holds the id of the file
containing the word and the position of the word in that
document. (The classes FileItem and IndexData were trivial to
write.)
This is NOT the only, or the best, way to represent the index or
file list! (For example, a List of int[2] arrays might be simpler
than a Set of IndexData objects.) Your should decide on the
types of collections used. Only then can you implement the
methods to read and write the data.
Building a Search Engine, Part III:
Collections
Please read the background information and full project
description from Search Engine Project, Part I.
In this final part of the project, you will complete the
application by implementing the index functions. These include
adding a file to the index, and removing a file from the index,
and reading and writing the index from/to a file. (Updating the
index when a file has been changed, can then be done by
removing and then re-adding a file.) Other operations include
searching the index for a given word, and returning a Set of
pairs (document ID and position) for that word.
Finally, you will have to implement the Boolean search
8. functions of the main user interface. (This is complex enough,
that it should have been another project!) I suggest you start
with an “OR” search, then worry about implementing the
“AND” and “PHRASE” search functions.
When building the index, keep in mind you will need to define
what you mean by “word”. One possibility is to strip out any
non-digits or letters, and convert the result to all lowercase,
both when you build the inverted index and when you read the
search terms entered by the user. Ideally, you can use the I18N
methods discussed in class to normalize the words.
Implementing Boolean Search:
The exact method depends in part on how you implement the
inverted index. In the suggested implementation (a Map with
words as the keys, and a List or Set of (document ID, position)
pairs as the values), you could implement the Boolean searches
using algorithms similar to the following (you can come up with
your own if you wish):
OR Search
This is the easiest one to implement. The general idea is to start
with an empty Set of matching files. Then add to that Set, the
files containing each search term; Just search the Map for that
word, and add each document found (if any). The result is the
OR search results, the files that contain any word in the search
list. (If user inputs no search words, say “ ,.”, then no files are
considered as matching.)
AND Search
This is done the opposite way from an OR search, and is only a
little harder to implement. The idea is to start with a set of all
files in the index. Then for each search term, for each file in the
9. Set, make sure that file is contained in the index for that search
term. Remove any files from the set that don't contain that
word. The resulting final set is the documents matching all
search terms. (If user inputs no search words, say “ ,.”, then all
files are considered as matching. If that isn't the behavior you
want, you need to treat that as a special case.)
PHRASE Search
This is the hardest search to implement. Unlike the OR and the
AND searches, with PHRASE searching, the position of the
search terms in the files matters. The algorithm I came up with
is:
Create an initially empty Set of Pair objects.
Add to the set the Pair objects for the files that contain the first
word of the phrase. This is the easy part: Just lookup that word
in the Map, and add all Pair objects found to a set.
The Set now contains Pair objects for just the files that might
contain the phrase. Next, loop over the remaining words of the
phrase, removing any Pairs from the set that are no longer
possible phrase continuations. (Actually, I just build a new Set.)
For each remaining word in the phrase:
Create a new, empty set of Pairs.
For each Pair in the previous set, see if the word appears in the
same file, but in the next position. If so, add the Pair object for
the word to the new set.
An example may help clarify this. Suppose the search phrase is
“big top now”. The set initially contains all the Pair objects for
the word “big”. Let's say for example, that set looks like:
10. (file1,position7), (file1,position22), (file3,position4)
For each Pair object in that set, you need to see if “top” is in
that same file, but the next position. If so, you add the Pair
object for that to the new Set. The (inner) loop for this example
checks each of the following:
Is a (file1,position8) Pair object in the Map for the word "top"?
Is a (file1,position23) Pair object in the Map for the word
"top"?
Is a (file3,position5) Pair object in the Map for the word "top"?
If the answer is “yes”, then add that Pair object to the new set.
When this loop ends, the new set will contain the Pair objects
for the phrase “big top” (pointing to the position of the word
“top”).
For example, suppose “top” is only found in (file1,position8)
and (file3,position5). You replace the first set with this new set:
(file1,position8), (file3,position5)
Repeat for the next word in the phrase, using the set built in the
previous loop.
Continue until the set is empty (so phrase not found), or until
the last word of the phrase has been processed. The Pair objects
remaining in the final set are the ones that contain the phrase;
the position will be that of the last word of the phrase. (We only
need to display the file name; in this project, the position of the
phrase doesn't matter.)
Part III Requirements:
11. This project has been split into three parts. Each part counts as
a separate project. In the first two parts, you designed and
implemented a graphic user interface for the application, and
added all required file operations.
In this part, you must implement the remaining operations of
your search engine application: the index operations, and the
searching.
You can download a Search Engine model solution, to play with
it and inspect its user interface, but please keep in mind you
should not copy that user interface; instead, invent a better,
nicer-looking one.
Hints:
Keep your code as simple as possible
The inverted index is naturally a Map, from words (the keys) to
a Set of objects (the values). Each of the objects represent a
document and a location within that document, where the word
was found. I called these objects Pairs, since they are a pair of
numbers, but you can use any name for your classes. Note, you
will need to be able to go from a document number to a file
name, when you display the search results.