Degreeproject

Independent thesis, 15 HE, for degree of Bachelor in
Computer Science
Spring term 2016
Improving search engines in client based
applications
A research and comparison of different
application search engines
Christian Sahlström Anton Mauritzson
School of Health and Society

Authors
Christian Sahlström
Anton Mauritzson
Title
Improving search engines in client based applications
Supervisor
Kamilla Klonowska
Examinator
Eric Chen
Abstract
For ages relational databases have dominated the database market and have been the primary choice
for developers. In recent times new technologies have been introduced, there the biggest contender is
the document based solution called NoSQL. This thesis provides research into current technologies
together with a constructed hybrid with inspiration from known search algorithms, search techniques
and web search engines. The hybrid presented in the thesis is 24% better than the relational database
and the NoSQL in finding the correct word when faced with miss-constructed user input. Our hybrid
solution can be incorporated in a relational database when used as a search engine. The algorithms
can even be used together with a NoSQL search engine to enable better word discovery. The hybrid
showed no improvement in performance when it came to time and memory usage and had no positive
effect on the search engine when the database was significantly scaled up.
Keywords
Relational databases, NoSQL, search algorithms

Table of contents 3
Document page i
Abstract ii
1 Introduction ................................................................................................................. 4
1.1 Aim and purpose................................................................................................... 4
1.2 Thesis questions.................................................................................................... 5
1.3 Methodology......................................................................................................... 5
1.4 Report outline ....................................................................................................... 5
2 Literature study............................................................................................................ 6
2.1 Systematic literature review ................................................................................. 6
2.2 Related work......................................................................................................... 9
2.3 Literature result................................................................................................... 11
3 Experiment ................................................................................................................ 14
3.1 Implementations ................................................................................................. 14
3.2 Evaluation Criteria.............................................................................................. 17
4 Hybrid algorithm implementation............................................................................. 18
4.1 Using distance to build queries........................................................................... 18
4.2 Finding relation between keywords.................................................................... 18
4.3 Using search history to increase performance.................................................... 19
4.4 Algorithm overview............................................................................................ 20
5 Experiment results..................................................................................................... 21
5.1 Performance........................................................................................................ 21
5.2 Relevance............................................................................................................ 23
5.3 Scalability ........................................................................................................... 23
5.4 Experiment evaluation........................................................................................ 24
5.5 Experiment summary.......................................................................................... 24
6 Discussion and ethical aspects................................................................................... 26
7 Conclusion and further work..................................................................................... 27
8 References ................................................................................................................. 28

4
1 Introduction
The need for search engines in applications both on the web and locally to present data
to a user is an ever growing need [1, 2]. Options for developers are plentiful with an
overflow of solutions and techniques. Popular implementations using SQL (Simple
query language), such as RDBMS (Relational database management system) have
severe limitations when processing keywords not exactly matching the regarded
attribute in the database. Other alternatives may be inaccessible due to lack of
knowledge or software costs.
The difference in search methods between web search engines (e.g. Google, Yahoo) and
search engines facilitated inside applications are in most cases acute. Google and Yahoo
depend on providing relevant data for its users to prevail over competition, their product
is the information thus they have effective means of providing their users with relevant
results. Many smaller applications have another primary focus so their search engines
can get overlooked.
As SQL is the most frequently used mean of retrieving information from RDBMS [3]
there is a natural demand for higher precision when dealing with keywords, relevance
and similar issues. Reducing lack of search results provoked by aberrations from desired
input would greatly reduce the loss of information. For example, user input of
“computer mouse” keywords being “computer” and “mouse”, using an SQL query
“SELECT * FROM TABLE_NAME WHERE COLUMN_NAME = ‘computer mouse’
the results would be defined to the entire word, this being “computer mouse”. For
results relevant to the separate keywords “computer” and “mouse” the developer would
have to perform an OR query, these are time consuming and the result could be
irrelevant to the entire word. This need to retrieve data related to the original input is
especially essential for large databases containing multiple rows with similar data (e.g.
online stores). Besides the demand of getting the correct result based on relevance there
is a need of handling other human errors such as misspellings. Many RDBMS providers
support something called Soundex, which is SQLs own command for handling
misspellings, for example:
“SELECT * FROM computers WHERE component SOUNDS LIKE ‘keybroad’
should result in:
“SELECT * FROM computers WHERE component = ‘keyboard’.
1.1 Aim and purpose
In this thesis, the focus is on the problem of enabling search engines in applications
handling user input to work more efficiently in terms of relevance, performance,
scalability and output when retrieving information. Focus lies specifically on RDBMS
and NoSQL databases as they are most frequently used in application both on the web
and on servers locally. Providing clarity for future developers in when and where to
implement different search engine solutions based on the applications search
requirements. Narrowing the gaps between web search engines, NoSQL and RDBMS
(see Figure 1), while providing algorithms and solutions to existing flaws that current
frequently used search engines possess.

5
Figure 1 – Overview of thesis focus.
1.2 Thesis questions
In this thesis, the following questions have to be taken into consideration:
Thesis question 1: Which search engine suits which type of application based on the
requirements of the application?
Thesis question 2: How do the algorithms in the search engines differ from each other?
Thesis question 3: How can current application search engines be improved?
1.3 Methodology
To answer the thesis questions, a systematic literature review is conducted along with
experiments.
Thesis question 1 is answered using the systematic literature review and the
experiments.
Thesis question 2 is answered using the systematic literature review.
Thesis question 3 is answered using both the systematic literature review and the
experiment.
Several sub questions are constructed in the literature review to enable more detailed
research into the general thesis questions. These will then be presented in detailed after
the research has been conducted.
1.4 Report outline
The report is organized as follows: Section 2 provides details on the systematic
literature review, related work to the topic and a literature summary. Section 3 presents
how the experiment was conducted and how the RDBMS and NoSQL solution was
implemented. Section 4 provides details about the implementation of the hybrid solution
and the algorithms used. Section 5 presents the results of the experiments and an
evaluation of the each engine and their strengths. Section 6 is a discussion about the
outcomes from the theoretical studies and the experiment. Section 7 provides a
conclusion of the whole thesis with focus on the results gathered and further work is
presented.

6
2 Literature study
The literature review contains outlined steps followed in the process of collecting and
evaluating information. The section also contains related work to the topic.
2.1 Systematic literature review
The literature review was conducted by following appropriate steps to ensure the quality
and relevance of the information acquired.
2.1.1 Review protocol
The first step consisted of outlining a protocol for the systematic literature review. To
maintain authenticity and keep the research unbiased, the literature review had to be
conducted in such a manner that articles were not included or excluded based on results
not being favorable. The literature review was conducted to achieve transparency and
replicability to enable further work based on our research.
2.1.2 Literature research sub questions
Below is a list of questions that is to be answered during the literature review. The
questions are based on the thesis questions. The thesis question is listed in the
parenthesis connected to the literature research questions. The questions are answered in
detail in the last part of the chapter.
Sub question 1 (Thesis question 1): What existing research has been done to improve
search engines based on RDBMS?
Sub question 2 (Thesis question 1): Have comparative tests between different engines
been conducted, if so what selection criteria were used?
Sub question 3 (Thesis question 2, 3): Which features are missing in NoSQL and
RDBMS?
Sub question 4 (Thesis question 2, 3): Can RDBMS be influenced by NoSQL and vise
verse.
Sub question 5 (Thesis question 2, 3): What techniques do web search engines use?
Can they be incorporated in a smaller engine?
2.1.3 Literature search
Keywords and operators used in the literature search are presented in Table 1. As the
table describes, search terms were constructed to go from broad results then gradually
specifying the search terms to get more detailed accurate results. Databases used in
searching for literature are ACM, HKR Summon, Google Scholar, Science Direct and
International journal of computer applications. The search criteria to get to the given
number of articles found in Table 1 were to only include peer-reviewed articles in full
text. In the table, each row is represented by four columns – first column shows the
keywords used in the search, second is for which field(s) the search is related to, the
third is for the search operation(s) used with the different keywords and the last is the
number of all articles returned. The search engines used presents search results based on
relevance and thus it was decided to only include articles from the first two pages of
results as they are the most relevant to the searches.

7
The first filtering mechanism being used was to briefly filter through the title and
related keywords and if they were related to the initial search the abstract was read. If
the article passed the abstract reading, it was saved in the log. The articles remaining
after the first filtering are displayed in Table 2.
Table 1 – Keyword, search operators and articles found
Keywords used Field Search
operator(s)
Articles found
Keyword search, relational databases RDBMS AND 4,006
Keyword search, relational databases,
Optimization
RDBMS AND 1,544
Relational databases, problems,
known problems
RDBMS AND 1,364
Relational databases, Fuzzy query
implementation
RDBMS AND 576
Improving, search, ranking, relational
databases
RDBMS AND 397
Ranking relevance, indexing, search
engines
Web
engines
and search
algorithms
AND 816
Google, error correction Web
engines
and search
algorithms
AND 560
Levenshtein, distance, search engines Search
algorithms
AND 156
Lucene NoSQL
databases
and Lucene
* 310
Lucene, implementation NoSQL
databases
and Lucene
AND 216
Search algorithms, search engines,
word relevance, Lucene
NoSQL
databases
and Lucene
AND 92
NoSQL, SQL RDBMS
vs.
NOSQL
AND 106
Relational databases, Comparisons,
NoSQL
RDBMS
vs.
NOSQL
AND 77
NoSQL, SQL, Comparisons RDBMS
vs.
NOSQL
AND 56

8
Table 2 – Amount of articles saved after first selection in each field
Field Amount of articles
Relational databases 68
NoSQL databases and Lucene 29
RDBMS vs. NOSQL 11
Search algorithms 23
Web engines 7
2.1.4 Selection of literature
Certain evaluation criteria had to be fulfilled for an article to be included in the thesis.
Articles have to:
-be unbiased.
-present results not based on specific software or hardware.
-include methods that are replicable.
-be in the field of computer science.
A second filtering based on the above mentioned criteria had to be conducted. In this
step articles were reviewed more thoroughly. If the above mentioned criteria were
fulfilled, the articles stayed in the log. Articles not achieving all criteria were discarded.
The result from this filtering is displayed in Table 3.
Table 3 – Amount of articles saved after the second selection in each field
RDBMS vs. NOSQL 5
Search algorithms 7
Web engines 5
2.1.5 Assessment of literature
After the selection process the literature went under the process of being assessed for
credibility. Peer-reviewed articles from journals with high credibility were accepted.
Articles from conferences were included if the articles published in the conference were
full papers with peer-review. Background checks on authors were also conducted to find
out if they were well reputed in the field of computer science. Table 4 shows the final
results of the literature assessed through all steps which are used in the thesis.

9
Table 4 – Amount of articles saved after the final selection in each field
RDBMS vs. NOSQL 3
Search algorithms 5
Web engines 2
2.2 Related work
In this chapter the related work is presented. The segment and sub topics are divided
into different parts to enable better structure to the work.
2.2.1 Keyword search algorithms
Many search algorithms applied in search engines have been tested and compared.
Keyword search algorithms have aimed to improve the discovery of data through
improving the usage of handling the keyword provided by the user in more efficient
ways. Keyword searching is a commonly used search method, although without set
standards of implementation. It provides a user without underlying knowledge of data
structure or query language the ability to search using keywords [1, 4, 5, 6, 7]. These
studies have concluded that search efficiency can be greatly improved when targeting
databases using keyword search algorithms. In most cases the focus lies on keyword
search algorithms such as Top-k algorithms and vector space model algorithms. Bruno
et al [4] describes Top-k in a way where the reader should consider a real-estate
database that maintains data like the price and number of bedrooms of each houses that
are for sale. If a user for example search for a house with 4 bedrooms and a price tag of
around $300,000, the database system should take the user preferences in consideration
and return the result ranked based on relevance – e.g. houses that has 4 bedrooms or
close and a price tag around $300,000 should be as top result, as they are closest to the
keywords. Studies conducted by Luo et al [6] performed Top-k keyword queries on
large scale databases with over ten million tuples using their ranking patterns proved to
greatly increase the efficiency in both performance and result relevance.
2.2.2 Ranking algorithms
Ranking systems has also been a large focus point. Ranking algorithms focuses on
presenting the most relevant data to the user based on users previous activity. The
Ranking algorithms main function is to use collected activity data including time spent
on websites and search history to propose relative searches and display advertisement
related to the users patterns [1, 2, 8, 9]. These systems are mainly used in web search
engines (e.g. Google search engine). The research is lacking in incorporation of
relevance and ranking targeting search engines working on smaller applications.
2.2.3 NoSQL and RDBMS
Research shows that large, public and content centered applications are often best suited
using a NoSQL database, while applications that are supporting business operations are
often best suited using a RDBMS. There is however a large grey area in which many

10
applications fall under that can choose any of the technologies. For optimization
purposes, to pick the most suitable solution is essential.
The advantages with NoSQL databases are that they are highly scalable, reliable and
use a plain query language. One of the biggest advantages that NoSQL provides
compared to RDBMS is that it can handle unstructured data. Unstructured data is
everything from video files to social media network data. However NoSQL databases
are not without limitations. Some common issues with NoSQL are the lack of
encryption support for data files, weak authentication between the client and the servers
and lack of client communication encryption [10, 11]. As Bhuvan et.al concludes [10]
there is no standard practice of choosing the most suitable database for an application. It
depends on the applications requirements and if the requirements are not as simple as
just picking between structured or unstructured data, the choice can be troublesome.
This is where pure performance data is needed.
2.2.4 RDBMS drawbacks
There have been numerous researches about performance issues and optimization
techniques for RDBMS. If used correctly, RDBMS is a reliable technique to use, but
there are a lot of things to have in mind to use optimized queries, correct indexing and
designing schema e.g. [12]. Corlatan et al. highlighted [13] common mistakes such as
missing indexing. Missing indexing is according to Corlatan et al. the factor that affects
the performance for SQL databases the most, both in time and CPU/RAM memory
usage. If a table is missing indexing, the search engine has to go step by step through
the database until it finds the desired row. Even if the database has some indexing
structure, index efficiency is very dependent on how the queries are written. Corlatan et
al. concludes that performance optimization is an ongoing process and that it is a
complex subject with a great width of fields that is needed to be taken into
consideration. Much of this optimization is not targeting RDBMS as a search engine
rather for the developers to build better databases and construct more efficient queries.
Chandra et al. [14] states that students and developers that has low experience in most
cases write queries that are prone to errors. The mistakes based on bad knowledge can
cause issues in form of data loss and wrong data results.
On the topic of misspellings and other user input errors Patman et al. [15] provides an
evaluation about the hidden risk for Soundex searching. Soundex is a feature in many of
the biggest RDBMS that uses the way a pronounced string sounds like to find the
correct entry in a database. Patman et al. highlighted points for what Soundex is unable
to do and its drawbacks. Below is a conclusion from their work.
 Dependence on initial letter
Soundex is using the first letter as a key component in the code it generates to represent
the string. This means that Soundex is unable to find match if the first letter does not
match, e.g. “vomputer” will not match with “computer”.
 Silent consonants
Soundex is unable to capture silent consonants in strings that sounds like one thing but
might be spelled another way. An example of this can be words such as “Phil” and
“fill”.
 Unranked, unordered returns
Results from a Soundex query will not be ordered by the most relevant match. The
result is ordered by their order in the database table instead of the how relevant they are
to the query.

11
Patman et al. concludes that Soundex is easy to understand and simple to implement.
However, application developers should be careful when consider using Soundex, from
a perspective of bad search results and excessive matches.
2.2.5 Levenshtein distance algorithm
The Levenshtein distance algorithm is a simple but yet a powerful string comparison
algorithm which measures the distance between two strings. The Lucene search engine
uses Levenshtein distance algorithm for its fuzzy query function [16]. A fuzzy query
does exactly what it sounds like, tries to match strings that are closely linked or as the
word describes “fuzzy”. Distance is measured in such a way that every step needed to
take to transform string one to string two increases the distance by one. The string with
the least distance to the measured string is the most relevant. For example, the two
strings “flaw” and “lawn” has a Levenshtein distance of two. The first character ‘f’ is
deleted which adds one to the distance. The last step to make the strings identical is by
inserting ‘n’, this operation increases the distance by one as well which results in two
identical strings with a Levenshtein distance of two [17]. The Levenshtein distance
algorithm is often used in symbiosis with a dictionary. Greenhill [18] provided a
research and concluded that the Levenshtein distance algorithm shows bad performance.
Mostly because it is naive and cannot distinguish between relevancies between two
strings, meaning that “cat”, “rat” and “sat” will return the same distance and not have
any differentials in relevance.
2.2.6 Auto-complete
Since it emerged as Google-Suggest in 2004 Auto-Complete has been used in search
engines on both large and smaller applications. In a paper by Ward et al. [19] human
experiments using Google-Suggest was used to find out how search suggestions
affected the users. Their findings show that suggestions do help in particular with
spelling. Other things such as speed and confidence were also improved for the user. A
problem with Auto-complete on a smaller scale would be that unless the developer
creates their own Auto-complete library, many of the search suggestions would return
empty results as the suggestions are based on another service search history. Bassil et al.
[20] uses Google-Suggest API to handle errors in OCR scanning. Google-Suggest API
is capable of finding the relevance between words and sentences, it has also technology
that can support words and sentences that sounds like the correct choice, for example
“flaj mi home” will give the result “fly me home”. This powerful feature is made
possible because of Googles huge amount of collected data in form of misspellings from
their web search engine.
2.3 Literature result
This section provides a detailed assessment of the literature gathered in comparison to
the initial problem statement and answers to the research questions specified in 2.1.2.
2.3.1 Assessment of gathered information
After assessing and researching into the topic the problem statement became clearer.
The area of keyword searches is documented heavily with plenty of alternatives and
different interpretations. A much less focused area is handling miss-constructed user
input (e.g. grammatical errors and misspelling) and its effect of query based search

12
engines. There is little to no information about incorporating web search engine
techniques on an application search engine aside from using Auto-complete.
2.3.2 Research question walkthrough
Sub question 1: What existing research has been done to improve search engines based
on RDBMS?
Optimizing the keyword search algorithms has been heavily researched. The keyword
search algorithms such as Top-k keyword algorithms are intended to faster guide the
engine to the right index within the database. The keyword search algorithms have no
response to miss-constructed user input and only have a positive effect as long as the
user provides queries that are constructed properly. Most research done on RDBMS
performance is clearly best suited for static queries even though RDBMS are still the
most frequently used database type.
Sub question 2: Have comparative tests between different engines been conducted, if so
what selection criteria were used?
There is a lot of information about NoSQL performance and RDBMS performance
alike. However, there is still gap when it comes to comparative tests performed in the
same environment. The only information which was found that actually compared
NoSQL and RDBMS were based on conclusions drawn from the structure and
components of the two compared and not actual data from performance tests in different
environments.
Sub question 3: Which features are missing in NoSQL and RDBMS?
NoSQL has a lot of features for retrieving data, as it often is required to handle
unstructured data it relies on complex indexing to be efficient. RDBMS has less
sophisticated means of retrieving data, it relies heavily on the developers knowledge
about queries to acquire the desired information. The lack of support for fuzzy querying
is a big drawback when using a RDBMS as a search engine as well. Features such as
security are overlooked as they are not directly related to the topic.
Sub question 4: Can RDBMS be influenced by NoSQL and vise verse.
As NoSQL is built to handle large amounts of complex, unstructured data its retrieval
methods are much better than the RDBMS methods of retrieval. Most NoSQL systems
come with built in fuzzy query searching something that RDBMS does not have proper
support for. The most similar thing RDBMS has to a fuzzy query feature is the query
Soundex. Soundex is not reliable as a fuzzy query method and therefor there is a room
for a more developed fuzzy query method for RDBMS.
Sub question 5: What techniques do web search engines use? Can they be incorporated
in a smaller engine?
The web search engine features, focused upon in this study have been ranking and
Auto-Complete. Saving information about patterns and history can be utilized in the
same way in a smaller engine as on a large web search engine although with different
capabilities. Just as a large web search engine uses search history to guide the user
towards its assumed desired information, a smaller engine could use the history to guide
itself towards the right information. By using search history the search engine can rank
entries in a database based on popularity and also use it to create more relevant search
results for the user. The Auto-Complete design can be incorporated in a smaller scale

13
engine with its own suggestions library. Although it can be used for more than only
presenting the user with a relevant word, it can be used to build spellchecks in
corporation with other algorithms using a large pool of previous searches such as the
one provided by Google-Suggest.

14
3 Experiment
In an effort to answer the questions that emerged during the work, several tests were
performed. These experiments were conducted according to the process model found in
[21]. Höst et al. outlines important features required to conduct experiments. These
outlines have been followed thoroughly. All experiments where held in controlled
environments on the exact same computer with the same pre-conditions.
3.1 Implementations
Implementation of the existing search engines have followed set standards from the
providers of the respective engines. To keep all tests as neutral and unbiased as
possible, code was reviewed and tested to get the best possible performance. All
implementations were done in Netbeans and all program code written in Java.
3.1.1 MySQL
The RDBMS used was MySQL which is a RDBMS platform used frequently around
the world and is provided by Oracle. Implementation techniques and technologies
provided by the provider were used [22], such as:
 JDBC driver
A JDBC driver is a software component which enables Java written software to
communicate with databases.
 Simple queries
Select queries are the main queries for retrieving information in a RDBMS. SQL
Soundex is being used which is a query that provides an arbitrary misspelling correction
is based on how the input string sounds like based on the entries in the database. The
Soundex query returns results that sound similar to the original input. The Soundex
query is not standard in every RDBMS however most big suppliers (e.g. Microsoft
SQL, MySQL) do have the function in their systems.
 Database structure
The database used is a sample database provided by Oracle which is called World.
Figure 2 shows the structure in form of an EER-diagram. The database contains three
tables. The tables has different amount of data for testing purposes.

15
Figure 2 – Database structure in form of EER-Diagram [22]
3.1.2 Lucene
Apache Lucene is a NoSQL database. When implementing the Apache Lucene engine,
guides were followed provided by the creators [16]. Only the bare minimum of code
was written to optimize the performance of the engine. As Lucene is document based,
the same data, used to populate the RDBMS is also stored in the Lucene documents.
Lucene is developed to be suitable for almost any application that requires full-text
searches, highly flexible data structure and queries. The engine comes with powerful
retrieval and management tools such as:
 Ranked search – returns the most relevant result first
 Multiple query types – e.g. fuzzy queries, range queries and phrase queries.
 Sorting algorithms
 Fast and memory-efficient suggestion algorithms
 Fielded search – e.g. title, ID, contents
One of Lucenes biggest features is its easy index functions. Lucene has built in indexing
that creates indexes of all information stored without any need for the developer to do it
themselves unless the developer has special indexing needs. Figure 3 shows the
indexing structure in Lucene.

16
Figure 3 – Lucene indexing architecture [16]
 Document
The document contains all the fields that are inserted for that specific index. The
documents are stored in the index. Below shows how to add a document with two
fields.
Document doc = new Document();
doc.add(new TextField(“Country”, country, Field.Store.YES);
doc.add(new TextField(“Language”, language, Field.Store.YES);
indexWriter.addDocument(doc);
 Analyzer
Lucene provides analyzers – which is a sophisticated feature for analyzing the inputs.
The analyzers are processing pipelines that breaks up the text into indexed tokens,
filtering out unwanted tokens. There are different analyzer packages for supporting
multiple languages and data sources, e.g. Russian, Arabic and Wikipedia syntax
analyzer.
 Index writer
The index writer is used to create an index and to add new index entries. The index
writer is initiated with knowledge about which analyzer is used, which directory it is
inserting to and configurations such as Lucene-version. Below shows an initialization of
index writer.
FSDirectory directory = FSDirectory.open(new File(“Countries”).toPath());
StandardAnalyzer analyzer = new StandardAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter indexWriter = new IndexWriter(directory, config);
 Directory
The directory is where the indexed documents are finally stored. The Directory objects
most be initialized with the path to where the directory is stored (see Document
description above).

17
3.1.3 Hybrid solution
The hybrid solution is created using the language Java. It will target the same data
source as the RDBMS. Details about implementation and algorithms used are presented
in the result part of the thesis.
3.2 Evaluation Criteria
Certain criteria had to be outlined for the testing of the different engines. The criteria
were created to get a broad evaluation of the different engines in as many important
aspects as possible.
3.2.1 Performance
When evaluating performance time was the primary measurement. Tool used to
measure time was the built in Netbeans monitor. Other performance related factors was
low utilization of computer resources, this being memory usage which was also
monitored using the Netbeans monitor.
3.2.2 Search relevance
To keep relevance as data centered as possible with low amount of personal judgment
the primary way of measuring relevance was through the amount of results relatable to
the keyword(s) processed.
3.2.3 Scalability
Scalability was a large factor as we intended to provide data relevant to many different
application sizes. Scalability was measured by scaling up the database size significantly.
As Lucene is document based it was important that the same information stored in the
RDBMS was also stored in the Lucene documents.

18
4 Hybrid algorithm implementation
In this chapter the implementation of the hybrid solution is presented in detail. The
implemented algorithms are carefully explained.
4.1 Using distance to build queries
From the perspective of interpreting a misspelled or wrongly constructed user search
into a query without the need for the user to correct the mistake, the Levenshtein
distance algorithm is much suitable (see 2.2.5). Applying the Levenshtein distance
algorithm to a search engine letting the user input be string one and a dictionary of a
relevant language be string two, ranking of words close to the original input is easily
handled. This method of finding correlation to assist users is nothing new. The
algorithm as a default will cycle through all words in a dictionary and calculate the
distance between the static string one and the changing string two that is based on the
dictionary. For performance optimization purposes a threshold is introduced. This
threshold can be seen as a maximum distance. The maximum distance will allow the
algorithm to skip strings when the distance is above the maximum distance and every
string with distance above the maximum distance can be seen as irrelevant. To further
reduce the operation time the maximum distance is constantly changed to the lowest
found distance. This will in the average case constantly reduce the operation time
significantly. This method does not reduce the worst case as the distance of the words in
the worst case scenario is in decreasing order, it does however improve the average time
of the operation. The effectiveness of the algorithm is also improved by creating the
dictionary from the entries in the respective database. This removes redundant words
that would yield no result.
4.2 Finding relation between keywords
The Levenshtein distance algorithm is effective when finding the most relevant
matching word. It has however problem with seeing the correlation between two or
more connected words like in a sentence. As it will handle a sentence as multiple
substrings and find the best match to each substring individually. The results from this
operation can be confusing and sentences transformed to unreadable states. A solution
to this problem was to implement an algorithm that uses Google-Suggest on strings that
contained more than one word. To utilize this feature Google has provided an API that
can take a string as a parameter and returns an XML of suggestions for the string
submitted. In the Google search engine, the suggestions are dynamically added as the
user inputs characters. However, the main purpose of just finding a suitable sentence or
collection of words relating to an input is not a necessary function for the purpose of
this engine. Using suggestions dynamically built on Google-Suggest would create
confusion for the user as it would be based on Google and not the individual application
resulting in search terms suggestions not related to the applications database. The
usefulness comes instead in the ability to connect separate words and form real
sentences through popularity of massive amounts of collected data. This allows the
engine to form search queries containing sentences that are spelled correctly with
grammatical checks for more accurate full text searches inside the database.

19
4.3 Using search history to increase performance
Any engine used frequently will build up a large history of searches. Inspired by the big
data wave and large web search engines data mining, collecting and using search history
data for the purpose of increasing performance is in line with modern day search
engines. Search history can be used in more ways than just displaying the most relevant
data to the user, but also to guide the search engine in the right direction when searching
for the information queried by the user. To complement the previously mentioned
solutions while taking use of all search history, a document of previously miss-
constructed searches was created. The document is constructed in such a manner that it
saves all misspellings and links it to the correct search. This document is then accessed
first in the process of handling a search from a user, if no result is found the other
methods of retrieving the correct searches is accessed. When accessing the document, if
the misspelling is already defined in the list, the engine can process the search much
faster. As the amount of searches and errors related to the user input increases, the
document grows and become more useful. As shown in Figure 4 the request process
time is significantly decreased when the desired search term can be found in the
document without having to access the remaining methods. To further increase the
usage of search history, the entries in the document are ranked based on usage of the
individual entries. An entry that often is misspelled and frequently visited by the search
engine in the document is placed on a higher ranking than other terms. This allows the
search engine to access entries in the document based on how often they are accessed
rather than in an order that is not based on usage. This method is applicable both for
ranking individual users search history and general search history. For this engine a
general approach is applied where rankings are not based on individual users but instead
on all users.
Figure 4 – Sample test for 200 iterations for some misspelled countries using search history to find the
correct spelling.

20
4.4 Algorithm overview
The system works as described in the flowchart in Figure 5. The process starts with a
user input that is checked against the document that is filled with previously misspelled
words and their correct counterpart. If the word exists in the document the request is
processed to the database with the correction for the user error. The reason for not
requesting to the database first is to minimize the amount of database requests to just
one, making sure that every database request is going to return results. If the input does
not exist inside the document the algorithm will process the input by both sending a
request to Google Suggest API and matching with the dictionary. Both processes results
in a large amount of results. These results are sorted by using Levenshtein distance
algorithm resulting in a list with the highest relevance on top. The user is then presented
with the list of suggestions to choose from. After the user has chosen, the chosen word
or phrase is put inside the document along with the original input. Finally, a request is
done to the database using the chosen word or phrase.
Figure 5 – Flowchart of the algorithm working with user input.

21
5 Experiment results
This section describes the results from the conducted experiment. The criteria and
measurements are described in section 3.2.
5.1 Performance
Performance was measured in time and memory usage by the process for every search
engine. The test was conducted by creating requests that iterated 500 times through the
World sample database with misspelled countries. The tests were conducted 10 times to
get an average result. Figure 6 shows results of memory usage. The red area represents
the heap size and the blue area represents the actual memory usage for the process.
MySQL used the smallest heap size and the lowest memory for the entire process
averaging around 10 MB per iteration and a heap size just above 60 MB. Lucene and
the hybrid used significantly larger heap and memory usage with Lucene spiking up
towards 150 MB heap size in the end.
Figure 6 – Used heap size for each engine iterating 500 times, measured using NetBeans profiling tool.
As for time shown in Figures 7, 8 and 9, MySQL was the fastest with an average
operation time of 1.49 ms. Lucene was a close second with an average operation time of
5.6 ms and hybrid was the slowest with an average time of 7.36. The best case for all
engines is very similar with times around 20 ms. As the results are purely based on time
and memory usage, null results from requests without results are not taken into account.

22
Figure 7 – Time for each iteration over 500 misspelled countries for MySQL solution.
Figure 8 – Time for each iteration over 500 misspelled countries for Lucene solution.
Figure 9 – Time for each iteration over 500 misspelled countries for Hybrid solution.

23
5.2 Relevance
Measuring relevance was done much in the same way as performance. A program was
built that requested 500 times to the World sample database. All requests were
constructed as misspelled words of countries existing in the database using the same
misspellings as in the performance test. The misspellings ranged from one to five in
Levenshtein distance. The test was conducted ten times to get an average result. As
described in Table 5, the hybrid implementation stood out in terms of matching words
with only an average of 15 words unmatchable in 500 iterations. Lucene came second
with an average of 365 results and MySQL the worst with an average of 311.3.
Table 5 – Statics for average retrieved result, based on match and relevance.
Search engine Average matches for 500
iterations
Percentage
MySQL 311.3 62.26%
Lucene 365 73%
Hybrid 485.3 97.06%
5.3 Scalability
To measure the scalability, the initial country table in the World sample database was
scaled up 50 times. From 900 entries to 45000 entries. To keep the results the same the
database contained 900 countries and the rest of the 44100 entries random words not
related to countries to simulate additional unique data. Aside from the increased size of
the databases, the test was performed exactly the same with 500 iterations with
misspelled countries. Table 6 shows a significant increase in time from the initial
performance test. Lucene was significantly faster than both the hybrid and MySQL.
Lucene was even faster than the first test on an even smaller data sample. MySQL and
the hybrid showed large problem handling this increased data sample size.
Table 6 – Statistics for average time over 500 iterations on the up-scaled database.
Search engine Average time over 500 iterations
MySQL 183.3
Lucene 4.956
Hybrid 168.94

24
5.4 Experiment evaluation
A deeper description of the results gathered from the result chapter is provided in the
following section.
5.4.1 MySQL
The results from the experiments show that the RDBMS is the most efficient from a
perspective of both time-performance and memory usage when testing against the
smaller database. However the database had great difficulties in finding the correct
word from the relevance and when the database was scaled up by 50, the database time-
performance was dropped significantly. This sums up to a result that RDBMS is a good
choice for any system that has no direct interaction between a user and the database
layer unless the input is purely static, because when RDBMS has to work with miss-
constructed input, it loses much of its power. Structured data and static queries can be
highlighted as keys for the RDBMS system to be efficient as a whole. Any application
built with those requirements will have great use for RDBMS.
5.4.2 Lucene
The NoSQL implementation performed well in almost every test except for the
relevancy test where it struggled with finding the related word to the search. The most
impressive aspect of the Lucene engine was the way it handled the scalability test.
While time increased significantly for the other two engines, Lucene was even faster
working with a 50 times bigger dataset. This is due to Lucenes indexing and is down to
one of the key strengths of the NoSQL technology where scalability has been a primary
focus. Any application working with large amounts of mainly unstructured data will
have great use for a NoSQL based search engine. The integrated fuzzy query method is
a great tool for any developer building an application with search functions as well.
5.4.3 Hybrid
The hybrid solution was not as effective as the RDBMS or Lucene in terms of pure
performance time and memory usage. The hybrids performance is although increased
significantly after a period of time, due to the document of search history is unable to be
utilized before it is filled with relevant information. The hybrid stood out in terms of
relevance and finding the correct word. In the relevance test, the hybrid was superior to
the two other engines and found the correct word almost every time (97.06%). Just as
the RDBMS, the hybrid struggled when the database was scaled up significantly. The
hybrid would suit any application already or being built on a RDBMS foundation as an
extra layer that handles the user input a lot better then what the standard SQL functions
can do with sacrifices in memory usage and time consumed per operation.
5.5 Experiment summary
The results from the experiment are close to what the related work suggested. There are
great differences to be found in regards to the different search engines. Every engine
had at least one test where it was the top-performing engine so it all breaks down to
outlining of the applications requirements. In Table 7, a collection of requirements in
collection with the ranking of the search engines in regards to the related requirement is
presented. For the developers with many requirements for their application,
combinations between the three engines can be implemented.

25
Table 7 – Collection of requirements and their suggested implementation.
Application requirement Suggested implementation
Scalable database NoSQL
Static queries RDBMS
Queries built on client input Hybrid, NoSQL
Searches interacting with direct user input Hybrid
Small amount of data RDBMS, Hybrid
Large amount of data NoSQL
Memory efficient database layer RDBMS
Structured data RDBMS, Hybrid
Unstructured data NoSQL
Automatic indexing NoSQL

26
6 Discussion and ethical aspects
As the amount of applications on the Internet and on networks around the world
increases, the need for better search engines inside the applications also increases. The
goal of the thesis was to provide a solution to the gap between NoSQL, RDBMS and
web search engines. In terms of processing user input and providing the correct result,
the hybrid created from the findings in the literature study was far better than the
existing technologies. The hybrid presented incorporated web search engine features
such as search history and Auto-complete to make use of previous searches and creating
relevance between words. The hybrid presented did not cover every gap found in the
literature study in regards to NoSQL and RDBMS. NoSQL still performed much better
when scaling up the database due to its automatic indexing functions. The results from
the performance experiment was thus much in line with the results from the literature
study, suggesting that NoSQL databases are far superior in systems containing large
amount of data which was also the case. The literature study suggested that RDBMS are
more suitable for structured data this is also in line with the result from the experiments.
Ethical aspects of the thesis are unclear. One could argue that saving user data could be
unethical if the data saved was used for unethical reasons. As in the hybrid the data is
not saved individually. It is rather saved in combination with all other user data to
increase efficiency. As no saved data is directly linked to an individual user, no ethical
aspects have been breached. The hybrid could have been improved further to save
individual data and use the data to present individualized searches. In that case, the
ethical aspects of saving data would have to be taken harder into consideration.
.

27
7 Conclusion and further work
In this thesis, we introduced algorithms influenced from web search engines, NoSQL
and RDBMS into an RDBMS environment. The thesis raised three important questions
that have been answered. The first question was which search engine suits which type of
application based on the requirements of the application? The results from both the
literature study and the experiment provide suggestions for what search engine a future
developer should choose. The results from the experiments are compiled into a table in
5.5 which highlights strengths of each technology based on the experiment conducted.
The results presented in the table provide a solid foundation for future developers to
choose their search engine from. This is important to make sure that developers are
selecting the right database not based on only preference but also on what they actually
need, what their application will be using it for and how the database will interact with
the end user. The second question asked was how do the algorithms in the search engines
differ from each other? NoSQL and RDBMS differ a lot. NoSQL has complex automatic
indexing algorithms that enable the NoSQL to handle large amounts of unstructured and
structured data and remain efficient when querying to it. The RDBMS requires good
indexing by the developer to remain efficient when scaling up. Without it, the RDBMS has
to search linearly through the entire table. The NoSQL have support for a fuzzy querying
method that relies on the Levenshtein distance algorithm to find correlations between input
and entities in the database. The RDMBS closest thing to a fuzzy query method is Soundex,
as concluded this is an unreliable fuzzy querying method and should be used carefully. The
final question was how can current application search engines be improved? The test
results using our hybrid solution were better at handling misspellings and grammatical
errors than the NoSQL and RDMBS engines. This shows that both NoSQL and
RDBMS have weaknesses in that area when used directly in concurrency with an
application as a search engine.
The hybrid presented by us had no positive effect on performance in terms of time and
memory usage compared to the standard RDBMS when requesting against a large
(45 000 rows) database table so there is a room for research in optimizing RDBMS
when handling large database tables. Further research can be done in implementing
similar algorithms on a NoSQL engine, constructing a similar layer to act as the
interpreter for the NoSQL database.

28
8 References
[1] Liu, Fang et al. "Effective Keyword Search In Relational Databases". SIGMOD '06
Proceedings Of The 2006 ACM SIGMOD International Conference On Management Of
Data. New York, NY, USA: ACM, 2016. 563-574.
[2] Rawat, Rakesh, Richi Nayak, and Yuefeng Li. "Improving Web Database Search
Incorporating Users Query Information". WIMS '11 Proceedings Of The International
Conference On Web Intelligence, Mining And Semantics. New York, NY, USA: ACM,
2011.
[3] "RDBMS Dominate The Database Market, But Nosql Systems Are Catching Up".
Db-engines.com. N.p., 2016. Web. 23 Mar. 2016.
[4] Bruno N, Chaudhuri S, Gravano L. “Top-k selection queries over relational
databases: Mapping strategies and performance evaluation”. ACM Transactions on
Database Systems. 2002;27(2):153-187.
[5] Park, Jaehui, and Sang-goo Lee. "Keyword Search In Relational Databases".
Knowledge and Information Systems 26.2 (2010): 175-193. Web. 23 Mar. 2016.
[6] Luo, Yi et al. "Spark: Top-K Keyword Query In Relational Databases". SIGMOD
'07 Proceedings Of The 2007 ACM SIGMOD International Conference On
Management Of Data Table Of Contents. New York, NY, USA: ACM, 2007. 115-126.
[7] Hristidis, Vagelis, and Yannis Papakonstantinou. "Discover: Keyword Search In
Relational Databases". VLDB '02 Proceedings Of The 28Th International Conference
On Very Large Data Bases. Hong Kong, China: VLDB Endowment, 2002. 670-681.
[8] Agichtein, Eugene, Eric Brill, and Susan Dumais. "Improving Web Search Ranking
By Incorporating User Behavior Information". SIGIR '06 Proceedings Of The 29Th
Annual International ACM SIGIR Conference On Research And Development In
Information Retrieval. New York, NY, USA: ACM, 2016. 19-26.
[9] Nath Singh, Jitendra, and Sanjay Kumar Dwivedi. "A Comparative Study On
Approaches Of Vector Space Model In Information Retrieval". IJCA Special Issue On
International Conference On Reliability, Infocom Technology And Optimization. IJCA
Journal, 2013. 36-40.
[10] TBhuvan N, Sudheep Elayidom M. “A Technical Insight on the New Generation
Databases: NoSQL”. International Journal of Computer Applications. 2015;121(7):24-
26.
[11] Nance, Cory et al. "Nosql Vs RDBMS – Why There Is Room For Both". 16Th
Southern Association For Information Systems Conference. Savannah, Georgia, USA:
N.p., 2013.
[12] Mercioiu N, Vladucu V. “Improving SQL Server Performance”. Informatica
Economică. 2010;14(2):55-60.
[13] Corlatan C, Lazar M, Luca V, Petricica O. “Query Optimization Techniques in
Microsoft SQL Server”. Database systems journal. 2014;2(2069-3230):33-48
[14] Chandra B, Chawda B, Kar B, Reddy K, Shah S, Sudarshan S. “Data generation
for testing and grading SQL queries”. The VLDB Journal. 2015;24(6):731-755.

29
[15] Patman F, Shaefer L. “Is Soundex Good Enough for You? The Hidden Risks of
Soundex-Based Name Searching” [Internet]. http://www.ibm.com/. 2001 [cited 19 April
2016]. Available from: http://ftp://public.dhe.ibm.com/software/data/mdm/soundex.pdf
[16] Apache Lucene - [Internet]. Lucene.apache.org. 2016 [cited 23 March 2016].
Available from: https://lucene.apache.org/core/documentation.html
[17] Climent J, Hexsel R “Iris recognition using Adaboost and Levenshtein distances”.
International Journal of Pattern Recognition and Artificial Intelligence.
2012;26(02):1266001.
[18] Greenhill S. “Levenshtein Distances Fail to Identify Language Relationships
Accurately”. Computational Linguistics. 2011;37(4):689-698.
[19] Ward D, Hahn J, Feist K. “Autocomplete as Research Tool: A Study on Providing
Search Suggestions”. ITAL. 2012;31(4).
[20] Bassil Y, Alwani M. “OCR Post-Processing Error Correction Algorithm Using
Google's Online Spelling Suggestion”. Journal of emerging trends in computing and
information sciences (Islamabad). 2012;3(2079-8407):90-99.
[21] Hö st, Martin, Bjö rn Regnell, and Per Runeson. ”Att Genomfö ra Examensarbete”.
Lund: Studentlitteratur, 2006.
[22] MySQL :: MySQL Documentation [Internet]. Dev.mysql.com. 2016 [cited 14 April
2016]. Available from: https://dev.mysql.com/doc/

Degreeproject

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Degreeproject

Similar to Degreeproject (20)

Degreeproject