NL Interface for Database - EJSR 20(4)

European Journal of Scientific Research
ISSN 1450-216X Vol.20 No.4 (2008), pp.844-851
© EuroJournals Publishing, Inc. 2008
http://www.eurojournals.com/ejsr.htm

Database Interfacing using Natural Language Processing

Imran Sarwar Bajwa
Department of Computer Science and IT, The Islamia University of Bahawalpur
E-mail: imransbajwa@gmail.com

Shahzad Mumtaz
E-mail: shahzadz22@hotmail.com

M. Shahid Naweed
E-mail: shahid_naweed@hotmail.com

Abstract

To write technically correct SQL queries is a complex and skill requiring task
especially for a novel user. This situation becomes more complex when a low skilled
person has to use a database management system for a specific business purpose. S/He has
to write some quires at his own and perform various tasks. This scenario requires more
expertise and skills in terms of understanding and writing the accurate and functional
queries. The task of the novel user can be simplified by providing an easy interface that is
well known to that user. In order to resolve all such issues, automated software is needed,
which facilitates both users and software engineers. User writes the requirements in simple
English in a few statements and the designed system has the ability to analyze the given
script. After composite analysis and mining of associated information, the designed system
generates the intended SQL queries that can be run directly. The paper describes a system
that can create SQL queries automatically. The designed system provides a quick and
reliable way to generate SQL queries to save time and budget of both the user and system
analyst.

Keywords: Information extraction, Automatic Query Generation, Knowledge Retrieval,
Natural language processing.

1.0. Introduction
Relational databases are the premier way of storing common data repositories. After storing the data
contents in a database, an interfacing mechanism is required to talk with the prearranged repository of
the confined data. The conventional way of communicating with a database is to fist build a connection
stream and then adding, deleting or updating the data contents in the database by using a standardized
interfacing mechanism [1]. Simple command shells are typically used and they are often incorporated
within every distinct database product. These command shells are typically simple filters which helps a
use to log on to the database, execute particular commands and receive output. These command shells
provide access to the database from the machine on which the database is actually running [2]. After
hooking to a particular database a user or a programmer requires an interface and typically that

Database Interfacing using Natural Language Processing 845

interface is provided by some technical languages. These languages are called query languages and are
constituted of the database commands typically used for asking questions to a distinctive database and
getting intended response. SQL [3] (Structured Query Language) is the most popular query language
which is actually the defacto language of databases today. SQL is an orthodox tool of database
querying. Different database management systems implement this standardized language with trivial
alterations and adjustments. However, in spite of these proprietary extensions by the vendors, the core
of this querying language is the same in all of the environments.
From an application programmer's point of view, the major novelty in the relational database is
that one uses a declarative query language, SQL. Most computer languages are procedural. The
programmer tells the computer what to do, step by step, specifying a procedure. Using SQL interface,
the programmer defines his requirements and questions and the RDBMS query planner figures out how
to get it [5]. There are two compensations of using a declarative language. The first is that the queries
no longer depend on the data depiction. The RDBMS is free to store data according to its own design
requirements [6]. The second major factor is improved software dependability. For various web-based
and stand-alone applications the generic SQL is used to make the things simple and straightforward.
Besides these praising compensations occupied by SQL, it’s technical and trifle interface makes this
language monotonous and difficult to learn and use. It is quite intricate to remember these SQL
commands and use them accurately and precisely.
In order to resolve all such issues, an automated software is needed, which facilitates both users
and software engineers. As far as this software is concerns the time, it takes to explore all the facilities
and services, should be quite less than a minute and this information is quite useful for the users.

2.0. Problem Description
Modern software engineering requires quick and automated solutions which may have ability to create
the accurate and precise SQL queries automatically. For complex queries an expert programmer also
requires assistance in terms of automatic query generation. He can use these queries after making
appropriate adjustments and alterations in the automated generated queries with less effort in less time
as compared to the traditional approaches.
The task of the novel user can be simplified by providing an easy interface that is more familiar
and well known to that user. In order to resolve all such issues, an automated software is needed, which
facilitates both users and software engineers. User writes the requirements in simple English in a few
statements and the designed system has obvious ability to analyze the given script. After composite
analysis and mining of associated information, the designed system generates the intended SQL queries
that can be run directly. The designed system has robust ability to create code automatically without
external environment. The designed system provides a quick and reliable way to generate SQL queries
to save the time and budget of both the user and system analyst

3.0. Used Methodology
The understanding and multi-aspect processing of the natural languages that are also termed as "speech
languages", is actually one of the arguments of greater interest in the field artificial intelligence field
[8]. The natural languages are irregular and asymmetrical. Traditionally, natural languages are based
on un-formal grammars. There are the geographical, psychological and sociological factors which
influence the behaviours of natural languages [12]. There are undefined set of words and they also
change and vary area to area and time to time.Due to these variations and inconsistencies, the natural
languages have different flavours as English language has more than half dozen renowned flavours all
over the world [14]. These flavours have different accents, set of vocabularies and phonological
aspects. These ominous and menacing discrepancies and inconsistencies in natural languages make it a
difficult task to process them as compared to the formal languages [13].

846 Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed

The English language statements are effortlessly converted into a SQL query by using the
newly designed rule based algorithm. Select query is the common query used to choose a set of values
from a table [4]. An example of a college database has been used in the conducted research. Student’s
data will be retrieved, inserted and deleted by automatically generated queries from simple English
text.

3.1. SELECT Query
First of all the ‘SELECT’ query has been processed. ‘SELECT’ query has four parts as following:
SELECT * FROM Students

Keyword Required Set keyword Table Name
‘SELECT’ query can easily be generated from the provided input string of as there are two
keywords ‘SELECT’ and ‘FROM’. Other two required values are ‘Required Set’ and ‘Table Name’.
To process the speech language text and find ‘Required Set’ and ‘Table Name’ the conventional norms
of the English language and grammatical rule are used. The conventional structure of simple English
sentence is the key rule of comprehending and analyzing the natural language text [13] as in the
following example:
“I need names of all students.”
Following is the complete analysis of this simple sentence.

Table 01: Generating SELCET Query from text

Lexicons Phase-I Phase –II
I Noun ----------
need Verb ----------
names Noun Field Name
of preposition ----------
all Noun *
students Noun Table Name

In this example the ‘Required Set’ field is filled by the ‘Filed Name’ attribute and the ‘Table
Name’ filed is filled by the ‘Table Name’ attribute as following:
Select * from Students
Here the table Name is searched from the array of available all tables in the database. From all
available tables, the nearest table name is picked that ‘students’ in this example.

3.2. INSERT Query
After ‘SELECT’ query ‘INSERT’ query has been processed. ‘INSERT’ query has five fragments as
following:
INSERT INTO Students VALUES (5, ‘Ali’)

Keyword keyword Table Name Keyword Record

‘INSERT’ query can also produced from the given statement as there are three keywords
‘INSERT’, ‘INTO’ and ‘VALUES’ [6]. Other two required parameters are ‘Table Name’ and
‘Record’. Using same rule based algorithm ‘Table Name’ and ‘Record’ are extracted. As in the
following example:
“I want to insert a student whose Roll No. is 5 and Name is Ali.”

Table 02: Generating INSERT Query from text

I Noun -----------
want Verb -----------
to Preposition -----------
insert Verb Action
a article -----------
student Noun Table Name
whose Conjunction -----------
Roll No Noun Attribute
is Helping Verb ------------
5 Noun Value
and Conjunction ------------
Name Noun Attribute
is Helping Verb ------------
Ali Noun Value

In this example the ‘Required Set’ field is filled by the ‘Filed Name’ attribute and the ‘Table
Name’ filed is filled by the ‘Table Name’ attribute. Here the table Name is searched from the array of
available all table sin the database. From all available tables, the nearest table name is picked that
‘students’ in this example.

3.3. DELETE Query
Same like ‘SELECT’ and ‘INSERT’ queries ‘DELETE’ query can also be easily processed. ‘DELETE’
query has five parts as following:
DELETE FROM Students WHERE Age > 25

Keyword Keyword Table Name Keyword Condition
The ‘DELETE’ query typically consists of three keywords as ‘DELETE’, ‘FROM’ and
‘WHERE’. Other two required values are ‘Table Name’ and ‘Condition’. To find ‘Table Name’ and
‘Condition’ parameters the English language defined grammatical rule are used as in the following
example:
“I want to delete the students more than 25 years age.”

Table 03: Generating DELETE Query from text

I Noun ---------
want Verb ---------
to preposition ---------
delete verb Action
the article ---------
students Noun Table Name
more preposition Condition
than Noun ----------
25 Noun Value
years Noun -----------
age Noun Parameter

For ‘DELETE’ query, first the condition is defined. In this example Parameter and Value are
combined with Condition parameters. In this example table Name is also retrieved from the array of
available all tables in the database.


4.0. Work Flow of Designed System
The designed system “Computational Linguistics based System for Automatic Database Query
Generation” is adequately capable of automatically generating queries. This designed system performs
its function in multi-phase procedure. There are five modules in total that are Text input acquisition,
text comprehension, Information retrieval and ultimately generation of SQL Queries. Following is the
brief detail of all these phases.

4.1. Text input Acquisition
This module helps to acquire input text scenario. User provides the business scenario in from of strings
of the text. This module reads the input text in the form characters and generates the words by
concatenating the input characters. This module is the implementation of the lexical phase. Lexicons
and tokens are generated in this module. After the lexicons generation further processing can be
performed on the input text.

Figure 01: Lexical analysis of input text string

4.2. Text Comprehension
This module reads the input from module one in the form of words or lexicons. These words are
categorized into various classes as verbs, helping verbs, nouns, pronouns, adjectives, prepositions,
conjunctions, etc. These classes are further used to understand the various parts of the given sentence.

Figure 02: Parts of speech tagging of input text

4.3. Information Retrieval
This module, extracts key words of the SQL queries as Select, Insert, Delete, From, Into, Where, etc.
Keywords are found by matching the tokens with the given array of al possible keywords. These key


words are further used to generate the respective queries. The information like table name, field name,
number of attributes and logical conditions are also extracted in this phase.

Figure 03: Query information extraction

4.4. SQL Queries generation
This module combines the keywords and other required parameters for a particular query. SQL query
is ultimately generated here according to the given rules in the designed algorithm. As separate
scenario will be provided for various types of queries, the separate functions have been implemented
for particular query.

Figure 04: Generation of SQL Query

5.0. Results and Analysis
After designing and coding the query generating system, its accuracy and efficiency was tested. For
testing purpose of the queries generated by the designed system simple and complex level queries were
generated. Each query from each category as Select, Insert, Delete was checked.
15 sample queries were generated and the intended results have been shown in the following
table.

Table 04: Accuracy ratio of various types of queries

Types Simple Complex Total
SELECT 14 13 90%
INSERT 13 11 80%
DELETE 14 12 87%
Total Accuracy = 86%

A matrix representing accuracy of query generation test (%) for simple level and complex level
queries has been constructed. Overall diagrams accuracy for all types of queries is determined by
adding total accuracy of all categories and calculating its average that is 86% in this case.

Figure 05: Graphical representation of the results

14

12

10

8
Simple
6
Complex
4

2

0
SELECT INSERT DELETE

The graph above is showing the accuracy ratio of various SELECT, INSERT & DELETE
queries in terms of simple and complex queries parameters.

6.0. Conclusion
The designed system “Computational Linguistics based System for Automatic Database Query
Generation” facilitates both users and software engineers in terms of generating SQL queries
automatically. The task of the novel user can be simplified by providing an easy interface that is more
familiar and well known to that user. In order to resolve all such issues, an automated software is
needed, which facilitates both users and software engineers. User writes the requirements in simple
English in a few statements and the designed system has obvious ability to analyze the given script.
After composite analysis and mining of associated information, the designed system generates the
intended SQL queries that can be run directly. The designed system has robust ability to create code
automatically without external environment. The designed system provides a quick and reliable way to
generate SQL queries to save the time and budget of both the user and system analyst. An elegant
graphical user interface has also been provided to the user for entering the Input scenario in a proper
way and generating UML diagrams.

7.0. Future Work
There is also some margin of improvements in the algorithms for generating the intended SQL queries.
Current accuracy of generating diagrams is about 80% to 85%. It can be enhanced up to 95% by
improving the algorithms and inducing the ability of learning in the system. In this research only three
types of queries has been addressed as SELECT, INSERT, and DELETE query. There are still other
types of queries that require some sufficient solution.


References
[1] Allen,J. (1994) Natural Language Understanding. Benjamin- Cummings Publishing Company,
New York.
[2] Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating Language
Structure and Use. Cambridge Univ. Press, Cambridge, U.K.
[3] D. DeHaan, D. Toman, M. P. Consens, and T. Ozsu. (2003) A Comprehensive XQuery to SQL
Translation using Dynamic Interval Encoding. In SIGMOD.
[4] C. A. Thompson, R. J. Mooney and L. R. Tang, Learning to parse natural language database
queries into logical form, in: Workshop on Automata Induction, Grammatical Inference and
Language Acquisition (1997).
[5] Salton, G., & McGill, M. (1983). Introduction to Modern Information Retrieval. McGraw-Hill,
New York.
[6] A. Rosenthal. D. Reiner, Extending the Algebraic Framework of Query Processing to Handle
Outer joins, Proc. VLDB Singa- pore 1984. pp. 334-343.
[7] Fagan, J. L. (1989). The effectiveness of a non-syntactic approach to automatic phrase indexing
for document retrieval. Journal of the American Society for Information Science, 40 (2), 115–
132.
[8] J. M. Zelle and R. J. Mooney, Learning semantic grammars with constructive inductive logic
programming, in: Proceedings of the 11th National Conference on Arti_cial Intelligence
(AAAI Press/MIT Press, Washington, D.C., 1993), pp. 817ñ822.
[9] Kowalski, G. (1998). Information Retrieval Systems: Theory and Implementation. Kluwer,
Boston.
[10] Krovetz, R., & Croft, W. B. (1992). Lexical ambiguity and information retrieval. ACM
Transactions on Information Systems, 10, 115–141.
[11] Losee, R. M. (1988). Parameter estimation for probabilistic document retrieval models. Journal
of the American Society for Information Science, 39(1), 8–16.
[12] Losee, R. M. (1996a). Learning syntactic rules and tags with genetic algorithms for information
retrieval and filtering: An empirical basis for grammatical rules. Information Processing and
Management, 32(2), 185–197.
[13] Manning, C. D., & Schutze, H. (1999). Foundations of Statistical Natural Language
Processing. MIT Press, Cambridge, Mass.
[14] Partee, B. H., Meulen, A. t., &Wall, R. E. (1990). Mathematical Methods in Linguistics.
Kluwer, Dordrecht, The Netherlands.

NL Interface for Database - EJSR 20(4)

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to NL Interface for Database - EJSR 20(4)

Similar to NL Interface for Database - EJSR 20(4) (20)

More from IT Industry

More from IT Industry (15)

NL Interface for Database - EJSR 20(4)