NL Interface for Database - EJSR 20(4)

434 views
351 views

Published on

Imran Sarwar Bajwa, Shahzad Mumtaz, M. Shahid Naveed [2008], "Database Interfacing using Natural Language Processing", European Journal of Scientific Research, Jul 2008, Vol. 20 No. 04, pp:844-851

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
434
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

NL Interface for Database - EJSR 20(4)

  1. 1. European Journal of Scientific ResearchISSN 1450-216X Vol.20 No.4 (2008), pp.844-851© EuroJournals Publishing, Inc. 2008http://www.eurojournals.com/ejsr.htm Database Interfacing using Natural Language Processing Imran Sarwar Bajwa Department of Computer Science and IT, The Islamia University of Bahawalpur E-mail: imransbajwa@gmail.com Shahzad Mumtaz Department of Computer Science and IT, The Islamia University of Bahawalpur E-mail: shahzadz22@hotmail.com M. Shahid Naweed Department of Computer Science and IT, The Islamia University of Bahawalpur E-mail: shahid_naweed@hotmail.com Abstract To write technically correct SQL queries is a complex and skill requiring task especially for a novel user. This situation becomes more complex when a low skilled person has to use a database management system for a specific business purpose. S/He has to write some quires at his own and perform various tasks. This scenario requires more expertise and skills in terms of understanding and writing the accurate and functional queries. The task of the novel user can be simplified by providing an easy interface that is well known to that user. In order to resolve all such issues, automated software is needed, which facilitates both users and software engineers. User writes the requirements in simple English in a few statements and the designed system has the ability to analyze the given script. After composite analysis and mining of associated information, the designed system generates the intended SQL queries that can be run directly. The paper describes a system that can create SQL queries automatically. The designed system provides a quick and reliable way to generate SQL queries to save time and budget of both the user and system analyst. Keywords: Information extraction, Automatic Query Generation, Knowledge Retrieval, Natural language processing.1.0. IntroductionRelational databases are the premier way of storing common data repositories. After storing the datacontents in a database, an interfacing mechanism is required to talk with the prearranged repository ofthe confined data. The conventional way of communicating with a database is to fist build a connectionstream and then adding, deleting or updating the data contents in the database by using a standardizedinterfacing mechanism [1]. Simple command shells are typically used and they are often incorporatedwithin every distinct database product. These command shells are typically simple filters which helps ause to log on to the database, execute particular commands and receive output. These command shellsprovide access to the database from the machine on which the database is actually running [2]. Afterhooking to a particular database a user or a programmer requires an interface and typically that
  2. 2. Database Interfacing using Natural Language Processing 845interface is provided by some technical languages. These languages are called query languages and areconstituted of the database commands typically used for asking questions to a distinctive database andgetting intended response. SQL [3] (Structured Query Language) is the most popular query languagewhich is actually the defacto language of databases today. SQL is an orthodox tool of databasequerying. Different database management systems implement this standardized language with trivialalterations and adjustments. However, in spite of these proprietary extensions by the vendors, the coreof this querying language is the same in all of the environments. From an application programmers point of view, the major novelty in the relational database isthat one uses a declarative query language, SQL. Most computer languages are procedural. Theprogrammer tells the computer what to do, step by step, specifying a procedure. Using SQL interface,the programmer defines his requirements and questions and the RDBMS query planner figures out howto get it [5]. There are two compensations of using a declarative language. The first is that the queriesno longer depend on the data depiction. The RDBMS is free to store data according to its own designrequirements [6]. The second major factor is improved software dependability. For various web-basedand stand-alone applications the generic SQL is used to make the things simple and straightforward.Besides these praising compensations occupied by SQL, it’s technical and trifle interface makes thislanguage monotonous and difficult to learn and use. It is quite intricate to remember these SQLcommands and use them accurately and precisely. In order to resolve all such issues, an automated software is needed, which facilitates both usersand software engineers. As far as this software is concerns the time, it takes to explore all the facilitiesand services, should be quite less than a minute and this information is quite useful for the users.2.0. Problem DescriptionModern software engineering requires quick and automated solutions which may have ability to createthe accurate and precise SQL queries automatically. For complex queries an expert programmer alsorequires assistance in terms of automatic query generation. He can use these queries after makingappropriate adjustments and alterations in the automated generated queries with less effort in less timeas compared to the traditional approaches. The task of the novel user can be simplified by providing an easy interface that is more familiarand well known to that user. In order to resolve all such issues, an automated software is needed, whichfacilitates both users and software engineers. User writes the requirements in simple English in a fewstatements and the designed system has obvious ability to analyze the given script. After compositeanalysis and mining of associated information, the designed system generates the intended SQL queriesthat can be run directly. The designed system has robust ability to create code automatically withoutexternal environment. The designed system provides a quick and reliable way to generate SQL queriesto save the time and budget of both the user and system analyst3.0. Used MethodologyThe understanding and multi-aspect processing of the natural languages that are also termed as "speechlanguages", is actually one of the arguments of greater interest in the field artificial intelligence field[8]. The natural languages are irregular and asymmetrical. Traditionally, natural languages are basedon un-formal grammars. There are the geographical, psychological and sociological factors whichinfluence the behaviours of natural languages [12]. There are undefined set of words and they alsochange and vary area to area and time to time.Due to these variations and inconsistencies, the naturallanguages have different flavours as English language has more than half dozen renowned flavours allover the world [14]. These flavours have different accents, set of vocabularies and phonologicalaspects. These ominous and menacing discrepancies and inconsistencies in natural languages make it adifficult task to process them as compared to the formal languages [13].
  3. 3. 846 Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed The English language statements are effortlessly converted into a SQL query by using thenewly designed rule based algorithm. Select query is the common query used to choose a set of valuesfrom a table [4]. An example of a college database has been used in the conducted research. Student’sdata will be retrieved, inserted and deleted by automatically generated queries from simple Englishtext.3.1. SELECT QueryFirst of all the ‘SELECT’ query has been processed. ‘SELECT’ query has four parts as following: SELECT * FROM Students Keyword Required Set keyword Table Name ‘SELECT’ query can easily be generated from the provided input string of as there are twokeywords ‘SELECT’ and ‘FROM’. Other two required values are ‘Required Set’ and ‘Table Name’.To process the speech language text and find ‘Required Set’ and ‘Table Name’ the conventional normsof the English language and grammatical rule are used. The conventional structure of simple Englishsentence is the key rule of comprehending and analyzing the natural language text [13] as in thefollowing example: “I need names of all students.” Following is the complete analysis of this simple sentence.Table 01: Generating SELCET Query from text Lexicons Phase-I Phase –II I Noun ---------- need Verb ---------- names Noun Field Name of preposition ---------- all Noun * students Noun Table Name In this example the ‘Required Set’ field is filled by the ‘Filed Name’ attribute and the ‘TableName’ filed is filled by the ‘Table Name’ attribute as following: Select * from Students Here the table Name is searched from the array of available all tables in the database. From allavailable tables, the nearest table name is picked that ‘students’ in this example.3.2. INSERT QueryAfter ‘SELECT’ query ‘INSERT’ query has been processed. ‘INSERT’ query has five fragments asfollowing: INSERT INTO Students VALUES (5, ‘Ali’) Keyword keyword Table Name Keyword Record ‘INSERT’ query can also produced from the given statement as there are three keywords‘INSERT’, ‘INTO’ and ‘VALUES’ [6]. Other two required parameters are ‘Table Name’ and‘Record’. Using same rule based algorithm ‘Table Name’ and ‘Record’ are extracted. As in thefollowing example: “I want to insert a student whose Roll No. is 5 and Name is Ali.” Following is the complete analysis of this simple sentence.
  4. 4. Database Interfacing using Natural Language Processing 847Table 02: Generating INSERT Query from textLexicons Phase-I Phase –III Noun -----------want Verb -----------to Preposition -----------insert Verb Actiona article -----------student Noun Table Namewhose Conjunction -----------Roll No Noun Attributeis Helping Verb ------------5 Noun Valueand Conjunction ------------Name Noun Attributeis Helping Verb ------------Ali Noun Value In this example the ‘Required Set’ field is filled by the ‘Filed Name’ attribute and the ‘TableName’ filed is filled by the ‘Table Name’ attribute. Here the table Name is searched from the array ofavailable all table sin the database. From all available tables, the nearest table name is picked that‘students’ in this example.3.3. DELETE QuerySame like ‘SELECT’ and ‘INSERT’ queries ‘DELETE’ query can also be easily processed. ‘DELETE’query has five parts as following: DELETE FROM Students WHERE Age > 25 Keyword Keyword Table Name Keyword Condition The ‘DELETE’ query typically consists of three keywords as ‘DELETE’, ‘FROM’ and‘WHERE’. Other two required values are ‘Table Name’ and ‘Condition’. To find ‘Table Name’ and‘Condition’ parameters the English language defined grammatical rule are used as in the followingexample: “I want to delete the students more than 25 years age.” Following is the complete analysis of this simple sentence.Table 03: Generating DELETE Query from text Lexicons Phase-I Phase –II I Noun --------- want Verb --------- to preposition --------- delete verb Action the article --------- students Noun Table Name more preposition Condition than Noun ---------- 25 Noun Value years Noun ----------- age Noun Parameter For ‘DELETE’ query, first the condition is defined. In this example Parameter and Value arecombined with Condition parameters. In this example table Name is also retrieved from the array ofavailable all tables in the database.
  5. 5. 848 Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed4.0. Work Flow of Designed SystemThe designed system “Computational Linguistics based System for Automatic Database QueryGeneration” is adequately capable of automatically generating queries. This designed system performsits function in multi-phase procedure. There are five modules in total that are Text input acquisition,text comprehension, Information retrieval and ultimately generation of SQL Queries. Following is thebrief detail of all these phases.4.1. Text input AcquisitionThis module helps to acquire input text scenario. User provides the business scenario in from of stringsof the text. This module reads the input text in the form characters and generates the words byconcatenating the input characters. This module is the implementation of the lexical phase. Lexiconsand tokens are generated in this module. After the lexicons generation further processing can beperformed on the input text. Figure 01: Lexical analysis of input text string4.2. Text ComprehensionThis module reads the input from module one in the form of words or lexicons. These words arecategorized into various classes as verbs, helping verbs, nouns, pronouns, adjectives, prepositions,conjunctions, etc. These classes are further used to understand the various parts of the given sentence. Figure 02: Parts of speech tagging of input text4.3. Information RetrievalThis module, extracts key words of the SQL queries as Select, Insert, Delete, From, Into, Where, etc.Keywords are found by matching the tokens with the given array of al possible keywords. These key
  6. 6. Database Interfacing using Natural Language Processing 849words are further used to generate the respective queries. The information like table name, field name,number of attributes and logical conditions are also extracted in this phase. Figure 03: Query information extraction4.4. SQL Queries generationThis module combines the keywords and other required parameters for a particular query. SQL queryis ultimately generated here according to the given rules in the designed algorithm. As separatescenario will be provided for various types of queries, the separate functions have been implementedfor particular query. Figure 04: Generation of SQL Query5.0. Results and AnalysisAfter designing and coding the query generating system, its accuracy and efficiency was tested. Fortesting purpose of the queries generated by the designed system simple and complex level queries weregenerated. Each query from each category as Select, Insert, Delete was checked. 15 sample queries were generated and the intended results have been shown in the followingtable.
  7. 7. 850 Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid NaweedTable 04: Accuracy ratio of various types of queries Types Simple Complex Total SELECT 14 13 90% INSERT 13 11 80% DELETE 14 12 87%Total Accuracy = 86% A matrix representing accuracy of query generation test (%) for simple level and complex levelqueries has been constructed. Overall diagrams accuracy for all types of queries is determined byadding total accuracy of all categories and calculating its average that is 86% in this case. Figure 05: Graphical representation of the results 14 12 10 8 Simple 6 Complex 4 2 0 SELECT INSERT DELETE The graph above is showing the accuracy ratio of various SELECT, INSERT & DELETEqueries in terms of simple and complex queries parameters.6.0. ConclusionThe designed system “Computational Linguistics based System for Automatic Database QueryGeneration” facilitates both users and software engineers in terms of generating SQL queriesautomatically. The task of the novel user can be simplified by providing an easy interface that is morefamiliar and well known to that user. In order to resolve all such issues, an automated software isneeded, which facilitates both users and software engineers. User writes the requirements in simpleEnglish in a few statements and the designed system has obvious ability to analyze the given script.After composite analysis and mining of associated information, the designed system generates theintended SQL queries that can be run directly. The designed system has robust ability to create codeautomatically without external environment. The designed system provides a quick and reliable way togenerate SQL queries to save the time and budget of both the user and system analyst. An elegantgraphical user interface has also been provided to the user for entering the Input scenario in a properway and generating UML diagrams.7.0. Future WorkThere is also some margin of improvements in the algorithms for generating the intended SQL queries.Current accuracy of generating diagrams is about 80% to 85%. It can be enhanced up to 95% byimproving the algorithms and inducing the ability of learning in the system. In this research only threetypes of queries has been addressed as SELECT, INSERT, and DELETE query. There are still othertypes of queries that require some sufficient solution.
  8. 8. Database Interfacing using Natural Language Processing 851References[1] Allen,J. (1994) Natural Language Understanding. Benjamin- Cummings Publishing Company, New York.[2] Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge Univ. Press, Cambridge, U.K.[3] D. DeHaan, D. Toman, M. P. Consens, and T. Ozsu. (2003) A Comprehensive XQuery to SQL Translation using Dynamic Interval Encoding. In SIGMOD.[4] C. A. Thompson, R. J. Mooney and L. R. Tang, Learning to parse natural language database queries into logical form, in: Workshop on Automata Induction, Grammatical Inference and Language Acquisition (1997).[5] Salton, G., & McGill, M. (1983). Introduction to Modern Information Retrieval. McGraw-Hill, New York.[6] A. Rosenthal. D. Reiner, Extending the Algebraic Framework of Query Processing to Handle Outer joins, Proc. VLDB Singa- pore 1984. pp. 334-343.[7] Fagan, J. L. (1989). The effectiveness of a non-syntactic approach to automatic phrase indexing for document retrieval. Journal of the American Society for Information Science, 40 (2), 115– 132.[8] J. M. Zelle and R. J. Mooney, Learning semantic grammars with constructive inductive logic programming, in: Proceedings of the 11th National Conference on Arti_cial Intelligence (AAAI Press/MIT Press, Washington, D.C., 1993), pp. 817ñ822.[9] Kowalski, G. (1998). Information Retrieval Systems: Theory and Implementation. Kluwer, Boston.[10] Krovetz, R., & Croft, W. B. (1992). Lexical ambiguity and information retrieval. ACM Transactions on Information Systems, 10, 115–141.[11] Losee, R. M. (1988). Parameter estimation for probabilistic document retrieval models. Journal of the American Society for Information Science, 39(1), 8–16.[12] Losee, R. M. (1996a). Learning syntactic rules and tags with genetic algorithms for information retrieval and filtering: An empirical basis for grammatical rules. Information Processing and Management, 32(2), 185–197.[13] Manning, C. D., & Schutze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Mass.[14] Partee, B. H., Meulen, A. t., &Wall, R. E. (1990). Mathematical Methods in Linguistics. Kluwer, Dordrecht, The Netherlands.

×