Your SlideShare is downloading. ×
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
report.doc
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

report.doc

883

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
883
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Table of Contents 1 INTRODUCTION..........................................................................................................3 2 HISTORY.........................................................................................................................6 2.1 Early systems...........................................................................................................6 3 NATURAL LANGUAGE PARSING...........................................................................7 3.1 Rule-Based Syntactic Parsing ...............................................................................7 3.2 Terminal Symbols...................................................................................................7 3.3 Non-terminal symbols............................................................................................7 3.4 Production Rules.....................................................................................................7 3.4.1 Grammar...........................................................................................................7 3.4.2 Parse tree...........................................................................................................8 3.4.2.1 Top down..................................................................................................8 3.4.2.2 Bottom up ..................................................................................................9 3.5 Probabilistic Parsing.............................................................................................10 3.5.1 Disambiguation..............................................................................................11 3.5.2 Training...........................................................................................................11 3.5.2.1 Treebank...................................................................................................12 3.5.2.2 Incremental learning...............................................................................12 3.6 Semantic Parsing...................................................................................................13 3.6.1 Semantic Data Models...................................................................................13 3.6.2 Case Based Reasoning...................................................................................14 3.6.3 Semantic Representation...............................................................................15 3.6.4 Actions of the Parser......................................................................................15 4 NLIDB ARCHITECTURE...........................................................................................17 4.1 Pattern-matching systems....................................................................................17 4.2 Parsing based systems..........................................................................................17 4.2.1 Semantic grammar based parsing...............................................................18 4.2.2 Translation......................................................................................................19 5 MARKET TEST.............................................................................................................23 5.1 Goals.......................................................................................................................23 5.2 Tests........................................................................................................................23 5.3 Results.....................................................................................................................23 5.3.1 Impressions.....................................................................................................23 5.3.1.1 Microsoft English Query........................................................................23 5.3.1.2 Elfsoft........................................................................................................24 5.3.2 Query results..................................................................................................25 6 FUTURE........................................................................................................................27 6.1 Language challenges............................................................................................27 6.2 Portability challenges...........................................................................................27 6.3 Competing systems...............................................................................................27 6.4 Possible avenues....................................................................................................27 1
  • 2. 6.4.1 Adaptation techniques..................................................................................28 6.4.2 Speech-based techniques..............................................................................28 6.4.3 Learning algorithms......................................................................................28 6.4.3.1 User Dialogue..........................................................................................28 6.4.3.2 Neural Networks.....................................................................................29 6.4.3.3 Genetic Algorithms.................................................................................29 7 CONCLUSIONS...........................................................................................................30 8 BIBLIOGRAPHY..........................................................................................................33 9 CONTRIBUTIONS.......................................................................................................36 APPENDIX A..................................................................................................................37 Evaluating Systems....................................................................................................37 Introduction............................................................................................................37 Why is there a need?..............................................................................................37 Current Marketing.................................................................................................37 Problems .................................................................................................................37 Black box metrics...................................................................................................38 Proposed black box evaluation scheme..............................................................38 Overall Characteristics..........................................................................................38 Vocabulary..............................................................................................................38 Ease of Interaction..................................................................................................39 Accuracy based on input complexity..................................................................39 APPENDIX B..................................................................................................................40 Test Protocol...............................................................................................................40 2
  • 3. 1 INTRODUCTION The ability to exercise language to convey different thoughts and feelings differentiates human beings from animals. The definition of Natural Language Processing is the capability of a machine to understand the full context of human language about a particular topic so that the unspecified guess and general knowledge can be understood. “Thus if the machine is able to achieve this, it has come close to the notion of artificial intelligence itself”1. One may find interacting with a foreign person with no knowledge of English intricate and frustrating. Thus, a translator will have to come into the picture to allow one to communicate with the foreigner. Companies have related this problem to extracting data from a database management system (DBMS) such as MS Access, Oracle and others. A person with no knowledge of Structured Query Language (SQL) may find himself or herself handicapped in corresponding with the database. Therefore, companies like Microsoft and Elfsoft (English Language Frontend Software) have analysed the abilities of Natural Language Processing to develop products for people to interact with the database in simple English. This enables a user to simply enter queries in English to the Natural Language database interface. This kind of application is known as a Natural Language Interface to a DataBase (NLIDB). The system works by utilizing the use of syntactic knowledge and the knowledge it has been provided about the relevant database.2 Hence, it is able to implement the natural language input to the structure, scope and contents of the database. The program translates the whole query into the standard query language to extract the relevant information from the database. Thus, these products have created a revolution in extracting information from databases. They have discarded the fuss of learning SQL and time is also saved in learning this query language. This report will look at the performance of each database interface connected to a standard database. The Northwind database has been chosen as the default database to work on. There are several companies that are offering such products in the market. Our group has found several of them, which include English Query, Elfsoft, EasyAsk and NLBean created by Mr Mark Watson. We have requested for these companies for their permissions to test their products in regards to our research. We received positive responses from Elfsoft and NLBean, but had to settle for tests on Microsoft English Query and Elfsoft only. We have also contacted EasyAsk via email but the company has provided minimal assistance in our research. 1 Manas Tungare 2 Manas Tungare 3
  • 4. In order to produce accurate conclusions on the different interpretations of each software, we have listed out over thirty questions to test the products. Each product will be asked the same questions in the same order. The questions have been carefully planned to test the pros and the cons of each product. These questions include: • Listing the specific columns and rows • Counting • Calculations • Cross referencing from more than one tables • Ordinal positions • Followed-ups • Conclusions • Semantics • Grammar mistakes • Spelling mistakes • Out-of-context questions There are three components in a natural language dialog system: analysis, evaluation and generation.3 The analysis component translates the query as entered by the user into a semantic representation which is transcripted in the knowledge representation language. There may be several communication sessions between the natural language access system and user interface system to the user in order to carry out the action to derive the result. The evaluation component allows information to be absorbed by the dialog system when queries have to be satisfied or the system needs to alert the user about any major state changes. The generation component gathers the intended information that the user wants to see as provided in the query. This component will generate text, graphs, query or any other responses according to the situational context of the query.4 The knowledge-based database assistant (KDA) as stated, is a practical development of an intelligent database front-end to assist novice users in retrieving desirable information from an unfamiliar database system.5 This component exists in both Microsoft English Query and Elfsoft. Thus, this useful program directs the novice user to get the relevant results by entering the accurate query or by prompting the user when insufficient information is entered to get the appropriate answer. This component can be seen in the later part in this report in both programs. 3 Dialog-Oriented Use of Natural Language 4 Dialog-Oriented Use of Natural Language 5 Manas Tungare 4
  • 5. In addition, “the KDA's responding functionality, which could change the user's knowledge state, is called query guidance”.6 It can detect a user’s scope of knowledge about the relevant database by studying the query entered by the user. If it sensed that the user has limited awareness about the database and could not retrieve his or her desired answer, the query guidance will jump into action and provide similar queries to allow the user to gather the appropriate facts from the database or present the most relevant query to the user based on the user’s perceived intention. Such a component allows the novice to get familiar with the database fast and enables the user to learn about the scope of the database based on the prompt messages and the queries generated from the KDA without the expense of learning those mass databases stored in most organizations. 6 Manas Tungare 5
  • 6. 2 HISTORY As the use of databases for data storage spread during the 1970’s, the user interface to these systems represented a burden for designers worldwide. At this point, both the relational database model and the SQL interface language were yet to be developed, which means that the task of inserting and querying data was tedious and difficult. It was therefore a logical step for programmers to attempt to develop more user- friendly and “human” interfaces to the databases. One of these approaches was the use of natural language processing, where the user interactively would be allowed to interrogate the stored information. 2.1 Early systems The most well-known historical natural language database interface systems are: • LUNAR, interfacing a database with information on rocks collected during American moon expeditions. It was originally published in 1972. When evaluated in 1977, it answered 78 % of questions correctly. Based on syntactic parsing, it tended to build several parse trees for the same query, and was deemed as inefficient7 and too domain-specific and inflexible. • LADDER, the first semantic grammar-based system, interfacing a database with information on US Navy ships. • CHAT-80, probably the most famous example. It interfaced a database of world geography facts. The entire application (both the database and the user interface) was developed in Prolog. As the source code was freely distributed, it is still used and cited. An online version can be found at8. 7 Hafner, C. D. and Gooden, K. pp 141-164 8 ECL I Vertiefung: natürlichsprachliche Zugangssysteme: chat80 6
  • 7. 3 NATURAL LANGUAGE PARSING 3.1 Rule-Based Syntactic Parsing Syntax means ways that words can fit together to form higher-level units such as phrases, clauses, and sentences. Therefore syntactically driven parsing means interpretations of larger groups of words are built up out of the interpretation of their syntactic constituent words or phrases. In a way this is the opposite of pattern matching as here the interpretation of the input is done as a whole. Syntactic analyses are obtained by application of a grammar that determines what sentences are legal in the language that is being parsed. Syntactic parsing operates through the translation of the natural language query into a parse tree, which is then converted to a SQL query. There are a number of fundamental concepts in the theory of syntactic parsing. 3.2 Terminal Symbols A terminal symbol is the basic building block of the language, i.e. words and delimiters. Together, the set of terminal symbols form the “dictionary of words”9 recognised by the system, i.e. the range of the vocabulary that it can read and interpret. 3.3 Non-terminal symbols Non-terminal symbols are higher-level language terms describing concepts and connections in the syntax of the language. Examples of non-terminal symbols include1 sentence, noun phrase, verb phrase, noun, and verb. 3.4 Production Rules As the query is analysed, a number of production rules fires to identify and classify the context of the read word. In analogy with a production system (such as the one used in PROLOG), a production rule in a context-free grammar10 converts a left-hand non-terminal symbol to a sequence of symbols, which can be either terminal or non-terminal. Examples of production rules: • Sentence := Noun phrase verb phrase • Verb phrase := verb These rules are also commonly referred to as rewrite rules. 3.4.1 Grammar The combination of the set of terminal symbols, set of non-terminal symbols, the production rules and an assigned start symbol (the highest-level construct in the system, usually sentence) form the grammar of the syntax. The role of the grammar is to define: • What category each word belongs to; 9 Luger, G.F. and Stubblefield, W.A. This paper will be constricted to the treatment of context-free grammars and not deal with the 10 more complex set of syntaxes known as context-sensitive. 7
  • 8. • What expressions are legal and syntactically correct; • How sentences are generated. 3.4.2 Parse tree The system analyses the sentence by reading the non-terminal symbols in order and identifying what production rule to fire. As it does so, it gradually builds a representation of the sentence referred to as a parse tree. The term has been coined from the tree-like graph that is produced, where the root is the top-level symbol (e.g. sentence), the children of each node are the right-hand non-terminal symbols and the leaves are the terminal symbols (the words). The parse tree can be built in two fundamentally different ways. 3.4.2.1 Top down A top down parser starts at the root and gradually builds the tree downwards by matching the read terminal symbols with symbols on the right-hand side of possible production rules. Terminal or non-terminal symbols on the right hand side are added at the level below the current symbol. This is similar to the goal-driven approach of a production system. The basic architecture of a top down parser is illustrated in figure 1. 8
  • 9. Figure 1 Top down parsing of the sentence "the girl forgot the boy"11 In many situations, the first token alone does not provide enough information to make the decision on what production rule should be fired. In order to overcome this, there are two basic methods. 3.4.2.1.1 Recursive Descent The system starts by firing the first production rule of the candidates for which the given terminal symbol could fit and builds the initial sub tree from this information. If this further downwards in the tree results in an inconsistency or syntactic error, it reverts to the point where the decision was made, removes all the nodes on the way back up and selects another of the possible productions. This is a procedure very similar to depth-first searching and backtracking in production systems. 3.4.2.1.2 Look Ahead Look Ahead systems will not be contented by just reading one token. Rather, it reads the number of tokens necessary to identify the given right-hand side beyond any ambiguities before firing any production rules. Grammars are characterised by the maximum number of terminal symbols required to read before all possible conflicts in the choice of production rule can be resolved. If this number is k, the grammar is referred to as an LL (k) grammar12. The look ahead procedure is more in analogy with a breadth-first search technique. 3.4.2.2 Bottom up A bottom up parser, on the other hand, works from the leaf upward by “tagging” the tokens, i.e. starting from the right-hand side of the production rules and associating the read word with its category. When a full right-hand side has been identified, the production rule fires and the left-hand side non- terminal symbol is added as a branch in the level above. This methodology corresponds to the data-driven technique of production systems. The bottom up parsing technique is illustrated in figure 2. 11 Dougherty, R.C. 12 Eriksson, G. 9
  • 10. Figure 2 Bottom up parsing of the sentence "the girl forgot the boy"13 In some cases, the sentence is ambiguous in itself and there are multiple production rules that match a given sentence, in which case the parser has to make a choice between the two potential interpretations. One strategy for dealing with these situations is referred to as probabilistic parsing. 3.5 Probabilistic Parsing Probabilistic parsing takes an empirical approach to the difficult task of disambiguation, i.e. identifying which of several mutually exclusive alternate 13 Dougherty, R.C. 10
  • 11. syntactic parse trees should be generated. For example, consider the sentence “One morning I shot an elephant in my pyjamas”14. There are two possible syntactic parses for this sentence15. One implies that the person was wearing the pyjamas, while the opposing view would claim that the elephant was in the underwear (hence the joke). Although the selection between these two interpretations is obvious to a human, how is this knowledge automated in a computer? One option, used in a.k.a. attribute grammars, is to encode information for each verb as a parameter to each production rule. However, as the dictionary grows, this approach may be too selective and require every different case to be specifically added to the production rules. Probabilistic parsing, on the other hand, works by augmenting the rules with assigned probabilities, representing the chance of the particular expansion (production rule) being the correct one. For example, a probabilistic grammar would introduce the following enhancements to the possible regular syntactic production rules for the expansion of the non-terminal symbol sentence [15]: • Sentence:= Nounphrase Verbphrase, P = 0.8 • Sentence:= Auxiliary Nounphrase Verbphrase, P = 0.15 • Sentence:= Verbphrase, P = 0.05 Note that the probabilities for the expansions of any given non-terminal symbol always add up to 1. 3.5.1 Disambiguation How does probabilistic parsing choose a parse tree from two possible interpretations? In most systems, it simply compares the products of all the probabilities involved in every production required for the competing parses and selects the one representing the highest of these probabilities. 3.5.2 Training One important task concerns how to set the probabilities. There are two fundamentally different techniques for this task [15]. 14 Groucho Marx 15 Jurafsky, D. & Martin, J. 11
  • 12. 3.5.2.1 Treebank A large database of sentences with their correct parses (parsed by knowledgeable humans) is entered into the system. The respective probabilities are then calculated as the relative frequencies of each possible parse. For more details, see [15]. The largest known treebank is known as the Penn Treebank16. The latest version, Treebank 3 contains parses of17: • One million words of 1989 Wall Street Journal material; • A small sample of ATIS-3 transcripts. The Air Travel Information Service is a joint project of DARPA (Defence Advanced Research Projects Agency) and SRI International, handling voice-based queries and requests about flights. More information can be found at18; • A fully parsed tagged version of the Brown Corpus, consisting of one million words from 500 different sources (novels, academic books, newspapers, non-fiction books etc. [15]); • Parsed and tagged text from a set of 560 transcripts of telephone conversation (a.k.a. the Switchboard-1 corpus). • This is a widely used “training set” (in analogy with an artificial neural network) enabling the parser to learn what classes of speech a given word can belong to and how frequently a particular expression is to be interpreted in different ways. 3.5.2.2 Incremental learning The other technique is a “trial and error” method, in which the parsing system much like an artificial neural network learns as it is used. The initial probabilities can be assigned randomly or by the user. After that, the system adjusts these probabilities according to the following rules [15]: • If the sentence was unambiguous, its parse count is increased by 1, i.e. pi: = pi +1; • If the sentence was ambiguous, each of the possible parses have their counts incremented by their respective probabilities, i.e. pi: = pi + P (p i ). The algorithm for this computation is referred to as the Inside - Outside Algorithm. It was originally proposed in19 and is described in detail in20. 16 Penn Treebank Project. 17 Quoted by the LDC office of the University of Pennsylvania in an email dated 10/7-2001. 18 Language Reference 19 Baker, J.K. pp. 547-550. 20 Manning, C.D. and Schutze, H. 12
  • 13. 3.6 Semantic Parsing The syntactic structure of a sentence is not enough to express its meaning. For instance, the noun phrase the catch can have different meanings depending on whether one is talking about a baseball game or a fishing expedition. To talk about different possible readings of the phrase the catch, one therefore has to define each specific sense of the phrase. The representation of the context- independent meaning of a sentence is called its logical form.21 Natural language analysis based on semantic grammar is similar to syntactically driven parsing except that in semantic grammar the categories used are defined semantically. Database items can be ambiguous when the same item is listed under more than one attribute. For example, the term “Mississippi” is ambiguous between being a river name or a state name, in other words, two different logical forms. The two different meanings have to be represented distinctly for an interpretation of a user query. 3.6.1 Semantic Data Models Semantic data models (SDM) are widely researched in the database community. They are closely related to semantic networks used in artificial intelligence, which were originally developed to support natural language processing. Hence, as database management systems they are capable of supporting large amounts of information, while still offering the potential of advanced inferencing capabilities including NLP, machine learning, and query processing. “SDMs can be seen as formalising many of the relationships, expressed in an ad hoc manor in conventional hypermedia systems.”22 SDMs support a variety of formalised links and relationships. An example of a small network on insects is shown in figure 3. The links in this graph express generalisation relationships or "ISA" (beneficial insect IS-A insect), part/whole (Abdomen is part of an Insect), association (Ladybugs eat Aphids), and class/instance (Ladybug is an instance of Beneficial Insect). 23 21 Tang, R. L. p5 22 Beck, H., Mobini, A., Kadambari, V 23 Beck, H., Mobini, A., Kadambari, V 13
  • 14. Figure 3 Semantic Data Model describing insects24 In figure 3, solid lines are ISA relationships, diamonds are part/whole, circles are associations, and Instances are underlined. Since concepts in SDMs are described by structured graphs expressing the relationships among symbols rather than connections between text files as in conventional hypertext, there exists the capability for manipulation of SDMs to produce a number of desirable functions. Foremost is that of search or query processing. [8] Suggests query processing based on graph matching techniques by which the query is expressed as a small semantic network. This query graph is then matched against the larger database graph to find connections. This gives a much more precise search capability than is possible with Boolean keyword searches over text files. 3.6.2 Case Based Reasoning In order to construct an NLP system, one must construct a large dictionary. Much of the recent advances in text understanding systems can be attributed to advances in design and construction of large lexicons. But that presupposes that word meaning is easily represented and a case-based reasoning approach to meaning is used. Words obtain meaning by how they are used. A particular word is used in many different situations and contexts. Each occurrence of the word is treated as one case. Similarities among cases can be observed, and cases with similar usage can be clustered together into categories. When a word is used in a new situation, similar cases are retrieved from the case-based memory in order to apply what happened before to the new context. The meaning of a particular word is established by a large case base, and thus a single word may be "worth 1,000 cases". 25 24 http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig1.gif 25 Beck, H., Mobini, A., Kadambari, V. 14
  • 15. 3.6.3 Semantic Representation The most basic constructs of the representation language are the terms used to describe objects in the database and the basic relations between them. Database objects bear relationships to each other or can be related to other objects of interest to a user who is requesting information from it. For instance, in a user query like “What is the capital of Texas?”, the data of interest is a city with certain relationship to a state called Texas, or more precisely its capital. The capital/2 relation, or predicate, is therefore defined to handle questions that require them. Predicates Description city (C) C is a city capital (S,C) C is the capital of S density (S,D) D is the population density of state S loc (X,Y) X is located in Y len (R,L) L is the length of river R next_to (S1,S2) State S1 borders S2 traverse (R,S) River R traverses state S Table 1 Sample of predicates26 3.6.4 Actions of the Parser Using the parser actions in CHILL [8] known as shift-reduce parsing we will discuss the working of the parser. The parser actions are generated from templates given by a logical query. An action template will be instantiated to form a specific parsing action. Recall that the parser also requires a lexicon to interpret meaning of phrases into specific logical forms. Consider the following example27: Sentence: What is the capital of Texas? Logical Query: answer(C,(capital(C,S),const(S,stateid(Texas)))). A very simple lexicon will map ‘capital’ to ‘capital(_,_)’ and ‘Texas’ to ‘const(_,stateid(texas))’. The parser begins with an initial stack and a buffer holding the input sentence, which is the initial parse state. Each predicate on the parse stack has an attached buffer to hold the context in which it was introduced. Words from the input sentence are shifted onto the stack buffer during parsing. The initial parse is as follow: Parse Stack: [answer(_,_):[]] Input Buffer: [what,is,the,capital,of,texas,?] 26 Lappoon R. T. p6 27 Tang, R.L. 15
  • 16. Since the first three words in the input buffer do not map to any logical forms, the next sequence of steps is to push the three words from the input buffer onto the parse stack. The process has the following result: Parse Stack: [answer(_,_):[the,is,what]] Input Buffer: [capital,of,Texas,?] Now, ‘capital’ is at the head of the input buffer and is mapped to ‘capital(_,_)’ in the lexicon. The next action is to push the logical form onto the parse stack. The resulting parse state is as followed: Parse Stack: [capital(_,_):[],answer(_,_):[the,is,what]] Input Buffer: [capital,of,Texas,?] The parser then binds two arguments of two different logical forms to the same variable, resulting in the following parse state: Parse Stack: [capital(C,_):[],answer(C,_):[the,is,what]] Input Buffer: [capital,of,Texas,?] The sequence repeats itself producing a parse state: Parse Stack: [const(S,stateid(Texas)):[?,Texas]capital(C,S):[of,capital],answer(C,_): [the,is,what]] Input Buffer: [] The final step is to take the logical form on the parse stack and put it into one of the arguments of the meta-predicate resulting in: Parse Stack: [answer(C,(capital(C,S), const(S,stateid(Texas)))): [?,Texas,of,capital,the,is,what]] Input Buffer: [] As this is the final parse state, the logical query is then constructed from the parse stack. 16
  • 17. 4 NLIDB ARCHITECTURE 4.1 Pattern-matching systems The first NLIDBs were based on pattern-matching techniques. As a simple illustration of pattern matching technique, consider the following database: Countries_Table Country Capital Language France Paris French Italy Rome Italian … … … Table 2 Sample Database Table28 A primitive pattern-matching system according to [8] may use rules as: Pattern: … ”capital” … <country> Action: Report CAPITAL of row where COUNTRY = <country> Pattern: … “capital” … “country” Action: Report CAPITAL and COUNTRY of each row If the user asked “What is the capital of France?”, using the first pattern rule the system would report “Paris”. The system would also use the same rule to handle questions such as “Print the capital of Italy”, “Could you please tell me what is the capital of France?” etc. Some advantages of this approach are that it requires no complicated parsing or interpretation modules, and that it is easy to implement. But the main advantage of this approach is its simplicity. However the shallowness of this approach often lead to bad failures. An example is when a pattern-matching NLIDB was asked “TITLES OF EMPLOYEES IN LOS ANGELES.” the system reported the state where each employee worked, assuming the “IN” to denote the post code of Indiana, and assumed that the question was about employees and states.29 4.2 Parsing based systems In general as [8] suggests, the system architectures of some NLIDBs can be seen as being made of two major modules. The first module controls the natural language, where a question is submitted and successively transformed. At the end of this process one or more intermediate logical query expressions is obtained. Given the dimension of the domain and the flexibility of the natural language, there usually exist several interpretations of the same question. The 28 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.14 29 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P., pp.14-15 17
  • 18. second component is in charge of the connection with the database, translating the expressions to structured query language (SQL) expressions (using mapping) and sending them to the Data Base Management System (DBMS) to produce the answers.30 For a graphical explanation of the structure, examine Figure 4. Figure 4 NLIDB Architecture31 As described in the previous section, the source language sentence is first parsed, producing a parse tree. The two methods often found of parsing are the syntax based and semantic grammar based. 4.2.1 Semantic grammar based parsing Using this technique, the grammar’s categories do not necessarily correspond to syntactic concepts. Examine the following figure: 30 Reis, P., Matias, J. and Mamede N. p.3-4 31 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.18 18
  • 19. Figure 5 Semantic base parsing tree32 Notice that some categories of the grammar (e.g.: Substance, Magnesium, Specimen_question) do not correspond to syntactic constituents (e.g.: Noun- Phrase, Noun, Sentence). This is because the semantic information about the knowledge domain (e.g.: a question may either refer to specimens or spacecraft) is hared-wired into the semantic grammar.33 Because the semantic grammar approach contains hard-wired knowledge about a specific knowledge domain, it is very difficult to transfer it to other knowledge domain. A new semantic grammar has to be written whenever the NLIDB is configured for a new knowledge domain.34 4.2.2 Translation The translation is usually based on several mapping tables. Figure 6 illustrates this process for both the addition of new information based on an input sentence and the processing of a related query. The query is represented by a small graph, which initiates the mapping to the semantic hierarchy. The small graph is mapped to the semantic network by creating a link from each node in the smaller graph to the corresponding nodes in the network starting with the most general concept (the root) and ending with the most specific. This will create a unique instance, which is the intersection of all of the nodes involved in the query and may be used to narrow down a neighbourhood based on the requested information. 35 The mapping process is bounded by rules, and completely based on the information of the parse tree. As an example of mapping rules, consider the previous query of “which rock contains magnesium” taken from [1]: • The mapping of “which” is for_every X. 32 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17 33 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17 34 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17 35 Beck, H., Mobini, A., Kadambari, V. [online] 19
  • 20. • The mapping of “rock” is (is_rock X). • The mapping of an NP is Det’ N’, where Det’ and N’ are the mappings of the determiner and the noun respectively. Thus resulting in for_every X (is_rock X). • The mapping of “contains’ is contains. • The mapping of “magnesium” is magnesium. • The mapping of a VP (V’ X N’). Thus resulting in (contains X magnesium). 20
  • 21. Figure 6 Mapping and Query Processing Model36 Figure 7 demonstrates when the user ask a query on how John spent his leisure time and displays how the answer to the query is produced by exploiting the relationship between "spending leisure time" and "having a chance to go fishing" (both are "doing"). Figure 7 Query processing model37 In many systems the syntax rules linking non-leaf nodes and the semantic rules are domain independent, and can be used in any application domain. The information describing the possible words (leaf nodes) and the logic expressions is domain dependent and has to be declared in the lexicon.38 As an example, consider the lexicon used in MASQUE [8] listing the possible words, “capital”, “capitals”, “border”, “borders”, “bordering”, “bordered”. • The logic expression of “capital”, “capitals” could be capital_of(Capital,Country). • The logic expression of “border”, “borders”, “bordering”, “bordered” could be borders(Country1,Country2). • The logic expression of “country” could be is_country(Country). 36 http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig2.gif 37 http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig3.gif 38 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.19 21
  • 22. Then the question, “What is the capital of each country bordering Greece?” would be mapped to this query: answer([Capital, Country]):- is_country(Country), borders(Country, Greece), capital_of(Capital, Country). The meaning of the logic query above is to find all pairs [Capital, Country], such that Country is a country, Country borders Greece, and Capital is the capital of Country. The interpreter also needs to consult a world model that describes the structure of the surrounding world as shown by the figure below. Typically, the model contains a hierarchy of classes of world objects, and constraints on the types of arguments each logic predicate may have.39 Figure 8 Hierarchy in world model40 39 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.18-19 40 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.19 22
  • 23. 5 MARKET TEST In order to get a good estimate of the current state of the technology, the applications presented in the previous chapter were subjected to a neutral test. 5.1 Goals The goals of the tests were: • To get a thorough understanding of contemporary market applications; • To get an estimate of the relevance and importance of this type of systems; • To get some insight into what features are more and less important. 5.2 Tests The tests were carried out on the Northwind database, a sample database with information on a shipping company. The database comes as a demo with all distributed copies of Microsoft Access. A number of different queries of different types were posed to the respective natural language front ends. The questions were classified as simple (S), average (A), or complex (C). For a more comprehensive explanation of the considerations behind the testing procedures, see Appendix A. 5.3 Results 5.3.1 Impressions 5.3.1.1 Microsoft English Query English Query is a development environment that enables programmers to produce natural language front ends for SQL 2000 databases. The product is included with SQL 2000. The tests were performed on a demo of English Query, developed by Microsoft to interface with the Northwind database. The user interface has five fields, with the following functionalities: • Query (user input) • Interpretation of query • Required operations • Produced SQL statement • Results A screen shot from one of the queries is presented in Figure 9. 23
  • 24. Figure 9 Microsoft English Query. 5.3.1.2 Elfsoft Elfsoft works together with either VB or Access. Queries are entered in a query window (see Figure 10) and can be output either as database tables (see Figure 11) or in a graphical format. Figure 10 Elfsoft query window. 24
  • 25. Figure 11 Elfsoft answer output. Elfsoft also includes several other options for enhanced portability, including: • Automatic analyser of any Access database • Enabling the user to teach program meanings of phrases • Allowing the user to explain why a query failed (what was missing and/or wrong) • Permitting the user to edit the dictionary • Logging of queries for statistics 5.3.2 Query results The results are summarised in Table 3. A full recollection of the questions asked is presented in Appendix B. Table 3 Accuracy percentages. Type of query English Elfsoft Query Simple 71 23 Average 50 40 25
  • 26. Complex 67 100 26
  • 27. 6 FUTURE During the mid-eighties it was believed that natural language processing systems would become a universal interface to databases worldwide41. However, due to the emergence of graphical interfaces to databases, the relative simplicity of SQL and the inherent problems of natural language processing they have never really caught on commercially42. The current position of NLIDBs is probably best described by “it’s a great idea, but…” Although their usefulness is appreciated, they are still at a research stage. There are several reasons as to why their usage is not taking off on a broader scale. 6.1 Language challenges It is still very hard to encode the vast source, complexity and ambiguity of a human language into a computer. The formalisms for representing language patterns are still not comprehensive enough to capture all the different ways that expressions and terms can be constructed and given meaning depending on the context. 6.2 Portability challenges Although several systems for communication with individual databases have been successfully implemented and used, a general technique, which would allow the user to specify the database and use a system with any database management system (whether it be Access, SQL 2000, Oracle or any type), is still rather elusive. This would require the system to be able to recognize the fields and attributes of the new storage source seamlessly. An even bigger hurdle to portability is the nature and scope of language understanding. Language use in different domains is very dissimilar, which means that any portable system has to have a huge vocabulary with terms from many different application domains and be able to recognize expressions from users of a wide variety of professions. 6.3 Competing systems Graphical and form-based interfaces have become the de facto standard for database front ends. Because of the challenges presented above, these other types of systems are generally possible to develop in shorter time and at a lower cost. 6.4 Possible avenues There is still a lot of research going on in this area. Having explored the application of Natural Language Processing as database interfaces, the authors can see a number of different scenarios. 41 Johnson, T. 42 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. pp.29-81 27
  • 28. 6.4.1 Adaptation techniques There is a need for methodologies that would enable the user to specify the data source in a general descriptive language and to supply a given set of terms used within the domain. This would make the application portable from database to database. This need has been recognised in [8], where a solution based on the general Resource Description Framework (RDF). The system outlined in [8] learns the pattern and domain vocabulary of any given database automatically and also contains an interface that allows the user to change the database model (classes, properties, tables etc.). 6.4.2 Speech-based techniques Certain authors [8] believe that natural language keyboard interfaces will be superseded by speech recognition systems. However, as such systems are of an even more complex nature, some of the linguistic challenges will have to be solved first. Research on NLIDBs can therefore be a base for the development of voice-based systems [8]. 6.4.3 Learning algorithms Every person has its own vocabulary and way of using language. There is absolutely no way that a program can contain all words in a language or all different meanings that a term may take on. Further, the use of language changes over time, which means that the semantics and vocabulary of a system may become obsolete after a certain time of use. An important challenge for a natural language database front end (or any natural language processing system in general) is to possess an ability to learn, as it is used, evolve with the user and adapt to new users. This ability is after all one of the definitions of artificial intelligence. There are several ways in which this could potentially be achieved. Note that these are suggestions and not based on in-depth research. 6.4.3.1 User Dialogue One way to achieve learning would be to include a lexical editor, where the user could enter language terms and link them to their synonyms. They should also be able to specify the different forms of the word, e.g. noun plurals, adjective comparative forms, verb tenses etc. This ability is present in Elfsoft. 28
  • 29. 6.4.3.2 Neural Networks By use of probabilistic techniques, a system might be able to adjust probabilities of different parses based on training texts and test texts, which have been parsed and tagged by the user or obtained from linguists. By continuously retraining the network with parsed texts from the database-specific domain, the neural network would be able to pick up language patterns and learn incrementally. 6.4.3.3 Genetic Algorithms Another way would be for the system to obtain feedback from the user on the accuracy (e.g. ask the user whether queries were answered correctly) and adjust its language processing structure (production rules) by the use of genetic algorithms. 29
  • 30. 7 CONCLUSIONS The project has focused on two main topics: • The techniques of translating a question in natural language into a database query, extracting the results that the user is looking for; • The leading contemporary applications on the market. The underlying methods belong in the general natural language processing area, while any system has to select among several different techniques involving different degrees of syntactic analysis, semantic processing or a combination. A general feature seems to be the translation of the query in two steps, first to an intermediate language and then to a database query language, e.g. SQL. The topic integrates approaches several other facets of artificial intelligence, e.g. production systems, neural networks, expert systems, and machine learning. Two of the leading commercial software packages were tested with mixed results. Some rather complex queries were handled well, while the systems tended to have problems handling rather easy tasks. The sample sizes involved are too small to base any general conclusions on, however. The reason for this is that the configuration of the university computers at our disposal could not be used for testing the programs. Many companies have overestimated the use of natural language processing in the database interface. Their interpretation of the system is that it is able to understand the significance of the query accurately. However, the system is not able to fully comprehend the human language and jargon unless it has been given the definitions for these terms relating to the relevant database.43 This mainly involves the semantic analysis. A sentence that is syntactically structured may have lead to various meanings, which may not even be similar to one another. Thus, as a result, this will produce undesirable conclusions in the database queries. This is one main reason why many systems tend to fail and explains why most companies would still rather rely on SQL programmers for their database processing. Although these kinds of applications are rather unpopular, the authors enjoyed using them and encourage their future evolvement. From the experiences of the performed tests, systems have the potential to make the task of searching for information a lot less tedious and time-consuming. The eventual success for natural language front ends will depend on how well they can adapt to new environments, both regarding databases and users’ way of using language. Two proposed benchmarks for these types of systems could be: 43 Timo Honkela 30
  • 31. • It has to be able to learn and understand the database faster than the user; • It has to learn natural language faster and easier than the user can learn a programming language. 31
  • 32. ACKNOWLEDGEMENTS The authors wish to extend their appreciations to the following people for their support during the course of the project: • Jon Greenblatt, President of English Language Frontend Software Co. • Girish Mohata, Teaching Fellow, IT School, Bond University 32
  • 33. 8 BIBLIOGRAPHY 1. Androutsopoulos, I., Ritchiey G.D., and Thanischz, P.: Natural Language Interfaces to Databases - An Introduction. Journal of Natural Language Engineering, vol. 1, No. 1. Cambridge University Press 1995 2. Baker, J.K.: Trainable grammars for speech recognition, Speech Communication Papers for the 97th Meeting of the Acoustical Society of America, Acoustical Society of America 1979. 3. Beck, H., Mobini, A., Kadambari, V. A Word is Worth 1000 Pictures: Natural Language Access to Digital Libraries. University of Florida. http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/b eckmain.html 4. Dialog-Oriented Use of Natural Language http://www.dfki.uni- sb.de/vitra/papers/ro-man94/node5.html. Accessed on 310701 5. Dougherty, R.C.: Natural Language Computing An English Generative Grammar in Prolog. Erlbaum, Lawrence Associates 1994. 6. EasyAsk - Applications Overview http://www.englishwizard.com/applications/index.cfm -. Accessed 19/7-2001 7. ECL I Vertiefung: natürlichsprachliche Zugangssysteme: chat80. http://www.ifi.unizh.ch/cl/broder/chat/chat80.htm. Accessed 12/7-2001 8. Eriksson, G.: Översättarteknik. KFS AB 1984. 9. Groucho Marx in the movie Animal Cracker. 10. Hafner, C. D. and Gooden, K.: Portability of Syntax and Semantics in Datalog. ACM Transactions on Information Systems, vol. 3. Association for Computing Machinery 1985. 11. Honkela, T., The Www Version Of Self-Organizing Maps In Natural Language Processing of Helsinki University of Technology – viewed on 22/07/01 http://www.cis.hut.fi/~tho/thesis/ 33
  • 34. 12. Johnson, T.: Natural Language Computing: The Commercial Applications. Ovum 1985. 13. Jurafsky, D. and Martin J. H.: Speech and Language Processing, An Introduction to Natural Language Processing, Computational Linguistic, and Speech Recognition. Prentice-Hall 2000 14. Language Reference http://www.darpa.mil/ito/psum2000/h165-0.html. Accessed 14/7-2001. 15. Luger, G.F. and Stubblefield, W.A.: Artificial Intelligence. Structures and Strategies for Complex Problem Solving. Third Edition. Addison-Wesley 1999. 16. Manas Tungare – Natural Language Processing http://www.manastungare.com/articles/nlp/natural-language- processing.asp. Accessed 30/07/01 17. Manning, C.D. and Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press 1999. 18. Natural-Language Database Interfaces from ELF Software Co http://www.elfsoft.com/ns/FAQ.htm -. Accessed 19/7 – 2001. 19. Palmer, M. and Finin, T.: Workshop on the Evaluation of Natural Language Processing Systems. Computational Linguistics, vol. 16, pp. 175-181. MIT Press 1990. 20. Penn Treebank Project http://www.cis.upenn.edu/~treebank/. Accessed 10/7 – 2001. 21. Reis, P., Matias, J., Mamede, N.: Edite – A Natural Language Interface to Databases, A new dimension for an old approach. http://digitais.ist.utl.pt/cstc/le/Papers/CSTCLE-12.PDF 22. Sharoff, S. and Zhigalov, V.: Register-domain separation as a Methodology for Development of Natural Language Interfaces to Databases. Proceedings of the IFIP TC.13 International Conference on Human-Computer Interaction. International Federation for Information Processing 1999. 34
  • 35. 23. Tang, R. L.: Integrating Statistical and Relational Learning for Semantic Parsing: Applications to Learning Natural Language Interfaces for Databases. University of Texas May 2000. 35
  • 36. 9 CONTRIBUTIONS The respective chapters were produced by the following group members: Chapter 1: Jun Chapter 2: Hakan Chapter 3: Aris and Hakan Chapter 4: Aris Chapter 5: All Chapter 6: Hakan Chapter 7: Hakan and Jun Bibliography and report compilation: Aris Appendices: Hakan 36
  • 37. APPENDIX A Evaluating Systems Introduction How good is a natural language database interface? The answer to this question is hard to define. A survey conducted during the course of this project revealed the existence of no formal evaluation techniques. As long as this situation remains, an unambiguous answer to the question will elude all stakeholders in this area. Why is there a need? The need for formal evaluation schemes in this field, as in any other arises out of several stakeholders’ desires: • Users want a guide for choosing between systems; • Companies want benchmarks for product development and improvement; • Companies need metrics for proving the capabilities of their products. Current Marketing The companies behind contemporary techniques market their products with some of the following arguments: • Ease of set up and integration with new databases. It is often mentioned [6,18] that end users will be relieved of the task of having to learn and understand the internal workings of the DataBase Management System (DBMS) • Money saved on searching • Price • Ease of integration across different DBMSs (Access, SQL Server, Oracle etc.) • Accuracy • The possibility to perform searches on several data stores simultaneously Problems There have been some attempts to define general formal metrics for natural language processing systems [19]. In [19], it was concluded that this is a difficult task for a number of reasons: • Systems are built using a variety of techniques; • They are used in many different domains, where users’ needs are varying; • There is a lack of funding for research in this area. • However, it is also concluded that database front ends constitutes one of the type of systems where metrics potentially could be developed and adopted.
  • 38. Black box metrics In [19], a strong distinction is made between black box and glass box metrics. A black box approach only looks at the output generated by a certain input and does not take into account the architecture of system, or the efficiency of individual components. Advantages • It takes the user’s view; • It can be applied across platforms, on systems with different implementation details; • It doesn’t tie to a specific implementation technique; • It can be used over time, regardless of trends in database and programming methodologies. Disadvantages: • It doesn’t give a good indication to programmers of what is actually wrong; • It is badly suited for testing individual components of a system. Proposed black box evaluation scheme The proposed evaluation scheme takes into account several different aspects of the program in question. Evaluation can be based on the following characteristics: Overall Characteristics • User Friendliness: Is the application easy to understand and use? Are help files accessible and explanatory? Are error messages clear? • Portability: Can it be used in conjunction with only a specific database? If no, how easy is it to integrate it with other databases? • Speed: How fast are answers extracted? • Fault Tolerance: Can the system recognize off-topic questions (queries on information that is not in the database) and give an informative response within a reasonable time frame? • Accessibility: Can it be used over the web? Vocabulary Can the system accurately understand the following expressions44: • What? • Which? • How many? • How much? • Show • List 44 This list is arbitrary and may have to be expanded/contracted.
  • 39. • Tell • Count Ease of Interaction • Linguistic Flexibility: How many spelling errors in a word can the system tolerate and understand? Can it suggest alternative spellings45? • Probing questions: Are “follow-up” questions (questions referring to the previous answer) allowed? • Can the system adjust for bad grammar and still understand the question? Accuracy based on input complexity The system is asked a number of different questions. These questions are ranked as simple, average or complex. The accuracy (percentage of questions answered correctly) in each of the three categories is noted. The evaluation scheme formed the basis of the market tests of chapter 5. However, because of the small sample size of tested applications, no attempt to formalize the scheme or develop a metric based on it was made. 45 For an example of this capability, please try a search on http://www.google.com with a word containing a slight spelling error, e.g. elpheants.
  • 40. APPENDIX B Test Protocol The questions asked, their respective classifications, and the outcomes for the tested programs are presented in table 4. In the classification column, S stands for Simple, A for Average, and C for Complex. Table 4. Test Protocol. Question Class Microsoft English Elfsoft outcome Comments Query outcome Who is the oldest S Correct Correct English Query gave employee? the oldest person, Elfsoft the one who had worked the longest at Northwind. Which supplier C Correct Correct (currently) supplies the most products (which are not discontinued)? Which employee A No answer Correct Elfsoft gave too has handled the much information most orders? What product is S Correct No answer the most frequently ordered? List the country A No answer Correct that has a supplier that ships tofu. Name the third S No answer No answer most ordered product. What is the least S Wrong No answer ordered product? How much is S Correct No answer 1kg of Queso Cabrales?
  • 41. Question Class Microsoft English Elfsoft outcome Comments Query outcome How much tofu A No answer Correct Elfsoft gave too have been much information ordered? Show the phone S Correct Correct number of united package. Tell me the S Correct No answer names of the sales representatives Tell me the age A Correct No answer of these people. And their phone A Correct Correct numbers? Count the S Correct Correct customers in Germany. What is the A Correct Wrong average age of the employees? Name the A Correct No answer employees that are older than average Give the name of S Correct No answer the sales manager. Where is Around S Correct No answer the Horn from? What is the A No answer Wrong median of the age of the employees? List the names of S No answer Wrong the people working currently in the company. Who is older S Correct No answer than Janet?
  • 42. Question Class Microsoft English Elfsoft outcome Comments Query outcome What can you tell S Too little No answer me about Ernst information Handel? Which supplier C Correct Correct supplies tofu but not longlife tofu? What are the C No answer Wrong contact names and phone numbers of customers that have received products sent with Federal Shipping? What are the A Correct Correct Microsoft English products that Query had the federal shipping wrong ships interpretation. What customers A No answer Wrong received these shipments?

×