Domain Specific Terminology Extraction (ICICT 2006)


Published on

Imran Sarwar Bajwa, M. Imran Siddique, M. Abbas Choudhary, [2006], "Automatic Domain Specific Terminology Extraction using a Decision Support System", in IEEE 4th International Conference on Information and Communication Technology (ICICT 2006), Cairo, Egypt. pp:651-659

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Domain Specific Terminology Extraction (ICICT 2006)

  1. 1. AUTOMATIC DOMAIN SPECIFIC TERMINOLOGY EXTRACTION USING A DECISION SUPPORT SYSTEM Imran Sarwar Bajwa1 , Imran Siddique2 and M. Abbas Choudhary3 1 Department of Computer Science and Information Technology The Islamia University of Bahawalpur, Pakistan Phone: +92 (62) 9255466 E-mail: 2 Facuty of Computer an EmergingSciences 3 Balochistan University of Information Technology and Management Sciences Quetta, Pakistan. Phone: +92 (81) 9202463 Fax: +92 (81) 9201064 E-mail: Speech languages or Natural languages contents are major tools of communication. This research paper presents a natural language processing based automated system for understanding speech language text. A new rule based model is presented for analyzing the natural languages and extracting the relative meanings from the given text. User writes the natural language scenario in simple English in a few paragraphs and the designed system has an obvious capability of analyzing the given script by the user. After composite analysis and extraction of associated information, the designed system gives particular meanings to an assortment of speech language text on the basis of its context. The designed system uses standard speech language rules that are clearly defined for all speech languages as
  2. 2. English, Urdu, Chinese, Arabic, French, etc. The designed system provides a quick and reliable way to comprehend speech language context and generates respective meanings. The application with such abilities can be more intelligent and pertinent specifically for the user to save the time.Keywords: Terminology Extraction, Automatic text understanding, Speech language processing, Information extraction1. INTRODUCTIONIn daily life, natural languages or speech languages are main source ofcommutation. Particular and typical set of words and vocabularycollections are used for each field of life and they are quiet exclusive totheir use. In computer science, Natural Language Processing (NLP) is thefield used for the analysis of speech language text [4]. NLP introducesmany new concepts and new terminologies to describe its models,techniques and processes. Some of these terms come directly fromlinguistics and the science of perception, while others were invented todescribe discoveries that did not fit into any previous category [3]. NaturalLanguage understanding is a hefty field to grasp. Understanding andcomprehending a speech language means to know that what concepts aword or a particular phrase stands under a particular perspective. To givemeanings to a particular sentence a system should know how to link theconcepts together in a meaningful way. Its satirical that the speechlanguages are easiest for human beings in terms of learning, understandingand using, on the other hand hardest for a computer to understand.In modern era, computer machines have attained the ability of solvingcomplex mathematical and statistical problems with speed and grace butstill they are inefficient to comprehend the basics of spoken and writtenlanguages. The designed automated system can be useful in variousbusiness and technical software by only acquiring the requirements fromthe user. The designed system will extract the required information fromthe given text and provide to the computer for further automatedprocessing. Applications can be automated generation of UML diagrams,query processing, web mining, web template designing, user interfacedesigning, etc. As far as this software is concerns the time, it takes to
  3. 3. explore all the facilities and services, should is far less than a minute andthis information is adequately useful for the engineers and computer users.2. RELATED WORKNatural languages have been an area of interest from last one century. Inthe lat nineteen sixties and seventies, so many researchers as NoamChomsky (1965) [9], Maron, M. E. and Kuhns, J. L (1960) [10], Chow, C.,& Liu, C (1968) [11] contributed in the area of information retrieval fromnatural languages. They contributed for analysis and understanding of thenatural languages, but still there was lot of effort required for betterunderstanding and analysis. Some authors concentrated in this area ineighties and nineties as Losee, R. M (1998) [6], Salton, G., & McGill, M(1995) [8], Maron, M. E. & Kuhns, J. (1997) [9]. These authors worked forlexical ambiguity [5]and information retrieval [7], probabilistic indexing[9], data bases handling [3] and so many other related areas.In this research paper, a newly designed rule based framework has beenproposed that is able to read the English language text and extract itsmeanings after analyzing and extracting related information. The conductedresearch provides a robust solution to the addressed problem. Thefunctionality of the conducted research was domain specific but it can beenhanced easily in the future according to the requirements. Currentdesigned system incorporate the capability of mapping user requirementsafter reading the given requirements in plain text and drawing the set ofspeech language contents.3. SPEECH LANGUAGE ANALYSISThe understanding and multi-aspect processing of the natural languagesthat are also termed as "speech languages", is actually one of the argumentsof greater interest in the field artificial intelligence field [6]. The naturallanguages are irregular and asymmetrical. Traditionally, natural languagesare based on un-formal grammars. There are the geographical,psychological and sociological factors which influence the behaviours ofnatural languages [17]. There are undefined set of words and they alsochange and vary area to area and time to time.Due to these variations andinconsistencies, the natural languages have different flavours as English
  4. 4. language has more than half dozen renowned flavours all over the world.These flavours have different accents, set of vocabularies and phonologicalaspects. These ominous and menacing discrepancies and inconsistencies innatural languages make it a difficult task to process them as compared tothe formal languages [13].In the process of analyzing and understanding the natural languages,various problems are usually faced by the researchers. The problemsconnected to the greater complexity of the natural language are verb’sconjugation, verbal inflexion & paradigms, active & passive vice, lexicalamplitude, problem of ambiguity, etc. All these problems are significant tohandle but problem of ambiguity is difficult to handle. Ambiguity could beeasily solved at the syntax and semantic level by using a sound and robustrule-based system. During the interpretation process of the Englishlanguage text, the text is distinguished in four classes to handle theambiguity as: lexical analysis, syntactic analysis, semantic analysis andpragmatic analysis. Such distinction is not exhaustive but it is useful tofocus on the problem and for practical purposes [7].3.1. Lexical Analysis:The problem of words written or pronounced not correctly is omitted in thisproblem scenario. These kinds of errors are simply solvable through thecomparison with the expressions contained in a dictionary. Lexicalambiguity is created when a same word assumes various meanings [3]. Inthis case that ambiguity is generated from the fact that which meanings willbe incorporated in which scenario. As an example we consider the "cold"adjective in the following sentences as "That room is cold." and "Thatperson is cold." It turns out obvious that the same "cold" adjective assumes,in the two phrases different meanings. In the first sentence it indicate atemperature, in the second one a particular character of a person.3.2. Syntactic Analysis:Syntax analysis is performed on world level to recognize the wordcategory. The syntactic analysis of the programs would have to be in aposition to isolate subject, verbs, objects and various complements. It islittle complex procedure. For example, "Imran eats the apple." In this
  5. 5. example, the actual meanings are that “Mario eats apple” but ambiguity canbe asserted if this sentence is conceived as “the apple eats Mario”.3.3. Semantic Analysis: To analyze a phrase from the semantic point of view means to give it ameaning. This should let you understand we arrived to a crucial point.Semantic ambiguities are most common due to the fact that generally acomputer is not in a position to distinguish the logical situations as "Thebus hit the pole while it was moving." All of us would surely interpret thephrase like "The car, while moving, hit the pole.", while nobody would bedreamed to attribute to the sentence the meant "the car hit the pole whilethe pole was moving ".3.4. Pragmatic Analysis:Pragmatic ambiguities born when the communication happens between twopersons who do not share the same context. As following example as "Iwill arrive to the airport at 8 oclock". In this example, if the subject personbelongs to a different continent, the meanings can be totally changed.4. DESIGNED SYSTEM ARCHITECTUREThe designed speech language contents system has ability to draw speechlanguage contents after reading the text scenario provided by the user. Thissystem draws in five modules: Text input acquisition, text understanding,knowledge extraction, generation of speech language contents and finallymulti-lingual code generation as shown in following fig.4.1. Text input acquisitionThis module helps to acquire input text scenario. User provides thebusiness scenario in from of paragraphs of the text. This module reads theinput text in the form characters and generates the words by concatenatingthe input characters. This module is the implementation of the lexicalphase. Lexicons and tokens are generated in this module.
  6. 6. 4.2. Contents InterpretationThis module reads the input from module 1 in the form of words. Thesewords are categorized into various classes as verbs, helping verbs, nouns,pronouns, adjectives, prepositions, conjunctions, etc. Text Required Text Terms Document Terms Text Input Acquisition Understanding Text Contents Interpretation Analyzing Terms Knowledge Retrieval Matching Terms Domain Knowledge Terminology Extraction Required Terms OutputFigure 1: Architecture of the designed Decision Support System for Domain Specific TerminologyExtraction4.3. Knowledge RetrievalThis module, extracts different objects and classes and their respectiveattributes on the basses of the input provided by the preceding module.Nouns are symbolized as classes and objects and their associated attributesare termed as attributes.
  7. 7. 4.4. Terminology ExtractionThis module finally uses speech language contents symbols to constitutethe various speech language contents by combining available symbolsaccording to the information extracted of the previous module.5. Used MethodologyConventional natural language processing based systems user rule basedsystems. Agents are another way to address this problem [8]. In theresearch, a rule-based algorithm has been used which has robust ability toread, understand and extract the desired information. First of all basicelements of the language grammar are extracted as verbs, nouns, adjectives,etc then on the basis of this extracted information further processing isperformed. In linguistic terms, verbs often specify actions, and nounphrases the objects that participate in the action [16]. Each noun phrasesthen role specifies how the object participates in the action. As in thefollowing example:"Ahmed hit a ball with a racket."A procedure that understands such a sentence must discover that Role is theagent because he performs the action of hitting, that the ball as the thematicobject because it is the object hit, and that the racket is an instrumentbecause it is the tool with which hitting is done. Thus, sentence analysisrequires, in part, the answers to these actions: The number of thematic rolesembraced by various theories varies probably. Some people use about half-dozen thematic roles [9]. Others use more times as many. The exactnumber does not matter much, as long as they will great enough to exposenatural constraints on how verbs and thematic-instances form sentences.Agent: The agent causes the action to occur as in "Ahmed hit the ball,"Ahmed is agent who performs the task. But in this example a passivesentence, the agent also may appear in a prepositional "The ball was hit byAhmed.Co-agent: The word with may introduce a join phrase that serves a partnerin the principal agent. They two carry out the action together "Ahmedplayed tennis with Ali."
  8. 8. Beneficiary: The beneficiary is the person for whom an action has beeperformed: "Ahmed bought the balls for Ali." In this sentence Ali isbeneficiary.Thematic object: The thematic object is the object the sentence is reallyall about— typically the object, undergoing a change. Often the thematicobject is the same as the syntactic direct object, as "Ahmed hit the ball."Here the ball is thematic object.Conveyance: The conveyance is something in which or on which travels:“Aslam always goes by train.”Trajectory: Motion from source to destination takes place over attrajectory. ID contrast to the other role possibilities, several prepositionscan serve to introduce trajectory noun phrases: "Ahmed and Aslam went inthrough the front door."Location: The location is where an action occurs. As in the trajectory role,several prepositions are possible, which conveys meanings in addition toserve as a signal that a location noun phrase is "Ahmed and Ali studied inthe library, at a desk, by the wall, a picture, near the door."Time: Time specifies when an action occurs. Prepositions such at, before,and after introduce noun phrases serving as time role fill "Ahmed and Alileft before noon."Duration: Duration specifies how long an action takes. Preposition such asfor indicate duration. "Ali and Zia jogged for an hour.”6. CONCLUSIONThe designed system for Automatic Domain Specific TerminologyExtraction using a Decision Support System is a robust framework forinferring the appropriate meanings from a given text. The accomplishedresearch is related to the understanding of the human languages. Humanbeing needs specific linguistic knowledge generating and understandingspeech language contents. It is difficult for computers to perform this task.The speech language context understanding using a rule based frameworkhas ability to read user provided text, extract related information andultimately give meanings to the extracted contents. The designed system
  9. 9. very effective and have high accuracy up to 90 %. The designed systemdepicts the meanings of the given sentence or paragraph efficiently. Anelegant graphical user interface has also been provided to the user forentering the Input scenario in a proper way and generating speech languagecontents.7. FUTURE WORKThe designed system analyzes and extracts the contents of a speechlanguage as English. Current designed system only works for the active-vice sentences. Passive-vice sentences are still to work for to make thesystem more efficient and effective for various business applications. Thereis also periphery of improvements in the algorithms for the better extractionof language parts and understanding the meanings of the speech languagecontext. Current accuracy of generating meanings from a text is about 85%to 90%. It can be enhanced up to 95% or even more by improving thealgorithms and inducing the ability of learning in the system.8. REFERENCES[1] - Andrake M.A. and Bork P. “Automated Extraction of Information in Molecular Biology” FEBS Letters, pp. 12-17, 2000.[2] -Blaschke C, Andrade M. A, Ouzounis C. and Valencia,A. “Automatic extraction of biological information from scientific text: protein–protein interactions”. Ismb, pp. 60–67, 1999.[3] - C. A. Thompson, R. J. Mooney and L. R. Tang, “Learning to parse natural language database queries into logical form”, in: Workshop on Automata Induction, Grammatical Inference and Language Acquisition, 1997.[4] - Pustejovsky J, Castaño J., Zhang J, Kotecki M, Cochran B. “Robust relational parsing over biomedical literature: Extracting inhibit relations”. Pac. Symp. Biocomput, 362-73, 2002.[5] - Krovetz, R., & Croft, W. B. “Lexical ambiguity and information retrieval”, ACM Transactions on Information Systems, 10, pp. 115–141, 1992.[6] - Losee, R. M. “Parameter estimation for probabilistic document retrieval models”. Journal of the American Society for Information Science, 39(1), pp. 8– 16, 1988.
  10. 10. [7] - Partee, B. H., Meulen, A. t., &Wall, R. E. “Mathematical Methods in Linguistics”. Kluwer, Dordrecht, The Netherlands. 1999.[8] - Salton, G., & McGill, M. “Introduction to Modern Information Retrieval” McGraw-Hill, New York., 1995[9] - Maron, M. E. & Kuhns, J. L. “On relevance, probabilistic indexing, and information retrieval” Journal of the ACM, 7, 216–244, 1997.[10] - Chomsky, N. “Aspects of the Theory of Syntax. MIT Press, Cambridge, Mass, 1965.[11] - Chow, C., & Liu, C. “Approximating discrete probability distributions with dependence trees”. IEEE Transactions on Information Theory, IT-14(3), 462– 467, 1968.[12] - C. A. Thompson, R. J. Mooney and L. R. Tang, Learning to parse natural language database queries into logical form, Workshop on Automata Induction, Grammatical Inference and Language Acquisition (1997).[13] - Taffet, Mary D. (2001). GENTECH 2001 Scholarship Proposal: Automatic Tagging of Genealogical Data to Enhance Web-based Retrieval. Available at: <>.[14] - Nerbonne, John (2003): "Natural language processing in computer-assisted language learning". In: Mitkov, R. (ed.): The Oxford Handbook of Computational Linguistics. Oxford: 670-698.[15] - Menzel, Wolfgang/Schröder, Ingo (1998): "Error diagnosis for language learning systems". Proceedings of NLP+IA98 1: 526-530.[16] - Etchegoyhen, Thierry/ Wehrle, Thomas (1998): "Overview of GBGen: A large-scale, domain independent syntactic generator". In: Hovy, Eduard (ed.): Proceedings of the 9th International Workshop on Natural Language Generation. Niagara-on-the-Lake: 288-291.[17] - Granger, Sylviane (2003): "Error-tagged Learner Corpora and CALL: A Promising Synergy". CALICO Journal 20/3: 465-480.[18] - Menzel, Wolfgang/Schröder, Ingo (1998): "Error diagnosis for language learning systems". Proceedings of NLP+IA98 1: 526-530. Anne Vandeventer Faltin: Natural Language Tools for CALL 153[19] - Nerbonne, John (2003): "Natural language processing in computer-assisted language learning". In: Mitkov, R. (ed.): The Oxford Handbook of Computational Linguistics. Oxford: 670-698.