Domain Specific Terminology Extraction (ICICT 2006)

AUTOMATIC DOMAIN SPECIFIC
TERMINOLOGY EXTRACTION USING A
DECISION SUPPORT SYSTEM

Imran Sarwar Bajwa1 , Imran Siddique2 and M. Abbas Choudhary3
1 Department of Computer Science and Information Technology
The Islamia University of Bahawalpur, Pakistan
Phone: +92 (62) 9255466
E-mail: imransbajwa@yahoo.com

2 Facuty of Computer an EmergingSciences
3 Balochistan University of Information Technology and Management Sciences
Quetta, Pakistan.
Phone: +92 (81) 9202463 Fax: +92 (81) 9201064
E-mail: imran@buitms.edu.pk
E-mail:abbas@buitms.edu.pk

Abstract: Speech languages or Natural languages contents are major
tools of communication. This research paper presents a
natural language processing based automated system for
understanding speech language text. A new rule based model
is presented for analyzing the natural languages and
extracting the relative meanings from the given text. User
writes the natural language scenario in simple English in a
few paragraphs and the designed system has an obvious
capability of analyzing the given script by the user. After
composite analysis and extraction of associated information,
the designed system gives particular meanings to an
assortment of speech language text on the basis of its
context. The designed system uses standard speech language
rules that are clearly defined for all speech languages as

English, Urdu, Chinese, Arabic, French, etc. The designed
system provides a quick and reliable way to comprehend
speech language context and generates respective meanings.
The application with such abilities can be more intelligent
and pertinent specifically for the user to save the time.
Keywords: Terminology Extraction, Automatic text understanding,
Speech language processing, Information extraction

1. INTRODUCTION
In daily life, natural languages or speech languages are main source of
commutation. Particular and typical set of words and vocabulary
collections are used for each field of life and they are quiet exclusive to
their use. In computer science, Natural Language Processing (NLP) is the
field used for the analysis of speech language text [4]. NLP introduces
many new concepts and new terminologies to describe its models,
techniques and processes. Some of these terms come directly from
linguistics and the science of perception, while others were invented to
describe discoveries that did not fit into any previous category [3]. Natural
Language understanding is a hefty field to grasp. Understanding and
comprehending a speech language means to know that what concepts a
word or a particular phrase stands under a particular perspective. To give
meanings to a particular sentence a system should know how to link the
concepts together in a meaningful way. It's satirical that the speech
languages are easiest for human beings in terms of learning, understanding
and using, on the other hand hardest for a computer to understand.
In modern era, computer machines have attained the ability of solving
complex mathematical and statistical problems with speed and grace but
still they are inefficient to comprehend the basics of spoken and written
languages. The designed automated system can be useful in various
business and technical software by only acquiring the requirements from
the user. The designed system will extract the required information from
the given text and provide to the computer for further automated
processing. Applications can be automated generation of UML diagrams,
query processing, web mining, web template designing, user interface
designing, etc. As far as this software is concerns the time, it takes to

explore all the facilities and services, should is far less than a minute and
this information is adequately useful for the engineers and computer users.

2. RELATED WORK
Natural languages have been an area of interest from last one century. In
the lat nineteen sixties and seventies, so many researchers as Noam
Chomsky (1965) [9], Maron, M. E. and Kuhns, J. L (1960) [10], Chow, C.,
& Liu, C (1968) [11] contributed in the area of information retrieval from
natural languages. They contributed for analysis and understanding of the
natural languages, but still there was lot of effort required for better
understanding and analysis. Some authors concentrated in this area in
eighties and nineties as Losee, R. M (1998) [6], Salton, G., & McGill, M
(1995) [8], Maron, M. E. & Kuhns, J. (1997) [9]. These authors worked for
lexical ambiguity [5]and information retrieval [7], probabilistic indexing
[9], data bases handling [3] and so many other related areas.
In this research paper, a newly designed rule based framework has been
proposed that is able to read the English language text and extract its
meanings after analyzing and extracting related information. The conducted
research provides a robust solution to the addressed problem. The
functionality of the conducted research was domain specific but it can be
enhanced easily in the future according to the requirements. Current
designed system incorporate the capability of mapping user requirements
after reading the given requirements in plain text and drawing the set of
speech language contents.

3. SPEECH LANGUAGE ANALYSIS
The understanding and multi-aspect processing of the natural languages
that are also termed as "speech languages", is actually one of the arguments
of greater interest in the field artificial intelligence field [6]. The natural
languages are irregular and asymmetrical. Traditionally, natural languages
are based on un-formal grammars. There are the geographical,
psychological and sociological factors which influence the behaviours of
natural languages [17]. There are undefined set of words and they also
change and vary area to area and time to time.Due to these variations and
inconsistencies, the natural languages have different flavours as English

language has more than half dozen renowned flavours all over the world.
These flavours have different accents, set of vocabularies and phonological
aspects. These ominous and menacing discrepancies and inconsistencies in
natural languages make it a difficult task to process them as compared to
the formal languages [13].
In the process of analyzing and understanding the natural languages,
various problems are usually faced by the researchers. The problems
connected to the greater complexity of the natural language are verb’s
conjugation, verbal inflexion & paradigms, active & passive vice, lexical
amplitude, problem of ambiguity, etc. All these problems are significant to
handle but problem of ambiguity is difficult to handle. Ambiguity could be
easily solved at the syntax and semantic level by using a sound and robust
rule-based system. During the interpretation process of the English
language text, the text is distinguished in four classes to handle the
ambiguity as: lexical analysis, syntactic analysis, semantic analysis and
pragmatic analysis. Such distinction is not exhaustive but it is useful to
focus on the problem and for practical purposes [7].
3.1. Lexical Analysis:
The problem of words written or pronounced not correctly is omitted in this
problem scenario. These kinds of errors are simply solvable through the
comparison with the expressions contained in a dictionary. Lexical
ambiguity is created when a same word assumes various meanings [3]. In
this case that ambiguity is generated from the fact that which meanings will
be incorporated in which scenario. As an example we consider the "cold"
adjective in the following sentences as "That room is cold." and "That
person is cold." It turns out obvious that the same "cold" adjective assumes,
in the two phrases different meanings. In the first sentence it indicate a
temperature, in the second one a particular character of a person.
3.2. Syntactic Analysis:
Syntax analysis is performed on world level to recognize the word
category. The syntactic analysis of the programs would have to be in a
position to isolate subject, verbs, objects and various complements. It is
little complex procedure. For example, "Imran eats the apple." In this

example, the actual meanings are that “Mario eats apple” but ambiguity can
be asserted if this sentence is conceived as “the apple eats Mario”.
3.3. Semantic Analysis:
To analyze a phrase from the semantic point of view means to give it a
meaning. This should let you understand we arrived to a crucial point.
Semantic ambiguities are most common due to the fact that generally a
computer is not in a position to distinguish the logical situations as "The
bus hit the pole while it was moving." All of us would surely interpret the
phrase like "The car, while moving, hit the pole.", while nobody would be
dreamed to attribute to the sentence the meant "the car hit the pole while
the pole was moving ".
3.4. Pragmatic Analysis:
Pragmatic ambiguities born when the communication happens between two
persons who do not share the same context. As following example as "I
will arrive to the airport at 8 o'clock". In this example, if the subject person
belongs to a different continent, the meanings can be totally changed.

4. DESIGNED SYSTEM ARCHITECTURE
The designed speech language contents system has ability to draw speech
language contents after reading the text scenario provided by the user. This
system draws in five modules: Text input acquisition, text understanding,
knowledge extraction, generation of speech language contents and finally
multi-lingual code generation as shown in following fig.
4.1. Text input acquisition
This module helps to acquire input text scenario. User provides the
business scenario in from of paragraphs of the text. This module reads the
input text in the form characters and generates the words by concatenating
the input characters. This module is the implementation of the lexical
phase. Lexicons and tokens are generated in this module.

4.2. Contents Interpretation
This module reads the input from module 1 in the form of words. These
words are categorized into various classes as verbs, helping verbs, nouns,
pronouns, adjectives, prepositions, conjunctions, etc.

Text Required
Text Terms
Document Terms

Text Input Acquisition

Understanding
Text

Contents Interpretation

Analyzing
Terms

Knowledge Retrieval

Matching
Terms Domain
Knowledge
Terminology Extraction

Required Terms
Output

Figure 1: Architecture of the designed Decision Support System for Domain Specific Terminology
Extraction

4.3. Knowledge Retrieval
This module, extracts different objects and classes and their respective
attributes on the basses of the input provided by the preceding module.
Nouns are symbolized as classes and objects and their associated attributes
are termed as attributes.

4.4. Terminology Extraction
This module finally uses speech language contents symbols to constitute
the various speech language contents by combining available symbols
according to the information extracted of the previous module.

5. Used Methodology
Conventional natural language processing based systems user rule based
systems. Agents are another way to address this problem [8]. In the
research, a rule-based algorithm has been used which has robust ability to
read, understand and extract the desired information. First of all basic
elements of the language grammar are extracted as verbs, nouns, adjectives,
etc then on the basis of this extracted information further processing is
performed. In linguistic terms, verbs often specify actions, and noun
phrases the objects that participate in the action [16]. Each noun phrase's
then role specifies how the object participates in the action. As in the
following example:
"Ahmed hit a ball with a racket."
A procedure that understands such a sentence must discover that Role is the
agent because he performs the action of hitting, that the ball as the thematic
object because it is the object hit, and that the racket is an instrument
because it is the tool with which hitting is done. Thus, sentence analysis
requires, in part, the answers to these actions: The number of thematic roles
embraced by various theories varies probably. Some people use about half-
dozen thematic roles [9]. Others use more times as many. The exact
number does not matter much, as long as they will great enough to expose
natural constraints on how verbs and thematic-instances form sentences.
Agent: The agent causes the action to occur as in "Ahmed hit the ball,"
Ahmed is agent who performs the task. But in this example a passive
sentence, the agent also may appear in a prepositional "The ball was hit by
Ahmed.''
Co-agent: The word with may introduce a join phrase that serves a partner
in the principal agent. They two carry out the action together "Ahmed
played tennis with Ali."

Beneficiary: The beneficiary is the person for whom an action has bee
performed: "Ahmed bought the balls for Ali." In this sentence Ali is
beneficiary.
Thematic object: The thematic object is the object the sentence is really
all about— typically the object, undergoing a change. Often the thematic
object is the same as the syntactic direct object, as "Ahmed hit the ball."
Here the ball is thematic object.
Conveyance: The conveyance is something in which or on which travels:
“Aslam always goes by train.”
Trajectory: Motion from source to destination takes place over at
trajectory. ID contrast to the other role possibilities, several prepositions
can serve to introduce trajectory noun phrases: "Ahmed and Aslam went in
through the front door."
Location: The location is where an action occurs. As in the trajectory role,
several prepositions are possible, which conveys meanings in addition to
serve as a signal that a location noun phrase is "Ahmed and Ali studied in
the library, at a desk, by the wall, a picture, near the door."
Time: Time specifies when an action occurs. Prepositions such at, before,
and after introduce noun phrases serving as time role fill "Ahmed and Ali
left before noon."
Duration: Duration specifies how long an action takes. Preposition such as
for indicate duration. "Ali and Zia jogged for an hour.”

6. CONCLUSION
The designed system for Automatic Domain Specific Terminology
Extraction using a Decision Support System is a robust framework for
inferring the appropriate meanings from a given text. The accomplished
research is related to the understanding of the human languages. Human
being needs specific linguistic knowledge generating and understanding
speech language contents. It is difficult for computers to perform this task.
The speech language context understanding using a rule based framework
has ability to read user provided text, extract related information and
ultimately give meanings to the extracted contents. The designed system

very effective and have high accuracy up to 90 %. The designed system
depicts the meanings of the given sentence or paragraph efficiently. An
elegant graphical user interface has also been provided to the user for
entering the Input scenario in a proper way and generating speech language
contents.

7. FUTURE WORK
The designed system analyzes and extracts the contents of a speech
language as English. Current designed system only works for the active-
vice sentences. Passive-vice sentences are still to work for to make the
system more efficient and effective for various business applications. There
is also periphery of improvements in the algorithms for the better extraction
of language parts and understanding the meanings of the speech language
context. Current accuracy of generating meanings from a text is about 85%
to 90%. It can be enhanced up to 95% or even more by improving the
algorithms and inducing the ability of learning in the system.

8. REFERENCES
[1] - Andrake M.A. and Bork P. “Automated Extraction of Information in
Molecular Biology” FEBS Letters, pp. 12-17, 2000.
[2] -Blaschke C, Andrade M. A, Ouzounis C. and Valencia,A. “Automatic
extraction of biological information from scientific text: protein–protein
interactions”. Ismb, pp. 60–67, 1999.
[3] - C. A. Thompson, R. J. Mooney and L. R. Tang, “Learning to parse natural
language database queries into logical form”, in: Workshop on Automata
Induction, Grammatical Inference and Language Acquisition, 1997.
[4] - Pustejovsky J, Castaño J., Zhang J, Kotecki M, Cochran B. “Robust relational
parsing over biomedical literature: Extracting inhibit relations”. Pac. Symp.
Biocomput, 362-73, 2002.
[5] - Krovetz, R., & Croft, W. B. “Lexical ambiguity and information retrieval”,
ACM Transactions on Information Systems, 10, pp. 115–141, 1992.
[6] - Losee, R. M. “Parameter estimation for probabilistic document retrieval
models”. Journal of the American Society for Information Science, 39(1), pp. 8–
16, 1988.

[7] - Partee, B. H., Meulen, A. t., &Wall, R. E. “Mathematical Methods in
Linguistics”. Kluwer, Dordrecht, The Netherlands. 1999.
[8] - Salton, G., & McGill, M. “Introduction to Modern Information Retrieval”
McGraw-Hill, New York., 1995
[9] - Maron, M. E. & Kuhns, J. L. “On relevance, probabilistic indexing, and
information retrieval” Journal of the ACM, 7, 216–244, 1997.
[10] - Chomsky, N. “Aspects of the Theory of Syntax. MIT Press, Cambridge,
Mass, 1965.
[11] - Chow, C., & Liu, C. “Approximating discrete probability distributions with
dependence trees”. IEEE Transactions on Information Theory, IT-14(3), 462–
467, 1968.
[12] - C. A. Thompson, R. J. Mooney and L. R. Tang, Learning to parse natural language
database queries into logical form, Workshop on Automata Induction, Grammatical
Inference and Language Acquisition (1997).
[13] - Taffet, Mary D. (2001). GENTECH 2001 Scholarship Proposal: Automatic Tagging
of Genealogical Data to Enhance Web-based Retrieval. Available at:
<http://web.syr.edu/~mdtaffet/GENTECH_Scholarship_Proposal.htm>.
[14] - Nerbonne, John (2003): "Natural language processing in computer-assisted language
learning". In: Mitkov, R. (ed.): The Oxford Handbook of Computational Linguistics.
Oxford: 670-698.
[15] - Menzel, Wolfgang/Schröder, Ingo (1998): "Error diagnosis for language learning
systems". Proceedings of NLP+IA'98 1: 526-530.
[16] - Etchegoyhen, Thierry/ Wehrle, Thomas (1998): "Overview of GBGen: A large-scale,
domain independent syntactic generator". In: Hovy, Eduard (ed.): Proceedings of the 9th
International Workshop on Natural Language Generation. Niagara-on-the-Lake: 288-291.
[17] - Granger, Sylviane (2003): "Error-tagged Learner Corpora and CALL: A Promising
Synergy". CALICO Journal 20/3: 465-480.
[18] - Menzel, Wolfgang/Schröder, Ingo (1998): "Error diagnosis for language learning
systems". Proceedings of NLP+IA'98 1: 526-530. Anne Vandeventer Faltin: Natural
Language Tools for CALL 153
[19] - Nerbonne, John (2003): "Natural language processing in computer-assisted language
learning". In: Mitkov, R. (ed.): The Oxford Handbook of Computational Linguistics.
Oxford: 670-698.

Domain Specific Terminology Extraction (ICICT 2006)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Domain Specific Terminology Extraction (ICICT 2006)

Similar to Domain Specific Terminology Extraction (ICICT 2006) (20)

More from IT Industry

More from IT Industry (13)

Domain Specific Terminology Extraction (ICICT 2006)