This paper introduces Named Entity Recognition approach for textual corpus. Supervised Statistical methods are used to develop our system. Our system can be used to categorize NEs belonging to a particular domain for which it is being trained. As Named Entities appears in text surrounded by contexts (words that are left or right of the NE), we will be focusing on extracting NE contexts from text and then perform statistical computing on them. We are using n-gram modeling for extracting contexts from text. Our methodology first extracts left and right tri-grams surrounding NE instances in the training corpus and calculate their probabilities. Then all the extracted tri-grams along with their calculated probabilities are stored in a file. During testing, system detects unrecognized NEs in the testing corpus and categorize them using the tri-gram probabilities calculated during training time. The proposed system consists of two modules namely Knowledge acquisition and NE Recognition. Knowledge acquisition module extracts the tri-grams surrounding NEs in the training corpus and NE Recognition module performs the categorization of Named Entities in the testing corpus.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
This paper proposes Natural language based Discourse Analysis method used for extracting
information from the news article of different domain. The Discourse analysis used the Rhetorical Structure
theory which is used to find coherent group of text which are most prominent for extracting information
from text. RST theory used the Nucleus- Satellite concept for finding most prominent text from the text
document. After Discourse analysis the text analysis has been done for extracting domain related object
and relates this object. For extracting the information knowledge based system has been used which
consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is
evaluated according gold-of-art analysis and human decision for extracted information.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
This paper proposes Natural language based Discourse Analysis method used for extracting
information from the news article of different domain. The Discourse analysis used the Rhetorical Structure
theory which is used to find coherent group of text which are most prominent for extracting information
from text. RST theory used the Nucleus- Satellite concept for finding most prominent text from the text
document. After Discourse analysis the text analysis has been done for extracting domain related object
and relates this object. For extracting the information knowledge based system has been used which
consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is
evaluated according gold-of-art analysis and human decision for extracted information.
Named Entity Recognition Using Web Document CorpusIJMIT JOURNAL
This paper introduces a named entity recognition approach in textual corpus. This Named Entity (NE) can be a named: location, person, organization, date, time, etc., characterized by instances. A NE is found in texts accompanied by contexts: words that are left or right of the NE. The work mainly aims at identifying contexts inducing the NE’s nature. As such, The occurrence of the word "President" in a text, means that this word or context may be followed by the name of a president as President "Obama". Likewise, a word preceded by the string "footballer" induces that this is the name of a footballer. NE recognition may be viewed as a classification method, where every word is assigned to a NE class, regarding the context.
The aim of this study is then to identify and classify the contexts that are most relevant to recognize a NE, those which are frequently found with the NE. A learning approach using training corpus: web documents, constructed from learning examples is then suggested. Frequency representations and modified tf-idf representations are used to calculate the context weights associated to context frequency, learning example frequency, and document frequency in the corpus.
Named entity recognition using web document corpusIJMIT JOURNAL
This paper introduces a named entity recognition approach in textual corpus. This Named Entity (NE)
can be a named: location, person, organization, date, time, etc., characterized by instances. A NE is
found in texts accompanied by contexts: words that are left or right of the NE. The work mainly aims at identifying contexts inducing the NE’s nature. As such, The occurrence of the word "President" in a text, means that this word or context may be followed by the name of a president as President "Obama". Likewise, a word preceded by the string "footballer" induces that this is the name of a
footballer. NE recognition may be viewed as a classification method, where every word is assigned to
a NE class, regarding the context. The aim of this study is then to identify and classify the contexts that are most relevant to recognize a NE, those which are frequently found with the NE. A learning approach using training corpus: web documents, constructed from learning examples is then suggested. Frequency representations and modified tf-idf representations are used to calculate the context weights associated to context frequency, learning example frequency, and document frequency in the corpus.
Architecture of an ontology based domain-specific natural language question a...IJwest
Question answering (QA) system aims at retrieving precise information from a large collection of
documents against a query. This paper describes the architecture of a Natural Language Question
Answering (NLQA) system for a specifi
c domain based on the ontological information, a step towards
semantic web question answering. The proposed architecture defines four basic modules suitable for
enhancing current QA capabilities with the ability of processing complex questions. The first m
odule was
the question processing, which analyses and classifies the question and also reformulates the user query.
The second module allows the process of retrieving the relevant documents. The next module processes the
retrieved documents, and the last m
odule performs the extraction and generation of a response. Natural
language processing techniques are used for processing the question and documents and also for answer
extraction. Ontology and domain knowledge are used for reformulating queries and ident
ifying the
relations. The aim of the system is to generate short and specific answer to the question that is asked in the
natural language in a specific domain. We have achieved 94 % accuracy of natural language question
answering in our implementation
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...Editor IJCATR
Entrance of object orienting concept in database caused the relation database gradually to replace with object oriented
database in various fields. On the other hand for solving the problem of real world uncertain data, several methods were presented.
One of these methods for modeling database is an approach wich couples object-oriented database modeling with fuzzy logic. Many
queries that users to pose are expressed on the basis of linguistic variables. Because of classical databases are not able to support these
variables, leads to fuzzy approaches are considered. We investigate databases queries in this study both simple and complex ways. In
the complex way, we use conjunctive and disjunctive queries. In the following, we use the XML labels to express inqueries into fuzzy.
We can also communicate with other sections of software by entering into XML world as the most reliable opportunity. Also we want
to correct conjunctive and disjunctive queries related to fuzzy object oriented database using the concept of dependency measure and
weight, and weight be assigned to different phrases of a query based on user emphasis. The other aim of this research is mapping fuzzy
queries to fuzzy-XML. It is expected to be simple implement of query, and output of execution of queries be greatly closer to users'
needs and fulfill her expect. The results show that the proposed method explains the possible conjunctive and disjunctive queries the
database in the form of Fuzzy-XML.
The classical or traditional information system provides answer after a user submits a complete query. It is even
noticed that presently, almost all the relational database systems rely on the query which has syntax and semantics
defined completely to access data. But often it is the case that we are willing to use vague terms in our query. The main
objective of database management system is to provide an environment that is both convenient and efficient for people
to use in storing and retrieving information. A recent trend of supporting auto complete is a first step to cope up with
this problem. We can have design of both classical and fuzzy database and can use effectively fuzzy queries on these
databases. Fuzzy databases are developed to manipulate the incomplete, unclear and vague data such as low, fast, very
high, about etc. The primary focus of fuzzy logic is on the natural language. This Paper provides the users the flexibility
or freedom to query database using natural language. Here this paper implements “interactive fuzzy search”. This
framework for interactive fuzzy search permits the user to explore the data as they type even in the presence of some
minor errors. This paper applies fuzzy queries on relational database so that it is possible to have the precise result as
well as the output for the uncertain terms we generally use based on some membership function
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...IJERA Editor
Although publicly accessible databases containing speech documents. It requires a great deal of time and effort
required to keep them up to date is often burdensome. In an effort to help identify speaker of speech if text is
available, text-mining tools, from the machine learning discipline, it can be applied to help in this process also.
Here, we describe and evaluate document classification algorithms i.e. a combo pack of text mining and
classification. This task asked participants to design classifiers for identifying documents containing speech
related information in the main literature, and evaluated them against one another. Expected systems utilizes a
novel approach of k -nearest neighbour classification and compare its performance by taking different values of
k.
Ontologies are being used to organize information in many domains like artificial intelligence,
information science, semantic web, library science. Ontologies of an entity having different information
can be merged to create more knowledge of that particular entity. Ontologies today are powering more
accurate search and retrieval in websites like Wikipedia etc. As we move towards the future to Web 3.0,
also termed as the semantic web, ontologies will play a more important role.
Ontologies are represented in various forms like RDF, RDFS, XML, OWL etc. Querying ontologies can
yield basic information about an entity. This paper proposes an automated method for ontology creation,
using concepts from NLP (Natural Language Processing), Information Retrieval and Machine Learning.
Concepts drawn from these domains help in designing more accurate ontologies represented using the
XML format. This paper uses document classification using classification algorithms for assigning labels
to documents, document similarity to cluster similar documents to the input document, together, and
summarization to shorten the text and keep important terms essential in making the ontology. The module
is constructed using the Python programming language and NLTK (Natural Language Toolkit). The
ontologies created in XML will convey to a lay person the definition of the important term's and their
lexical relationships.
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A study on the approaches of developing a named entity recognition tooleSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Novel Database-Centric Framework for Incremental Information Extractionijsrd.com
Information extraction (IE) has been an active research area that seeks techniques to uncover information from a large collection of text. IE is the task of automatically extracting structured information from unstructured and/or semi structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in document processing like automatic annotation and content extraction could be seen as information extraction. Many applications call for methods to enable automatic extraction of structured information from unstructured natural language text. Due to the inherent challenges of natural language processing, most of the existing methods for information extraction from text tend to be domain specific. In this project a new paradigm for information extraction. In this extraction framework, intermediate output of each text processing component is stored so that only the improved component has to be deployed to the entire corpus. Extraction is then performed on both the previously processed data from the unchanged components as well as the updated data generated by the improved component. Performing such kind of incremental extraction can result in a tremendous reduction of processing time and there is a mechanism to generate extraction queries from both labeled and unlabeled data. Query generation is critical so that casual users can specify their information needs without learning the query language.
J48 and JRIP Rules for E-Governance DataCSCJournals
Data are any facts, numbers, or text that can be processed by a computer. Data Mining is an analytic process which designed to explore data usually large amounts of data. Data Mining is often considered to be \"a blend of statistics. In this paper we have used two data mining techniques for discovering classification rules and generating a decision tree. These techniques are J48 and JRIP. Data mining tools WEKA is used in this paper.
A Design of fuzzy controller for Autonomous Navigation of Unmanned VehicleWaqas Tariq
The design approach is proposed for fuzzy logic controller for autonomous navigation of a vehicle in an obstacle filled environment. The proposed fuzzy controller is composed obstacle avoidance layer, orientation control layer, passage detection module. Here the fuzzy controller for obstacle avoidance is proposed. It provides a model for multiple sensor input fusion and it is composed of eight individual controllers, each calculating a collision possibility in different directions of movement. By calculating value of collision possibility main controller that performs real-time collision avoidance. The operating frequency & logic cells requirements for different implementation techniques are find out. The designs have been carried out in the digital domain with VHDL using Altera Quartus-II software.
A Proposed Web Accessibility Framework for the Arab DisabledWaqas Tariq
The Web is providing unprecedented access to information and interaction for people with disabilities. This paper presents a Web accessibility framework which offers the ease of the Web accessing for the disabled Arab users and facilitates their lifelong learning as well. The proposed framework system provides the disabled Arab user with an easy means of access using their mother language so they don’t have to overcome the barrier of learning the target-spoken language. This framework is based on analyzing the web page meta-language, extracting its content and reformulating it in a suitable format for the disabled users. The basic objective of this framework is supporting the equal rights of the Arab disabled people for their access to the education and training with non disabled people. Key Words : Arabic Moon code, Arabic Sign Language, Deaf, Deaf-blind, E-learning Interactivity, Moon code, Web accessibility , Web framework , Web System, WWW.
New Approach of Prediction of Sidoarjo Hot Mudflow Disastered Area Based on P...Waqas Tariq
A new approach of prediction of Sidoarjo hot mudflow disastered area based on cellular automata with probabilistic adjustment for minimizing prediction errors is proposed. Sidoarjo hot mudflow has specific characteristics such as plane and complex area, huge mud plumes, high viscosity and surface temperature changes, so that it needs combined approaches of slow debris flow, and material changes caused by viscous fluid and thermal changes. Some deterministic approaches can not show the high state changes. This paper presents a new approach of cellular automata using probabilistic state changing to simulate hot mudflow spreading. The model was calibrated with the time series of topological maps. The experimental results show new inundated areas that are identified as high risk areas where are covered by mud. It is also show that the proposed probabilistic cellular automata approach works well for prediction of hot mudflow spreading areas much accurate than the existing conventional methods.
Named Entity Recognition Using Web Document CorpusIJMIT JOURNAL
This paper introduces a named entity recognition approach in textual corpus. This Named Entity (NE) can be a named: location, person, organization, date, time, etc., characterized by instances. A NE is found in texts accompanied by contexts: words that are left or right of the NE. The work mainly aims at identifying contexts inducing the NE’s nature. As such, The occurrence of the word "President" in a text, means that this word or context may be followed by the name of a president as President "Obama". Likewise, a word preceded by the string "footballer" induces that this is the name of a footballer. NE recognition may be viewed as a classification method, where every word is assigned to a NE class, regarding the context.
The aim of this study is then to identify and classify the contexts that are most relevant to recognize a NE, those which are frequently found with the NE. A learning approach using training corpus: web documents, constructed from learning examples is then suggested. Frequency representations and modified tf-idf representations are used to calculate the context weights associated to context frequency, learning example frequency, and document frequency in the corpus.
Named entity recognition using web document corpusIJMIT JOURNAL
This paper introduces a named entity recognition approach in textual corpus. This Named Entity (NE)
can be a named: location, person, organization, date, time, etc., characterized by instances. A NE is
found in texts accompanied by contexts: words that are left or right of the NE. The work mainly aims at identifying contexts inducing the NE’s nature. As such, The occurrence of the word "President" in a text, means that this word or context may be followed by the name of a president as President "Obama". Likewise, a word preceded by the string "footballer" induces that this is the name of a
footballer. NE recognition may be viewed as a classification method, where every word is assigned to
a NE class, regarding the context. The aim of this study is then to identify and classify the contexts that are most relevant to recognize a NE, those which are frequently found with the NE. A learning approach using training corpus: web documents, constructed from learning examples is then suggested. Frequency representations and modified tf-idf representations are used to calculate the context weights associated to context frequency, learning example frequency, and document frequency in the corpus.
Architecture of an ontology based domain-specific natural language question a...IJwest
Question answering (QA) system aims at retrieving precise information from a large collection of
documents against a query. This paper describes the architecture of a Natural Language Question
Answering (NLQA) system for a specifi
c domain based on the ontological information, a step towards
semantic web question answering. The proposed architecture defines four basic modules suitable for
enhancing current QA capabilities with the ability of processing complex questions. The first m
odule was
the question processing, which analyses and classifies the question and also reformulates the user query.
The second module allows the process of retrieving the relevant documents. The next module processes the
retrieved documents, and the last m
odule performs the extraction and generation of a response. Natural
language processing techniques are used for processing the question and documents and also for answer
extraction. Ontology and domain knowledge are used for reformulating queries and ident
ifying the
relations. The aim of the system is to generate short and specific answer to the question that is asked in the
natural language in a specific domain. We have achieved 94 % accuracy of natural language question
answering in our implementation
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...Editor IJCATR
Entrance of object orienting concept in database caused the relation database gradually to replace with object oriented
database in various fields. On the other hand for solving the problem of real world uncertain data, several methods were presented.
One of these methods for modeling database is an approach wich couples object-oriented database modeling with fuzzy logic. Many
queries that users to pose are expressed on the basis of linguistic variables. Because of classical databases are not able to support these
variables, leads to fuzzy approaches are considered. We investigate databases queries in this study both simple and complex ways. In
the complex way, we use conjunctive and disjunctive queries. In the following, we use the XML labels to express inqueries into fuzzy.
We can also communicate with other sections of software by entering into XML world as the most reliable opportunity. Also we want
to correct conjunctive and disjunctive queries related to fuzzy object oriented database using the concept of dependency measure and
weight, and weight be assigned to different phrases of a query based on user emphasis. The other aim of this research is mapping fuzzy
queries to fuzzy-XML. It is expected to be simple implement of query, and output of execution of queries be greatly closer to users'
needs and fulfill her expect. The results show that the proposed method explains the possible conjunctive and disjunctive queries the
database in the form of Fuzzy-XML.
The classical or traditional information system provides answer after a user submits a complete query. It is even
noticed that presently, almost all the relational database systems rely on the query which has syntax and semantics
defined completely to access data. But often it is the case that we are willing to use vague terms in our query. The main
objective of database management system is to provide an environment that is both convenient and efficient for people
to use in storing and retrieving information. A recent trend of supporting auto complete is a first step to cope up with
this problem. We can have design of both classical and fuzzy database and can use effectively fuzzy queries on these
databases. Fuzzy databases are developed to manipulate the incomplete, unclear and vague data such as low, fast, very
high, about etc. The primary focus of fuzzy logic is on the natural language. This Paper provides the users the flexibility
or freedom to query database using natural language. Here this paper implements “interactive fuzzy search”. This
framework for interactive fuzzy search permits the user to explore the data as they type even in the presence of some
minor errors. This paper applies fuzzy queries on relational database so that it is possible to have the precise result as
well as the output for the uncertain terms we generally use based on some membership function
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...IJERA Editor
Although publicly accessible databases containing speech documents. It requires a great deal of time and effort
required to keep them up to date is often burdensome. In an effort to help identify speaker of speech if text is
available, text-mining tools, from the machine learning discipline, it can be applied to help in this process also.
Here, we describe and evaluate document classification algorithms i.e. a combo pack of text mining and
classification. This task asked participants to design classifiers for identifying documents containing speech
related information in the main literature, and evaluated them against one another. Expected systems utilizes a
novel approach of k -nearest neighbour classification and compare its performance by taking different values of
k.
Ontologies are being used to organize information in many domains like artificial intelligence,
information science, semantic web, library science. Ontologies of an entity having different information
can be merged to create more knowledge of that particular entity. Ontologies today are powering more
accurate search and retrieval in websites like Wikipedia etc. As we move towards the future to Web 3.0,
also termed as the semantic web, ontologies will play a more important role.
Ontologies are represented in various forms like RDF, RDFS, XML, OWL etc. Querying ontologies can
yield basic information about an entity. This paper proposes an automated method for ontology creation,
using concepts from NLP (Natural Language Processing), Information Retrieval and Machine Learning.
Concepts drawn from these domains help in designing more accurate ontologies represented using the
XML format. This paper uses document classification using classification algorithms for assigning labels
to documents, document similarity to cluster similar documents to the input document, together, and
summarization to shorten the text and keep important terms essential in making the ontology. The module
is constructed using the Python programming language and NLTK (Natural Language Toolkit). The
ontologies created in XML will convey to a lay person the definition of the important term's and their
lexical relationships.
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A study on the approaches of developing a named entity recognition tooleSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Novel Database-Centric Framework for Incremental Information Extractionijsrd.com
Information extraction (IE) has been an active research area that seeks techniques to uncover information from a large collection of text. IE is the task of automatically extracting structured information from unstructured and/or semi structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in document processing like automatic annotation and content extraction could be seen as information extraction. Many applications call for methods to enable automatic extraction of structured information from unstructured natural language text. Due to the inherent challenges of natural language processing, most of the existing methods for information extraction from text tend to be domain specific. In this project a new paradigm for information extraction. In this extraction framework, intermediate output of each text processing component is stored so that only the improved component has to be deployed to the entire corpus. Extraction is then performed on both the previously processed data from the unchanged components as well as the updated data generated by the improved component. Performing such kind of incremental extraction can result in a tremendous reduction of processing time and there is a mechanism to generate extraction queries from both labeled and unlabeled data. Query generation is critical so that casual users can specify their information needs without learning the query language.
J48 and JRIP Rules for E-Governance DataCSCJournals
Data are any facts, numbers, or text that can be processed by a computer. Data Mining is an analytic process which designed to explore data usually large amounts of data. Data Mining is often considered to be \"a blend of statistics. In this paper we have used two data mining techniques for discovering classification rules and generating a decision tree. These techniques are J48 and JRIP. Data mining tools WEKA is used in this paper.
A Design of fuzzy controller for Autonomous Navigation of Unmanned VehicleWaqas Tariq
The design approach is proposed for fuzzy logic controller for autonomous navigation of a vehicle in an obstacle filled environment. The proposed fuzzy controller is composed obstacle avoidance layer, orientation control layer, passage detection module. Here the fuzzy controller for obstacle avoidance is proposed. It provides a model for multiple sensor input fusion and it is composed of eight individual controllers, each calculating a collision possibility in different directions of movement. By calculating value of collision possibility main controller that performs real-time collision avoidance. The operating frequency & logic cells requirements for different implementation techniques are find out. The designs have been carried out in the digital domain with VHDL using Altera Quartus-II software.
A Proposed Web Accessibility Framework for the Arab DisabledWaqas Tariq
The Web is providing unprecedented access to information and interaction for people with disabilities. This paper presents a Web accessibility framework which offers the ease of the Web accessing for the disabled Arab users and facilitates their lifelong learning as well. The proposed framework system provides the disabled Arab user with an easy means of access using their mother language so they don’t have to overcome the barrier of learning the target-spoken language. This framework is based on analyzing the web page meta-language, extracting its content and reformulating it in a suitable format for the disabled users. The basic objective of this framework is supporting the equal rights of the Arab disabled people for their access to the education and training with non disabled people. Key Words : Arabic Moon code, Arabic Sign Language, Deaf, Deaf-blind, E-learning Interactivity, Moon code, Web accessibility , Web framework , Web System, WWW.
New Approach of Prediction of Sidoarjo Hot Mudflow Disastered Area Based on P...Waqas Tariq
A new approach of prediction of Sidoarjo hot mudflow disastered area based on cellular automata with probabilistic adjustment for minimizing prediction errors is proposed. Sidoarjo hot mudflow has specific characteristics such as plane and complex area, huge mud plumes, high viscosity and surface temperature changes, so that it needs combined approaches of slow debris flow, and material changes caused by viscous fluid and thermal changes. Some deterministic approaches can not show the high state changes. This paper presents a new approach of cellular automata using probabilistic state changing to simulate hot mudflow spreading. The model was calibrated with the time series of topological maps. The experimental results show new inundated areas that are identified as high risk areas where are covered by mud. It is also show that the proposed probabilistic cellular automata approach works well for prediction of hot mudflow spreading areas much accurate than the existing conventional methods.
Institutional Investors Heterogeneity And Earnings Management: The R&D Invest...Waqas Tariq
This study examines the association between different institutional investors\' ownership and earnings management practice through R&D expenditures. It investigates this relationship for a sample of 123 US firms. We examine also the effect of institutional ownership on earnings management of firms having different information environment (S&P 500 versus non S&P 500). Results show that while investment funds exacerbate earnings management by encouraging managers to limit R & D expenditures, pension funds and banks follow passive behaviors. Moreover, the hypothesis of the relevance of the environment information in the explanation of the institutional investors’ behavior seems to be important in our case.
Using Met-modeling Graph Grammars and R-Maude to Process and Simulate LRN ModelsWaqas Tariq
Nowadays, code mobility technology is one of the most attractive research domains. Numerous domains are concerned, many platforms are developed and interest applications are realized. However, the poorness of modeling languages to deal with code mobility at requirement phase has incited to suggest new formalisms. Among these, we find Labeled Reconfigurable Nets (LRN) [9], This new formalism allows explicit modeling of computational environments and processes mobility between them. it allows, in a simple and an intuitive approach, modeling mobile code paradigms (mobile agent, code on demand, remote evaluation). In this paper, we propose an approach based on the combined use of Meta-modeling and Graph Grammars to automatically generate a visual modeling tool for LRN for analysis and simulation purposes. In our approach, the UML Class diagram formalism is used to define a meta-model of LRN. The meta-modeling tool ATOM3 is used to generate a visual modeling tool according to the proposed LRN meta-model. We have also proposed a graph grammar to generate R-Maude [22] specification of the graphically specified LRN models. Then the reconfigurable rewriting logic language R-Maude is used to perform the simulation of the resulted R-Maude specification. Our approach is illustrated through examples.
This study presents an improvement to the Brent¡¯s Method by reconstruction. The Brent¡¯s Method determines the next iteration interval from two subsections, whereas the new method determines the next iteration interval from three subsections constructed by four given points and thus can greatly reduce the iteration interval length. The new method not only gets more readable but also converges faster. An experiment is made to investigate its performance. Results show that, after simplification, the computational efficiency can greatly be improved.
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Waqas Tariq
A \"sentence pattern\" in modern Natural Language Processing is often considered as a subsequent string of words (n-grams). However, in many branches of linguistics, like Pragmatics or Corpus Linguistics, it has been noticed that simple n-gram patterns are not sufficient to reveal the whole sophistication of grammar patterns. We present a language independent architecture for extracting from sentences more sophisticated patterns than n-grams. In this architecture a \"sentence pattern\" is considered as n-element ordered combination of sentence elements. Experiments showed that the method extracts significantly more frequent patterns than the usual n-gram approach.
Application of Mobile Technology in Waste CollectionWaqas Tariq
One of the stages in waste management is waste collection, and as global waste generation continue to increase year after year, the need for better and more efficient waste disposal, collection and management methods become more evident and urgent. Automated forms of waste collection are very expensive and far from being affordable in many low income communities, especially in the so called developing countries. To solve this dilemma, mobile technologies are considered for use in waste collection as a prospective means of improving waste management. This paper is an attempt to proffer a generic but yet concrete and efficient solution to the problems associated with waste collection via the application of mobile technologies, firstly, by tackling the problems individually in form of subsystems and then, through integration of the subsystems together.
Logistic Loglogistic With Long Term Survivors For Split Population ModelWaqas Tariq
Split population models are also known as mixture model . The data used in this paper is Stanford Heart Transplant data. Survival times of potential heart transplant recipients from their date of acceptance into the Stanford Heart Transplant program [3]. This set consists of the survival times, in days, uncensored and censored for the 103 patients and with 3 covariates are considered Ages of patients in years, Surgery and Transplant, failure for these individuals is death. Covariate methods have been examined quite extensively in the context of parametric survival models for which the distribution of the survival times depends on the vector of covariates associated with each individual. See [6] for approaches which accommodate censoring and covariates in the ordinary exponential model for survival. Currently, such mixture models with immunes and covariates are in use in many areas such as medicine and criminology. See for examples [4][5][7]. In our formulation, the covariates are incorporated into a split loglogistic model by allowing the proportion of ultimate failures and the rate of failure to depend on the covariates and the unknown parameter vectors via logistic model. Within this setup, we provide simple sufficient conditions for the existence, consistency, and asymptotic normality of a maximum likelihood estimator for the parameters involved. As an application of this theory, the likelihood ratio test for a difference in immune proportions is shown to have an asymptotic chi-square distribution. These results allow immediate practical applications on the covariates and also provide some insight into the assumptions on the covariates and the censoring mechanism that are likely to be needed in practice. Our models and analysis are described in section 5.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
Cloud-Based Environmental Impact Assessment Expert System – A Case Study of FijiWaqas Tariq
Environmental impact assessments [EIA] involve identifying, measuring, and assessing impacts. This complex process deals with considerable amount of information and requires processing and analysis of quantitative data, qualitative information as well as expert human judgements. Often, available information is incomplete, subjective, and inconsistent. This challenge of collecting, processing, analyzing, and reporting EIA information can be met by computer systems. A Cloud-based Environmental Impact Assessment [EIA] system is proposed in this paper to overcome the many challenges faced by practitioners. Fiji’s EIA process is used as a case study. The steps involved in the process are automated as a sequence of computer executable programs with Expert System. Based on the information provided about projects, the EIA system is expected to compute environmental impacts and produce Environment Impact Statements. With the system, a user enters information about the environmental settings in which the development project is expected to take place as well as the proposed development project activities. Based on the input, an expert system with an inference engine uses rules to check the knowledge base and report on possible impacts and mitigation actions. The knowledge base is connected to databases on domain experts, GIS and simulation models.
Real Time Blinking Detection Based on Gabor FilterWaqas Tariq
New method of blinking detection is proposed. The utmost important of blinking detections method is robust against different users, noise, and also change of eye shape. In this paper, we propose blinking detections method by measuring of distance between two arcs of eye (upper part and lower part). We detect eye arcs by apply Gabor filter onto eye image. As we know that Gabor filter has advantage on image processing application since it able to extract spatial localized spectral features, such line, arch, and other shape are more easily detected. After two of eye arcs are detected, we measure the distance between both by using connected labeling method. The open eye is marked by the distance between two arcs is more than threshold and otherwise, the closed eye is marked by the distance less than threshold. The experiment result shows that our proposed method robust enough against different users, noise, and eye shape changes with perfectly accuracy.
Impact of Defence Offsets On The Companies of The Participating Industry - A ...Waqas Tariq
The knowledge about the connection between purchases of equipment and offset obligations is almost unknown in many areas of the economy. The requests for this offsets occurs primarily in the area of arms imports and covers the full range of benefits that firms provide to the buying governments as inducements for the purchase of military equipment. For those companies which participate for the first time in such offset programs, is it very limited to inform on the effects of offsets. So it is necessary to provide information about the impact of offset for the companies of the participating industry. This examination was triggered through an overall research project on the impact of offsets on the business processes of SMEs. During the necessary Pre-Study for this research project first indications appear that the impact of offset is often not known by the affected companies. The purpose of this paper is to analyze the generic impact of offset for the affected companies with the help of a case study examination. The data for this examination were obtained from secondary sources. After data collection, an analysis was performed on the chosen case studies: Switzerland and Malaysia. This analyzes shows that offset has a wide range impact for the companies.
General Principles of User Interface Design and WebsitesWaqas Tariq
User Interfaces have gone a major transformation since 1970s, all this was possible because of the advances in HCI and related technologies. The Principles of User Interface design has contributed much to the change that we see in the present day user interfaces and predominantly the web interfaces of various websites. This paper presents the various General Principles of User Interface Design and their relevance for present day web interfaces with full length analysis. Each principle is investigated over five different types of web interfaces with 30 different websites per type. The various properties that contribute to the principles have been investigated thoroughly and their statistical values are reported in their entirety
Design Auto Adjust Sliding Surface Slope: Applied to Robot ManipulatorWaqas Tariq
The main target in this paper is to present the nonlinear methods in order to control the robot manipulators and also the related results. Also the important role of sliding surface slope in sliding mode fuzzy control of robot manipulator should be considered. Sliding mode controller (SMC) is a significant nonlinear controller in certain and uncertain dynamic parameters systems. To solve the chattering phenomenon, this paper complicated two methods to each other; boundary layer method and applied fuzzy logic in sliding mode methodology. To remove the chattering sliding surface slope also played important role so this paper focused on the auto tuning this important coefficient to have the best results by applied mathematical model free methodology. Auto tuning methodology has acceptable performance in presence of uncertainty (e.g., overshoot=0%, rise time=0.8 s, steady state error = 1e-9 and RMS error=0.0001632).
V/F Control of Squirrel Cage Induction Motor Drives Without Flux or Torque Me...Waqas Tariq
Based on the popular constant volts per hertz principle, two improvement techniques are presented: keeping maximum torque constant or keeping magnetic flux constant. An open-loop inverter-three-phase squirrel-cage induction motor drive system that provides constant maximum torque or increased maximum torque and reduced slip speed at frequencies below the nominal frequency has been modeled, simulated and tested. Load performance analysis of the proposed system under different operation conditions was provided. These principles of operation are extended to the case of operation from variable frequency or variable voltage control method. Finally, the effects of the non-sinusoidal voltage and/or current wave shapes are covered. The results show that both suggested improvement techniques (constant torque or constant flux) improve the steady-state performance A.C. drive system with squirrel cage induction motors. The slip speed has been decreased and the starting torque and maximum torque have been increased, which means that the suggested control techniques can be used in drive systems with short time operating mode under light loads.
Anonymization techniques are used to ensure the privacy preservation of the data owners, especially for personal and sensitive data. While in most cases, data reside inside the database management system; most of the proposed anonymization techniques operate on and anonymize isolated datasets stored outside the DBMS. Hence, most of the desired functionalities of the DBMS are lost, e.g., consistency, recoverability, and efficient querying. In this paper, we address the challenges involved in enforcing the data privacy inside the DBMS. We implement the k-anonymity algorithm as a relational operator that interacts with other query operators to apply the privacy requirements while querying the data. We study anonymizing a single table, multiple tables, and complex queries that involve multiple predicates. We propose several algorithms to implement the anonymization operator that allow efficient non-blocking and pipelined execution of the query plan. We introduce the concept of k-anonymity view as an abstraction to treat k-anonymity (possibly, with multiple k preferences) as a relational view over the base table(s). For non-static datasets, we introduce the materialized k-anonymity views to ensure preserving the privacy under incremental updates. A prototype system is realized based on PostgreSQL with extended SQL and new relational operators to support anonymity views. The prototype system demonstrates how anonymity views integrate with other privacy- preserving components, e.g., limited retention, limited disclosure, and privacy policy management. Our experiments, on both synthetic and real datasets, illustrate the performance gain from the anonymity views as well as the proposed query optimization techniques under various scenarios.
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet.. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or need statements, clustering of document collections on the basis of language or topic, and statistical, probabilistic, and semantic methods of analyzing and retrieving documents. Information extraction from text has therefore been pursued actively as an attempt to present knowledge from published material in a computer readable format. An automated extraction tool would not only save time and efforts, but also pave way to discover hitherto unknown information implicitly conveyed in this paper. Work in this area has focused on extracting a wide range of information such as chromosomal location of genes, protein functional information, associating genes by functional relevance and relationships between entities of interest. While clinical records provide a semi-structured, technically rich data source for mining information, the publications, in their unstructured format pose a greater challenge, addressed by many approaches.
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into predefined categories such as Person names, Organization names , Location names, miscellaneous(Date and others). It is a key technology of Information Extraction, Question Answering system, Machine Translations, Information Retrial etc. This paper reports about the development of a NER system for Telugu using Conditional Random field (CRF). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Telugu languages is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named entities (NE) classes, such as Person name, Location name, Organization name, miscellaneous (Date and others). Keywords: Named entity, Conditional Random field, NE, CRF, NER, named entity recognition
Eat it, Review it: A New Approach for Review Predictionvivatechijri
Deep Learning has achieved significant improvement in various machine learning tasks. Nowadays,
Recurrent Neural Network (RNN) and Long Short Term Memory (LSTM) have been increasing its popularity on
Text Sequence i.e. word prediction. The ability to abstract information from image or text is being widely
adopted by organizations around the world. A basic task in deep learning is classification be it image or text.
Current trending techniques such as RNN, CNN has proven that such techniques open the door for data analysis.
Emerging technologies such has Region CNN, Recurrent CNN have been under consideration for the analysis.
Recurrent CNN is being under development with the current world. The proposed system uses Recurrent Neural
Network for review prediction. Also LSTM is used along with RNN so as to predict long sentences. This system
focuses on context based review prediction and will provide full length sentence. This will help to write a proper
reviews by understanding the context of user.
Rule-based Information Extraction from Disease Outbreak ReportsWaqas Tariq
Information extraction (IE) systems serve as the front end and core stage in different natural language programming tasks. As IE has proved its efficiency in domain-specific tasks, this project focused on one domain: disease outbreak reports. Several reports from the World Health Organization were carefully examined to formulate the extraction tasks: named-entities, such as disease name, date and location; the location of the reporting authority; and the outbreak incident. Extraction rules were then designed, based on a study of the textual expressions and elements found in the text that appeared before and after the target text.
The experiment resulted in very high performance scores for all the tasks in general. The training corpora and the testing corpora were tested separately. The system performed with higher accuracy with entities and events extraction than with relationship extraction.
It can be concluded that the rule-based approach has been proven capable of delivering reliable IE, with extremely high accuracy and coverage results. However, this approach requires an extensive, time-consuming, manual study of word classes and phrases.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
The aim of Named Entity Recognition (NER) is to identify references of named entities in unstructured documents, and to classify them into pre-defined semantic categories. NER often aids from added background knowledge in the form of gazetteers. However using such a collection does not deal with name variants and cannot resolve ambiguities associated in identifying the entities in context and associating them with predefined categories. We present a semi-supervised NER approach that starts with identifying named entities with a small set of training data. Using the identified named entities, the word and the context features are used to define the pattern. This pattern of each named entity category is used as a seed pattern to identify the named entities in the test set. Pattern scoring and tuple value score enables the generation of the new patterns to identify the named entity categories. We have evaluated the proposed system for English language with the dataset of tagged (IEER) and untagged (CoNLL 2003) named entity corpus and for Tamil language with the documents from the FIRE corpus and yield an average f-measure of 75% for both the languages.
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. This paper will describe some algorithms for identifying semantic relations and constructing an Information Technology Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences. We then extract these sentences based on English pattern in order to build training set. We use a random sample among 245 categories of ACM to evaluate our results. Results generated show that our system yields superior performance.
leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...KristiLBurns
NER is a process used in Natural Language Processing (NLP) where a computer program analyzes text to identify and extract important pieces of information, such as names of people, places, organizations, dates, and more. Employing NER allows a computer program to automatically recognize and categorize these specific pieces of information within the text.
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGESijistjournal
Named Entity Recognition is a prior task in Natural Language Processing. Named Entity Recognition is a sub task of information extraction and it identifies and classifies proper nouns in to its predefined categories such as person, location, organization, time, date etc. In this document the major focus is given on NER approaches and the work done till now for various languages to identify Named Entities is been discussed. Author have done comparative study to recognize named entity and identified that CRF approach proven best for Indian languages to identify named entity.
Similar to Domain Specific Named Entity Recognition Using Supervised Approach (20)
The Use of Java Swing’s Components to Develop a WidgetWaqas Tariq
Widget is a kind of application provides a single service such as a map, news feed, simple clock, battery-life indicators, etc. This kind of interactive software object has been developed to facilitate user interface (UI) design. A user interface (UI) function may be implemented using different widgets with the same function. In this article, we present the widget as a platform that is generally used in various applications, such as in desktop, web browser, and mobile phone. We also describe a visual menu of Java Swing’s components that will be used to establish widget. It will assume that we have successfully compiled and run a program that uses Swing components.
3D Human Hand Posture Reconstruction Using a Single 2D ImageWaqas Tariq
Passive sensing of the 3D geometric posture of the human hand has been studied extensively over the past decade. However, these research efforts have been hampered by the computational complexity caused by inverse kinematics and 3D reconstruction. In this paper, our objective focuses on 3D hand posture estimation based on a single 2D image with aim of robotic applications. We introduce the human hand model with 27 degrees of freedom (DOFs) and analyze some of its constraints to reduce the DOFs without any significant degradation of performance. A novel algorithm to estimate the 3D hand posture from eight 2D projected feature points is proposed. Experimental results using real images confirm that our algorithm gives good estimates of the 3D hand pose. Keywords: 3D hand posture estimation; Model-based approach; Gesture recognition; human- computer interface; machine vision.
Camera as Mouse and Keyboard for Handicap Person with Troubleshooting Ability...Waqas Tariq
Camera mouse has been widely used for handicap person to interact with computer. The utmost important of the use of camera mouse is must be able to replace all roles of typical mouse and keyboard. It must be able to provide all mouse click events and keyboard functions (include all shortcut keys) when it is used by handicap person. Also, the use of camera mouse must allow users troubleshooting by themselves. Moreover, it must be able to eliminate neck fatigue effect when it is used during long period. In this paper, we propose camera mouse system with timer as left click event and blinking as right click event. Also, we modify original screen keyboard layout by add two additional buttons (button “drag/ drop” is used to do drag and drop of mouse events and another button is used to call task manager (for troubleshooting)) and change behavior of CTRL, ALT, SHIFT, and CAPS LOCK keys in order to provide shortcut keys of keyboard. Also, we develop recovery method which allows users go from camera and then come back again in order to eliminate neck fatigue effect. The experiments which involve several users have been done in our laboratory. The results show that the use of our camera mouse able to allow users do typing, left and right click events, drag and drop events, and troubleshooting without hand. By implement this system, handicap person can use computer more comfortable and reduce the dryness of eyes.
Computer Input with Human Eyes-Only Using Two Purkinje Images Which Works in ...Waqas Tariq
A method for computer input with human eyes-only using two Purkinje images which works in a real time basis without calibration is proposed. Experimental results shows that cornea curvature can be estimated by using two light sources derived Purkinje images so that no calibration for reducing person-to-person difference of cornea curvature. It is found that the proposed system allows usersf movements of 30 degrees in roll direction and 15 degrees in pitch direction utilizing detected face attitude which is derived from the face plane consisting three feature points on the face, two eyes and nose or mouth. Also it is found that the proposed system does work in a real time basis.
Toward a More Robust Usability concept with Perceived Enjoyment in the contex...Waqas Tariq
Mobile multimedia service is relatively new but has quickly dominated people¡¯s lives, especially among young people. To explain this popularity, this study applies and modifies the Technology Acceptance Model (TAM) to propose a research model and conduct an empirical study. The goal of study is to examine the role of Perceived Enjoyment (PE) and what determinants can contribute to PE in the context of using mobile multimedia service. The result indicates that PE is influencing on Perceived Usefulness (PU) and Perceived Ease of Use (PEOU) and directly Behavior Intention (BI). Aesthetics and flow are key determinants to explain Perceived Enjoyment (PE) in mobile multimedia usage.
Collaborative Learning of Organisational KnolwedgeWaqas Tariq
This paper presents recent research into methods used in Australian Indigenous Knowledge sharing and looks at how these can support the creation of suitable collaborative envi- ronments for timely organisational learning. The protocols and practices as used today and in the past by Indigenous communities are presented and discussed in relation to their relevance to a personalised system of knowledge sharing in modern organisational cultures. This research focuses on user models, knowledge acquisition and integration of data for constructivist learning in a networked repository of or- ganisational knowledge. The data collected in the repository is searched to provide collections of up-to-date and relevant material for training in a work environment. The aim is to improve knowledge collection and sharing in a team envi- ronment. This knowledge can then be collated into a story or workflow that represents the present knowledge in the organisation.
Our research aims to propose a global approach for specification, design and verification of context awareness Human Computer Interface (HCI). This is a Model Based Design approach (MBD). This methodology describes the ubiquitous environment by ontologies. OWL is the standard used for this purpose. The specification and modeling of Human-Computer Interaction are based on Petri nets (PN). This raises the question of representation of Petri nets with XML. We use for this purpose, the standard of modeling PNML. In this paper, we propose an extension of this standard for specification, generation and verification of HCI. This extension is a methodological approach for the construction of PNML with Petri nets. The design principle uses the concept of composition of elementary structures of Petri nets as PNML Modular. The objective is to obtain a valid interface through verification of properties of elementary Petri nets represented with PNML.
Development of Sign Signal Translation System Based on Altera’s FPGA DE2 BoardWaqas Tariq
The main aim of this paper is to build a system that is capable of detecting and recognizing the hand gesture in an image captured by using a camera. The system is built based on Altera’s FPGA DE2 board, which contains a Nios II soft core processor. Image processing techniques and a simple but effective algorithm are implemented to achieve this purpose. Image processing techniques are used to smooth the image in order to ease the subsequent processes in translating the hand sign signal. The algorithm is built for translating the numerical hand sign signal and the result are displayed on the seven segment display. Altera’s Quartus II, SOPC Builder and Nios II EDS software are used to construct the system. By using SOPC Builder, the related components on the DE2 board can be interconnected easily and orderly compared to traditional method that requires lengthy source code and time consuming. Quartus II is used to compile and download the design to the DE2 board. Then, under Nios II EDS, C programming language is used to code the hand sign translation algorithm. Being able to recognize the hand sign signal from images can helps human in controlling a robot and other applications which require only a simple set of instructions provided a CMOS sensor is included in the system.
An overview on Advanced Research Works on Brain-Computer InterfaceWaqas Tariq
A brain–computer interface (BCI) is a proficient result in the research field of human- computer synergy, where direct articulation between brain and an external device occurs resulting in augmenting, assisting and repairing human cognitive. Advanced works like generating brain-computer interface switch technologies for intermittent (or asynchronous) control in natural environments or developing brain-computer interface by Fuzzy logic Systems or by implementing wavelet theory to drive its efficacies are still going on and some useful results has also been found out. The requirements to develop this brain machine interface is also growing day by day i.e. like neuropsychological rehabilitation, emotion control, etc. An overview on the control theory and some advanced works on the field of brain machine interface are shown in this paper.
Exploring the Relationship Between Mobile Phone and Senior Citizens: A Malays...Waqas Tariq
There is growing ageing phenomena with the rise of ageing population throughout the world. According to the World Health Organization (2002), the growing ageing population indicates 694 million, or 223% is expected for people aged 60 and over, since 1970 and 2025.The growth is especially significant in some advanced countries such as North America, Japan, Italy, Germany, United Kingdom and so forth. This growing older adult population has significantly impact the social-culture, lifestyle, healthcare system, economy, infrastructure and government policy of a nation. However, there are limited research studies on the perception and usage of a mobile phone and its service for senior citizens in a developing nation like Malaysia. This paper explores the relationship between mobile phones and senior citizens in Malaysia from the perspective of a developing country. We conducted an exploratory study using contextual interviews with 5 senior citizens of how they perceive their mobile phones. This paper reveals 4 interesting themes from this preliminary study, in addition to the findings of the desirable mobile requirements for local senior citizens with respect of health, safety and communication purposes. The findings of this study bring interesting insight to local telecommunication industries as a whole, and will also serve as groundwork for more in-depth study in the future.
Principles of Good Screen Design in WebsitesWaqas Tariq
Visual techniques for proper arrangement of the elements on the user screen have helped the designers to make the screen look good and attractive. Several visual techniques emphasize the arrangement and ordering of the screen elements based on particular criteria for best appearance of the screen. This paper investigates few significant visual techniques in various web user interfaces and showcases the results for better understanding and their presence.
Virtual teams are used more and more by companies and other organizations to receive benefits. They are a great way to enable teamwork in situations where people are not sitting in the same physical place at the same time. As companies seek to increase the use of virtual teams, a need exists to explore the context of these teams, the virtuality of a team and software that may help these teams working virtualy. Virtual teams have the same basic principles as traditional teams, but there is one big difference. This difference is the way the team members communicate. Instead of using the dynamics of in-office face-to-face exchange, they now rely on special communication channels enabled by modern technologies, such as e-mails, faxes, phone calls and teleconferences, virtual meetings etc. This is why this paper is focused on the issues regarding virtual teams, and how these teams are created and progressing in Albania.
Cognitive Approach Towards the Maintenance of Web-Sites Through Quality Evalu...Waqas Tariq
It is a well established fact that the Web-Applications require frequent maintenance because of cutting– edge business competitions. The authors have worked on quality evaluation of web-site of Indian ecommerce domain. As a result of that work they have made a quality-wise ranking of these sites. According to their work and also the survey done by various other groups Futurebazaar web-site is considered to be one of the best Indian e-shopping sites. In this research paper the authors are assessing the maintenance of the same site by incorporating the problems incurred during this evaluation. This exercise gives a real world maintainability problem of web-sites. This work will give a clear picture of all the quality metrics which are directly or indirectly related with the maintainability of the web-site.
USEFul: A Framework to Mainstream Web Site Usability through Automated Evalua...Waqas Tariq
A paradox has been observed whereby web site usability is proven to be an essential element in a web site, yet at the same time there exist an abundance of web pages with poor usability. This discrepancy is the result of limitations that are currently preventing web developers in the commercial sector from producing usable web sites. In this paper we propose a framework whose objective is to alleviate this problem by automating certain aspects of the usability evaluation process. Mainstreaming comes as a result of automation, therefore enabling a non-expert in the field of usability to conduct the evaluation. This results in reducing the costs associated with such evaluation. Additionally, the framework allows the flexibility of adding, modifying or deleting guidelines without altering the code that references them since the guidelines and the code are two separate components. A comparison of the evaluation results carried out using the framework against published evaluations of web sites carried out by web site usability professionals reveals that the framework is able to automatically identify the majority of usability violations. Due to the consistency with which it evaluates, it identified additional guideline-related violations that were not identified by the human evaluators.
Robot Arm Utilized Having Meal Support System Based on Computer Input by Huma...Waqas Tariq
A robot arm utilized having meal support system based on computer input by human eyes only is proposed. The proposed system is developed for handicap/disabled persons as well as elderly persons and tested with able persons with several shapes and size of eyes under a variety of illumination conditions. The test results with normal persons show the proposed system does work well for selection of the desired foods and for retrieve the foods as appropriate as usersf requirements. It is found that the proposed system is 21% much faster than the manually controlled robotics.
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI). Keywords: ASR Performance, ASR Parameters Optimization, Multi-Environmental Training, Fuzzy Inference System for ASR, ubiquitous ASR system, Human Computer Interaction (HCI)
Interface on Usability Testing Indonesia Official Tourism WebsiteWaqas Tariq
Ministry of Tourism and Creative Economy of The Republic of Indonesia must meet the wide audience various needs and should reach people from all levels of society around the world to provide Indonesia tourism and travel information. This article will gives the details in the evolution of one important component of Indonesia Official Tourism Website as it has grown in functionality and usefulness over several years of use by a live, unrestricted community. We chose this website to see the website interface design and usability and to popularize Indonesia tourism and travel highlights. The analysis done by looking at the criteria specified for usability testing. Usability testing measures are the ease of use (effectiveness, efficiency, consistency and interface design), easy to learn, errors and syntax which is related to the human computer interaction. The purpose of this article is to test the usability level of the website, analyze the website interface design, and provide suggestions for improvements in Indonesia Official Tourism Website of analysis we have done before.
Monitoring and Visualisation Approach for Collaboration Production Line Envir...Waqas Tariq
In this paper, a tool, called SPMonitor, to monitor and visualize of run-time execution productive processes is proposed. SPMonitor enables dynamically visualizing and monitoring workflows running in a system. It displays versatile information about currently executed workflows providing better understanding about processes and the general functionality of the domain. Moreover, SPMonitor enhances cooperation between different stakeholders by offering extensive communication and problem solving features that allow actors concerned to react more efficiently to different anomalies that may occur during a workflow execution. The ideas discussed are validated through the study of real case related to airbus assembly lines.
Hand Segmentation Techniques to Hand Gesture Recognition for Natural Human Co...Waqas Tariq
This work is the part of vision based hand gesture recognition system for Natural Human Computer Interface. Hand tracking and segmentation are the primary steps for any hand gesture recognition system. The aim of this paper is to develop robust and efficient hand segmentation algorithm where three algorithms for hand segmentation using different color spaces with required morphological processing have were utilized. Hand tracking and segmentation algorithm (HTS) is found to be most efficient to handle the challenges of vision based system such as skin color detection, complex background removal and variable lighting condition. Noise may contain, sometime, in the segmented image due to dynamic background. An edge traversal algorithm was developed and applied on the segmented hand contour for removal of unwanted background noise.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Domain Specific Named Entity Recognition Using Supervised Approach
1. Ashwini A. Shende, Avinash J. Agrawal & Dr. O. G. Kakde
International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (1) : 2012 67
Domain Specific Named Entity Recognition Using
Supervised Approach
Ashwini A. Shende zashwini@rediffmail.com
Department of Computer Science & Engineering, RCOEM,
Rashtrasant Tukdoji Maharaj, Nagpur University
Nagpur, 440013, India
Avinash J. Agrawal avinashjagrawal@gmail.com
Department of Computer Science & Engineering, RCOEM,
Rashtrasant Tukdoji Maharaj, Nagpur University
Nagpur, 440013, India
Dr. O. G. Kakde ogkakde@vnit.ac.in
Visvesvaraya National Institute of Technology
Nagpur, 440010, India
Abstract
This paper introduces Named Entity Recognition approach for text corpus. Supervised Statistical
methods are used to develop our system. Our system can be used to categorize NEs belonging
to a particular domain for which it is being trained. As Named Entities appears in text surrounded
by contexts (words that are left or right of the NE), we will be focusing on extracting NE contexts
from text and then performing statistical computing on them. We are using n-gram model for
extracting contexts from text. Our methodology first extracts left and right tri-grams surrounding
NE instances in the training corpus and calculate their probabilities. Then all the extracted tri-
grams along with their calculated probabilities are stored in a file. During testing, system detects
unrecognized NEs from the testing corpus and categorizes them using the tri-gram probabilities
calculated during training time. The proposed system is made up of two modules i.e. Knowledge
acquisition and NE Recognition. Knowledge acquisition module extracts tri-grams surrounding
NEs in the training corpus and NE Recognition module performs the categorization of
unrecognized NEs in the testing corpus.
Keywords: Named Entity, Supervised Machine learning, N-gram, Context Extraction, NE
Recognition
1. INTRODUCTION
The term “Named Entity” (NE) is frequently used in Information Extraction (IE) applications. It
was coined at the sixth Message Understanding Conference (MUC-6) which influenced IE
research in the 1990s. In defining IE tasks, people noticed that it is essential to recognize
information units such as names including person, organization, and location names, and numeric
expressions including time, date, money, and percentages. Identifying references to these entities
in text was acknowledged as one of IE’s important sub-tasks and was called “Named Entity
Recognition (NER).” Named Entity Recognition is complex in various areas of automatic Natural
Language Processing of (NLP), document indexing, document annotation, translation, etc. It is a
fundamental step in various Information Extraction (IE) tasks.
1.1 Named Entity Recognition
The NER task consists of identifying the occurrences of some predefined phrase types in a text.
In the expression “Named Entity,” the word “Named” aims to restrict the task to only those
entities for which one or many rigid designators, stands for the referent. Some tasks related to
NER (David Nadeau et.al. [1]) can be listed as follows.
2. Ashwini A. Shende, Avinash J. Agrawal & Dr. O. G. Kakde
International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (1) : 2012 68
• Personal Name Disambiguation :
It is the task of identifying the correct referent of a given designator. In a given context, it may
consist of identifying whether Jim Clark is the race driver, the film editor, or the Netscape founder.
Corpus-wide disambiguation of personal names has applications in document clustering for
information retrieval.
• NE Descriptions Identification :
It is the identification of textual passages that describe a given NE. For instance, Bill Clinton is
described as “the President of the U.S.,” “the democratic presidential candidate” or “an Arkansas
native,” depending on the document. Description identification can be use as a clue in personal
name disambiguation.
• Named Entity Translation :
It is the task of translating NEs from one language to another. For instance, the French translation
of “National Research Council Canada” is “Camseil national de recherché Canada.” NE
translation is acknowledged as a major issue in machine translation.
• Analysis of Name Structu
It is the identification of the parts in a person name. For example, the name “Doctor Paul R.
Smith” is composed of a person title, a first name, a middle name, and a surname. It is presented
as a preprocessing step for NER and for the resolution of co-references to help. Determine, for
instance, that “John F. Kennedy” and “President Kennedy” is the same person, while “John F.
Kennedy” and “Caroline Kennedy” are two distinct persons.
• Entity Anaphora Resolution :
It mainly consists of resolving pronominal co-reference when the antecedent is an NE. For
example, in the sentence “Rabi finished reading the book and he replaced it in the library,” the
pronoun “he” refers to “Rabi.” Anaphora resolution can be useful in solving the NER problem
itself by enabling the use of extended co-reference networks. Meanwhile it has many applications
of its own, such as in “question answering” (e.g., answering “Who put the book in the library?”).
• Acronym Identification :
It is described as the identification of an acronym’s definition (e.g., “IBM” stands for “International
Business Machines”) in a given document. The problem is related to NER because many
organization names are acronyms (GE, NRC, etc.). Resolving acronyms is useful, again, to build
co-reference networks aimed at solving NER On its own; it can improve the recall of information
retrieval by expanding queries containing an acronym with the corresponding definition.
• Record linkage :
It is the task of matching named entities across databases. It involves the use of clustering and
string matching techniques in order to map database entries having slight variations. It is used in
database cleaning and in data mining on multiple databases.
• Case Restoration :
It consists of restoring expected word casing in a sentence. Given a lower case sentence, the
goal is to restore the capital letters usually appearing on the first word of the sentence and on
NEs. This task is useful in machine translation, where a sentence is usually translated without
capitalization information.
Computational research aiming at automatically identifying NEs in texts forms a vast and
heterogeneous pool of strategies, methods, and representations. In its canonical form, the input
of an NER system is a text and the output is information on boundaries and types of NEs found in
the text. The majority of NER systems fall in two categories: the Rule-based systems; and the
Statistical systems. While early studies were mostly based on handcrafted rules, most of the
recent systems preferred statistical methods. In both approaches, large collections of documents
are analyzed by hand to obtain sufficient knowledge for designing rules or for feeding machine
learning algorithms. Expert linguists must execute this important amount of work, which in turn
limits the building and maintenance of large-scale NER systems.
The ability to recognize previously unknown entities is an essential part of NER systems. Such
ability hinges upon recognition and classification rules triggered by distinctive modeling features
3. Ashwini A. Shende, Avinash J. Agrawal & Dr. O. G. Kakde
International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (1) : 2012 69
associated with positive and negative examples. When training examples are not available,
handcrafted rules systems remain the preferred technique. The statistical methods collect
statistical knowledge from corpus and determine NE categories based on the statistical
knowledge. The statistical methods use supervised machine learning algorithms. The idea of
supervised learning is to study the features of positive and negative examples of NE over a large
collection of annotated documents and design rules that capture instances of a given type. The
main shortcoming of Supervised Learning is the requirement of a large annotated corpus. The
unavailability of such resources and the prohibitive cost of creating them lead to two alternative
learning methods: semi-supervised learning (SSL); and unsupervised learning (UL).
The term “semi-supervised” or “weakly supervised” is relatively recent. The main technique
for SSL is called “bootstrapping” and involves a small degree of supervision, such as a set of
seeds, for starting the learning process. For example, a system aimed at “disease names” might
ask the user to provide a small number of example names. Then, the system searches for
sentences that contain these names and tries to identify some contextual clues common to the
five examples. Then, the system tries to find other instances of disease names appearing in
similar contexts. The learning process is then reapplied to the newly found examples, so as to
discover new relevant contexts. By repeating this process, a large number of disease names and
a large number of contexts will eventually be gathered.
The typical approach in unsupervised learning is clustering. For example, one can try to gather
NEs from clustered groups based on context similarity. There are other unsupervised methods
also. Basically, the techniques rely on lexical resources (e.g., WordNet), on lexical patterns, and
on statistics computed on a large unannotated corpus.
This paper discusses the use of supervised machine learning approach for the problem of NE
recognition. The aim of our study is to reveal contextual NE in a document corpus using n-gram
modeling. A context considers words surrounding the NE in the sentence in which it appears, it is
a sequence of words, that are left or right of the NE. In this work, we use supervised learning
technologies, combined with statistical models to extract contexts from text document corpus, to
identify the most pertinent contexts for the recognition of a NE.
1.2 n-gram Modeling
A useful part of the knowledge needed for Word Prediction can be captured using simple
statistical techniques like the notion of the probability of a sequence (a phrase, a sentence). An
n-gram model is a type of probabilistic model for predicting the next item in a sequence. n-gram
probabilities can be used to estimate the likelihood
Of a word occurring in a context (n-1)
Of a sentence occurring at all
n-gram models are used in various areas of statistical natural language processing and genetic
sequence analysis. n-gram language model uses the previous n-1 words in a sequence to predict
the next word. These models are trained using very large corpora. n-gram probabilities come
from a training corpus
overly narrow corpus: -probabilities don't generalize
overly general corpus:- probabilities don't reflect task or domain
A separate test corpus is used to evaluate the model, typically using standard metrics
held out test set; development test set
cross validation
results tested for statistical significance
An n-gram is a subsequence of n items from a given sequence. The items can be phonemes,
syllables, letters, words or base pairs according to the application. An n-gram of size 1 is referred
to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram"; size
4 is a "four-gram" and size 5 or more is simply called an "n-gram"...
4. Ashwini A. Shende, Avinash J. Agrawal & Dr. O. G. Kakde
International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (1) : 2012 70
E.g. for the sequence “the big red ball”
unigram P (ball)
bigram P (ball / red)
trigram P (ball / big red)
four-gram P (ball / the big red)
In general
P (Word| Some fixed prefix)
As we increase the value of n, the accuracy of n-gram model increases, since choice of next word
becomes increasingly constrained.
n-gram is a sequence of n words in a text document and one can get a set of n-grams by moving
a floating window from the beginning to the end of the document. During the n-gram extraction
from text document, duplicate n-grams must be removed and the frequency of the n-gram types
should be calculated. Additionally, other values can be stored with n-gram type and frequency,
e.g. n-gram unique number, but it is document and query model dependent.
FIGURE 1, shows a common architecture of an n-gram extraction framework. This framework
usually includes:
1. Document parsing – it parses terms from input documents.
2. Term pre-processing – in this phase, various techniques like stemming and stop-list are applied
for the reduction of terms.
3. n-gram building and pre-processing – it creates an n-gram as a sequence of n terms.
Sometimes, n-grams are not shared by text units (sentences or paragraphs). It means, the last
term of a sentence is the last term of an n-gram and the next n-gram begins by the first term of
the next sentence.
4. n-gram extraction – the main goal of this phase is to remove duplicate n-grams. The result of
this phase is a collection of n-gram types with the frequency enclosed to each type. This
collection can be cleaned after this phase; for example, n-gram types with a low frequency are
removed. However, it is not appropriate to apply this post-processing in any application. It can be
used only when we do not need low frequency n-gram types. A common part of such a
framework is n-gram indexing. A data structure is applied to speed up access to the tuple
_ngram, id, frequency_, where ngram is a key; it means the ngram is an input of the query and id
and frequency form the output. Although, it is necessary to create other data structures, for
specific document and query models, one must always consider this global storage of the tuples.
FIGURE 1: n-gram Extraction Framework
Document
parsing
Term
Pre
processing
n-gram
extraction
n-gram
building
and Pre
processing
Text
document Terms
Terms
n- grams
n-gram
types
5. Ashwini A. Shende, Avinash J. Agrawal & Dr. O. G. Kakde
International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (1) : 2012 71
The remaining of the paper is organized as follows: Section 2 presents the review of the various
methods used for Named Entity Recognition. Section 3 describes the Methodology and section
4 gives test results of our approach. Section 5 gives the Work’s conclusion and Section 6
explains the future work recommended.
2. RELATED WORK
Named entity recognition can be used to perform numerous processing tasks in various areas: of
Information Extraction systems, Text mining, Automatic Speech Recognition (ASR) etc. Several
works are particularly interested in the recognition of named entities.
Mikheev et al. [2] have built a system for recognizing named entities, which combines a model
based on grammar rules, and statistical models, without resorting to named entity lists.
Collins et al. [3] suggests an algorithm for named entity classification, based on the meaning
word disambiguation, and exploits the redundancy in the contextual characteristics. This system
operates a large corpus to produce a generic list of proper nouns. The names are collected by
searching for a syntax diagram with specific properties. For example, a proper name is a
sequence of consecutive words in a nominal phrase, etc.
Petasis et al. [4] presented a method that helps to build a rules-based system for recognition and
classification of named entities. They have used machine learning, to monitor system
performance and avoid manual marking.
In his paper, Mann et al. [5] explores the idea of fine-grained proper noun ontology and its use in
question answering. The ontology is built from unrestricted text using simple textual co-
occurrence patterns. This ontology is therefore used on a question answering task to provide
primary results on the utility of this information. However, this method has a low coverage.
The Nemesis system presented by Fourour et al.[ 6] is founded on some heuristics, allowing the
identification of named entities, and their classification by detecting the boundaries of the entity
called "context" to the left or right, and by studying syntactic, or morphological nature of these
entities. (n-gram modeling) For example, acronyms are named entities consisting of a single
lexical unit comprising several capital letters, etc.
Krstev et al. [7] suggested a basic structure of a relational model of a multilingual dictionary of
proper names based on four-level ontology.
Etzioni et al. [8] planned the KNOWITALL system which aims at automating the process of
extracting named entities from the Web in an unsupervised and scalable manner. This system is
not intended for recognizing a named entity, but used to create long lists of named entities.
However, it is not designed to resolve the ambiguity in some documents.
Friburger et al. [9] recommends a method based on rules for finding a large proportion of person
names. However, this method has some limitations as errors, and missing responses.
Nadeau et al. [10] have suggested a system for recognizing named entities. Their work is based
on those of Collins, and Etzioni. The system exploits human-generated HTML markup in Web
pages to generate gazetteers, then it uses simple heuristics for the entity disambiguation in the
context of a given document.
Kono Kim et. al. [11] proposed a NE (Named Entity) recognition system using a semi supervised
statistical method. In training time, the NE recognition system builds error-prone training data only
using a conventional POS (Part-Of-Speech) tagger and a NE dictionary that semi-automatically is
constructed. Then, the NE recognition system generates a co-occurrence similarity matrix from
6. Ashwini A. Shende, Avinash J. Agrawal & Dr. O. G. Kakde
International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (1) : 2012 72
the error prone training corpus. In running time, the NE recognition system detects NE candidates
and assigns categories to the NE candidates using Viterbi searching on the AWDs.
In view of works touching the recognition of named entities, we perceive that most of them are
based on a set of rules in relation to predefined categories: morphological, grammatical, etc. or
on predefined lists or dictionaries. The n-gram modeling domain is still in exploration. We
adopted the idea of Nemesis based on the left and the right context of the named entity.
However, our approach does not mark the context derived from syntactic or morphological rules,
but identifies the context founded on learning phase. The objective is thus to carry out a system,
able to induce the nature of a named entity, without requiring dictionaries or lists of named
entities.
3. METHADOLOGY
This paper discusses the use of supervised machine learning approach for the problem of NE
recognition. The aim of our study is to reveal contextual NE in a document corpus using n-gram
modeling. A context consists of words surrounding the NE in the sentence in which it appears. It
is a sequence of words, that are left or right of the NE. In this work, we use supervised learning
technologies, combined with statistical methods to extract contexts from text document and to
identify the most pertinent contexts for the recognition of a NE.
Our work mainly focuses on Context extraction i.e. extracting the left and the right context of the
Named entity... Two or more words that tend to occur in similar linguistic context (i.e. to have
similar Co-occurrence pattern), tend to be positioned closer together in semantic space and tend
to resemble each other in meaning. Our objective is to carry out a system, able to induce the
nature of a named entity, following the meeting of certain indicators.
FIGURE 2: Block diagram of Proposed System
FIGURE. 2 shows block diagram of the proposed system. The proposed system consists of two
modules. First module is a Knowledge acquisition module which detects NE instances from the
training corpus. Then, it extracts left and right tri-grams surrounding those NE instances and
calculates its probability occurrence in the training corpus. After calculating all probabilities,
extracted tri-grams along with their probabilities are stored in a text file for reference. When
testing corpus is given for testing, NE recognition module finds all unrecognized NE instances
from it by using the same method used in knowledge acquisition module. Then, it classifies each
Detection of NE
instances
Extraction of
n-gram
n-gram probability
calculation
NE
classification
n-gram
matching
n-gram
extraction
Detection of
unrecognized
NEs
Knowledge Acquisition
NE Categorization
Tagged
text
document
(Training
corpus)
NE
Classification
results
Tagged text
document (Testing
corpus)
7. Ashwini A. Shende, Avinash J. Agrawal & Dr. O. G. Kakde
International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (1) : 2012 73
unrecognized NE instance in the testing corpus into one of the domain specific categories using
the tri-gram probabilities already stored in a file.
3.1.Knowledge Acquisition
The main functioning of this module is to extract the tri-grams surrounding NEs from given
domain specific text document. The document acts as the training corpus for learning. Our
system input is a tagged text document. For our corpus, all NEs should have numerical tagging.
Some of the sentences from our corpus are given below.
• when [q] vidarbha [1] express [n] reaches [v] wardha [2]
• what [q] is [x] the [d] status [n] of [p] mumbai [1] mail [n]
• what [q] is [x] departure [n] time [n] of [p] vidarbha [1] express [n]
• when [q] mumbai [1] mail [n] reaches [v] mumbai [2]
• what [q] is [x] the [d] position [n] of [p] the [d] gitanjali [1] express [n]
ALGORITHM:-
Locate all NEs from training corpus.
Extract left & right trigrams surrounding NEs.
If trigram does not exist then extract bigrams.
o If bigram does not exist then extract unigrams.
Remove duplicate trigrams / bigrams / unigrams and calculate the probability of each in
the corpus.
Store the unique trigrams / bigrams / unigrams along with probability in a file.
The first step of our algorithm is to locate Named Entities in each sentence by reading the text
corpus. NE’s are the words which are followed by numerical tagging. E.g. “vidarbha “,
“Mumbai”,”wardha” etc are NE instances in the above examples. After locating the NEs,
surrounding trigrams are extracted from the text corpus. Trigrams are the 3 consecutive words to
the left or right of NE. For efficiency purpose we will extract both left and right trigrams for each
NE. following structure is used to store the trigram.
public class TriGramElement
{
public String[] LeftElements = new String[3];
public String[] RightElements = new String[3];
public String CentreElement;
public String[] LeftValue = new String[3];
public String[] RightValue = new String[3];
public String CentreValue;
};
Every NE occurrence cannot guarantee presence of trigram surrounding it, especially if NE
occurs as the first or last word of the sentence in a corpus. In such cases our system is flexible to
consider either Bigram or unigram. E.g. for NE “vidarbha” left context is a unigram and right
context consists of a trigram. For “Mumbai”, left context is trigram and right content is unigram
and for “hawrah” both left and right context consists of trigrams. Some sample extracted trigrams
from the corpus is mentioned below.
8. Ashwini A. Shende, Avinash J. Agrawal & Dr. O. G. Kakde
International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (1) : 2012 74
when vidarbha express reaches wardha
the status of mumbai mail
by what time hawrah mail will come
The next step of the algorithm is to remove duplicate n-grams. Removal of duplicate trigrams is
necessary to apply statistical methods on it. For probability calculation we need to get the
occurrence count of each trigram. Our system generates a list of unique trigrams and stores them
in a text file along with their probabilities. The sample trigrams stored in a text file is shown below.
1 ,,,when : wardha,reaches,express --> 0.02
1 ,the,status,of : is,what,mail -> 0.02
1 ,position,of,the : is,what,express --> 0.02
3 ,the,fare,from : what,gondia,to --> 0.02
FIGURE 3: List of sample trigrams stored in a file
3.2 NE Categorization
After detecting unrecognized NEs, the NE recognition module assigns categories to them using
the trigram probabilities calculated by Knowledge acquisition module.
ALGORITHM:
Detect unrecognized NE instance from testing corpus.
Extract left and right trigrams for it.
• If trigram does not exist then extract bigram.
o If bigram does not exist then extract unigram.
For every unrecognized NE instance in testing corpus, search for left trigram / bigram /
unigram in the list stored in a file. (Generated from training corpus) using linear search.
If match not found search for right trigram / bigram / unigram in the list.
If match not found for left as well as right trigram / bigram / unigram then marked the
corresponding NE as unrecognized.
Find out the category of maximum probability trigram / bigram / unigram match.
Assign maximum probability category to the unrecognized NE.
Repeat above steps for all unrecognized NE instances in the testing corpus.
Store NE categorization results in a file.
9. Ashwini A. Shende, Avinash J. Agrawal & Dr. O. G. Kakde
International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (1) : 2012 75
NE categorization module will first extract all NE instances, from the testing corpus by applying
the same method used in knowledge acquisition module. We are assuming that testing corpus is
a tagged corpus in which all unrecognized NE’s are marked with tag [0]. After detecting NE’s, NE
categorization module will create a list of unrecognized NE instances. For each NE stored in list,
left and right content words are extracted from the testing corpus in the form of trigrams.
To categorize NE, our system will compare its left context words with the tri-gram entries
(generated from training corpus) stored in a file using linear search algorithm. Our system
prefers left context words over right context words as left context is more relevant in comparison
to right context for recognition. If the match is found then its probability count value will be
extracted. After checking all the entries, NE categorization module will compare probability count
of all matched entries and will find the maximum probability count out of it. Then unrecognized
NE will be classified to the matched category for which probability count is maximum. In absence
of left trigrams, right trigrams will be considered for matching. Our system is flexible to use
bigrams as well as unigrams in absence of trigrams for categorizing NEs. After categorizing all
NE’s categorization results will be stored in a text file.
Consider the following sentence from the testing corpus
how [q] many [u] trains [n] are [x] of [p] type [n] doronto [0]
In the above sentence word with tag [0] is detected as unrecognized NE. i.e. ”doronto”. Next step
is to find out the context words to the left and right of NE. The extracted tri-gram for the “doronta”
is
are of type doronto --- ---- -----
In above case left trigram consists of 3 words whereas right tri-gram is null as “doronto” is the
last word of the sentence. Our algorithm gives precedence to left context words .So it will search
the Tri-gram entries stored in a file to get a match for tri-gram “are of type “. The match is found
with probability count 0.02 and the category type is train name. So “doronto” will be categorized
to Train name and result will be stored in a text file.
4. EXPERIMENTAL RESULTS
4.1 Test Collections
To evaluate the performance of the proposed system, we used a test collection of a Railway
Reservation domain. The testing corpus is a collection of routine railway enquiries consisting of
domain specific NE categories like train names, source and destination train names, reservation
classes etc. Categories are labeled with numerical tagging in the testing corpus. We think that the
preliminary experiments have some meaning as our goal is to recognize NE categories with
supervised statistical methods.
4.2 Performance Evaluation
Since any NER system or method must produce a single, unambiguous output for any Named
Entity in the text, the evaluation is not based on a system architecture in which Named Entity
Recognition would be completely handled as a preprocess to sentence and discourse analysis.
The task requires that the system recognize what a NE represents, not just its superficial
appearance and the answer may have to be obtained using techniques that draw information
from a larger context or from reference lists.
A scoring model developed for the MUC and Named Entity Task evaluations measures, both
precision (P) and recall (R) terms borrowed from the information-retrieval community. These two
measures of performance combine to form one measure of performance, the F-measure, which is
computed by the uniformly weighted harmonic mean of precision and recall.
To evaluate performance of the proposed system, we used the performance measures like
precision, recall and the F-score. Precision (p) is the proportion of correct responses out of
10. Ashwini A. Shende, Avinash J. Agrawal
International Journal of Computational Linguistics (IJCL)
returned NE categories, and Recall (r) is the proportion of returned NE categories out of
classification targets. Following graph shows the performance measure results of our system
FIGURE
5. CONCLUSION
We proposed a NE recognition system using Supervised Statistical methods. Our goal is to
uncover Named Entity in a document corpus. NE occurs frequently accompanied by contexts: i.e.
sequence of words, that are left or right of the NE. In training time, th
all NE instances from a given domain specific text document. Then, the proposed system
generates a list of unique tri
probability occurrence for each. This information is
During testing, this information is referred to identify most pertinent contexts for the categorization
of unrecognized NEs from the testing corpus. This enables to derive a model for NE recognition.
In the preliminary experiments on Railway Reservation domain the proposed system showed
90.04% average F-score measure
Recall and precision are usually admitted parameters for measuring system performance in the
NER field.
Precision = ( No. Of correct respo
Recall = (No. Of correct responses )
F- measure = Precision × Recall
For NER task it is observed that though Hand
results in specific domain but it has problem with broad and new domain. As Hand
based method are dependent to domain, Machine learning
independent solution for NER. Machine Learning methods can get good result in precision and
recall with high portability and it can be best independent and portable solution for text mining and
specially NER. But high performance of this kind of method
This type of approach can get high precision in recognition when amount of data training is huge
and the result is strictly reduce when data training value is few or malfunction of algorithm. The
Hybrid methods gave good results but portability of this type of approach is reduced when they
improve precision in recognition by using huge value of fixed rules
based systems were more popular now a
developing NER systems. TABLE
systems with the existing systems.
0
10
20
30
40
50
60
70
80
90
100
Precision
Agrawal & Dr. O. G. Kakde
International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (1) : 2012
returned NE categories, and Recall (r) is the proportion of returned NE categories out of
ification targets. Following graph shows the performance measure results of our system
FIGURE 4: Performance Measure results
We proposed a NE recognition system using Supervised Statistical methods. Our goal is to
uncover Named Entity in a document corpus. NE occurs frequently accompanied by contexts: i.e.
sequence of words, that are left or right of the NE. In training time, the proposed system extracts
all NE instances from a given domain specific text document. Then, the proposed system
generates a list of unique tri-grams surrounding NEs in the training corpus and calculate
probability occurrence for each. This information is stored in a file as a reference for testing.
During testing, this information is referred to identify most pertinent contexts for the categorization
of unrecognized NEs from the testing corpus. This enables to derive a model for NE recognition.
eliminary experiments on Railway Reservation domain the proposed system showed
score measure.
Recall and precision are usually admitted parameters for measuring system performance in the
Precision = ( No. Of correct responses ) ⁄ ( No. of responses )
Recall = (No. Of correct responses ) ⁄ (No. correct in key)
measure = Precision × Recall ⁄ ½ (Precision + Recall )
For NER task it is observed that though Hand-made rule based approach can get high
it has problem with broad and new domain. As Hand
based method are dependent to domain, Machine learning–based methods is the b
independent solution for NER. Machine Learning methods can get good result in precision and
recall with high portability and it can be best independent and portable solution for text mining and
specially NER. But high performance of this kind of methods depends on the data training value.
This type of approach can get high precision in recognition when amount of data training is huge
and the result is strictly reduce when data training value is few or malfunction of algorithm. The
od results but portability of this type of approach is reduced when they
improve precision in recognition by using huge value of fixed rules. Though traditionally Rule
based systems were more popular now a days machine learning approach is preferred for
TABLE 1 shows the comparison of the results obtained from proposed
systems with the existing systems.
Precision Recall F-score
76
returned NE categories, and Recall (r) is the proportion of returned NE categories out of
ification targets. Following graph shows the performance measure results of our system.
We proposed a NE recognition system using Supervised Statistical methods. Our goal is to
uncover Named Entity in a document corpus. NE occurs frequently accompanied by contexts: i.e.
e proposed system extracts
all NE instances from a given domain specific text document. Then, the proposed system
grams surrounding NEs in the training corpus and calculate
stored in a file as a reference for testing.
During testing, this information is referred to identify most pertinent contexts for the categorization
of unrecognized NEs from the testing corpus. This enables to derive a model for NE recognition.
eliminary experiments on Railway Reservation domain the proposed system showed
Recall and precision are usually admitted parameters for measuring system performance in the
( No. of responses )
made rule based approach can get high rate
it has problem with broad and new domain. As Hand-made rule
based methods is the best
independent solution for NER. Machine Learning methods can get good result in precision and
recall with high portability and it can be best independent and portable solution for text mining and
s depends on the data training value.
This type of approach can get high precision in recognition when amount of data training is huge
and the result is strictly reduce when data training value is few or malfunction of algorithm. The
od results but portability of this type of approach is reduced when they
Though traditionally Rule
machine learning approach is preferred for
1 shows the comparison of the results obtained from proposed
11. Ashwini A. Shende, Avinash J. Agrawal & Dr. O. G. Kakde
International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (1) : 2012 77
System Precision Recall F -Score
1 Proposed System 93.68 86.40 90.04
2 NYU System (Rule based) 90 86 88.19
3 IsoQuest,Inc (Rule based) 93 90 91.60
4 MENE (Machine learning based) 96 89 92.20
5 Association Rule Mining
(Machine learning based)
83.43 66.34 70.16
6 IdentiFinder (Machine learning
based)
92 89 90.44
7 LTG (hybrid) 95 92 93.39
8 NYU Hybrid (hybrid) 93 85 88.80
TABLE 1: Comparison of proposed system results with the existing systems
All the proposed methods and models developed for NER task have tried to improve precision in
recognition module and portability in recognition domain as one of the major problem and
difficulty in NER systems is to change and switch over to a new domain called portability. Most
distinguishing feature of the proposed system is that it is easily portable to the new domain as it is
based on supervised machine learning approach. Proposed system is using n-gram model for
extracting NE context which is also contributing to the portability of the proposed system across
multiple domains. It is not required to maintain large gazetteer lists as NE recognition for our
system is context based. Context is extracted from the corpus itself (training as well as testing)
and not dependent on gazetteer lists. Based on the experimental results it can be said that the
proposed system is a good solution to address NER problem as it is capable of recognizing NEs
from the given domain corpora dynamically without maintaining huge large NE dictionaries or
gazetteer lists.
6. FUTURE WORK
Though primarily we have applied the proposed approach to address NER problem, it is not
restricted to that problem only. The proposed approach can be applied to solve many problems in
Natural Language Processing domain. It can be used in various research areas like machine
translation, Question answering systems etc.
As we have stated that proposed system is portable in nature we need to use our systems across
diverse domains and get its performance analysis across those diverse domains.
Our future work recommendations are as follows.
• To test the system on different domain corpora,
• To discern and to measure similarity between contexts. We can use this measurement to
cluster similar contexts.
• Though we have primarily applied our approach to NER problem, we can also attempt
some additional concepts