Automatically finding domain specific key terms from a given set of research paper is a challenging task and research papers to a particular area of research is a concern for many people including students, professors and researchers. A domain classification of papers facilitates that search process. That is, having a list of domains in a research field, we try to find out to which domain(s) a given paper is more related. Besides, processing the whole paper to read take a long time. In this paper, using domain knowledge requires much human effort, e.g., manually composing a set of labeling a large corpus. In particular, we use the abstract and keyword in research paper as the seeing terms to identify similar terms from a domain corpus which are then filtered by checking their appearance in the research papers. Experiments show the TF –IDF measure and the classification step make this method more precisely to domains. The results show that our approach can extract the terms effectively, while being domain independent.
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
The enormous amount of information stored in unstructured texts cannot simply be used for further
processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific
(pre-) processing methods and algorithms are required in order to extract useful patterns. Text Mining is the
discovery of valuable, yet hidden, information from the text document. Text classification (Also called Text
Categorization) is one of the important research issues in the field of text mining. It is necessary to
classify/categorize large texts (documents) into specific classes. Text Classification assigns a text document to one of a
set of predefined classes. This paper covers different text classification techniques and also includes Classifier
Architecture and Text Classification Applications.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
Answer extraction and passage retrieval forWaheeb Ahmed
—Question Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the user’s questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the user’s
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
This paper proposes Natural language based Discourse Analysis method used for extracting
information from the news article of different domain. The Discourse analysis used the Rhetorical Structure
theory which is used to find coherent group of text which are most prominent for extracting information
from text. RST theory used the Nucleus- Satellite concept for finding most prominent text from the text
document. After Discourse analysis the text analysis has been done for extracting domain related object
and relates this object. For extracting the information knowledge based system has been used which
consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is
evaluated according gold-of-art analysis and human decision for extracted information.
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
The enormous amount of information stored in unstructured texts cannot simply be used for further
processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific
(pre-) processing methods and algorithms are required in order to extract useful patterns. Text Mining is the
discovery of valuable, yet hidden, information from the text document. Text classification (Also called Text
Categorization) is one of the important research issues in the field of text mining. It is necessary to
classify/categorize large texts (documents) into specific classes. Text Classification assigns a text document to one of a
set of predefined classes. This paper covers different text classification techniques and also includes Classifier
Architecture and Text Classification Applications.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
Answer extraction and passage retrieval forWaheeb Ahmed
—Question Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the user’s questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the user’s
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
This paper proposes Natural language based Discourse Analysis method used for extracting
information from the news article of different domain. The Discourse analysis used the Rhetorical Structure
theory which is used to find coherent group of text which are most prominent for extracting information
from text. RST theory used the Nucleus- Satellite concept for finding most prominent text from the text
document. After Discourse analysis the text analysis has been done for extracting domain related object
and relates this object. For extracting the information knowledge based system has been used which
consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is
evaluated according gold-of-art analysis and human decision for extracted information.
Information Retrieval System is an effective process that helps a user to trace relevant information by Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic Information Retrieval System(BIRS) based on information and the system is significant mathematically and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit (BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered 672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted information. For testing this system, we have created 19335 questions from the introduced information and got 97.22% accurate answer.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...dannyijwest
General search engines often provide low precise results even for detailed queries. So there is a vital need
to elicit useful information like keywords for search engines to provide acceptable results for user’s search
queries. Although many methods have been proposed to show how to extract keywords automatically, all
attempt to get a better recall, precision and other criteria which describe how the method has done its job
as an author. This paper presents a new automatic keyword extraction method which improves accessibility
of web content by search engines. The proposed method defines some coefficients determining features
efficiency and tries to optimize them by using a genetic algorithm. Furthermore, it evaluates candidate
keywords by a function that utilizes the result of search engines. When comparing to the other methods,
experiments demonstrate that by using the proposed method, a higher score is achieved from search
engines without losing noticeable recall or precision.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Elevating forensic investigation system for file clusteringeSAT Journals
Abstract In computer forensic investigation, thousands of files are usually surveyed. Much of the data in those files consists of formless manuscript, whose investigation by computer examiners is very tough to accomplish. Clustering is the unverified organization of designs that is data items, remarks, or feature vectors into groups (clusters). To find a noble clarification for this automated method of analysis are of great interest. In particular, algorithms such as K-means, K-medoids, Single Link, Complete Link and Average Link can simplify the detection of new and valuable information from the documents under investigation. This paper is going to present an tactic that applies text clustering algorithms to forensic examination of computers seized in police investigations using multithreading technique for data clustering. Keywords- Clustering, forensic computing, text mining, multithreading.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Arabic morphology complex is a core challenge of
Arabic information retrieval. This limitation makes
Arabic language is so complex environment of
information retrieval specialists due to the high
inflected in Arabic language. This paper explains the
importance of the stemming in information retrieval
through developed a method to search about information needs in Arabic text. The new developed
method based on light stemmer to increase matching
the keywords between the query, with the related
document in the test collection.
Architecture of an ontology based domain-specific natural language question a...IJwest
Question answering (QA) system aims at retrieving precise information from a large collection of
documents against a query. This paper describes the architecture of a Natural Language Question
Answering (NLQA) system for a specifi
c domain based on the ontological information, a step towards
semantic web question answering. The proposed architecture defines four basic modules suitable for
enhancing current QA capabilities with the ability of processing complex questions. The first m
odule was
the question processing, which analyses and classifies the question and also reformulates the user query.
The second module allows the process of retrieving the relevant documents. The next module processes the
retrieved documents, and the last m
odule performs the extraction and generation of a response. Natural
language processing techniques are used for processing the question and documents and also for answer
extraction. Ontology and domain knowledge are used for reformulating queries and ident
ifying the
relations. The aim of the system is to generate short and specific answer to the question that is asked in the
natural language in a specific domain. We have achieved 94 % accuracy of natural language question
answering in our implementation
Semantic tagging for documents using 'short text' informationcsandit
Tagging documents with relevant and comprehensive k
eywords offer invaluable assistance to
the readers to quickly overview any document. With
the ever increasing volume and variety of
the documents published on the internet, the intere
st in developing newer and successful
techniques for annotating (tagging) documents is al
so increasing. However, an interesting
challenge in document tagging occurs when the full
content of the document is not readily
accessible. In such a scenario, techniques which us
e “short text”, e.g., a document title, a news
article headline, to annotate the entire article ar
e particularly useful. In this paper, we pro-
pose a novel approach to automatically tag document
s with relevant tags or key-phrases using
only “short text” information from the documents. W
e employ crowd-sourced knowledge from
Wikipedia, Dbpedia, Freebase, Yago and similar open
source knowledge bases to generate
semantically relevant tags for the document. Using
the intelligence from the open web, we prune
out tags that create ambiguity in or “topic drift”
from the main topic of our query document.
We have used real world dataset from a corpus of re
search articles to annotate 50 research
articles. As a baseline, we used the full text info
rmation from the document to generate tags. The
proposed and the baseline approach were compared us
ing the author assigned keywords for the
documents as the ground truth information. We found
that the tags generated using proposed
approach are better than using the baseline in term
s of overlap with the ground truth tags
measured via Jaccard index (0.058 vs. 0.044). In te
rms of computational efficiency, the
proposed approach is at least 3 times faster than t
he baseline approach. Finally, we
qualitatively analyse the quality of the predicted
tags for a few samples in the test corpus. The
evaluation shows the effectiveness of the proposed
approach both in terms of quality of tags
generated and the computational time.
Extraction of Data Using Comparable Entity Miningiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
Abstract
There are many data mining techniques have been proposed for mining useful patterns from documents. Still, how to effectively use and update discovered patterns is open for future research , especially in the field of text mining. As most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy(words have multiple meanings) and synonymy(multiple words have same meaning). People have held hypothesis that pattern-based approaches should perform better than the term-based, but many experiments does not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern matching, to improve the effective use of discovered patterns.
Keywords: Pattern Mining, Pattern Taxonomy Model, Inner Pattern Evolving, TF-IDF, NLP etc.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
Keyword extraction, concept finding are in learning objects is very important subject in today’s eLearning environment. Keywords are subset of words that contains the useful information about the content of the document. Keyword extraction is a process that is used to get the important keywords from documents. In this proposed System Decision tree algorithm is used for feature selection process using wordnet dictionary. WordNet is a lexical database of English which is used to find similarity from the candidate words. The words having highest similarity are taken as keywords.
In recent years the growth of digital data is increasing dramatically, knowledge discovery and data mining have attracted immense attention with coming up need for turning such data into useful information and knowledge. Keyword extraction is considered an essential task in natural language processing (NLP) that facilitates mapping of documents to a concise set of representative single and multi-word phrases. This paper investigates using of Word2Vec and Decision Tree for keywords extraction from textual documents. The Sem-Eval (2010) dataset is used as a main input for the proposed study. The words are represented by vectors with Word2Vec technique following applying pre-processing operations on the dataset. This method is based on word similarity between candidate keywords from both collecting keywords for each label and one sample from the same label. An appropriate threshold has been determined by which the percentages that exceed this threshold are exported to the Decision Tree in order to consider an appropriate classification to be taken on the text document.
Some similarity measurements were used for the classification process. The efficiency and accuracy of the algorithm was measured in the process of classification using precision, recall and F-score rates. The obtained results indicated that using of vector representation for each keyword is an effective way to identify the most similar words, so that the opportunity to recognize the correct classification of the document increases. When using word2Vec CBOW the result of F-Score was 64% with the Gini method and WordNet Lemmatizer. Meanwhile, when using Word2Vec SG the result of F-Score was 82% with Gini Index and English Porter Stemming which considered the highest ratio for all our experiments.
http://sites.google.com/site/ijcsis/
https://google.academia.edu/JournalofComputerScience
https://www.linkedin.com/in/ijcsis-research-publications-8b916516/
http://www.researcherid.com/rid/E-1319-2016
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those
applications such as search engine, text categorization,summarization, and topic detection arebased on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be
established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TF•IDF. Term frequency-Inverse
document frequency is widely used to express the documentsfeature weight, which can’t reflect the division of terms in the document, and then can’t reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TF•IDF.The extracted feature can represent the content of the text better and has a better distinguished
Information Retrieval System is an effective process that helps a user to trace relevant information by Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic Information Retrieval System(BIRS) based on information and the system is significant mathematically and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit (BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered 672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted information. For testing this system, we have created 19335 questions from the introduced information and got 97.22% accurate answer.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...dannyijwest
General search engines often provide low precise results even for detailed queries. So there is a vital need
to elicit useful information like keywords for search engines to provide acceptable results for user’s search
queries. Although many methods have been proposed to show how to extract keywords automatically, all
attempt to get a better recall, precision and other criteria which describe how the method has done its job
as an author. This paper presents a new automatic keyword extraction method which improves accessibility
of web content by search engines. The proposed method defines some coefficients determining features
efficiency and tries to optimize them by using a genetic algorithm. Furthermore, it evaluates candidate
keywords by a function that utilizes the result of search engines. When comparing to the other methods,
experiments demonstrate that by using the proposed method, a higher score is achieved from search
engines without losing noticeable recall or precision.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Elevating forensic investigation system for file clusteringeSAT Journals
Abstract In computer forensic investigation, thousands of files are usually surveyed. Much of the data in those files consists of formless manuscript, whose investigation by computer examiners is very tough to accomplish. Clustering is the unverified organization of designs that is data items, remarks, or feature vectors into groups (clusters). To find a noble clarification for this automated method of analysis are of great interest. In particular, algorithms such as K-means, K-medoids, Single Link, Complete Link and Average Link can simplify the detection of new and valuable information from the documents under investigation. This paper is going to present an tactic that applies text clustering algorithms to forensic examination of computers seized in police investigations using multithreading technique for data clustering. Keywords- Clustering, forensic computing, text mining, multithreading.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Arabic morphology complex is a core challenge of
Arabic information retrieval. This limitation makes
Arabic language is so complex environment of
information retrieval specialists due to the high
inflected in Arabic language. This paper explains the
importance of the stemming in information retrieval
through developed a method to search about information needs in Arabic text. The new developed
method based on light stemmer to increase matching
the keywords between the query, with the related
document in the test collection.
Architecture of an ontology based domain-specific natural language question a...IJwest
Question answering (QA) system aims at retrieving precise information from a large collection of
documents against a query. This paper describes the architecture of a Natural Language Question
Answering (NLQA) system for a specifi
c domain based on the ontological information, a step towards
semantic web question answering. The proposed architecture defines four basic modules suitable for
enhancing current QA capabilities with the ability of processing complex questions. The first m
odule was
the question processing, which analyses and classifies the question and also reformulates the user query.
The second module allows the process of retrieving the relevant documents. The next module processes the
retrieved documents, and the last m
odule performs the extraction and generation of a response. Natural
language processing techniques are used for processing the question and documents and also for answer
extraction. Ontology and domain knowledge are used for reformulating queries and ident
ifying the
relations. The aim of the system is to generate short and specific answer to the question that is asked in the
natural language in a specific domain. We have achieved 94 % accuracy of natural language question
answering in our implementation
Semantic tagging for documents using 'short text' informationcsandit
Tagging documents with relevant and comprehensive k
eywords offer invaluable assistance to
the readers to quickly overview any document. With
the ever increasing volume and variety of
the documents published on the internet, the intere
st in developing newer and successful
techniques for annotating (tagging) documents is al
so increasing. However, an interesting
challenge in document tagging occurs when the full
content of the document is not readily
accessible. In such a scenario, techniques which us
e “short text”, e.g., a document title, a news
article headline, to annotate the entire article ar
e particularly useful. In this paper, we pro-
pose a novel approach to automatically tag document
s with relevant tags or key-phrases using
only “short text” information from the documents. W
e employ crowd-sourced knowledge from
Wikipedia, Dbpedia, Freebase, Yago and similar open
source knowledge bases to generate
semantically relevant tags for the document. Using
the intelligence from the open web, we prune
out tags that create ambiguity in or “topic drift”
from the main topic of our query document.
We have used real world dataset from a corpus of re
search articles to annotate 50 research
articles. As a baseline, we used the full text info
rmation from the document to generate tags. The
proposed and the baseline approach were compared us
ing the author assigned keywords for the
documents as the ground truth information. We found
that the tags generated using proposed
approach are better than using the baseline in term
s of overlap with the ground truth tags
measured via Jaccard index (0.058 vs. 0.044). In te
rms of computational efficiency, the
proposed approach is at least 3 times faster than t
he baseline approach. Finally, we
qualitatively analyse the quality of the predicted
tags for a few samples in the test corpus. The
evaluation shows the effectiveness of the proposed
approach both in terms of quality of tags
generated and the computational time.
Extraction of Data Using Comparable Entity Miningiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
Abstract
There are many data mining techniques have been proposed for mining useful patterns from documents. Still, how to effectively use and update discovered patterns is open for future research , especially in the field of text mining. As most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy(words have multiple meanings) and synonymy(multiple words have same meaning). People have held hypothesis that pattern-based approaches should perform better than the term-based, but many experiments does not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern matching, to improve the effective use of discovered patterns.
Keywords: Pattern Mining, Pattern Taxonomy Model, Inner Pattern Evolving, TF-IDF, NLP etc.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
Keyword extraction, concept finding are in learning objects is very important subject in today’s eLearning environment. Keywords are subset of words that contains the useful information about the content of the document. Keyword extraction is a process that is used to get the important keywords from documents. In this proposed System Decision tree algorithm is used for feature selection process using wordnet dictionary. WordNet is a lexical database of English which is used to find similarity from the candidate words. The words having highest similarity are taken as keywords.
In recent years the growth of digital data is increasing dramatically, knowledge discovery and data mining have attracted immense attention with coming up need for turning such data into useful information and knowledge. Keyword extraction is considered an essential task in natural language processing (NLP) that facilitates mapping of documents to a concise set of representative single and multi-word phrases. This paper investigates using of Word2Vec and Decision Tree for keywords extraction from textual documents. The Sem-Eval (2010) dataset is used as a main input for the proposed study. The words are represented by vectors with Word2Vec technique following applying pre-processing operations on the dataset. This method is based on word similarity between candidate keywords from both collecting keywords for each label and one sample from the same label. An appropriate threshold has been determined by which the percentages that exceed this threshold are exported to the Decision Tree in order to consider an appropriate classification to be taken on the text document.
Some similarity measurements were used for the classification process. The efficiency and accuracy of the algorithm was measured in the process of classification using precision, recall and F-score rates. The obtained results indicated that using of vector representation for each keyword is an effective way to identify the most similar words, so that the opportunity to recognize the correct classification of the document increases. When using word2Vec CBOW the result of F-Score was 64% with the Gini method and WordNet Lemmatizer. Meanwhile, when using Word2Vec SG the result of F-Score was 82% with Gini Index and English Porter Stemming which considered the highest ratio for all our experiments.
http://sites.google.com/site/ijcsis/
https://google.academia.edu/JournalofComputerScience
https://www.linkedin.com/in/ijcsis-research-publications-8b916516/
http://www.researcherid.com/rid/E-1319-2016
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those
applications such as search engine, text categorization,summarization, and topic detection arebased on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be
established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TF•IDF. Term frequency-Inverse
document frequency is widely used to express the documentsfeature weight, which can’t reflect the division of terms in the document, and then can’t reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TF•IDF.The extracted feature can represent the content of the text better and has a better distinguished
Using data mining methods knowledge discovery for text miningeSAT Journals
Abstract Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase)-based approaches should perform better than the term-based ones, but many experiments do not support this hypothesis. Proposed work presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Keywords:-Text mining, text classification, pattern mining, pattern evolving, information filtering.
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
Novel Database-Centric Framework for Incremental Information Extractionijsrd.com
Information extraction (IE) has been an active research area that seeks techniques to uncover information from a large collection of text. IE is the task of automatically extracting structured information from unstructured and/or semi structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in document processing like automatic annotation and content extraction could be seen as information extraction. Many applications call for methods to enable automatic extraction of structured information from unstructured natural language text. Due to the inherent challenges of natural language processing, most of the existing methods for information extraction from text tend to be domain specific. In this project a new paradigm for information extraction. In this extraction framework, intermediate output of each text processing component is stored so that only the improved component has to be deployed to the entire corpus. Extraction is then performed on both the previously processed data from the unchanged components as well as the updated data generated by the improved component. Performing such kind of incremental extraction can result in a tremendous reduction of processing time and there is a mechanism to generate extraction queries from both labeled and unlabeled data. Query generation is critical so that casual users can specify their information needs without learning the query language.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...idescitation
Genetic Algorithm (GA) has been a successful method
that is been used for extracting keywords. This paper presents
a full method by which keywords can be derived from the
various corpuses. We have built equations that exploit the
structure of the documents from which the keywords need to
be extracted. The procedures are been broken into two
distinguished profiles: one is to weigh the words in the whole
document content and the other is to explore the possibilities
of the occurrence of key terms by using genetic algorithm.
The basic equations of the heuristic mechanism is been varied
to allow the complete exploitation of document. The Genetic
Algorithm and the enhanced standard deviation method is
used in full potential to enable the generation of the key
terms that describe the given text document. The new
technique has an enhanced performance and better time
complexities.
Text categorization is a term that has intrigued researchers for quite some time now. It is the concept
in which news articles are categorized into specific groups to cut down efforts put in manually categorizing
news articles into particular groups. A growing number of statistical classification and machine learning
technique have been applied to text categorization. This paper is based on the automatic text categorization
of news articles based on clustering using k-mean algorithm. The goal of this paper is to automatically
categorize news articles into groups. Our paper mostly concentrates on K-mean for clustering and for term
frequency we are going to use TF-IDF dictionary is applied for categorization. This is done using mahaout
as platform.
Survey on Key Phrase Extraction using Machine Learning ApproachesYogeshIJTSRD
The automated keyword extraction task is to define a collection of representative terms for the text. Extracting keywords defines a small collection of terms, key phrases and keywords that define the document’s context. Keyword search allows large document collections to be searched effectively. To allocate suitable key phrases to new documents, text categorization techniques can be applied. A predefined collection of key phrases from which all key phrases for new documents are selected is given in the training documents. The training data for each key phrase describes a collection of documents associated with it. Standard machine learning techniques are used for each key phrase to construct a classifier from the training materials, using those relevant to it as positive examples and the rest as negative examples. Provided a new text, it is processed by the classifier of each key phrase. Preeti Sondhi | Aakib Jabbar "Survey on Key Phrase Extraction using Machine Learning Approaches" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-3 , April 2021, URL: https://www.ijtsrd.com/papers/ijtsrd39890.pdf Paper URL: https://www.ijtsrd.com/other-scientific-research-area/other/39890/survey-on-key-phrase-extraction-using-machine-learning-approaches/preeti-sondhi
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Current multi-document summarization systems can successfully extract summary sentences, however with
many limitations including: low coverage, inaccurate extraction to important sentences, redundancy and
poor coherence among the selected sentences. The present study introduces a new concept of centroid
approach and reports new techniques for extracting summary sentences for multi-document. In both
techniques keyphrases are used to weigh sentences and documents. The first summarization technique
(Sen-Rich) prefers maximum richness sentences. While the second (Doc-Rich), prefers sentences from
centroid document. To demonstrate the new summarization system application to extract summaries of
Arabic documents we performed two experiments. First, we applied Rouge measure to compare the new
techniques among systems presented at TAC2011. The results show that Sen-Rich outperformed all systems
in ROUGE-S. Second, the system was applied to summarize multi-topic documents. Using human
evaluators, the results show that Doc-Rich is the superior, where summary sentences characterized by
extra coverage and more cohesion
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
Similar to 6.domain extraction from research papers (20)
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
1. www.jst.org.in 42 | Page
Journal of Science and Technology (JST)
Volume 2, Issue 4, April 2017, PP 42-50
www.jst.org.in ISSN: 2456 - 5660
Domain Extraction From Research Papers
Dr. R. Jayanthi1
, S. Sheela2
1
Dr. Mrs. R. Jayanthi, Assistant Professor, PG and Research Department of Computer Science, Quaid E –
Millath college for Women, Chennai, India..
2
Ms. S. Sheela, M.Phil Research Scholar, Computer Science, Quaid E – Millath college for Women, Chennai,
India.
Abstract: Automatically finding domain specific key terms from a given set of research paper is a challenging
task and research papers to a particular area of research is a concern for many people including students,
professors and researchers. A domain classification of papers facilitates that search process. That is, having a
list of domains in a research field, we try to find out to which domain(s) a given paper is more related. Besides,
processing the whole paper to read take a long time. In this paper, using domain knowledge requires much
human effort, e.g., manually composing a set of labeling a large corpus. In particular, we use the abstract and
keyword in research paper as the seeing terms to identify similar terms from a domain corpus which are then
filtered by checking their appearance in the research papers. Experiments show the TF –IDF measure and the
classification step make this method more precisely to domains. The results show that our approach can extract
the terms effectively, while being domain independent.
Keywords - Text Mining, Information Extraction, Domain keyword extraction, Term Frequency and Inverse
Document Frequency (TF – IDF)
I. INTRODUCTION
Domain – specific terms are term that have significant meaning (s) in a specific domain [1]. We extract
the term from the research papers. Here a “term” refers to a word or compound words representing a concept of
a specific domain, e.g., in chemistry, “alcohol” is a term that refers to an organic compound in which the
hydroxyl functional group is bound to a saturated carbon atom [2]. Terminology extraction are using rule based
techniques, supervised learning techniques or a combination of these two types of techniques all of which rely
on some domain knowledge. Acquiring such domain knowledge requires much human effort (e.g., manually
labeling a large corpus) [3].
To overcome this approach uses semantic extraction by building knowledge from a large corpus. The
semantic extraction refers to range of processing techniques that identify and extract entities, facts, attributes,
concepts, and events to populate meta- data fields. The purpose of this is to enable the analysis of semi-
structured or unstructured content. Semantic extraction is usually based on three approaches,
1. Rule based: Matching similar to entity extraction, this approach requires the support of one or more
vocabularies.
2. Machine learning: A statistical analysis of the content, that potentially compute intensive application that
can benefit within the document corpus.
3. Hybrid solution: Statistically driven, but enhanced by a vocabulary. This is typically the best approach if the
content set is focused on a specific subject area.
The extracted information also should be machine understandable as well as human understandable in
term of research paper from set of domain corpus [4]. A statistical method is proposed in this paper and it is
based on, 3 steps.
First, extract the abstract and keyword in the collection of research paper using pdfifilter. Second,
terms which are similar to a certain domain occur frequently in research paper. Third, count word introduced
learning to Predict from Text (Weighted Scoring Method). The highly count value is the domain name of
particular research paper. The TF–IDF is introduced, the weighting method of measure the value of domain
corpus. User can easily find out the domain name from research paper as well as to separate folder for each
domain paper to save on your computer. Later users don’t waste the time for searching the research paper. We
can save the time for searching the research paper with domain name.
II. LITERATURE REVIEW
For text mining of domain extraction techniques we have studied few related papers. In this section we
describe the different techniques with different authors which are related to the domain extraction.
In this paper [2] existing terminology extraction approaches are mostly domain dependent. They use
domain specific linguistic rules, supervised machine learning techniques. In particular, we use the title words
and the keywords in research papers as the seeing terms and word2vec to identify similar terms from an open-
domain corpus as the candidate terms, which are the filtered by checking their occurrence in researchpapers.
2. www.jst.org.in 43 | Page
Journal of Science and Technology
Rakhi Chakraborty, explains [5] it is extremely time consuming and difficult task to extract keyword or
feature manually. So an automated process that extracts keywords or features needs to be established. This paper
proposes a new domain keyword extraction technique that includes a new weighting method on the base of the
conventional TF - IDF. Term frequency-Inverse document frequency is widely used to express the documents
feature weight, which can’t reflect the division of terms in the document, and then can’t reflect the significance
degree and the difference between categories.
Hospice Houngbo, Robert E. Merer, Method Mention Extraction from Scientific Research Paper,
explains [6] scientific publications contain many references to method terminologies used during scientific
experiments. In this study we report our attempt to automatically extract such method terminologies from
scientific research papers, using rule-based and machine learning techniques. We first used some linguistic
features to extract fine-grained method sentences from a large biomedical corpus and then applied well
established methodologies to extract the method terminologies.
The author of [7] the computational linguistics community and its sub-fields have changed over the
years with respect to their foci, methods used, and domain problems. We extract these characteristics by
matching semantic extraction patterns, learned using bootstrapping, to the dependency trees of sentences in an
article’s abstract.
In this paper [8] the dynamics of a research community can be studied by extracting information from
its publications. Such information cannot be extracted using approaches that assume words are independent of
each other in a document. We use dependency trees, which give rich information about structure of a sentence,
and extract relevant information from them by matching semantic patterns.
III. DOMAIN KEYWORD EXTRACTION TECHNIQUES
Domain Corpus
In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually
electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking
occurrences or validating linguistic rules within a specific language territory. This paper, uses seven Domain
name with set of large corpus.
Figure 1: Domain Corpus in Research Paper
Figure1 shows the list for each domain some number of keywords is manually setting such as Text Mining 51,
Data Mining 66, Cryptography 56, Software Engineering 53, Big Data 57, Network Security 78, and Image
Processing 73.
PDF IFILTER
An IFilter is a plugin that allows Microsoft's search engines to index various file formats (as documents, email
attachments, database records, audio metadata etc.) So that they become searchable. Without an appropriate
IFilter, contents of a file cannot be parsed and indexed by the search engine. An IFilter acts as a plug-in for
extracting full-text and metadata for search engines.
A search engine usually works in two steps: The search engine goes through a designated place, e.g. a file folder
or a database, and indexes all documents or newly modified documents, including the various types documents,
in the background and creates internal data to store indexing result. A user specifies some keywords he would
like to search and the search engine answers the query immediately by looking up the indexing result and
responds to the user with that contains the keywords. During Step 1, the search engine itself doesn't understand
format of a document.
3. www.jst.org.in 44 | Page
Journal of Science and Technology
Therefore, it looks on Windows registry for an appropriate IFilter to extract the data from the document
format, filtering out embedded formatting and any other non-textual data. Figure 2 explain, one full research
paper (paper format pdf) read after using this tool to extract abstract and keywords in research paper.
Figure 2: Extract abstract and Index Terms from research paper TEXT
Similarity Mining
Text Vector must be generated before text similarity calculation between domain corpus and extraction
of research paper. A Chinese corpus converting economy, military, education, culture, and other fields, contains
approximately 100,000 documents [9]. Generating word figure3 sets from research paper such as domain
corpus. The process is relatively simple feature words were extracted from research paper and put in the domain
corpus.
Figure 3: Intersection of Domain Corpus and Extracted Research Paper
Figure 4, getting the result to calculate the total word for each domain. Total result of particular domain
name in given research paper, t is the maximum number appearance of the word.
t = count (Maximum Number of words appearance in domain corpus for each domain)
Figure 4: Count the words appearance in domain corpus for each domain
4. www.jst.org.in 45 | Page
Journal of Science and Technology
Finally, figure 5 explain the maximum number of count in the domain result of one research paper. For
each paper for calculating and get the result for 200 research paper.
Figure 5: Domain Result
EVALUATION MEASURE
Precision
In the field of information retrieval, precision is the fraction of retrieved documents that are relevant to
the query:
Precision = (relevant items retrieved) / (retrieved items) = P (relevant|retrieved)
Precision takes all retrieved documents into account, but it can also be evaluated at a given cut-off rank,
considering only the topmost results returned by the system. This measure is called precision. For example for a
text search on a set of documents precision is the number of correct results divided by the number of all returned
results. Precision is also used with recall, the percent of all relevant documents that is returned by the search.
The two measures are sometimes used together in the F1 Score (or f-measure) to provide a single measurement
for a system. Note that the meaning and usage of "precision" in the field of information retrieval differs from the
definition of accuracyand precision within other branches of science and technology.
Recall
Recall in information retrieval is the fraction of the documents that are relevant to the query that are
successfully retrieved.
Recall = (relevant items retrieved) / (relevant items) = P(retrieved|relevant)
For example for text search on a set of documents recall is the number of correct results divided by the number
of results that should have been returned. In binary classification, recall is called sensitivity. So it can be looked
at as the probability that a relevant document is retrieved by the query. It is trivial to achieve recall of 100% by
returning all documents in response to any query. Therefore, recall alone is not enough but one needs to measure
the number of non-relevant documents also, for example by computing the precision.
F –Measure
F1 measure is a derived effectiveness measurement. The resultant value is interpreted as a weighted average of
the precision and recall. The best value is 1 and the worst is 0.
F –Measure = 2((precision*recall) / (precision+recall))
TF – IDF
TF: Term Frequency, which measures how frequently a term occurs in a document. Since Table1, every
document is different in length it is possible that a term would appear much more times in long documents than
shorter ones. Thus, the term frequency is often divided by the document length (the total number of terms in the
document) as a way of classification [10].
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
The domain result is Data mining then value is 1 otherwise 0.
Total number of paper: 200
TF (t) = 40/56 = 0.7142857142
5. www.jst.org.in 46 | Page
Journal of Science and Technology
Where, text mining paper total count is 40 and domain corpus count is 56. The TF calculation number of
times t appear in the document value dividing the total number of terms in the document. Table 1, explain the
number of paper appears in the particular domain and the number of times appears, then the total calculation for
each domain.
Table 1: TF calculation
rm is. While computing TF, all terms are
considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear
a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare
ones, by computing the following [10]:
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
IDF(t) = log_e (200 / 20)
Total no of paper =200
No of paper occur = 10
Finally,
tf-idf = tf *idf
= 0.7142857142 * 10
tf-idf = 7.142857142
Example, Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency
(i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in
one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) =
4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
Domain name Paper1 Paper 2 Paper 3 Paper 4 Paper …. Paper
200
Total
Data mining 1 0 0 1 … 1 20
Text mining 0 1 1 1 … 1 40
Network
security
1 0 1 1 … 0 20
Cryptography 1 1 0 0 … 1 35
Software
engineering
1 0 1 0 … 0 25
Image
processing
1 0 1 1 ... 0 30
Big data 1 1 0 1 … 1 30
IDF: Inverse Document Frequency, which measures how important a te
6. www.jst.org.in 47 | Page
Journal of Science and Technology
IV. EXPERIMENTAL RESULT IN DOMAIN EXTRACTION
Step 1: Create Domain Corpus in Word document or notepad
Step 2: Read Research paper(pdf format) using Pdf IFilter Tool
Step 3: Extract Abstract and Index Terms from Research Paper
7. www.jst.org.in 48 | Page
Journal of Science and Technology
Step 4: Then count the term frequency from the collected term and rank them according to the high term
frequency (Using Term Frequency Similarity)
Step 5: Domain Extraction Result from Research Paper
Step 6: Precision, Recall and F – Measure calculation from Research Paper
8. www.jst.org.in 49 | Page
Journal of Science and Technology
Step 7: TF – IDF calculation
Step 8: Domain Extraction from Research paper for each paper domain name for the following image
Step 9: Graph Representation from Research paper for each Domain
V. CONCLUSION
The abstract and index terms extracting can be used for extracting domain name from specified research
paper and applied to document classification. This paper uses PDF IFilter is one of the best-known and most
commonly used research paper extractions currently in use. Term similarity mining approach is the term
frequency value counting from research paper. We use different performance measurements, including
precision, recall and F – measures with respect to individual human annotations and a weighted measure. Finally
TF –IDF measurement is calculated for each domain from overall research paper. This paper experimented with
the identification of individual research paper with domain name.
9. www.jst.org.in 50 | Page
Journal of Science and Technology
REFERENCES
[1] Su Nam Kim and Lawrence Cavedon, Classifying Domain-Specific Terms Using a Dictionary, In Proceedings of Australasian Language
Technology Association Workshop, 2011, 57−65.
[2] Birong Jiang, Edong Xun and Jianzhong Qi, A Domain Independent approach from Extracting Terms from Research Papers, Springer
International Publishing Switzerland 2015, 155- 166.
[3] P. Velardi, M. Missikoff, R. Basili, Identification of Relevant Terms to Support the Construction of Domain Ontologies, ACL workshop
on Human Language Technology and Knowledge Management, 2001, 5:1 – 5:8.
[4] Rishabh Upadhyay, Akihiro Fujii, Semantic Knowledge Extraction from Research Documents, Computer Science and Information System
(Fed CSIS) 2016.
[5] Rakhi Chakraborty, Domain Keyword Extraction Technique: A new weighting method based on frequency analysis, National Conference
on Advancement of Computing in Engineering Research, ACER 2013.
[6] Hospice Houngbo, Robert E. Merer, Method Mention Extraction from Scientific Research Paper, Proceeding of COLING 2012, 1211 –
1222.
[7] Sonal Gupta, Christopher D.Manning, Analyzing the dynamics of Research by Extracting key Aspects of Scientific Papers, International
Joint Conference on Natural Language Processing (IJCNLP), 2011.
[8] Sonal Gupta, Christopher D.Manning, Identifying Focus, Techniques and Domain of Scientific Papers, NIPS workshop on
Computational Social Science and the Wisdom of Crowds (NIPS-CSS), 2010.
[9] Gang Chen, Feng Liu, Mohammad Shojafer, Fuzzy System and Data Mining proceeding of FSDM 2015, Frontiers in Artificial
Intelligence and Applications 281, IOS Press 2016, ISBN 978-1-61499-618-7.
[10] H. Wu and R. Luk and K. Wong and K. Kwok, Interpreting TF-IDF term weights as making relevance decisions, ACM Transactions on
Information Systems, 2008.
BIBLIOGRAPHY OF AUTHORS
Dr. R. Jayanthi, MCA, M.Phil., Ph.D., working as an Assistant Professor in PG &
Research Department of Computer Science at Quaid-E-Millath Govt. College for
Women (Autonomous), Chennai. Her areas of interests are Data Mining, Text
Mining, Natural Language Processing, Information Extraction and Business
Intelligence. She has published articles in more than 10 InternationalJournals.
S. Sheela received her M.sc. Computer science in 2015 from Queen Mary’s College
for Women. She is pursuing her M.Phil. Computer Science under the supervision of
Dr. Mrs. R. Jayanthi, MCA., M.Phil., Ph.D., Assistant Professor in PG & Research
Department of Computer Science at Quaid-EMillath government College for
Women, Affiliated to university of Madras. She has presented papers in International
conferences and published papers in International Journals. Her area of interest is
Text Mining, Cryptography and Network Security.