This document provides a literature review and bibliometric analysis of natural language processing (NLP) applications in library and information science. It identifies 6,607 relevant publications on topics like information retrieval, machine translation, and text summarization. The document analyzes the historical trends, core journals, and prominent publications in the field. It also describes how NLP can enhance bibliometric studies by automating tasks like information extraction from texts. The bibliometric analysis reveals that library and information science is the most prominent subject category for NLP publications, followed by computer science and engineering.
The document discusses text classification and summarization techniques for complex domain-specific documents like research papers. It reviews various preprocessing approaches like stopword removal, lemmatizing, tokenization, and stemming. It also compares different machine learning algorithms for text classification, including Naive Bayes, decision trees, SVM, KNN, and neural networks. The document surveys works analyzing domain-specific documents using these techniques, such as biomedical document relation extraction and research paper topic classification.
This paper proposes Natural language based Discourse Analysis method used for extracting
information from the news article of different domain. The Discourse analysis used the Rhetorical Structure
theory which is used to find coherent group of text which are most prominent for extracting information
from text. RST theory used the Nucleus- Satellite concept for finding most prominent text from the text
document. After Discourse analysis the text analysis has been done for extracting domain related object
and relates this object. For extracting the information knowledge based system has been used which
consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is
evaluated according gold-of-art analysis and human decision for extracted information.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Opinion Mining Techniques for Non-English Languages: An OverviewCSCJournals
The amount of user-generated data on web is increasing day by day giving rise to necessity of automatic tools to analyze huge data and extract useful information from it. Opinion Mining is an emerging area of research concerning with extracting and analyzing opinions expressed in texts. It is a language and domain dependent task having number of applications like recommender systems, review analysis, marketing systems, etc. Early research in the field of opinion mining has concentrated on English language. Many opinion mining tools and linguistic resources have been built for English language. Availability of information in regional languages has motivated researchers to develop tools and resources for non-English languages. In this paper we present a survey on the opinion mining research for non-English languages.
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
A domain specific automatic text summarization using fuzzy logicIAEME Publication
This paper proposes a domain-specific automatic text summarization technique using fuzzy logic. It extracts important sentences from documents using both statistical and linguistic features. The technique first preprocesses documents by segmenting sentences, removing stop words, and stemming words. It then extracts eight features from each sentence, including title words, sentence length, position, and similarity. Fuzzy logic is used to assign an importance score to each sentence based on these features. Sentences are ranked and the top sentences are selected for the summary based on the compression rate. The technique was evaluated on news articles and shown to generate good quality summaries compared to other summarizers based on precision, recall, and F-measure.
The document discusses the impact of standardized terminologies and domain ontologies in multilingual information processing. It outlines how natural language processing (NLP) techniques can be used to semi-automatically populate ontologies by extracting information from text. Integrating knowledge from ontologies, NLP tools, and subject experts allows for more effective information access and management in an organization.
The document discusses text classification and summarization techniques for complex domain-specific documents like research papers. It reviews various preprocessing approaches like stopword removal, lemmatizing, tokenization, and stemming. It also compares different machine learning algorithms for text classification, including Naive Bayes, decision trees, SVM, KNN, and neural networks. The document surveys works analyzing domain-specific documents using these techniques, such as biomedical document relation extraction and research paper topic classification.
This paper proposes Natural language based Discourse Analysis method used for extracting
information from the news article of different domain. The Discourse analysis used the Rhetorical Structure
theory which is used to find coherent group of text which are most prominent for extracting information
from text. RST theory used the Nucleus- Satellite concept for finding most prominent text from the text
document. After Discourse analysis the text analysis has been done for extracting domain related object
and relates this object. For extracting the information knowledge based system has been used which
consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is
evaluated according gold-of-art analysis and human decision for extracted information.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Opinion Mining Techniques for Non-English Languages: An OverviewCSCJournals
The amount of user-generated data on web is increasing day by day giving rise to necessity of automatic tools to analyze huge data and extract useful information from it. Opinion Mining is an emerging area of research concerning with extracting and analyzing opinions expressed in texts. It is a language and domain dependent task having number of applications like recommender systems, review analysis, marketing systems, etc. Early research in the field of opinion mining has concentrated on English language. Many opinion mining tools and linguistic resources have been built for English language. Availability of information in regional languages has motivated researchers to develop tools and resources for non-English languages. In this paper we present a survey on the opinion mining research for non-English languages.
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
A domain specific automatic text summarization using fuzzy logicIAEME Publication
This paper proposes a domain-specific automatic text summarization technique using fuzzy logic. It extracts important sentences from documents using both statistical and linguistic features. The technique first preprocesses documents by segmenting sentences, removing stop words, and stemming words. It then extracts eight features from each sentence, including title words, sentence length, position, and similarity. Fuzzy logic is used to assign an importance score to each sentence based on these features. Sentences are ranked and the top sentences are selected for the summary based on the compression rate. The technique was evaluated on news articles and shown to generate good quality summaries compared to other summarizers based on precision, recall, and F-measure.
The document discusses the impact of standardized terminologies and domain ontologies in multilingual information processing. It outlines how natural language processing (NLP) techniques can be used to semi-automatically populate ontologies by extracting information from text. Integrating knowledge from ontologies, NLP tools, and subject experts allows for more effective information access and management in an organization.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
A statistical model for gist generation a case study on hindi news articleIJDKP
Every day, huge number of news articles are reported and disseminated on the Internet. By generating gist
of an article, reader can go through the main topics instead of reading the whole article as it takes much
time for reader to read the entire content of the article. An ideal system would understand the document
and generate the appropriate theme(s) directly from the results of the understanding. In the absence of
natural language understanding system, it is required to design an appropriate system. Gist generation is a
difficult task because it requires both maximizing text content in short summary and maintains
grammaticality of the text. In this paper we present a statistical approach to generate a gist of a Hindi
news article. The experimental results are evaluated using the standard measures such as precision, recall
and F1 measure for different statistical models and their combination on the article before pre-processing
and after pre-processing.
An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...dannyijwest
General search engines often provide low precise results even for detailed queries. So there is a vital need
to elicit useful information like keywords for search engines to provide acceptable results for user’s search
queries. Although many methods have been proposed to show how to extract keywords automatically, all
attempt to get a better recall, precision and other criteria which describe how the method has done its job
as an author. This paper presents a new automatic keyword extraction method which improves accessibility
of web content by search engines. The proposed method defines some coefficients determining features
efficiency and tries to optimize them by using a genetic algorithm. Furthermore, it evaluates candidate
keywords by a function that utilizes the result of search engines. When comparing to the other methods,
experiments demonstrate that by using the proposed method, a higher score is achieved from search
engines without losing noticeable recall or precision.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High –quality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
The document provides information on review papers and metagenomics. It begins with defining a review paper as a critical analysis of previously published literature that summarizes, classifies, analyzes and compares existing works. It describes the different types of review papers. The document then discusses metagenomics, providing a history and overview of the field. It explains that metagenomics involves directly extracting and sequencing DNA from environmental samples without culturing to study microbial communities. Examples of applications in various fields like medicine, engineering and agriculture are provided. The concluding remarks summarize key points about the importance of review papers and references are listed.
Semi Automated Text Categorization Using Demonstration Based Term SetIJCSEA Journal
Manual Analysis of huge amount of textual data requires a tremendous amount of processing time and effort in reading the text and organizing them in required format. In the current scenario, the major problem is with text categorization because of the high dimensionality of feature space. Now-a-days there are many methods available to deal with text feature selection. This paper aims at such semi automated text categorization feature selection methodology to deal with a massive data using one of the phases of David Merrill’s First principles of instruction (FPI). It uses a pre-defined category group by providing them with the proper training set based on the demonstration phase of FPI. The methodology involves the text tokenization, text categorization and text analysis.
Language Models for Information RetrievalDustin Smith
The document provides background information on Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze, who are authors of the book "Introduction to Information Retrieval: Language models for information retrieval". It then outlines the presentation which discusses language models for information retrieval, including query likelihood models, estimating query generation probabilities, and experiments comparing language modeling approaches to other IR techniques.
Improved method for pattern discovery in text miningeSAT Journals
This document summarizes an improved method for pattern discovery in text mining proposed by Bharate Laxman and D. Sujatha. The method implements a novel pattern discovery technique proposed by Zhong et al. that discovers patterns from text documents and computes pattern specificities to evaluate term weights. The authors built a prototype application to test the technique. Experimental results showed the solution is useful for text mining as it avoids problems of misinterpretation and low frequency compared to previous methods.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Mining Opinion Features in Customer ReviewsIJCERT JOURNAL
Now days, E-commerce systems have become extremely important. Large numbers of customers are choosing online shopping because of its convenience, reliability, and cost. Client generated information and especially item reviews are significant sources of data for consumers to make informed buy choices and for makers to keep track of customer’s opinions. It is difficult for customers to make purchasing decisions based on only pictures and short product descriptions. On the other hand, mining product reviews has become a hot research topic and prior researches are mostly based on pre-specified product features to analyse the opinions. Natural Language Processing (NLP) techniques such as NLTK for Python can be applied to raw customer reviews and keywords can be extracted. This paper presents a survey on the techniques used for designing software to mine opinion features in reviews. Elven IEEE papers are selected and a comparison is made between them. These papers are representative of the significant improvements in opinion mining in the past decade.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
Navigation through citation network based on content similarity using cosine ...Salam Shah
The rate of scientific literature has been increased in the past few decades; new topics and information is added in the form of articles, papers, text documents, web logs, and patents. The growth of information at rapid rate caused a tremendous amount of additions in the current and past knowledge, during this process, new topics emerged, some topics split into many other sub-topics, on the other hand, many topics merge to formed single topic. The selection and search of a topic manually in such a huge amount of information have been found as an expensive and workforce-intensive task. For the emerging need of an automatic process to locate, organize, connect, and make associations among these sources the researchers have proposed different techniques that automatically extract components of the information presented in various formats and organize or structure them. The targeted data which is going to be processed for component extraction might be in the form of text, video or audio. The addition of different algorithms has structured information and grouped similar information into clusters and on the basis of their importance, weighted them. The organized, structured and weighted data is then compared with other structures to find similarity with the use of various algorithms. The semantic patterns can be found by employing visualization techniques that show similarity or relation between topics over time or related to a specific event. In this paper, we have proposed a model based on Cosine Similarity Algorithm for citation network which will answer the questions like, how to connect documents with the help of citation and content similarity and how to visualize and navigate through the document.
IRJET- Automatic Recapitulation of Text DocumentIRJET Journal
The document describes an approach for automatically generating summaries of text documents. It involves preprocessing the input text by tokenizing, removing stop words, and stemming words. It then extracts features from the preprocessed text, such as term frequency-inverse document frequency (tf-idf) values. Sentences are scored based on these features and sentences with higher scores are selected to form the summary. Keywords from the text are also identified using WordNet to help select the most relevant sentences for the summary. The proposed approach aims to generate concise yet meaningful summaries using natural language processing techniques.
The document presents an overview of probabilistic models for information retrieval. It discusses how probability theory can be applied to model the uncertain nature of retrieval, where queries only vaguely represent user needs and relevance is uncertain. The document outlines different probabilistic IR models including the classical probabilistic retrieval model, probability ranking principle, binary independence model, Bayesian networks, and language modeling approaches. It also describes datasets used to evaluate these models, including collections from TREC, Cranfield, and others. Basic probability theory concepts are reviewed, including joint probability, conditional probability, and rules relating probabilities.
This document summarizes a survey on string similarity matching search techniques. It discusses how string similarity matching is used to find relevant information in text collections. The document reviews different algorithms for string matching, including edit distance, NR-grep, n-grams, and approaches based on hashing and locality-sensitive hashing. It analyzes techniques like pattern matching, threshold-based joins, and vector representations. The goal is to present an overview of the field and compare algorithm performance for similarity searches.
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet.. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or need statements, clustering of document collections on the basis of language or topic, and statistical, probabilistic, and semantic methods of analyzing and retrieving documents. Information extraction from text has therefore been pursued actively as an attempt to present knowledge from published material in a computer readable format. An automated extraction tool would not only save time and efforts, but also pave way to discover hitherto unknown information implicitly conveyed in this paper. Work in this area has focused on extracting a wide range of information such as chromosomal location of genes, protein functional information, associating genes by functional relevance and relationships between entities of interest. While clinical records provide a semi-structured, technically rich data source for mining information, the publications, in their unstructured format pose a greater challenge, addressed by many approaches.
A Review Of Text Mining Techniques And ApplicationsLisa Graves
This document provides a review of various text mining techniques and applications. It discusses techniques used for text classification and summarization, including Naive Bayes classification, backpropagation neural networks, keyword matching, and information extraction. It also covers applications of text mining in areas like sentiment analysis of social media posts and hotel reviews. Finally, it discusses the need for organizational text mining to extract useful information and insights from large amounts of unstructured text data.
A Comprehensive Study On Natural Language Processing And Natural Language Int...Scott Bou
The document provides a comprehensive overview of natural language processing (NLP) and natural language interfaces to databases (NLIDBs). It discusses the different levels of NLP including morphological, lexical, syntactic, semantic and pragmatic analysis. It also describes various approaches used to develop NLIDBs, including symbolic, empirical, connectionist and maximum entropy approaches. Additionally, it outlines the history of NLP and NLIDBs, covering early work in machine translation and historically developed systems like LUNAR.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
A statistical model for gist generation a case study on hindi news articleIJDKP
Every day, huge number of news articles are reported and disseminated on the Internet. By generating gist
of an article, reader can go through the main topics instead of reading the whole article as it takes much
time for reader to read the entire content of the article. An ideal system would understand the document
and generate the appropriate theme(s) directly from the results of the understanding. In the absence of
natural language understanding system, it is required to design an appropriate system. Gist generation is a
difficult task because it requires both maximizing text content in short summary and maintains
grammaticality of the text. In this paper we present a statistical approach to generate a gist of a Hindi
news article. The experimental results are evaluated using the standard measures such as precision, recall
and F1 measure for different statistical models and their combination on the article before pre-processing
and after pre-processing.
An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...dannyijwest
General search engines often provide low precise results even for detailed queries. So there is a vital need
to elicit useful information like keywords for search engines to provide acceptable results for user’s search
queries. Although many methods have been proposed to show how to extract keywords automatically, all
attempt to get a better recall, precision and other criteria which describe how the method has done its job
as an author. This paper presents a new automatic keyword extraction method which improves accessibility
of web content by search engines. The proposed method defines some coefficients determining features
efficiency and tries to optimize them by using a genetic algorithm. Furthermore, it evaluates candidate
keywords by a function that utilizes the result of search engines. When comparing to the other methods,
experiments demonstrate that by using the proposed method, a higher score is achieved from search
engines without losing noticeable recall or precision.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High –quality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
The document provides information on review papers and metagenomics. It begins with defining a review paper as a critical analysis of previously published literature that summarizes, classifies, analyzes and compares existing works. It describes the different types of review papers. The document then discusses metagenomics, providing a history and overview of the field. It explains that metagenomics involves directly extracting and sequencing DNA from environmental samples without culturing to study microbial communities. Examples of applications in various fields like medicine, engineering and agriculture are provided. The concluding remarks summarize key points about the importance of review papers and references are listed.
Semi Automated Text Categorization Using Demonstration Based Term SetIJCSEA Journal
Manual Analysis of huge amount of textual data requires a tremendous amount of processing time and effort in reading the text and organizing them in required format. In the current scenario, the major problem is with text categorization because of the high dimensionality of feature space. Now-a-days there are many methods available to deal with text feature selection. This paper aims at such semi automated text categorization feature selection methodology to deal with a massive data using one of the phases of David Merrill’s First principles of instruction (FPI). It uses a pre-defined category group by providing them with the proper training set based on the demonstration phase of FPI. The methodology involves the text tokenization, text categorization and text analysis.
Language Models for Information RetrievalDustin Smith
The document provides background information on Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze, who are authors of the book "Introduction to Information Retrieval: Language models for information retrieval". It then outlines the presentation which discusses language models for information retrieval, including query likelihood models, estimating query generation probabilities, and experiments comparing language modeling approaches to other IR techniques.
Improved method for pattern discovery in text miningeSAT Journals
This document summarizes an improved method for pattern discovery in text mining proposed by Bharate Laxman and D. Sujatha. The method implements a novel pattern discovery technique proposed by Zhong et al. that discovers patterns from text documents and computes pattern specificities to evaluate term weights. The authors built a prototype application to test the technique. Experimental results showed the solution is useful for text mining as it avoids problems of misinterpretation and low frequency compared to previous methods.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Mining Opinion Features in Customer ReviewsIJCERT JOURNAL
Now days, E-commerce systems have become extremely important. Large numbers of customers are choosing online shopping because of its convenience, reliability, and cost. Client generated information and especially item reviews are significant sources of data for consumers to make informed buy choices and for makers to keep track of customer’s opinions. It is difficult for customers to make purchasing decisions based on only pictures and short product descriptions. On the other hand, mining product reviews has become a hot research topic and prior researches are mostly based on pre-specified product features to analyse the opinions. Natural Language Processing (NLP) techniques such as NLTK for Python can be applied to raw customer reviews and keywords can be extracted. This paper presents a survey on the techniques used for designing software to mine opinion features in reviews. Elven IEEE papers are selected and a comparison is made between them. These papers are representative of the significant improvements in opinion mining in the past decade.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
Navigation through citation network based on content similarity using cosine ...Salam Shah
The rate of scientific literature has been increased in the past few decades; new topics and information is added in the form of articles, papers, text documents, web logs, and patents. The growth of information at rapid rate caused a tremendous amount of additions in the current and past knowledge, during this process, new topics emerged, some topics split into many other sub-topics, on the other hand, many topics merge to formed single topic. The selection and search of a topic manually in such a huge amount of information have been found as an expensive and workforce-intensive task. For the emerging need of an automatic process to locate, organize, connect, and make associations among these sources the researchers have proposed different techniques that automatically extract components of the information presented in various formats and organize or structure them. The targeted data which is going to be processed for component extraction might be in the form of text, video or audio. The addition of different algorithms has structured information and grouped similar information into clusters and on the basis of their importance, weighted them. The organized, structured and weighted data is then compared with other structures to find similarity with the use of various algorithms. The semantic patterns can be found by employing visualization techniques that show similarity or relation between topics over time or related to a specific event. In this paper, we have proposed a model based on Cosine Similarity Algorithm for citation network which will answer the questions like, how to connect documents with the help of citation and content similarity and how to visualize and navigate through the document.
IRJET- Automatic Recapitulation of Text DocumentIRJET Journal
The document describes an approach for automatically generating summaries of text documents. It involves preprocessing the input text by tokenizing, removing stop words, and stemming words. It then extracts features from the preprocessed text, such as term frequency-inverse document frequency (tf-idf) values. Sentences are scored based on these features and sentences with higher scores are selected to form the summary. Keywords from the text are also identified using WordNet to help select the most relevant sentences for the summary. The proposed approach aims to generate concise yet meaningful summaries using natural language processing techniques.
The document presents an overview of probabilistic models for information retrieval. It discusses how probability theory can be applied to model the uncertain nature of retrieval, where queries only vaguely represent user needs and relevance is uncertain. The document outlines different probabilistic IR models including the classical probabilistic retrieval model, probability ranking principle, binary independence model, Bayesian networks, and language modeling approaches. It also describes datasets used to evaluate these models, including collections from TREC, Cranfield, and others. Basic probability theory concepts are reviewed, including joint probability, conditional probability, and rules relating probabilities.
This document summarizes a survey on string similarity matching search techniques. It discusses how string similarity matching is used to find relevant information in text collections. The document reviews different algorithms for string matching, including edit distance, NR-grep, n-grams, and approaches based on hashing and locality-sensitive hashing. It analyzes techniques like pattern matching, threshold-based joins, and vector representations. The goal is to present an overview of the field and compare algorithm performance for similarity searches.
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet.. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or need statements, clustering of document collections on the basis of language or topic, and statistical, probabilistic, and semantic methods of analyzing and retrieving documents. Information extraction from text has therefore been pursued actively as an attempt to present knowledge from published material in a computer readable format. An automated extraction tool would not only save time and efforts, but also pave way to discover hitherto unknown information implicitly conveyed in this paper. Work in this area has focused on extracting a wide range of information such as chromosomal location of genes, protein functional information, associating genes by functional relevance and relationships between entities of interest. While clinical records provide a semi-structured, technically rich data source for mining information, the publications, in their unstructured format pose a greater challenge, addressed by many approaches.
A Review Of Text Mining Techniques And ApplicationsLisa Graves
This document provides a review of various text mining techniques and applications. It discusses techniques used for text classification and summarization, including Naive Bayes classification, backpropagation neural networks, keyword matching, and information extraction. It also covers applications of text mining in areas like sentiment analysis of social media posts and hotel reviews. Finally, it discusses the need for organizational text mining to extract useful information and insights from large amounts of unstructured text data.
A Comprehensive Study On Natural Language Processing And Natural Language Int...Scott Bou
The document provides a comprehensive overview of natural language processing (NLP) and natural language interfaces to databases (NLIDBs). It discusses the different levels of NLP including morphological, lexical, syntactic, semantic and pragmatic analysis. It also describes various approaches used to develop NLIDBs, including symbolic, empirical, connectionist and maximum entropy approaches. Additionally, it outlines the history of NLP and NLIDBs, covering early work in machine translation and historically developed systems like LUNAR.
Syracuse UniversitySURFACEThe School of Information Studie.docxdeanmtaylor1545
Syracuse University
SURFACE
The School of Information Studies Faculty
Scholarship
School of Information Studies (iSchool)
2001
Natural Language Processing
Elizabeth D. Liddy
Syracuse University, [email protected]
Follow this and additional works at: http://surface.syr.edu/istpub
Part of the Library and Information Science Commons, and the Linguistics Commons
This Book Chapter is brought to you for free and open access by the School of Information Studies (iSchool) at SURFACE. It has been accepted for
inclusion in The School of Information Studies Faculty Scholarship by an authorized administrator of SURFACE. For more information, please contact
[email protected]
Recommended Citation
Liddy, E.D. 2001. Natural Language Processing. In Encyclopedia of Library and Information Science, 2nd Ed. NY. Marcel Decker, Inc.
http://surface.syr.edu?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://surface.syr.edu/istpub?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://surface.syr.edu/istpub?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://surface.syr.edu/ischool?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://surface.syr.edu/istpub?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://network.bepress.com/hgg/discipline/1018?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://network.bepress.com/hgg/discipline/371?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
mailto:[email protected]
Natural Language Processing
1
INTRODUCTION
Natural Language Processing (NLP) is the computerized approach to analyzing text that
is based on both a set of theories and a set of technologies. And, being a very active area
of research and development, there is not a single agreed-upon definition that would
satisfy everyone, but there are some aspects, which would be part of any knowledgeable
person’s definition. The definition I offer is:
Definition: Natural Language Processing is a theoretically motivated range of
computational techniques for analyzing and representing naturally occurring texts
at one or more levels of linguistic analysis for the purpose of achieving human-like
language processing for a range of tasks or applications.
Several elements of this definition can be further detailed. Firstly the imprecise notion of
‘range of computational techniques’ is necessary because there are multiple methods or
techniques from which to choose to accomplish a particular type of language analysis.
‘Naturally occurring texts’ can be of any language, mode, genre, etc. The texts can be
oral or written. The only requirement is that they be in a language used by humans to
communicate to one another. Also, the text being analyzed should not be specifically
constru.
This document provides an agenda for research on knowledge discovery from web search. It begins with an introduction on knowledge discovery and how search engines can help extract information. It then outlines the goals and objectives, provides a literature review on related work, and discusses some common limitations observed, such as models achieving low accuracy and WSD approaches not being efficient enough. The document serves to provide background and planning for a research study on improving knowledge discovery through web search.
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
These are the slides for the technical briefing given at ICSE 2021, given by Alessio Ferrari, Liping Zhao, and Waad Alhoshan
It covers RE tasks to which NLP is applied, an overview of a recent systematic mapping study on the topic, and a hands-on tutorial on using transfer learning for requirements classification.
Please find the links to the colab notebooks here:
https://colab.research.google.com/drive/158H-lEJE1pc-xHc1ISBAKGDHMt_eg4Gn?usp=sharing
https://colab.research.google.com/d rive/1B_5ow3rvS0Qz1y-KyJtlMNnm gmx9w3kJ?usp=sharing
https://colab.research.google.com/d rive/1Xrm0gNaa41YwlM5g2CRYYX cRvpbDnTRT?usp=sharing
This doctoral research explores how users make sense of and use information when seeking information from web-based resources. The research examines user behaviors and strategies as they interact with information sources to identify examples of how users collect, evaluate, understand, interpret, and integrate new information. Three empirical studies were conducted involving participants completing information seeking tasks. The results provide insights into users' information interaction strategies and sensemaking activities. Implications for the design of technologies to better support sensemaking are discussed.
Automatize Document Topic And Subtopic Detection With Support Of A CorpusRichard Hogue
This document discusses previous research on automatic topic and subtopic detection from documents. It provides an overview of various approaches that have been used, including text segmentation, text clustering, frequent word sequences, word clustering, concept-based classification, probabilistic topic modeling, and agglomerative clustering. The document then proposes a new method called paragraph extension, which treats a document as a set of paragraphs and uses a paragraph merging technique and corpus of related words to detect topics and subtopics.
Dialectal Arabic sentiment analysis based on tree-based pipeline optimizatio...IJECEIAES
This document summarizes a research paper that proposes using a tree-based pipeline optimization tool (TPOT) to improve sentiment classification of dialectal Arabic texts. The paper provides background on sentiment analysis and challenges in analyzing informal Arabic texts. It then discusses related work applying TPOT and AutoML techniques to optimize machine learning for various tasks. The proposed approach uses TPOT for sentiment analysis of three Arabic dialect datasets to automatically optimize hyperparameters and improve over similar prior work.
French machine reading for question answeringAli Kabbadj
This paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training da-tasets. Until now these technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embed-ding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network ar-chitectures in French and carried out a French Q&A models with F1 score around 70%.
Implementation of a Knowledge Management Methodology based on Ontologies :Cas...rahulmonikasharma
This document proposes a knowledge management methodology based on ontologies for the tourism domain. The methodology covers the key stages of knowledge management: identifying critical knowledge, preserving knowledge through ontology development, enhancing knowledge by validating and exploiting the ontology, and updating knowledge. The methodology uses heterogeneous sources like text documents and thesauri to build an initial ontology, which is then enriched using natural language processing tools. It provides guidelines for each stage, including performing a feasibility study, choosing scenarios, and developing the ontology while ensuring consistency. The goal is to develop a reusable ontology that represents distributed and heterogeneous knowledge from multiple perspectives.
ANALYSIS OF RHETORICAL MOVES OF JOURNAL ARTICLES AND ITS IMPLICATION TO THE T...April Knyff
This document analyzes the rhetorical moves used in journal articles and implications for teaching academic writing. It analyzed 8 articles from Indonesian and English journals using frameworks for introduction and discussion sections. The analysis found that both native English and Indonesian writers generally follow the frameworks, though there is variation. Rhetorical moves were most frequently used in introductions to claim centrality or counter-claim gaps. This suggests rhetorical moves should be taught to undergraduate students to strengthen academic writing skills.
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
Due to the dramatic growth of internet use, the amount of unstructured Bengali text data has increased
enormous. It is therefore essential to extract event intelligently from it. The progress in technologies in
natural language processing (NLP) for information extraction that is used to locate and classify content in
news data according to predefined categories such as person name, place name, organization name, date,
time etc. The current named entity recognition (NER), which is a subtask of NLP, plays a vital rule to
achieve human level performance on specific documents such as newspapers to effectively identify entities.
The purpose of this research is to introduce NER system in Bengali news data to identify events of specified
things in running text based on regular expression and Bengali grammar. In so doing, I have designed and
evaluated part-of-speech (POS) tags to recognize proper nouns. In this thesis, I have explained Hidden
Markov Model (HMM) based approach for developing NER system from Bengali news data.
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...PhD Assistance
Natural language processing is an important area for future PhD research. Recent trends in NLP research include using techniques like opinion mining and content analysis to analyze user reviews and news articles. The document outlines many potential topics in NLP that PhD students could explore like developing personalized educational materials using medical records or predicting clinical outcomes with deep learning models.
Natural Language Processing Theory, Applications and Difficultiesijtsrd
The promise of a powerful computing device to help people in productivity as well as in recreation can only be realized with proper human machine communication. Automatic recognition and understanding of spoken language is the first step toward natural human machine interaction. Research in this field has produced remarkable results, leading to many exciting expectations and new challenges. This field is known as Natural language Processing. In this paper the natural language generation and Natural language understanding is discussed. Difficulties in NLU, applications and comparison with structured programming language are also discussed here. Mrs. Anjali Gharat | Mrs. Helina Tandel | Mr. Ketan Bagade "Natural Language Processing Theory, Applications and Difficulties" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-6 , October 2019, URL: https://www.ijtsrd.com/papers/ijtsrd28092.pdf Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/28092/natural-language-processing-theory-applications-and-difficulties/mrs-anjali-gharat
The document provides a survey of word sense disambiguation (WSD) research. It discusses the history and applications of WSD, and categorizes the main WSD approaches as knowledge-based, supervised, and unsupervised. For each category, it outlines several common algorithms used, such as Lesk algorithm, decision trees, Naive Bayes, and support vector machines. The document surveys the state-of-the-art in WSD performance and compares different algorithm types. It also provides an overview of WSD research in Indian languages.
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...Kim Daniels
This document summarizes a study that compares different term reduction methods for representing the vocabulary of literary texts. Specifically, it examines the effectiveness of the entropy method, transition point method, and a hybrid method in reducing the size of the vocabulary from a collection of Quran texts, while preserving important terms. The results indicate that the transition point method was most effective at reducing the vocabulary size without losing important terms, compared to the other methods.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
1. Natural Language Processing Applications in Library and Information
Science
* & Umut Al**
Abstract
Purpose: With the recent developments in information technologies, natural language
processing (NLP) practices have made tasks in many areas easier and more practical.
Nowadays, especially when big data are used in most research, NLP provides fast and
easy methods for processing these data. The main objective of this paper is to identify
subfields of library and information science (LIS) where NLP can be used and to
provide a guide based on bibliometrics and social network analyses for researchers
who intend to study this subject.
Design/methodology/approach: Within the scope of this study, 6,607 publications,
including NLP methods published in the field of LIS, are examined and visualized by
social network analysis methods.
Findings: After evaluating the obtained results, the subject categories of publications,
frequently used keywords in these publications, and the relationships between these
words are revealed. Finally, the core journals and articles are classified thematically
for researchers working in the field of LIS and planning to apply NLP in their research.
Originality/value: The results of this study draws a general framework for LIS field and
guides researchers on new techniques that may be useful in the field.
Introduction
Natural language processing (NLP) is a process of understanding how texts, speeches,
and similar materials are used by computerized systems and how they are operated
on computers (Chowdhury, 2003, p. 51). The Oxford Dictionary defines NLP the
application of computational techniques to the analysis and synthesis of natural
language and spe goal of these
applications is to realize a human-like language processing for several tasks or
applications and to analyze the generated texts with computational techniques (Liddy,
2010, p. 3864). With similar applications, detailed linguistic analyses are possible, and
large texts can easily be analyzed.
Although NLP approaches were initially applied for endangered languages to prevent
their extinction, these approaches have been recently used in many studies to organize
and make sense of big data. Currently, it would be more difficult and time-consuming
to work without NLP many fields, including marketing, information validation, and
information retrieval or visualization. The main aim of this study is to provide
* Hacettepe University, Department of Information Management (iSchool), Turkey.
https://orcid.org/0000-0001-7102-493X | http://www.bby.hacettepe.edu.tr/akademik/zehrataskin/
** Hacettepe University, Department of Information Management (iSchool), Turkey.
http://yunus.hacettepe.edu.tr/~umutal/
2. information about the main stages of NLP applications and to provide a detailed
bibliometric approach on the literature about NLP published in the field of library and
information science (LIS). In this respect, potential subjects, which can employ different
NLP methods, can be revealed, and ideas about future trends can be obtained. The
main research questions are as follows:
- What are the main NLP applications used in LIS literature? What is the historical
evolution of NLP subjects?
- Which journals should be followed by researchers new to NLP, and how is the
distribution of these journals according to the subtopics of NLP applications?
- What are the prominent publications on NLP for LIS?
In this context, basic features of NLP applications are first presented, and then works of
the literature are gathered and evaluated to reveal the main characteristics of the field.
Natural Language Processing
Levels of NLP Tasks
Natural language processing is divided into seven basic levels (Fig.1), from the
simplest to the most difficult (Feldman, 1999, p. 62-64; Liddy, 2010, p. 3867-3868). In
this manner, NLP tasks are carried out for many studies, and large texts in any
language can easily be analyzed.
Figure 1. Seven levels of NLP (Feldman, 1999, p. 62-64; Liddy, 2010, p. 3867-3868).
Studies carried out at a phonetic level are designed to understand spoken languages
or pronunciations. In the literature, various studies have been conducted on languages
to prevent their extinction , & Schultz, 2000; Schultz & Waibel, 2001;
Shen, Wu, & Tsai, 2015). The next level, morphology, deals with the smallest parts of
words, morphemes. Using morphology, the main features of languages can be
revealed, and the roots and suffixes of words can be discriminated. For example,
3. similar applications developed for Turkish language, which is one of the agglutinative
languages in the world, are included in the liter
2015). The main aim of the lexical level is to understand the meanings of words. The
fourth level (syntactic) uncovers the grammatical structures of sentences, and several
studies have been conducted on it (e.g., Saraçlar, Roark, & Shafran, 2010),
while the semantic level aims to reveal the meanings of words. These two levels
(semantic and syntactic) are combined for discourse level studies. The most
complicated level is the pragmatic level, which aims to understand the purposeful use
of languages in situations. To achieve NLP tasks successfully, it is necessary to
investigate all or a few of these levels. Each level is a continuation of the previous level;
that is, while the simplest NLP task is performed at the phonetic level, the most
complicated/detailed task is performed at the pragmatic level.
NLP Applications
Since there are various purposes for processing natural language texts, many applications
have been developed for each purpose. The most common NLP applications in the
literature are as follows:
Information Retrieval: With the recent increase in the amount of information, the
application of NLP to access meaningful information has become important. The
access any section of a
paragraph, book, or large-scale text (Lewis & Jones, 1996, p. 92).
Information Extraction: The application of information extraction is based on the
identification, labeling, and extraction of key elements, such as people names,
institutions, locations, and countries from large texts (Liddy, 2010, p. 3871).
Information extraction may be used with other NLP applications because extracted
data may form the basis of complex NLP tasks (Blake, 2013, p. 129).
Machine Translation: This application is based on the automatic translation of
texts or speeches from one language to another. This practice aims to integrate
different cultures, remove language barriers, and preserve ancient languages
(Manning & Schütze, 1999, p. 463).
Summarization: Summarization systems are based on using some linguistic or
statistical methods to choose the most important words or phrases from sentences
or paragraphs in large texts, and creating a meaningful summary to represent the
text (Chowdhury, 2003, p. 60).
Text Categorization: These applications are predictive models used for various
purposes such as image or pattern recognition, weather forecasting, and disease
various classes depending on the basic text features (Blake, 2013, p. 136).
4. NLP and Bibliometrics
Studies in which NLP methods and bibliometric analyses are performed together have
been very common in recent years. These studies can be categorized into two groups:
studies in which NLP applications are used to increase performance of bibliometric studies
and studies in which NLP articles are analyzed by bibliometric methods.
In the literature, the abovementioned NLP applications have all been used to improve the
performance of bibliometric studies. Information extraction, which is the most frequently
used application, is used to access the names of authors, institutions, and countries of
publications, especially when there are standardization problems in citation indexes
(Galvez & Moya-Anegón, 2007; Hooper, Neves,
Al, 2014). In these cases, automated information extraction processes can be
carried out with NLP software such as Nooj, Saffron, and Intex. In addition, information
extraction algorithms have been developed for detecting citation sentences from full texts
(Kim, Le, , & Sezen, 2017). In a recent study, it was also stated
that an NLP-based extraction system has been used to analyze the interpretation of a
specific term (Web of Science) in a dataset (Li, Rollins, & Yan, 2018).
The approach of pennant diagrams, developed in the center of relevance and
bibliometrics and based on information retrieval, provides a convergence between
information retrieval applications and bibliometrics. Currently, these diagrams
developed by Howard White (2007) are frequently used for bibliometric studies (e.g.,
Akbulut, 2016; Carevic & Mayr, 2014; White, 2018).
There is a strong relationship between bibliometrics and information retrieval when
citation indexes are the main data source for bibliometric studies. The quality of data
in citation indexes can increase the success of these studies. This strong connection
between the concepts, bibliometrics, and information retrieval has facilitated the
organization of new events that address this connection. The intersections of
bibliometrics and information retrieval are presented in detail at the International
Society for Scientometrics and Informetrics conference in Vienna, 2013, with the theme
. Its selected papers are published
in Scientometrics in 2015 (Mayr & Scharnhorst, 2015). Similarly, the Joint Workshop
on Bibliometric-enhanced IR and NLP for Digital Libraries (BIRNDL) event, which has
been held regularly for three years as part of the Association for Computing
Machinery's Special Interest Group on Information Retrieval (ACM SIGIR)
conferences, also provides platforms to present works on bibliometric-enhanced
information retrieval for authors working in the field (BIRNDL2018, 2018).
Summarization is one of the subjects of some bibliometric studies in the literature. For
instance, one study focused on creating a summary of a field by using different
publications in the dataset (Nanba & Okumura, 1999). In a similar study, text
summarization was performed using co-citation clusters, and as result, the automatic
summarization system significantly outperformed the baseline in both metrics
(Fiszman, Demner- , & Rindflesch, 2009). In another study, domain
summarization was successfully performed as a result of co-citation analyses (Chen,
Ibekwe-SanJuan, & Hou, 2010).
5. Semantic applications are often used for analyzing research results or determining the
meaning of citations. For instance, the bibliometric and semantic nature of negative results
publications was examined in a study, wherein negative expressions were determined
using Nooj software (Gumpenberger, Gorraiz, Wieland, Roche, Schiebel, Besagni, &
François, 2013). In another study, a citation-weighting proposal was introduced using
semantic analysis applications for computational bibliometrics domain (Avram, Velter, &
Dumitrache, 2014). Semantic and syntactic analyses are often preferred for content-based
citation analyses, which may replace the traditional citation counting. The main aim of
studies using semantic and syntactic features of citations is to develop new-generation
citation indicators using NLP techniques (e.g., Adedayo, 2015; Catalini, Lacetera, & Oettl,
2015; Jha, Jbara, Qazvinian, , & Pifat-Mrzljak,
-
bibliometric studies in the literature (e.g., Glänzel, Heeffler, & Thijs, 2017).
All the above studies are applications of NLP meant to enhance the quality of
bibliometric studies. However, in the literature, papers on NLP have been analyzed
using bibliometric methods. One of these works is the bibliometric analysis of NLP
studies in health field (Chen, Xie, Wang, Liu, Xu, & Hao, 2018). In this study, sub-
themes affecting the field were determined as computational biology, terminology
mining, information extraction, text classification, and information retrieval. In another
study that examined clinical NLP studies, it was found that the most frequently used
text types are patient-authored texts, transcribed audios, and online encyclopedic
resources (Névéol & Zweigenbaum, 2016). In another study, papers on data mining
published in a conference proceedings book were analyzed (Mikova, 2016). In this
study, the emerging words were identified as web scraping, ontology modeling,
advanced bibliometrics, semantics, and sentiments analysis. In a review, publications
on NLP were examined in depth, and it was found that the future of NLP lies in the
biologically and linguistically motivated computational paradigms that enable narrative
understanding and hence sense making (Cambria & White, 2014).
Although there are a large number of studies on bibliometric and information science using
NLP applications in the literature, no bibliometric analysis has been made on these
publications. This work aims to fill this gap and provide a roadmap for new researchers.
Data and Methods
In this study, LIS studies on NLP and its various applications were examined in depth
by the using a data collection process. The first step
was to create the dataset using the query below:
TS=("natural language processing" OR "text categorization" OR "information
retrieval" OR "machine translation" OR "information extraction" OR
summarization) OR TI=("natural language processing" OR "text categorization"
OR "information retrieval" OR "machine translation" OR "information extraction"
OR summarization)) AND WC="Information science & Library Science"
Timespan: All years. Indexes: SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-
SSH, BKCI-S, BKCI-SSH, ESCI
6. The query to create the dataset was performed on the Web of Science on April 19,
2018, and we realized 6,607 publications. These publications were downloaded in tab
delimited (.txt) format and corrections/unifications were made on author (AU), citation
(CR), source (SO), keyword (DE & ID), and abstract (AB) fields (Clarivate Analytics,
2018) to ensure data quality and standardization. The steps carried out in the data
standardization process were as follows:
- The authors with two names/surnames and different authors with the same
names in the datasets causes consistency problems for social network analysis
studies. Therefore, we conducted name standardization process for author
names. In this context, the names of the authors were listed and problematic
author names were unified.
- In some cases, identifying and analyzing citations of some publications that have
not a DOI number can create consistency problems for co-citation analyses. In
order to ensure consistency in the most cited publications in our dataset, the entire
list of cited papers was obtained and repetitive publications were unified.
- Journal name changes may also affect social network analyses negatively because
the journals, which changed their names, can be represented as different journals
on the maps. In this context, even if the journals were renamed, it was standardized
under the last names of the journals to ensure consistency.
- The most important consistency problem in keyword or co-occurrence maps is
conducting analyses without parsing of keywords in titles or abstracts. If the
standardization process for keywords is not conducted before analysis, it is
possible to encounter many words that are duplicated in keyword maps for
reasons such as plural suffixes, word phrases and verb inflections. In this
context, all the keywords were simplified to the most basic form for analysis. In
addition, word phrases were identified to provide consistency.
Data was made ready for analysis with this standardization process carried out for the
dataset. To mention the features of the publications in our dataset, 65% of publications
are articles, 18% are proceedings and 9% are book reviews. The publications spread
across 17 different document types. The distribution of publications and citations by
year is shown on Fig. 2.
Two different kinds of visualization software were used to visualize the evolution of the
NLP literature and its changes over time. While VOSviewer1 software is used for drawing
a general framework of the field, CiteSpace2 software is used for visualizing trends for
years. The Vosviewer software works with tab-delimited text and CiteSpace software
with field delimited text file formats. Therefore, standardized data files were converted
into formats that meets software requirements. The use of both software is different from
each other and thus, manuals of them may be examined for detailed information (Chen,
2018; van Eck, & Waltman, 2018).
1 http://www.vosviewer.com/
2 http://cluster.cis.drexel.edu/~cchen/citespace/
7. Figure 2. Distribution of articles and their citations by year (logarithmic scale).
Findings
Convergence of Categories
The subject categories of the Web of Science were used to find the distribution of
studies in the field of NLP and to show their convergence. Regarding the number of
ion
by LIS. According to the
centralities, LIS
second highest frequency of
,737 publications), and it is followed by
,576 publications. In Fig. 3, the
relationships of each subject categories for years can be observed.3
Five-year intervals were used to draw the timeline, and all the items for each slice are
shown in Fig. 3. The distribution of the categories can be evaluated in four basic clusters,
which are NLP, decision support system, dictionary based systems, and recommender
systems. Cluster names were defined by CiteSpace using the most common words in
publications. According to the map, studies in the field of computer science and
information science began in 1954. In the 1960s, it converged to only the fields of law
and social sciences. Until the 1980s, no other subject areas were seen in the map. By
the 1980s, it became clear that the applications of NLP began in education, health
sciences, management, and linguistics. In addition, NLP techniques have been used in
recent years for environmental studies such as environmental risk assessment. It can
also be traced from the network map that, in recent years, NLP techniques have been
3 The extended version of the network, which includes divisions of subject clusters, is available at
https://goo.gl/6FaVRk
8. applied in history, psychology, and art. This proves that the NLP, which is thought to be
related to only computer science, actually has an interdisciplinary structure and can be
used in many fields from law to history. This also indicates that the transdisciplinary
structure of the field will continue to increase in the following years, as studies on NLP
have been carried out in 58 of the 252 Web of Science subject categories.
Figure 3. Distribution of publications to the subject categories and their relations.
Title and Keyword Analysis
By analyzing various words used in the different sections of scientific articles such as
keywords, abstracts, titles, and full texts, the word maps of the fields can be revealed,
and thematic links between studies can be obtained. Keywords are considered as the
main elements for analyses in some studies (Ding, Chowdhury, & Foo, 2001; Sue &
Lee, 2010), while the words in the abstracts and titles are used to expand the scope of
some studies (Rotto & Morgan, 1997; Sedighi, 2016). Rotto and Morgan (1997, p. 101)
stated that co-word analyses should include abstracts because specific research can
only be revealed in this way. In this study, co-word analysis was carried out using titles
and abstracts of NLP publications. For this analysis, singular/plural words and
synonyms were parsed to standardize the words. Then, the abbreviations were unified,
9. to ensure that each word is included only once in the co-word map. The network map
creation after these steps is as shown in Fig. 4.4
Figure 4. Co-word analysis of title and abstracts (for interactive map, please visit https://goo.gl/GspS9C).
VosViewer determined four main clusters5 in the co-word analysis, and 289 nodes were
determined for the first cluster (red), 269 for the second (green), 201 for the third (blue),
and 185 for the fourth (yellow). To name these clusters in terms of thematic distribution,
the first cluster can be considered to contain the basic/traditional subjects of LIS. The most
used and strongest term in the cluster , which has strong connections with the
other clusters (total link strength: 4,928; links: 808) followed by student, ,
. The most used words in bibliometric studies such as citation
analysis, bibliometrics, and citation data are also in this cluster. In addition, concepts
such as bibliography, documentation, and knowledge management, which are traditional
LIS subjects, can be traced in this cluster. The use of NLP techniques or applications in
traditional subjects is important as it shows that all the areas of the LIS field are
harmonized to technological innovations along with their subfields.
Considering
of its own cluster but of the whole map (total link strength: 5,995; links: 829).
with other clusters according to the
, ,
.
The third cluster includes keywords on human information behaviors, information retrieval
models, , ,
4 Co-word map is also created by CiteSpace to visualize changes over time. The map is accessible on
the link: https://goo.gl/ENep6B
5 Five clusters are shown on the map, however, pink cluster includes only three keywords and these
keywords do not have strong links to others. Therefore, it is excluded from the analysis.
10. . Many elements in this cluster are often related to the end user experiences,
such as user interaction and usability. This cluster converges to the red cluster because
information behaviors are one of the fundamental subtopics of the LIS domain.
The last cluster contains key applications, algorithms, and techniques used for
performance evaluations in NLP studies. The most powerful keywords of this cluster are
, , NLP. can be traced from the map that this cluster is close to
the green cluster containing the performance evaluation. Therefore, the green and
yellow clusters can be correctly evaluated together and classified
and performance evaluations.
Similar results were obtained in the map produced by CiteSpace6, which shows the
changes of keywords and their relationships with each other over the years. Studies
at the end of 1950s focused only on computer
technology until the 1990s; however, after the 90s, NLP studies cover almost all areas
of information science. CiteSpace identified seven main clusters. The first cluster is
named eSpace, and it comprises studies covering
information retrieval on the web. This cluster is close to the second cluster named
text documents, and thus, these two clusters are considered as one. The works in
these clusters continue from the 50s, and it is expected to continue in the future. The
most important observation about this cluster is social media, which started in 2010,
and the more recent information literacy issues. It can be inferred that while the
previous problem of information on the we ,
.
Keywords, which refer to human behaviors that affect information retrieval, are shown
on the cluster. Recent words are generally
focused on social media just as in the previous cluster.
It is possible to track keywords on online catalogs and databases subjects, which are
the results of the use of online systems in libraries, from the
retrieval sys cluster. This cluster is convergent to the LIS domain, and it includes
issues focused on information retrieval in libraries.
Works on health information retrieval help health subjects to create a cluster of health
issues on the map. The keywords related to the methods and techniques for accessing
health information can be seen cluster. Perhaps, the main
reason for the formation of a separate cluster for health field in this map is the journals
(e.g., Qualitative Health Research, Journal of Health Communication and Health
Information, and Libraries Journal) that are indexed in the Web of Science under the
LIS topic and are focused on health issues.
are used in bibliometric studies,
there are words such as information representation, citation analysis and
bibliometrics. The p
problems and is a 10-year
period, seems to focus on the end user experience and information retrieval processes.
6 https://goo.gl/MRK9Rd
11. Journal Co-Citation Analysis
Co-citation analyses aim to reveal similarities to present general structures of research
domains. Journal co-citation analyses help to develop core journal lists that are specific
to the domains, reveal the expertise of the areas through published literature, and
develop effective collections for users (McCain, 1991, p. 291). The aim of this study is
to suggest a core journal collection for researchers and librarians working on the field
of LIS and using NLP techniques in their works. In this context, the most frequently
cited journals7 and the links between these journals are shown in Fig. 5. All journal
titles are standardized so that journal title changes do not affect the consistency of the
journal co-citation map.
As seen in Fig. 5, CiteSpace determined six clusters for journal co-citation map, which
are "information retrieval," "seeking context," "information retrieval system design,"
"information retrieval: Health," "authorship attribution, ." The
cluster starts with Information Retrieval Journal. Association for
Computing Machinery has a significant impact in this cluster. Several ACM journals
can be traced in the cluster. It is important for researchers conducting investigations
on information retrieval to follow ACM publications and ACM SIGIR conferences to be
informed about latest developments in their fields. In the cluster #3 (information
retrieval), there are journals that focus on information retrieval in the health sciences.
Researchers working in the field of health information may benefit from these journals.
The second cluster containing journals on information behaviors starts with The
Journal of the Association for Information Science and Technology.8 This journal is
also connected with the information retrieval cluster. Other journals in this cluster are
Online Information Review, Aslib Proceedings, Library Quarterly, and the Annual
Review of Information Science, which cover key topics in LIS.
Considering cluster #2, researchers seeking publications on information retrieval
system design for information centers may follow LIS journals such as Electronic
Library, Library Trends, Journal of Academic Libraries, and Cataloging and
Classification Quarterly.
cluster includes journals on the subject of bibliometrics,
such as Scientometrics and Journal of Informetrics as well as interdisciplinary journals
such as Science and Nature. It seems that cluster #4 is a cluster that includes
bibliometrics studies using NLP techniques. Researchers conducting bibliometric
studies may follow the journals listed in this cluster.
Although the last cluster is identified by CiteSpace ,
includes journals on computer science and management indexed in LIS subject in the
Web of Science. Interdisciplinary researchers who work for computer science and LIS
may benefit from the journals listed in the cluster.
7 Top 1% of cited journals are selected for visualization.
8 The name of the journal was American Documentation between 1956 and 1969.
13. Document Co-Citation Analysis
Document co-citation analysis method, which is primarily based on identifying pairs of
highly cited papers (Garfield, 2001, p. 2), may be useful for expediting knowledge and
building consilience across disciplines (Trujillo & Long, 2018). The most co-cited
articles by NLP papers on LIS subject is shown in Fig. 6.
Figure 6. Document co-citation map of mostly cited papers (for interactive map, please visit https://goo.gl/yiw5ix).
According to the map created by VosViewer, five different clusters were determined.
The first cluster (red) includes 179 publications on information retrieval subject. The
book entitled "Introduction to Modern Information Retrieval" (Salton, 1983) is the
strongest node in this cluster with 304 links and 1695 total link strength. It is followed
by
-Jones, 1976). Publications in this cluster can serve as
reference guides for authors who work in information retrieval domain.
The cluster defined as the green color contains 107 publications on human information
behaviors. The most co-cited paper in this cluster is "Cognitive Perspectives of
Information Retrieval Interaction: Elements of a Cognitive IR Theory" with 316 links
and 1997 total link strength (Ingwersen, 1996). It is followed by the papers "The Design
of Browsing and Berrypicking Techniques for the Online Search Interface" (Bates,
1989) and "Inside the Search Process: Information Seeking from the Users
Perspective" (Kuhlthau, 1991).
The blue cluster contains 62 papers published on topics of human information behavior
and information retrieval. The most important difference of this cluster from the green
one is the similarity of blue and red cluster subjects. The most frequently co-cited paper
Information Retrieval: Background and T
Oddy, and Brooks (1982).
14. Forty-four publications are shown on the yellow cluster. The cluster includes articles
published in the main LIS subfields. The green cluster containing the information
behavior issues resembles the yellow one from the point at which it contains similar
Framework
For Thinking on Notion in Information S
The last cluster contains 41 publications and is defined with a pink color. The
publications in this cluster seem to be focused on the philosophy of librarianship and
information retrieval. The most frequently co-
R 9
Discussion
Natural language processing can be used to understand and evaluate information
effectively, and it became especially useful with the continuous rise in the amount of
information. The main purpose of this study is to reveal opportunities to use NLP
techniques in LIS field as well as to draw a framework for early career researchers who
intend to conduct studies on LIS field using NLP techniques. In accordance of this
purpose, this study focused on finding answers to three basic research questions,
which were discussed below.
The first research question was designed to determine which natural language
processing applications were generally used by LIS field. According to the findings, the
first NLP studies in the field of LIS were based on the subject of information retrieval
and information retrieval systems. Besides, it was found that NLP applications were
used in many subjects in LIS from traditional librarianship to next-generation practices.
The main research topics of NLP applications in LIS are user experiences, information
behaviors, bibliometrics and information systems design.
The second research question aimed to determine the most frequently cited journals
in articles that used NLP applications in LIS field. It was found that preferred journals
for citation varied according to the subjects of the studies. According to the results,
ACM is the most important organization for people working for information retrieval
subject in LIS. JASIS&T and Online Information Review journals may be followed for
information behavior subjects; Electronic Library and Library Trends for designing
effective information retrieval systems; Scientometrics and Journal of Informetrics for
bibliometric studies; and journals of IEEE and MIS Quarterly for interdisciplinary issues.
In our study, it was aimed to determine the core publications in order to provide
guideline for researchers who are new to the field. As result, the core publications were
classified in certain subject categories such as information retrieval and philosophy of
librarianship. Through the publications in each subject classes focusing on information
retrieval, information behaviors, information seeking and philosophy of LIS, it is
possible to follow the reference publications of the field.
9 For a complete list of co-cited papers and their distribution to the clusters, please follow this link:
https://goo.gl/hPKF7W
15. Conclusion
In this study, we aimed to reveal main characteristics of NLP-based LIS publications
by using the benefits of social network analysis and bibliometrics. In recent years, the
increase in the amount of information and difficulties in information processing have
led to the use of NLP practices in LIS, which is also an information-based field.
Although it is thought to be a subfield of computer science, NLP techniques have been
widely used in LIS for the past 10 years. In addition, NLP techniques have made it
possible to perform more tasks with less human power in the field of LIS. Because of
their usefulness, it is important to provide ideas about next-generation methods to
people working in LIS field. This study also revealed that NLP is a transdisciplinary
subject for LIS studies. NLP is not only used in technical studies, but also in studies on
library philosophy. Moreover, it was revealed that NLP methods have been used in
many LIS works in the literature, and through this study, the general framework of these
studies in the field of LIS has been drawn.
The most important benefit of this study is the guidance it provides to early-career
researchers who work in LIS field. Through the obtained results, core journals and the
most influential publications were identified and relations of these publications were
revealed. In addition, this study provides a detailed literature review on which NLP
applications can be used in the future studies, which journals can be followed and
which core publications can be read.
Acknowledgement
This article was supported in part by a research grant from the Turkish Scientific and
Technological Research Center (115K440).
References
Adedayo, A.V. (2015). Citations in introduction and literature review sections should
not count for quality. Performance Measurement and Metrics, 16(3), 303-306.
Akbulut, M. (2016).
relevance rankings using pennant diagrams]. esis,
Hacettepe University, Ankara.
-lexical
features for Turkish discriminative language models. In IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP), 2010 (pp.
5538-5541). Dallas, TX: IEEE.
Avram, S., Velter, V. & Dumitrache, I. (2014). Semantic analysis applications in
computational bibliometrics. Control Engineering and Applied Informatics, 16(1), 62-69.
Bates, M.J. (1989). The Design of Browsing and Berrypicking Techniques for the
online search interface. Online Review, 13(5), 407-424.
Belkin, N.J., Oddy, R.N. & Brooks, H.M. (1982). Ask for information-retrieval .1.
Background and theory. Journal of Documentation, 38(2), 61-71.
16. BIRNDL2018. (2018). 3rd Joint Workshop on Bibliometric-enhanced Information
Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2018).
http://wing.comp.nus.edu.sg/~birndl-sigir2018/
Blair, D.C. (1990). Language and representation. Annual Review of Information
Science and Technology, 44(1), 159-200.
Blake, C. (2013). Text mining. Annual Review of Information Science and
Technology, 45(1), 121-125.
Cambria, E. & White, B. (2014). Jumping NLP curves: a review of natural language
processing research. IEEE Computational Intelligence Magazine. doi:
10.1109/MCI.2014.2307227
Carevic, Z. & Mayr, P. (2014). Recommender systems using pennant diagrams in
digital libraries. 13th European Networked Knowledge Organization Systems
(NKOS) Workshop. https://arxiv.org/ftp/arxiv/papers/1407/1407.7276.pdf
Catalini, C., Lacetera, N. & Oettl, A. (2015). The incidence and role of negative
citations in science. PNAS, 112(45), 13823-13826.
Chen, C. (2018) The CiteSpace manual. https://leanpub.com/howtousecitespace
Chen, C., Ibekwe-SanJuan, F. & Hou, J. (2010). The structure and dynamics of
cocitation clusters: a multiple-perspective cocitation analysis. Journal of the
American Society for Information Science and Technology, 61(7), 1386-1409.
Chen, X., Xie, H., Wang, F.L., Liu, Z., Xu, J. & Hao, T. (2018). A bibliometric analysis of
natural language processing in medical research. BMC Medical Informatics, 18(1).
Chowdhury, G.G. (2003). Natural language processing. Annual Review of Information
Science and Technology, 37(1), 51-89.
Clarivate Analytics. (2018). Web of Science core collection field tags.
https://images.webofknowledge.com/images/help/WOS/hs_wos_fieldtags.html
recognition for agglutinative languages. In Acoustics, Speech, and Signal
Processing, 2000. ICASSP '00. Proceedings (pp. 1563-
Ding, Y., Chowdhury, G.G. & Foo, S. (2001). Bibliometric cartography of information
retrieval research by using co-word analysis. Information Processing &
Management, 37, 817-842.
ITU Turkish natural language processing pipeline.
http://tools.nlp.itu.edu.tr/MorphAnalyzer
Feldman, S. (1999). NLP meets the jabberwocky: Natural language processing in
information retrieval. Online, 23, 62-72.
Fiszman, M., Demner-
summarization of MEDLINE citations for evidence-based medical treatment: a
topic-oriented evaluation. Journal of Biomedical Informatics, 42, 801-813.
Galvez, C. & Moya-Anegón, F. (2007). Standardizing formats of corporate source
data. Scientometrics, 70(1), 3-26.
Garfield, E. (2001). From bibliographic coupling to co-citation analysis via algorithmic
historio-bibliography.
http://garfield.library.upenn.edu/papers/drexelbelvergriffith92001.pdf
Glänzel, W., Heeffler, S. & Thijs, B. (2017). Lexical analysis of scientific publications
for nano-level scientometrics. Scientometrics, 111, 1897-1906.
17. Gumpenberger, C., Gorraiz, J., Wieland, M., Roche, I., Schiebel, E., Besagni, D. &
François, C. (2013). Exploring the bibliometric and semantic nature of negative
results. Scientometrics, 95, 277-297.
Hooper, C.J., Neves, B. & Bordea, G. (2015). A disciplinary analysis of internet
science. In Tiropanis, T., Vakali, A. Sartori, L & Burnap, P. (eds) Internet
Science. INSCI 2015. Lecture Notes in Computer Science, vol 9089 (pp. 63-77).
Switzerland: Springer Cham.
Ingwersen, P. (1996). Cognitive perspectives of information retrieval interaction:
Elements of a cognitive IR theory. Journal of Documentation, 52(1), 3-50.
Jha, R., Jbara, A-A., Qazvinian, V. & Radev, D.R. (2016). NLP-driven citation analysis
for scientometrics. Natural Language Engineering, 23(1), 93-130.
Kim, I.C., Le, D.X. & Thoma, G.R. (2014). Automated model for extracting citation
sentences from online biomedical articles using SVM-based text summarization
technique. In Proceedings 2014 IEEE International Conference on Systems, Man,
and Cybernetics (SMC), San Diego, CA (pp. 1991-1996). San Diego: IEEE.
Kuhlthau, C.C. (1991). Inside the search process - information seeking from the
users perspective. Journal of the American Society for Information Science,
42(5), 361-371.
Lewis, D.D. & Jones, K.S. (1996). Natural language processing for information
retrieval. Communications of the ACM, 39(1), 92-101.
Li, K., Rollins, J. & Yan, E. (2018). Web of Science use in published research and
review papers 1997 2017: a selective, dynamic, cross-domain, content-based
analysis. Scientometrics, 115, 1-20.
Liddy, E.D. (2010). Natural language processing. In Encyclopedia of Library and
Information Sciences, Third Edition (pp. 3864-3873). New York: Taylor and Francis.
Manning, C.D. & Schütze, H. (1999). Foundations of statistical natural language
processing. London: MIT Pres.
-Mrzljak, G. (1998). Citation context versus
the frequency counts of citation histories. Journal of the American Society for
Information Science, 49(6), 530-540.
Mayr, P. & Scharnhorst, A. (2015). Combining bibliometrics and information retrieval:
preface. Scientometrics, 102(3), 2191-2192.
McCain, K.W. (1991). Mapping economics through the journal literature: an
experiment in journal cocitation analysis. Journal of the American Society For
Information Science, 42(4), 290-296.
Mikova, N. (2016). Recent trends in technology mining approaches: quantitative
analysis of GTM Conference Proceedings. In Daim, T.U., Chiavetta, D., Porter,
ds) Anticipating Future Innovation Pathways through Large
Data Analysis (pp. 59-70). Switzerland: Springer Nature.
Nanba, H. & Okumura, M. (1999). Towards multi-paper summarization reference
information. In IJCAI'99 Proceedings of the 16th International Joint Conference
on Artificial intelligence - Volume 2, (pp. 926-931). Stockholm: Margan
Kaufmann Publishers Inc.
Natural Language Processing. (2017). Oxford Living Dictionaries
https://en.oxforddictionaries.com/definition/natural_language_processing
18. Névéol, A. & Zweigenbaum, P. (2016). Clinical natural language processing in 2015:
leveraging the variety of texts of clinical interest. IMIA Yearbook of Medical
Informatics 2016, 234-239. Doi: 10.15265/IY-2016-049
Robertson, S.E. & Sparck-Jones, K. (1976). Relevance weighting of search terms.
Journal of the American Society for Information Science, 27(3), 129-146.
Rotto, E. & Morgan, R.P. (1997). An exploration of expert-based text analysis
techniques for assessing industrial relevance in U.S. engineering dissertation
abstracts. Scientometrics, 40, 83-102.
Salton, G. (1983). Introduction to modern information retrieval. New York: McGraw Hill.
Saracevic, T. (1975). Relevance - review of and a framework for thinking on notion in
information-science. Journal of the American Society for Information Science,
26(6), 321-343.
Schultz, T. & Waibel, A. (2001). Language-independent and language-adaptive acoustic
modeling for speech recognition. Speech Communication, 35(1-2), 31-51.
Sedighi, M. (2016). Application of word co-occurrence analysis method in mapping of the
scientific fields (case study: the field of Informetrics). Library Review, 65(1-2), 52-64.
Shen, H-P., Wu, C-H. & Tsai, P-S. (2015). Model generation of accented speech using
model transformation and verification for bilingual speech recognition. ACM
Transactions on Asian and Low-Resource Language Information Processing, 14(2).
conceps and algorithms]
Sue, H-N. & Lee, P-C. (2010). Mapping knowledge structure by keyword co-occurrence: a
first look at journal papers in Technology Foresight. Scientometrics, 85, 65-79.
example of the names of Turkish hospitals. Procedia - Social and Behavioral
Sciences, 73, 544-550.
indexes. Scientometrics, 98(1), 347-368.
-based citation analysis study based on text
categorization. Scientometrics, 114(1), 335-357.
-based
citation analysis study: detection of citation sentences. In STI2017, open
indicators: innovation, participation and actor-based STI indicators, Paris, 2017.
https://goo.gl/hdbnt3
Trujillo, C.M. & Long, T.M. (2018). Document co-citation analysis to enhance
transdisciplinary research. Science Advances, 4(1). doi: 10.1126/sciadv.1701130
van Eck, N.J. & Waltman, L. (2018). VOSviewer manual.
http://www.vosviewer.com/download/f-z2x2.pdf
van Rijsbergen, C.J. (1979). Information retrieval. London: Butterworth-Heinemann Newton.
White, H.D. (2007). Combining bibliometrics, information retrieval, and relevance
theory. Part 1: First examples of a synthesis. Journal of the American Society for
Information Science and Technology, 58, 536-559.
White, H.D. (2018). Pennants for Garfield: bibliometrics and document retrieval.
Scientometrics, 114, 757-778.
Zemberek NLP. (2015). http://zembereknlp.blogspot.com.tr/
View publication statsView publication stats