Text can be analysed by splitting the text and extracting the keywords .These may be represented as summaries, tabular representation, graphical forms, and images. In order to provide a solution to large amount of information present in textual format led to a research of extracting the text and transforming the unstructured form to a structured format. The paper presents the importance of Natural Language Processing (NLP) and its two interesting applications in Python Language: 1. Automatic text summarization [Domain: Newspaper Articles] 2. Text to Graph Conversion [Domain: Stock news]. The main challenge in NLP is natural language understanding i.e. deriving meaning from human or natural
language input which is done using regular expressions, artificial intelligence and database concepts.Automatic Summarization tool converts the newspaper articles into summary on the basis of frequency of words in the text. Text to Graph Converter takes in the input as stock article, tokenize them on various index (points and percent) and time and then tokens are mapped to graph. This paper proposes a business solution for users for effective time management.
Text mining is the process of extracting interesting and non-trivial knowledge or information from unstructured text data. Text mining is the multidisciplinary field which draws on data mining, machine learning, information retrieval, computational linguistics and statistics. Important text mining processes are information extraction, information retrieval, natural language processing, text classification, content analysis and text clustering. All these processes are required to complete the preprocessing step before doing their intended task. Pre-processing significantly reduces the size of the input text documents and the actions involved in this step are sentence boundary determination, natural language specific stop-word elimination, tokenization and stemming. Among this, the most essential and important action is the tokenization. Tokenization helps to divide the textual information into individual words. For performing tokenization process, there are many open source tools are available. The main objective of this work is to analyze the performance of the seven open source tokenization tools. For this comparative analysis, we have taken Nlpdotnet Tokenizer, Mila Tokenizer, NLTK Word Tokenize, TextBlob Word Tokenize, MBSP Word Tokenize, Pattern Word Tokenize and Word Tokenization with Python NLTK. Based on the results, we observed that the Nlpdotnet Tokenizer tool performance is better than other tools.
The document describes a proposed system for automatic text summarization using latent semantic analysis and natural language processing. The system would take in a user-submitted document, preprocess it by removing stop words, generate a singular value decomposition matrix to derive the latent semantic structure, and output a summary by semantically arranging sentences to encompass the original text's concepts. The document provides an implementation example in Python using the LSA method from the scikit-learn machine learning library for natural language processing and matrix decomposition.
Integrating natural language processing and software engineeringNakul Sharma
This document summarizes research on integrating natural language processing and software engineering. It provides a literature review of works that have used natural language text as input to generate software engineering artifacts like UML diagrams, test cases, and process models. The paper also discusses how techniques from natural language processing can be applied to different phases of the software development life cycle and how natural language understanding can help automate software engineering tasks.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
Natural Language Processing Theory, Applications and Difficultiesijtsrd
The promise of a powerful computing device to help people in productivity as well as in recreation can only be realized with proper human machine communication. Automatic recognition and understanding of spoken language is the first step toward natural human machine interaction. Research in this field has produced remarkable results, leading to many exciting expectations and new challenges. This field is known as Natural language Processing. In this paper the natural language generation and Natural language understanding is discussed. Difficulties in NLU, applications and comparison with structured programming language are also discussed here. Mrs. Anjali Gharat | Mrs. Helina Tandel | Mr. Ketan Bagade "Natural Language Processing Theory, Applications and Difficulties" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-6 , October 2019, URL: https://www.ijtsrd.com/papers/ijtsrd28092.pdf Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/28092/natural-language-processing-theory-applications-and-difficulties/mrs-anjali-gharat
Summarization of software artifacts is an ongoing field of research among the software engineering community due to the benefits that summarization provides like saving of time and efforts in various software engineering tasks like code search, duplicate bug reports detection, traceability link recovery, etc. Summarization is to produce short and concise summaries. The paper presents the review of the state of the art of summarization techniques in software engineering context. The paper gives a brief overview to the software artifacts which are mostly used for summarization or have benefits from summarization. The paper briefly describes the general process of summarization. The paper reviews the papers published from 2010 to June 2017 and classifies the works into extractive and abstractive summarization. The paper also reviews the evaluation techniques used for summarizing software artifacts. The paper discusses the open problems and challenges in this field of research. The paper also discusses the future scopes in this area for new researchers.
Single document keywords extraction in Bahasa Indonesia using phrase chunkingTELKOMNIKA JOURNAL
Keywords help readers to understand the idea of a document quickly. Unfortunately, considerable time and effort are often needed to come up with a good set of keywords manually. This research focused on generating keywords from a document automatically using phrase chunking. Firstly, we collected part of speech patterns from a collection of documents. Secondly, we used those patterns to extract candidate keywords from the abstract and the content of a document. Finally, keywords are selected from the candidates based on the number of words in the keyword phrases and some scenarios involving candidate reduction and sorting. We evaluated the result of each scenario using precision, recall, and F-measure. The experiment results show: i) shorter-phrase keywords with string reduction extracted from the abstract and sorted by frequency provides the highest score, ii) in every proposed scenario, extracting keywords using the abstract always presents a better result, iii) using shorter-phrase patterns in keywords extraction gives better score in comparison to using all phrase patterns, iv) sorting scenarios based on the multiplication of candidate frequencies and the weight of the phrase patterns offer better results.
Text mining is the process of extracting interesting and non-trivial knowledge or information from unstructured text data. Text mining is the multidisciplinary field which draws on data mining, machine learning, information retrieval, computational linguistics and statistics. Important text mining processes are information extraction, information retrieval, natural language processing, text classification, content analysis and text clustering. All these processes are required to complete the preprocessing step before doing their intended task. Pre-processing significantly reduces the size of the input text documents and the actions involved in this step are sentence boundary determination, natural language specific stop-word elimination, tokenization and stemming. Among this, the most essential and important action is the tokenization. Tokenization helps to divide the textual information into individual words. For performing tokenization process, there are many open source tools are available. The main objective of this work is to analyze the performance of the seven open source tokenization tools. For this comparative analysis, we have taken Nlpdotnet Tokenizer, Mila Tokenizer, NLTK Word Tokenize, TextBlob Word Tokenize, MBSP Word Tokenize, Pattern Word Tokenize and Word Tokenization with Python NLTK. Based on the results, we observed that the Nlpdotnet Tokenizer tool performance is better than other tools.
The document describes a proposed system for automatic text summarization using latent semantic analysis and natural language processing. The system would take in a user-submitted document, preprocess it by removing stop words, generate a singular value decomposition matrix to derive the latent semantic structure, and output a summary by semantically arranging sentences to encompass the original text's concepts. The document provides an implementation example in Python using the LSA method from the scikit-learn machine learning library for natural language processing and matrix decomposition.
Integrating natural language processing and software engineeringNakul Sharma
This document summarizes research on integrating natural language processing and software engineering. It provides a literature review of works that have used natural language text as input to generate software engineering artifacts like UML diagrams, test cases, and process models. The paper also discusses how techniques from natural language processing can be applied to different phases of the software development life cycle and how natural language understanding can help automate software engineering tasks.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
Natural Language Processing Theory, Applications and Difficultiesijtsrd
The promise of a powerful computing device to help people in productivity as well as in recreation can only be realized with proper human machine communication. Automatic recognition and understanding of spoken language is the first step toward natural human machine interaction. Research in this field has produced remarkable results, leading to many exciting expectations and new challenges. This field is known as Natural language Processing. In this paper the natural language generation and Natural language understanding is discussed. Difficulties in NLU, applications and comparison with structured programming language are also discussed here. Mrs. Anjali Gharat | Mrs. Helina Tandel | Mr. Ketan Bagade "Natural Language Processing Theory, Applications and Difficulties" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-6 , October 2019, URL: https://www.ijtsrd.com/papers/ijtsrd28092.pdf Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/28092/natural-language-processing-theory-applications-and-difficulties/mrs-anjali-gharat
Summarization of software artifacts is an ongoing field of research among the software engineering community due to the benefits that summarization provides like saving of time and efforts in various software engineering tasks like code search, duplicate bug reports detection, traceability link recovery, etc. Summarization is to produce short and concise summaries. The paper presents the review of the state of the art of summarization techniques in software engineering context. The paper gives a brief overview to the software artifacts which are mostly used for summarization or have benefits from summarization. The paper briefly describes the general process of summarization. The paper reviews the papers published from 2010 to June 2017 and classifies the works into extractive and abstractive summarization. The paper also reviews the evaluation techniques used for summarizing software artifacts. The paper discusses the open problems and challenges in this field of research. The paper also discusses the future scopes in this area for new researchers.
Single document keywords extraction in Bahasa Indonesia using phrase chunkingTELKOMNIKA JOURNAL
Keywords help readers to understand the idea of a document quickly. Unfortunately, considerable time and effort are often needed to come up with a good set of keywords manually. This research focused on generating keywords from a document automatically using phrase chunking. Firstly, we collected part of speech patterns from a collection of documents. Secondly, we used those patterns to extract candidate keywords from the abstract and the content of a document. Finally, keywords are selected from the candidates based on the number of words in the keyword phrases and some scenarios involving candidate reduction and sorting. We evaluated the result of each scenario using precision, recall, and F-measure. The experiment results show: i) shorter-phrase keywords with string reduction extracted from the abstract and sorted by frequency provides the highest score, ii) in every proposed scenario, extracting keywords using the abstract always presents a better result, iii) using shorter-phrase patterns in keywords extraction gives better score in comparison to using all phrase patterns, iv) sorting scenarios based on the multiplication of candidate frequencies and the weight of the phrase patterns offer better results.
Benchmarking nlp toolkits for enterprise applicationConference Papers
The document summarizes a study that evaluated five popular natural language processing (NLP) toolkits (CoreNLP, NLTK, OpenNLP, SparkNLP, and spaCy) on their ability to perform common NLP tasks like sentence segmentation, tokenization, lemmatization, part-of-speech (POS) tagging, and named entity recognition (NER) on news articles in Malay. The study found that CoreNLP achieved the highest accuracy on four tasks, while spaCy was slightly better than CoreNLP for POS tagging. When retraining the NER models on Malaysian entities, CoreNLP and spaCy achieved the highest F-score of 0.78, beating OpenNLP.
This document provides an introduction to natural language processing (NLP) and discusses several key concepts:
- NLP aims to allow computers to understand human language in a way similar to humans. Examples of NLP applications discussed include spam filters, sentiment analysis tools, digital assistants, and language translators.
- The document outlines some of the core components of NLP systems, including natural language understanding to interpret text/speech meaning, and natural language generation to produce output text/speech.
- It introduces the NLTK (Natural Language Toolkit) as a popular Python package used for various NLP tasks like tokenization, tagging, parsing, and more. Basic NLTK structure and example modules are covered at a
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
The document describes a machine transliteration system that transliterates Hindi and Marathi names and words to English using support vector machines (SVM). It segments source language names into phonetic units, and trains an SVM classifier using phonetic units and n-grams as features to label each unit with its English transliteration. The system achieves good accuracy for Hindi-English and Marathi-English transliteration.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
IRJET - Text Optimization/Summarizer using Natural Language Processing IRJET Journal
1. The document discusses the development of an intelligent system to optimize the English language using natural language processing techniques. The system will perform functions like summarization, spell check, grammar check, and sentence auto-completion.
2. It describes the various algorithms used for each function, including extracting important sentences for summarization, comparing words to dictionaries for spell check, analyzing syntax for grammar check, and completing sentences based on previous user data for auto-completion.
3. The system aims to build a smart tool that can correct errors and summarize text in English to improve communication through optimized language.
XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...ijnlc
In this paper, we proposed a XAI-based Language Learning Chatbot (namely XAI Language Tutor) by using ontology and transfer learning techniques. To facilitate three levels of language learning, XAI Language Tutor consists of three levels for systematically English learning, which includes: 1) phonetics level for speech recognition and pronunciation correction; 2) semantic level for specific domain conversation, and 3) simulation of “free-style conversation” in English - the highest level of language chatbot communication as “free-style conversation agent”. In terms of academic contribution, we implement the ontology graph to explain the performance of free-style conversation, following the concept of XAI (Explainable Artificial Intelligence) to visualize the connections of neural network in bionics, and explain the output sentence from language model. From implementation perspective, our XAI Language Tutor agent integrated the mini-program in WeChat as front-end, and fine-tuned GPT-2 model of transfer learning as back-end to interpret the responses by ontology graph.
All of our source codes have uploaded to GitHub: https://github.com/p930203110/EnglishLanguageRobot
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet.. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or need statements, clustering of document collections on the basis of language or topic, and statistical, probabilistic, and semantic methods of analyzing and retrieving documents. Information extraction from text has therefore been pursued actively as an attempt to present knowledge from published material in a computer readable format. An automated extraction tool would not only save time and efforts, but also pave way to discover hitherto unknown information implicitly conveyed in this paper. Work in this area has focused on extracting a wide range of information such as chromosomal location of genes, protein functional information, associating genes by functional relevance and relationships between entities of interest. While clinical records provide a semi-structured, technically rich data source for mining information, the publications, in their unstructured format pose a greater challenge, addressed by many approaches.
A prior case study of natural language processing on different domain IJECEIAES
This document summarizes a prior case study on natural language processing across different domains. It begins with an introduction to natural language processing, describing how it is a branch of artificial intelligence that allows computers to understand human language. It then reviews several existing studies that applied natural language processing techniques such as named entity recognition and text mining to tasks like identifying technical knowledge in resumes, enhancing reading skills for deaf students, and predicting student performance. The document concludes by highlighting some of the challenges in developing new natural language processing models.
Implementation of Urdu Probabilistic ParserWaqas Tariq
The implementation of Urdu probabilistic parser is the main contribution of this research work. In the beginning, a lot of Urdu text was collected from different sources. The sentences in the text were subsequently tagged. The tagged sentences were then parsed by a chart parser to formulate the rules. In the next step, probabilities were assigned to these rules to get a Probabilistic Context Free Grammar. For Urdu probabilistic parser, the idea of shift-reduce multi-path strategy is used. The developed software performs the syntactic analysis of a sentence, using a given set of probabilistic phrase structure rules. The parse with the highest probability is selected, as the most suitable one from a set of possible parses produced by this parser. The structure of each sentence is represented in the form of successive rules. This parser parses sentences with 74% accuracy.
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...ijnlc
Most languages, especially in Africa, have fewer or no established part-of-speech (POS) tagged corpus.
However, POS tagged corpus is essential for natural language processing (NLP) to support advanced
researches such as machine translation, speech recognition, etc. Even in cases where there is no POS
tagged corpus, there are some languages for which parallel texts are available online. The task of POS
tagging a new language corpus with a new tagset usually face a bootstrapping problem at the initial stages
of the annotation process. The unavailability of automatic taggers to help the human annotator makes the
annotation process to appear infeasible to quickly produce adequate amounts of POS tagged corpus for
advanced NLP research and training the taggers. In this paper, we demonstrate the efficacy of a POS
annotation method that employed the services of two automatic approaches to assist POS tagged corpus
creation for a novel language in NLP. The two approaches are cross-lingual and monolingual POS tags
projection. We used cross-lingual to automatically create an initial ‘errorful’ tagged corpus for a target
language via word-alignment. The resources for creating this are derived from a source language rich in
NLP resources. A monolingual method is applied to clean the induce noise via an alignment process and to
transform the source language tags to the target language tags. We used English and Igbo as our case
study. This is possible because there are parallel texts that exist between English and Igbo, and the source
language English has available NLP resources. The results of the experiment show a steady improvement
in accuracy and rate of tags transformation with score ranges of 6.13% to 83.79% and 8.67% to 98.37%
respectively. The rate of tags transformation evaluates the rate at which source language tags are
translated to target language tags.
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...ijctcm
This document summarizes an approach to automatic text summarization using the Simplified Lesk algorithm and WordNet. It analyzes sentences to determine relevance based on semantic information rather than surface features like position or format. Sentences are assigned weights based on the number of overlaps between words' dictionary definitions and the full text. Higher weighted sentences are selected for the summary based on a percentage of the original text length. The approach achieves 80% accuracy on 50% summarizations of diverse texts compared to human summaries.
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
Class Diagram Extraction from Textual Requirements Using NLP Techniquesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
G2 pil a grapheme to-phoneme conversion tool for the italian languageijnlc
This paper presents a knowledge-based approach for the grapheme to-phoneme conversion (G2P) of isolated words of the Italian language. With more than 7,000 languages in the world, the biggest challenge today is to rapidly port speech processing systems to new languages with low human effort and at reasonable cost. This includes the creation of qualified pronunciation dictionaries. The dictionaries provide the mapping from the orthographic form of a word to its pronunciation, which is useful in both speech synthesis and automatic speech recognition (ASR) systems. For training the acoustic models we need an automatic routine that maps the spelling of training set to a string of phonetic symbols representing the pronunciation.
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...ijnlc
Building
dialogues systems
interaction
has recently gained considerable
attention, but most of the
resourc
es and systems built so far are
tailored to
English and other Indo
-
European languages. The need
for designing
systems for
other languages is increasing such as Arabic language.
For this reasons, there
are more int
erest for Arabic dialogue acts classification
task because it
a key player in Arabic language
under
standing
to
bu
ilding this systems
.
This paper surveys
different techniques
for dialogue acts classification
for Arabic.
W
e describe the
main existing techniques for utterances segmentations and
classification, annotation schemas, and
test corpora for Arabic
dialogues understanding
that have introduced
in the literature
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
Machine translation evaluation is a very important
activity in machine translation development. Automa
tic
evaluation metrics proposed in literature are inade
quate as they require one or more human reference
translations to compare them with output produced b
y machine translation. This does not always give
accurate results as a text can have several differe
nt translations. Human evaluation metrics, on the o
ther
hand, lacks inter-annotator agreement and repeatabi
lity. In this paper we have proposed a new human
evaluation metric which addresses these issues. Mor
eover this metric also provides solid grounds for
making sound assumptions on the quality of the text
produced by a machine translation.
This document summarizes a research paper on understanding and estimating emotional expression using acoustic analysis of natural speech. The paper explores identifying seven emotional states (anger, surprise, sadness, happiness, fear, disgust, and neutral) using fifteen acoustic features extracted from the SAVEE speech database. Three models using different combinations of features were evaluated using various machine learning algorithms. The results showed that Model 2, using energy intensity, pitch, standard deviation, jitter, and shimmer, achieved the highest classification accuracy. Estimation of emotions using confidence intervals showed that most emotions could be accurately estimated using energy intensity and pitch. The paper concludes that expanding the study to include more features and databases could improve emotional state recognition.
Benchmarking nlp toolkits for enterprise applicationConference Papers
The document summarizes a study that evaluated five popular natural language processing (NLP) toolkits (CoreNLP, NLTK, OpenNLP, SparkNLP, and spaCy) on their ability to perform common NLP tasks like sentence segmentation, tokenization, lemmatization, part-of-speech (POS) tagging, and named entity recognition (NER) on news articles in Malay. The study found that CoreNLP achieved the highest accuracy on four tasks, while spaCy was slightly better than CoreNLP for POS tagging. When retraining the NER models on Malaysian entities, CoreNLP and spaCy achieved the highest F-score of 0.78, beating OpenNLP.
This document provides an introduction to natural language processing (NLP) and discusses several key concepts:
- NLP aims to allow computers to understand human language in a way similar to humans. Examples of NLP applications discussed include spam filters, sentiment analysis tools, digital assistants, and language translators.
- The document outlines some of the core components of NLP systems, including natural language understanding to interpret text/speech meaning, and natural language generation to produce output text/speech.
- It introduces the NLTK (Natural Language Toolkit) as a popular Python package used for various NLP tasks like tokenization, tagging, parsing, and more. Basic NLTK structure and example modules are covered at a
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
The document describes a machine transliteration system that transliterates Hindi and Marathi names and words to English using support vector machines (SVM). It segments source language names into phonetic units, and trains an SVM classifier using phonetic units and n-grams as features to label each unit with its English transliteration. The system achieves good accuracy for Hindi-English and Marathi-English transliteration.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
IRJET - Text Optimization/Summarizer using Natural Language Processing IRJET Journal
1. The document discusses the development of an intelligent system to optimize the English language using natural language processing techniques. The system will perform functions like summarization, spell check, grammar check, and sentence auto-completion.
2. It describes the various algorithms used for each function, including extracting important sentences for summarization, comparing words to dictionaries for spell check, analyzing syntax for grammar check, and completing sentences based on previous user data for auto-completion.
3. The system aims to build a smart tool that can correct errors and summarize text in English to improve communication through optimized language.
XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...ijnlc
In this paper, we proposed a XAI-based Language Learning Chatbot (namely XAI Language Tutor) by using ontology and transfer learning techniques. To facilitate three levels of language learning, XAI Language Tutor consists of three levels for systematically English learning, which includes: 1) phonetics level for speech recognition and pronunciation correction; 2) semantic level for specific domain conversation, and 3) simulation of “free-style conversation” in English - the highest level of language chatbot communication as “free-style conversation agent”. In terms of academic contribution, we implement the ontology graph to explain the performance of free-style conversation, following the concept of XAI (Explainable Artificial Intelligence) to visualize the connections of neural network in bionics, and explain the output sentence from language model. From implementation perspective, our XAI Language Tutor agent integrated the mini-program in WeChat as front-end, and fine-tuned GPT-2 model of transfer learning as back-end to interpret the responses by ontology graph.
All of our source codes have uploaded to GitHub: https://github.com/p930203110/EnglishLanguageRobot
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet.. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or need statements, clustering of document collections on the basis of language or topic, and statistical, probabilistic, and semantic methods of analyzing and retrieving documents. Information extraction from text has therefore been pursued actively as an attempt to present knowledge from published material in a computer readable format. An automated extraction tool would not only save time and efforts, but also pave way to discover hitherto unknown information implicitly conveyed in this paper. Work in this area has focused on extracting a wide range of information such as chromosomal location of genes, protein functional information, associating genes by functional relevance and relationships between entities of interest. While clinical records provide a semi-structured, technically rich data source for mining information, the publications, in their unstructured format pose a greater challenge, addressed by many approaches.
A prior case study of natural language processing on different domain IJECEIAES
This document summarizes a prior case study on natural language processing across different domains. It begins with an introduction to natural language processing, describing how it is a branch of artificial intelligence that allows computers to understand human language. It then reviews several existing studies that applied natural language processing techniques such as named entity recognition and text mining to tasks like identifying technical knowledge in resumes, enhancing reading skills for deaf students, and predicting student performance. The document concludes by highlighting some of the challenges in developing new natural language processing models.
Implementation of Urdu Probabilistic ParserWaqas Tariq
The implementation of Urdu probabilistic parser is the main contribution of this research work. In the beginning, a lot of Urdu text was collected from different sources. The sentences in the text were subsequently tagged. The tagged sentences were then parsed by a chart parser to formulate the rules. In the next step, probabilities were assigned to these rules to get a Probabilistic Context Free Grammar. For Urdu probabilistic parser, the idea of shift-reduce multi-path strategy is used. The developed software performs the syntactic analysis of a sentence, using a given set of probabilistic phrase structure rules. The parse with the highest probability is selected, as the most suitable one from a set of possible parses produced by this parser. The structure of each sentence is represented in the form of successive rules. This parser parses sentences with 74% accuracy.
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...ijnlc
Most languages, especially in Africa, have fewer or no established part-of-speech (POS) tagged corpus.
However, POS tagged corpus is essential for natural language processing (NLP) to support advanced
researches such as machine translation, speech recognition, etc. Even in cases where there is no POS
tagged corpus, there are some languages for which parallel texts are available online. The task of POS
tagging a new language corpus with a new tagset usually face a bootstrapping problem at the initial stages
of the annotation process. The unavailability of automatic taggers to help the human annotator makes the
annotation process to appear infeasible to quickly produce adequate amounts of POS tagged corpus for
advanced NLP research and training the taggers. In this paper, we demonstrate the efficacy of a POS
annotation method that employed the services of two automatic approaches to assist POS tagged corpus
creation for a novel language in NLP. The two approaches are cross-lingual and monolingual POS tags
projection. We used cross-lingual to automatically create an initial ‘errorful’ tagged corpus for a target
language via word-alignment. The resources for creating this are derived from a source language rich in
NLP resources. A monolingual method is applied to clean the induce noise via an alignment process and to
transform the source language tags to the target language tags. We used English and Igbo as our case
study. This is possible because there are parallel texts that exist between English and Igbo, and the source
language English has available NLP resources. The results of the experiment show a steady improvement
in accuracy and rate of tags transformation with score ranges of 6.13% to 83.79% and 8.67% to 98.37%
respectively. The rate of tags transformation evaluates the rate at which source language tags are
translated to target language tags.
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...ijctcm
This document summarizes an approach to automatic text summarization using the Simplified Lesk algorithm and WordNet. It analyzes sentences to determine relevance based on semantic information rather than surface features like position or format. Sentences are assigned weights based on the number of overlaps between words' dictionary definitions and the full text. Higher weighted sentences are selected for the summary based on a percentage of the original text length. The approach achieves 80% accuracy on 50% summarizations of diverse texts compared to human summaries.
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
Class Diagram Extraction from Textual Requirements Using NLP Techniquesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
G2 pil a grapheme to-phoneme conversion tool for the italian languageijnlc
This paper presents a knowledge-based approach for the grapheme to-phoneme conversion (G2P) of isolated words of the Italian language. With more than 7,000 languages in the world, the biggest challenge today is to rapidly port speech processing systems to new languages with low human effort and at reasonable cost. This includes the creation of qualified pronunciation dictionaries. The dictionaries provide the mapping from the orthographic form of a word to its pronunciation, which is useful in both speech synthesis and automatic speech recognition (ASR) systems. For training the acoustic models we need an automatic routine that maps the spelling of training set to a string of phonetic symbols representing the pronunciation.
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...ijnlc
Building
dialogues systems
interaction
has recently gained considerable
attention, but most of the
resourc
es and systems built so far are
tailored to
English and other Indo
-
European languages. The need
for designing
systems for
other languages is increasing such as Arabic language.
For this reasons, there
are more int
erest for Arabic dialogue acts classification
task because it
a key player in Arabic language
under
standing
to
bu
ilding this systems
.
This paper surveys
different techniques
for dialogue acts classification
for Arabic.
W
e describe the
main existing techniques for utterances segmentations and
classification, annotation schemas, and
test corpora for Arabic
dialogues understanding
that have introduced
in the literature
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
Machine translation evaluation is a very important
activity in machine translation development. Automa
tic
evaluation metrics proposed in literature are inade
quate as they require one or more human reference
translations to compare them with output produced b
y machine translation. This does not always give
accurate results as a text can have several differe
nt translations. Human evaluation metrics, on the o
ther
hand, lacks inter-annotator agreement and repeatabi
lity. In this paper we have proposed a new human
evaluation metric which addresses these issues. Mor
eover this metric also provides solid grounds for
making sound assumptions on the quality of the text
produced by a machine translation.
This document summarizes a research paper on understanding and estimating emotional expression using acoustic analysis of natural speech. The paper explores identifying seven emotional states (anger, surprise, sadness, happiness, fear, disgust, and neutral) using fifteen acoustic features extracted from the SAVEE speech database. Three models using different combinations of features were evaluated using various machine learning algorithms. The results showed that Model 2, using energy intensity, pitch, standard deviation, jitter, and shimmer, achieved the highest classification accuracy. Estimation of emotions using confidence intervals showed that most emotions could be accurately estimated using energy intensity and pitch. The paper concludes that expanding the study to include more features and databases could improve emotional state recognition.
SEMANTIC PROCESSING MECHANISM FOR LISTENING AND COMPREHENSION IN VNSCALENDAR ...ijnlc
The document describes the VNSCalendar system, which allows users to manage their personal calendar using Vietnamese speech commands. It has three key capabilities:
1. It can recognize Vietnamese speech commands and questions, convert them to text, analyze the syntax and semantics, generate database queries, and return results to the user.
2. The system's semantic processing mechanism analyzes syntax and semantics of Vietnamese commands and questions using Definite Clause Grammar and formal semantics techniques.
3. An evaluation of the system found it could accurately recognize over 91% of Vietnamese speech commands after being built and tested in a PC environment.
UNL-ization of Numbers and Ordinals in Punjabi with IANijnlc
In the field of Natural Language Processing, Universal Networking Language (UNL) has been an area of
immense interest among researchers during last couple of years. Universal Networking Language (UNL) is
an artificial Language used for representing information in a natural-language-independent format. This
paper presents UNL-ization of Punjabi sentences with the help of different examples, containing numbers
and ordinals written in words, using IAN (Interactive Analyzer) tool. In UNL approach, UNL-ization is a
process of converting natural language resource to UNL and NL-ization, is a process of generating a
natural language resource out of a UNL graph. IAN processes input sentences with the help of TRules and
Dictionary entries. The proposed system performs the UNL-ization of up to fourteen digit number and
ordinals, written in words in Punjabi language, with the help of 104 dictionary entries and 67 TRules. The
system is tested on a sample of 150 random Punjabi Numbers and Ordinals, written in words, and its FMeasure
comes out to be 1.000 (on a scale of 0 to 1).
Event detection and summarization based on social networks and semantic query...ijnlc
Events can be characterized by a set of descriptive, collocated keywords extracted documents. Intuitively,
documents describing the same event will contain similar sets of keywords, and the graph for a document collection will contain clusters individual events. Helping users to understand the event is an acute problem nowadays as the users are struggling to keep up with tremendous amount of information published every day in the Internet. The challenging task is to detect the events from online web resources, it is getting more attentions. The important data source for event detection is a Web search log because the information it contains reflects users’ activities and interestingness to various real world events. There are three major issues playing role for event detection from web search logs: effectiveness, efficiency of
detected events. We focus on modeling the content of events by their semantic relations with other events
and generating structured summarization. Event mining is a useful way to understand computer system behaviors. The focus of recent works on event mining has been shifted to event summarization from discovering frequent patterns. Event summarization provides a comprehensible explanation of the event sequence based on certain aspects.
Processing vietnamese news titles to answer relative questions in vnewsqa ict...ijnlc
This paper introduces two important elements of our VNewsQA/ICT system: its semantic models
of simple Vietnamese sentences and its semantic processing mechanism. The VNewsQA/ICT is a
Vietnamese based Question Answering system which has the ability to gather information from
some Vietnamese news title forms on the ICTnews websites (http://www.ictnews.vn), instead of
using a database or a knowledge base, to answer the related Vietnamese questions in the domain
of information and communications technology.
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
IMPLEMENTING A SUBCATEGORIZED PROBABILISTIC DEFINITE CLAUSE GRAMMAR FOR VIETN...ijnlc
This document describes a probabilistic definite clause grammar (PDCG) parser for Vietnamese sentence parsing that incorporates Chomsky's theory of subcategorization. The researchers built a treebank of 1000 Vietnamese training sentences that were syntactically analyzed and tagged by hand. They defined subcategorization tags for nouns, verbs, adjectives, adverbs, prepositions, and conjunctions in Vietnamese. They also defined phrasal tags and developed syntactic rules for phrases based on the subcategorization tags. Experimental results showed that the precisions, recalls and F-measures of the subcategorized PDCG parser were over 98%, demonstrating the effectiveness of the approach.
SEMANTIC PARSING OF SIMPLE SENTENCES IN UNIFICATION-BASED VIETNAMESE GRAMMARijnlc
The document discusses building a semantic parsing model for simple Vietnamese sentences using a unification-based Vietnamese grammar. It defines a taxonomy of Vietnamese nouns and feature structures for nouns and verbs. Unification rules for phrases, clauses, and sentences are presented based on the defined grammar. The semantic parser achieved 84% precision and recall on a test corpus of 500 Vietnamese news titles.
Developing links of compound sentences for parsing through marathi link gramm...ijnlc
Marathi is a verb-final language with a relatively free word order. Complex Sentences is one of the major types of sentences which are used commonly in any language. This paper explores the study of complex sentence structure of Marathi language. The paper proposes various links of complex sentence clauses and modelling of the complex sentences using proposed links in the Link Grammar Framework for parsing purpose.
Kridantas play a vital role in understanding Sanskrit language. Kridantas includes nouns, adjectives and
indeclinable words called avyayas. Kridantas are formed with root and certain suffixes called Krits. Some
times Kridantas may occur with certain prefixes. Many morphological analyzers are lacking the complete
analysis of Kridantas. This paper describes a novel approach to deal completely with Kridantas.
This document summarizes a survey of opinion mining and sentiment analysis techniques. It discusses how opinion mining uses natural language processing and machine learning to analyze sentiment in text sources like blogs, reviews and social media. It outlines several key tasks in opinion mining including sentiment classification at the document, sentence and feature levels. Supervised, unsupervised and semi-supervised machine learning algorithms are commonly used for sentiment classification tasks. Naive Bayes classification and text classification algorithms are also discussed.
Evaluation of subjective answers using glsa enhanced with contextual synonymyijnlc
Evaluation of subjective answers submitted in an exam is an essential but one of the most resource consuming educational activity. This paper details experiments conducted under our project to build a software that evaluates the subjective answers of informative nature in a given knowledge domain. The paper first summarizes the techniques such as Generalized Latent Semantic Analysis (GLSA) and Cosine Similarity that provide basis for the proposed model. The further sections point out the areas of improvement in the previous work and describe our approach towards the solutions of the same. We then discuss the implementation details of the project followed by the findings that show the improvements achieved. Our approach focuses on comprehending the various forms of expressing same entity and thereby capturing the subjectivity of text into objective parameters. The model is tested by evaluating answers submitted by 61 students of Third Year B. Tech. CSE class of Walchand College of Engineering Sangli in a test on Database Engineering.
An exhaustive font and size invariant classification scheme for ocr of devana...ijnlc
The document presents a classification scheme for recognizing Devanagari characters that is invariant to font and size. It identifies the basic symbols that commonly appear in the middle zone of Devanagari text across different fonts and sizes. Through an analysis of over 465,000 words from various sources, it finds that 345 symbols account for 99.97% of text and aims to classify these into groups based on structural properties like the presence or absence of vertical bars. The proposed classification scheme is validated on 25 fonts and 3 sizes to demonstrate its font and size invariance.
Summarization of software artifacts is an ongoing field of research among the software engineering community due to the benefits that summarization provides like saving of time and efforts in various software engineering tasks like code search, duplicate bug reports detection, traceability link recovery, etc. Summarization is to produce short and concise summaries. The paper presents the review of the state of the art of summarization techniques in software engineering context. The paper gives a brief overview to the software artifacts which are mostly used for summarization or have benefits from summarization. The paper briefly describes the general process of summarization. The paper reviews the papers published from 2010 to June 2017 and classifies the works into extractive and abstractive summarization. The paper also reviews the evaluation techniques used for summarizing software artifacts. The paper discusses the open problems and challenges in this field of research. The paper also discusses the future scopes in this area for new researchers.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language
pair. The machine translation system will take input script as English sentence and parse with the help of
Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the
machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will
take the parsed output and separate the source text word by word and searches for their corresponding
target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also
reordering rules are there. After applying the reordering rules, English sentence will be syntactically
reordered to suit Marathi language.
This paper presents a vision on how the software development process could be a fully unified mechanized
process by getting benefits from the advances of Natural Language Processing and Program Synthesis
fields. The process begins from requirements written in natural language that is translated to sentences in
logical form. A program synthesizer gets those sentences in logical form (the translator's outcome) and
generates a source code. Finally, a compiler produces ready-to-run software. To find out how the building
blocks of our proposed approach works, we conducted an exploratory research on the literature in the
fields of Requirements Engineering, Natural Language Processing, and Program Synthesis. Currently, this
approach is difficult to accomplish in a fully automatic way due to the ambiguities inherent in natural
language, the reasoning context of the software, and the program synthesizer limitations in generating a
source code from logic.
Possibility of interdisciplinary research software engineering andnatural lan...Nakul Sharma
This document discusses the possibility of interdisciplinary research between software engineering and natural language processing. It provides a literature review of research papers from 2003 to 2014 related to applying tools and techniques from one field to the other. Some key areas discussed include generating UML diagrams from natural language text, developing ontologies to clarify meanings, and potential issues with joint research like determining complexity of sentences. The document proposes a flowchart for how artifacts could be analyzed using tasks from either field to enable interdisciplinary research.
Real time text stream processing - a dynamic and distributed nlp pipelineConference Papers
The document proposes a real-time architecture using Apache Storm and Apache Kafka to apply natural language processing (NLP) tasks to streams of text data. It allows developers to inject NLP modules from different programming languages in a distributed, scalable, and low-latency manner. An experiment was conducted using OpenNLP, Fasttext and SpaCy modules on Bahasa Malaysia and English text, and Apache Storm achieved the lowest latency compared to other frameworks.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
This document discusses a proposed speech-to-speech translation system that would allow translation between English and Hindi. It outlines the objectives of integrating speech recognition, text translation, text-to-speech synthesis, and text extraction from images into a single application. The proposed system would use neural networks like RNNs and LSTMs to perform these functions. It describes the overall architecture and flow of information between the various modules, including preprocessing text, translating with rules and word embeddings, and generating speech output. The goal is to develop a user-friendly system to help overcome language barriers.
Text Mining: open Source Tokenization Tools � An Analysisaciijournal
Text mining is the process of extracting interesting and non-trivial knowledge or information from
unstructured text data. Text mining is the multidisciplinary field which draws on data mining, machine
learning, information retrieval, computational linguistics and statistics. Important text mining processes
are information extraction, information retrieval, natural language processing, text classification, content
analysis and text clustering. All these processes are required to complete the preprocessing step before
doing their intended task. Pre-processing significantly reduces the size of the input text documents and
the actions involved in this step are sentence boundary determination, natural language specific stop-word
elimination, tokenization and stemming. Among this, the most essential and important action is the
tokenization. Tokenization helps to divide the textual information into individual words. For performing
tokenization process, there are many open source tools are available. The main objective of this work is to
analyze the performance of the seven open source tokenization tools. For this comparative analysis, we
have taken Nlpdotnet Tokenizer, Mila Tokenizer, NLTK Word Tokenize, TextBlob Word Tokenize, MBSP
Word Tokenize, Pattern Word Tokenize and Word Tokenization with Python NLTK. Based on the results,
we observed that the Nlpdotnet Tokenizer tool performance is better than other tools.
IRJET- Automatic Recapitulation of Text DocumentIRJET Journal
The document describes an approach for automatically generating summaries of text documents. It involves preprocessing the input text by tokenizing, removing stop words, and stemming words. It then extracts features from the preprocessed text, such as term frequency-inverse document frequency (tf-idf) values. Sentences are scored based on these features and sentences with higher scores are selected to form the summary. Keywords from the text are also identified using WordNet to help select the most relevant sentences for the summary. The proposed approach aims to generate concise yet meaningful summaries using natural language processing techniques.
TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSISaciijournal
Text mining is the process of extracting interesting and non-trivial knowledge or information from unstructured text data. Text mining is the multidisciplinary field which draws on data mining, machine
learning, information retrieval, computational linguistics and statistics. Important text mining processes are information extraction, information retrieval, natural language processing, text classification, content analysis and text clustering. All these processes are required to complete the preprocessing step before
doing their intended task.
The document discusses text summarization and describes the problem it addresses of producing concise summaries of lengthy documents. It outlines two main techniques for text summarization - extractive summarization which extracts key phrases and sentences, and abstractive summarization which generates a new summary using NLP techniques. The model used is natural language processing with Python's NLTK package. Tools used include Spyder/PyCharm and the technologies are NLP machine learning with the Python programming language. The overall goal is to create an efficient and accurate text summarizer.
The document discusses Python and the Natural Language Toolkit (NLTK). It explains that Python was chosen as the implementation language for NLTK due to its shallow learning curve, transparent syntax and semantics, and good string handling functionality. NLTK provides basic classes for natural language processing tasks, standard interfaces and implementations for tasks like tokenization and tagging, and extensive documentation. NLTK is organized into packages that encapsulate data structures and algorithms for specific NLP tasks.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
Conceptual framework for abstractive text summarizationijnlc
As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
Automatic Text Summarization: A Critical ReviewIRJET Journal
This document provides a literature review and critical analysis of automatic text summarization techniques. It discusses extractive and abstractive summarization approaches and reviews 10 papers published between 2009-2021 on topics like graph-based, keyword-based, and feature-based summarization methods. The document aims to identify strengths and limitations of the approaches discussed and opportunities for future work in automatic text summarization.
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and computational linguistics that focuses on enabling computers to understand and interact with human language. It combines techniques from computer science, linguistics, and statistics to bridge the gap between human language and machine understanding. NLP has gained significant attention in recent years due to advancements in AI and the increasing need for machines to process and interpret vast amounts of textual data.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
This document presents a new method for extracting class diagrams from textual requirements using natural language processing (NLP) techniques. It proposes the Requirements Analysis and Class diagram Extraction (RACE) system, which uses tools like the OpenNLP parser, a stemming algorithm, and WordNet to extract concepts and identify classes, attributes and relationships. The RACE system applies heuristic rules and a domain ontology to the output of the NLP tools to refine and finalize the extracted class diagram. The paper concludes that the RACE system demonstrates the effective use of NLP techniques to automate the extraction of class diagrams from informal natural language requirements specifications.
IRJET- Semantic based Automatic Text Summarization based on Soft ComputingIRJET Journal
This document discusses semantic-based automatic text summarization using soft computing techniques. It begins with an introduction describing how large amounts of data are generated daily and the need for automated summarization. The next sections cover related work on text summarization methods including syntactic parsing, extractive techniques using n-gram language models and A* search, and mathematical reduction techniques like singular value decomposition and non-negative matrix factorization. The document also discusses using part-of-speech tagging, hidden Markov models, and named entity recognition for extractive summarization in Indian languages.
Similar to ALGORITHM FOR TEXT TO GRAPH CONVERSION (20)
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
1. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
DOI: 10.5121/ijnlc.2015.4403 22
ALGORITHM FOR TEXT TO GRAPH
CONVERSION AND SUMMARIZING USING
NLP: A NEW APPROACH FOR BUSINESS
SOLUTIONS
Prajakta Yerpude and Rashmi Jakhotiya and Manoj Chandak
Department of Computer Science and Engineering, RCOEM, Nagpur
Abstract
Text can be analysed by splitting the text and extracting the keywords .These may be represented as
summaries, tabular representation, graphical forms, and images. In order to provide a solution to large
amount of information present in textual format led to a research of extracting the text and transforming
the unstructured form to a structured format. The paper presents the importance of Natural Language
Processing (NLP) and its two interesting applications in Python Language: 1. Automatic text
summarization [Domain: Newspaper Articles] 2. Text to Graph Conversion [Domain: Stock news]. The
main challenge in NLP is natural language understanding i.e. deriving meaning from human or natural
language input which is done using regular expressions, artificial intelligence and database concepts.
Automatic Summarization tool converts the newspaper articles into summary on the basis of frequency
of words in the text. Text to Graph Converter takes in the input as stock article, tokenize them on various
index (points and percent) and time and then tokens are mapped to graph. This paper proposes a
business solution for users for effective time management.
Keywords
NLP, Automatic Summarizer, Text to Graph Converter, Data Visualization, Regular Expression,
Artificial Intelligence
1. Introduction
The paper deals with applications of natural language processing using its various domains
regarding textual analysis. Natural language processing (NLP)[1] is a bridge between human
interpretations and computer. It makes use of artificial intelligence and various techniques of
analysis to give about 90% accuracy of data. The term Natural Language Processing [4]
comprises a great horizon of techniques for automatic generation, manipulation and analysis of
natural or human languages. It includes various categories like syntactic analysis[22]where
sequence of words are converted to structures that shows relation between the words, semantic
analysis[9] where meanings are assigned to a group of words, pragmatic analysis[24] where
differences between expected and actual interpretation is analysed, morphological analysis[10]
where punctuations are grouped and removed etc. The paper demonstrates two different types of
applications that use NLP principle and are as follows:
An automatic text summarizer
2. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
23
Domain: Newspaper articles
Statistical unstructured text to graph conversion
Domain: Stock market articles
The above applications deal with textual analysis and deriving an optimum result to reduce the
time of any reader. Often it becomes tedious for any reader to read and interpret the whole
article from any newspaper whether it belongs to any domain. Hence it becomes necessary to
optimize this data by removing redundancies in an efficient way. Natural Language Processing
provides various techniques for text processing and is available in various technologies like
Python, Java, Ruby, etc. The technology used for these two applications is Python which
provides with NLTK- Natural Language Toolkit [4] that provides various types of libraries for
textual analysis. Python provides with extensive approach to the Regular Expressions and NLP
required for text processing.
Automatic summarization deals with removal of redundancy from the text thereby maintaining
the gist of any text. There are techniques available for textual analysis which includes text
processing, text categorization [13], part of speech tagging [20], and regular expressions [8] to
classify text and summarize it. Methods of summarization include extraction [20], where main
keywords and sentences are returned as a summary whereas abstraction refers to building of a
new text based upon the content. The paper focuses on extraction method that provides insight
to text analysis. There are API's of summarization available in Java that consumes memory as
well as time for processing. Python, being equipped with NLTK [15] provides an efficient way
for implementing NLP tasks, thereby reducing time and space of the user. We have used Python
for implementing summarizer.
Statistical data includes figures, comparison of two different datasets, numbers that are easily
understood when explained using visual aid. Graphs are used as a visual aid for representation
statistical data in an efficient way. There are tools available that convert structured data to
graphs like Microsoft Excel where figures have to be entered manually which becomes quite
tedious. Python consists of libraries for plotting graphs from given lists of tokens of texts. Our
focus is to convert unstructured data into a graphical format by extracting figures [4] and
arranging them in a data structure named 'dict' in Python [14].
Software Development Lifecycle [18] gives a systematic approach to the development of any
software. The phases of module implementation were planned, designed, coded, tested and
integrated. Planning included requirements gathering, technological study, survey of text and
deciding upon flow of working. Designing and Coding included the implementation of stepwise
approach to the tool. Testing included construction and implementation of various use cases to
determine the viability of tool.
The organization of the latter part of the paper is as: chapter 2 gives the background and related
work done in the area of NLP and its applications using various technologies and the advantages
of the technology used in the project is explained. Chapter 3 gives our components details of the
3. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
24
Python and NLTK, which we adopted for implementation as a part of our project on NLP.
Chapter 4 shows the experimental details of both the applications and the flow of working of the
programs. Chapter 5 summarizes the whole paper with conclusion and describes the future
scope in the field of NLP.
2. Related Work:
NLP is an important area of research in many direct or indirect application problems of
information extraction, machine translation, text correction, text identification, parsing,
sentiment analysis, etc. Our work has the major focus on information extraction i.e. getting the
important words, figures from the text.
The two projects Automatic text Summarizer and Text to Graph Conversion both require
extraction of text. In the former, the text is entered and the tokens are extracted to calculate
frequency which on integration would return the sentences according to the highest rank
obtained helping in creation of summary. While in the later, tokens are again extracted in the
form of points, percentage, time, company, etc which are stored in data structure known as
dictionary and mapped onto the graph.
The technology used is Python. Python consists of ‘n’ number of libraries for simplified
processing of textual data. Python is used to handle various tasks of NLP which include parts of
speech tagging, classification, translation, noun phrase extraction, etc. Researchers of NLP and
programmers have developed multiples ways of text summarization and various online tools
using extractive techniques.
Most early focus of automatic text summarization was on technical documents. The most cited
paper on summarization is that of [11], describing the research done at IBM in the 1950s.
Related work [2], also done at IBM, providing early insight on a particular feature assisted in
finding important parts of documents: the sentence position. Some research processes [7]
describe a system that produces document extracts. His primary contribution was to develop a
typical structure for an extractive summarization experiment.
Many tools are available wherein the information has to be entered in the structured format and
is used to map that information on the graph. In most of the cases, csv (comma separated
values) file, excel files or any structured data source is to be attached to the tool in order to get
graphical representation of the information present in the document. Various platforms for
conversion of structured information to graph are Microstrategy, MS-Word, MS-Excel, Tableau,
etc.
Our research focuses on extracting the text from the stock articles which is in unstructured form
and then maps them to the graph. Our research has an additional feature of extracting tokens
from the unstructured document which is based on text processing in NLP. The text
classification[17] has been a subject of ongoing researches to get the in-depth knowledge of
various types of languages and their profound meanings. Some languages like Chinese and
Japanese where sentences determine the limit have to undergo word segmentation[5] process
that also removes the whitespaces between the words. This approach has been used to remove
the white-spaces between words in text.
Various researches and programs have been developed using Java as technology. But for text
processing, Python has few added advantages over Java. Python has various libraries for text
processing like NLTK (Natural Language Toolkit) [15], TextBlob [24], Pattern [16], etc. Python
is less verbose as compared to Java. It requires about 10 lines of code for a program in Java,
4. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
25
while it requires only 2 lines of code in Python. As it is dynamically-typed language, it is
estimated that programmers in Python can be 5-10 times productive than that in Java, which is
statically-typed. The input text can be taken from web pages using BeautifulSoup[3].
Python[23] has extensive standard libraries which bolster everything from string and regular
expression processing to XML parsing and generation.
2.1 Text Segmentation:
Segmentation[5] involves splitting of text into key phrases, words and tokens. Like Google
shows the most relevant results during the search, Text segmentation gives this result by
Information Retrieval[25]. This process include approaches like stopword removal, suffix
stripping and term weighing to calculate the most important keyword of the text. Stopwords are
those words that cause redundancy in the text. Words like a, an, the, to, in etc. are considered as
stopwords. The terms are weighed according to their frequencies in text. Certain algorithms like
TextTiling[25] break up the text into multiple paragraphs(subparts) by semantic analysis. In this
paper, the text mapping is done using regular expressions for deriving patterns and information
retrieval techniques like stopword removal and term weighing are used.
3. Operations Used For Text Segmentation:
3.1 Components for text analysis:
1. Collections: Collections contains different types of modules out of which
defaultdict(x) is used to declare and define a variable of any data type 'x'.This data structure uses
of keys and their corresponding values as a pair and stores them accordingly. Associative arrays
and hash tables also make use of python dictionaries where functions are mapped with their
pointer values as addresses. The general syntax of a dictionary is given below:
dict = {p(key):x(value)}
Example: dict1 = {'9:00 am': '27,890.09', '4:00 pm': '26,990.01'}
2. Heapq: This python module gives a structured and systematic implementation of heap
queue algorithm. In heapq, given a particular list can be converted to a heap by means of the
heapify() function. The method nlargest() was used to get the most important ‘n’
sentences.[This module is used in text summarizer in order to fetch ‘n’ sentences as required by
the user in summary]
3. Nltk.tokenize: Tokens are the substrings of a whole text. Hence tokenize method is
used for splitting any string into substrings according to the conditions provided. From this
module the methods sent_tokenize[4] and word_tokenize were used. Sent_tokenize splits the
input text (paragraph) into sentences while word_tokenize divides these sentences into words.
If a sentence is 'History gives information about our ancestors'
>>word_sent = [word_tokenize(s.lower()) for s in sents]
>> print word_sent
[['History', 'gives', 'information', 'about', 'our', 'ancestors', '.']
4. Nltk.corpus: Corpora contain a large set of structured data. In Python, a collection of
corpus contains various classes which can be used to access these large set of data. Stopwords
5. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
26
are most common words such as the, is, on, at, etc. The method call stopwords.words('text') was
used to remove these unimportant words.
For example:
>>> from nltk.corpus import stopwords
>>> a= set(stopwords.words('english') + list(punctuation))
>>> print a
set([u'all', u'just', u'being', '-', u'over', u'both', u'through', u'yourselves', u'its', u'before', '$',
u'herself', u'had', ',', u'should', u'to', u'only', u'under', u'ours', u'has', '<', u'do', u'them', u'his',
u'very', u'they', u'not', u'during', u'now', u'him', u'nor', '`', u'did', '^']) and so on.
3.2 Components for Pattern Matching:
Regular expressions re: Regular expressions[14] abates the time for processing the whole
text by providing various simple and easy to use formats for text searching patterns[21],
replacing and their analysis[8].
1. re.search(pattern, string)
This method scans the text and checks the location where the pattern matches and regular
expression returns the matching object's instance. It returns nothing if the pattern is not matched
in the string.
Example: >>>p = re.search('(?<=abc)points', 'mainpoints')
>>>p.group(0)
>>>'points'
2. re.match(pattern, string)
This method matches zero or more characters of the pattern at the beginning of a string and if
matched, returns its corresponding matching object instance. Similarly like search method, it
returns nothing if the pattern is not matched in the string.
Example: >>> u = re.match(r'(w+) (w+)', 'Lord Tyrell, King')
>>>u.group(0)
>>>'Lord Tyrell'
3. re.findall(pattern, string)
This method returns the matches of pattern in the form of list of strings. While scanning the text
sequentially, it returns the found matches in order. It also returns any matched group in the form
of a list. If matches are not found, empty lists are included in the group. A list of tuples will be
returned if the pattern contains different groups.
4. re.compile
This method is used to execute a regular expression. The conditions specified in the expressions
are checked and results are returned.
5. re.strip
It is applied on a string or a string of characters to remove or hide the invalid elements. It is also
used in bifurcation of the text according to the conditions specified to reduce time for scanning.
3.3 Components for Database Connectivity:
1. MySQLdb:
6. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
27
MySQLdb is a compatible interface to MySQL database server that connects database in
MySQL. The next step to using MySQL in a Python script is to make a connection to the
database [12].All Python Database-API 2.0 modules provide a function
'database_name.connect'. This is the function that is used to connect to the database, in our case
MySQL.
>>>db(anyname)=MySQLdb.connect(host=HOST_NAME,user=USER_NAME,passwd=MYPA
SSWORD, db=DB_NAME)
In order to put our new connection to good use we need to create a cursor object. The cursor
object is used to work with the tables of database specified in the Python Database-API 2.0. It
gives us the ability to have multiple separate working environments through the same
connection to the database. One can create a cursor by executing the 'cursor' function of your
database object.
>>>cur(name) = db.cursor()
Executing queries is done by using execute() method.
3.4 Components for Graphical User Interface:
Tkinter: Tkinter[14] is the Python module for implementing GUI programming where it
provides functions like buttons to navigate, message and dialogue boxes for entering text,
scrollbars, text widgets and design templates for GUI. Text widget is where multiple lines can
be written in a text box and Tkinter provides flexibility for working with widgets. They are also
used for showcasing web links and images. Distributions of TK module are available for Unix
as well as Windows.
Example: To create a new widget,
>>>import Tkinter
>>>new = Tkinter.Tk()
>>>new.mainloop()
PIL module is used for inserting graphics such as images and videos on GUI. Images in formats
like BMP, JPEG, CUR, DCX, EPS, FITS, FPX, GIF, etc are supported by this library.
3.5 Components for Graph Plotting:
Matplotlib: Python provides a 2D plotting library for line graphs, bar graphs, pie charts,
histograms, scatterplots etc. Matplotlib can be used in Python shell as well as script, html
servers and GUI toolkits[26]. Simple plotting can be combined with iPython provides a Matlab
type interface. This module lets you deal with the object oriented concepts thereby letting user a
full control over its working. PlotPy is imported for Matplotlib which provides collection of
various styles for plotting. Functions like constructing a graph, changing variables, plotting area
and lines in that area, labels can be easily implemented in PlotPy. A bar and line graph is used
to store statistical information about stock articles in this project.
Example: To plot a line graph,
>>>plt.plot([1,2,3,4],[1.5,2.5,3.5,4.5]) >>>plt.ylabel('numbers') >>>plt.xlabel('decimal
numbers') >>>plt.show()
Functions of Modules used in Project:
Table 1: Modules of Python
7. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
28
Modules Classes Functions
Text Analysis
NLTK-Natural Language
Toolkit
Tokenize
Corpus
Word.tokenize()
Sentence.tokenize()
Stopwords()
Collections Dict Defaultdict()
RE-Regular
Expressions
Re
Replace()
Split()
Compile()
Strip()
Findall()
String Punctuation Append()
Graph Plotting
MatPlotlib Pyplot
Figure()
Plot()
Bar()
Line()
Database Connectivity
MySQLdb Mdb
Connect()
Cursor()
Fetchall()
4. Experimental Details:
4.1 AUTOMATIC TEXT SUMMARIZATION USING NLTK IN PYTHON
A summary states the most important points of the text in a shorter form. It helps to retain the
gist without having to go through irrelevant information also the reader can decide if going
through the entire document is actually necessary or not. A text summary restates the important
points of text in a compressed form. It presents only salient information, in a condensed format.
Thus it helps the reader to get acquainted with the subject matter and also to decide whether
reading the entire document will be useful. Automatic Text Summarization has two general
approaches: extraction and abstraction. Abstraction works in a way similar to the way humans
8. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
29
would summarize text. It first builds an internal semantic representation and then generates a
summary with the help of Natural Language Generation Techniques. The extractive method
selects from the original text a subset of words, phrases, sentences. These are then arranged in
proper sequence to give the summary. This summarizer was developed using the extractive
technique wherein the most important sentences are extracted and retained in the summary.
Flow of Working of Algorithm:
Input: News article Output: Summarized article
Steps: [In brief]
1. The news article and the no. of statements required in the summary are entered.
2. The entered text is split into sentences. These sentences are then split into words.
3. These words are filtered by removing stopwords and punctuations.
4. The frequency of each word (from remaining words) is calculated.
5. The frequency of each word relative to the word having highest frequency is calculated.
6. The rank of each statement in the input text is calculated by adding up the relative
frequencies of the words appearing in those statements.
7. These sentences are then sorted using nlargest method of heapq which returns ‘n’
sentences having highest ranks.
8. These sentences are then returned as summary statements.
Detailed Explanation:
Take the text as an input and tokenize it into sentences and words using ntlk.tokenize modules
namely sent_tokenize and word_tokenize and filtering the words by removing the stopwords
using nltk.corpus module. On sent_tokenize the entered text gets split on a period (.) and
sentences are obtained. These sentences are further operated on by word_tokenize to obtain
tokens in the form of words. Stopwords are the words likes articles, to be verbs (am, is are, was,
were, etc) and also the punctuations of which if the frequency is calculated will just increase the
complexity of the code. So these words are to be neglected as soon as the text is entered.
sents = sent_tokenize(text) // ’sents’ contains sentences
word_sent = [word_tokenize(sent)] // ’word_sent’ contains words
a= set(stopwords.words('english') + list(punctuation)) // ‘a’ contains stopwords
The next step is calculating the frequency of words which belong to ‘word_sent’ and not to set
‘a’ containing stopwords. The frequency calculated is stored using ‘collections’ module in
Python.
freq[word] += 1
The further part is calculating the rank of each sentence. The frequency of each word in a
sentence is integrated and a rank is given to each sentence and the sentences are sorted in
descending order of the rank. This is done using sort method using heapq module in Python
Language.
rank[sent] += self._freq[word]
9. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
30
The last part is displaying the summary. So the highest ‘n’ sentences are returned as summary of
the entered text. The GUI (Graphical User Interface Is created In Python Using Tkinter module
for working of this program).Here the input entered is the text and the number of statements in
which summary is required. The output is the summary, number of statements in the entered
text and the Summary Ratio (%).
Figure 1 Output of text summarizer
4.2 AUTOMATIC TEXT TO GRAPH CONVERSION USING NLP IN
PYTHON
Graph is a good method of condensing and representing data in a readily understandable form.
The visual representation provides an ease of access to the statistical data and interpret data at a
glance. Graphical representation makes data easy to recall. Our tool focuses at automatic
conversion of statistical data that comes with stock market into graphs. The automated graph
enables to overview and to explore the statistical data sets and has a great potential to research.
Graphs are of immense importance in decision making in business, marketing etc. The domain
used here is stock news articles. Text processing is efficiently handled in Python language
which itself is integrated with natural language processing. Python consists of huge libraries for
graph plotting, database connectivity and textual analysis which proved significant in designing
the tool. NLTK is a platform that enables Python programs to work with natural language. It
provides a collection of classes, modules, methods, etc. making it easier to process text. Our
domain, Stock articles, consists of statistics and figures in different formats like time, index
(points and percent). Tokens, figures were extracted using NLTK and regular expressions
respectively whereas mapping of graphs is done using Python libraries.
Stock market is where dealers and buyers come across and trading occurs between them, and
with it comes a lot of figures in the form of shares. The main aim is whether the statistical
10. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
31
information from the stock news can be directly converted to a graph which will be very easy to
access and compare. Our tool have focused the BSE (Bombay Stock Exchange)and the private
sector companies to derive the daily gainers and losers. It gives two graphs containing the
sensex points and losers and gainers. Stock articles have been extracted from Economic
Times[6]which gives the scenario of the entire day.
Flow of working of Algorithm:
Input: Stock Article Output: Graphical Representation
Steps: [In brief]
Note: Establish Database Connectivity
PART A: BSE
1. Tokenize the text into sentences on basis of Full stop(Period).
2. Tokenize each sentence into words.
3. Remove all stopwords from the list of words.
4. Traverse the text linearly and match it with the keywords from database.
5. If keyword found, retrieve its tuple from the database and store in list.
6. Retrieve all the figures from the text.
7. Using Regular Expressions fetch the sentences having time mentioned and from it fetch
time and stock points to be stored into a dictionary(time:points).
8. For BSE:
a) Plot a bar graph using dictionary where time and points are present
b) Plot a line graph using lists for the figures which don't have time stored against it.
PART B: COMPANIES (GAINERS AND LOSERS)
1. Fetch the sentences from the text containing gainers and losers.
2. In the sentence, fetch the company names from the text, match it with database and
store in list.
3. Use Regular Expressions to fetch gain % or lose %
4. If its gain, store it into dictionary 'dict' [key: value] as [company name: +%]
5. If its loss, store it into dictionary 'dict' [key: value] as [company name: -%]
6. Plot the bar graph as companies versus percent (gain or loss)
7. Exit
Detailed Explanation:
The explanation could be better understood using an example as follows:
Enter the text: The S&P BSE Sensex started on a cautious note on Wednesday following muted
trend seen in other Asian markets. The index was trading in a narrow range, weighed down by
losses in ITC, ICICI Bank, HDFC Bank and SesaSterlite. Tracking the momentum, the 50-share
Nifty index also turned choppy after a positive start, weighed down by losses in banks, metal
and FMCG stocks. At 09:20 am, the 30-share index was at 27419, down 6 points or 0.02 per
cent. It touched a high of 27460.76 and a low of 27351.27 in trade today. At 02:30 pm, the 30-
share index was at 27362, down 62 points or 0.23 per cent. It touched a high of 27512.80 and a
low of 27203.25 in trade today. BHEL (up 1.2 per cent), TataMotors (up 1.1 per cent), HUL (up
1.1 per cent), Wipro (up 1.03 per cent) and Infosys (up 0.74 per cent) were among the major
11. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
32
Sensex gainers. ITC (down 2.8 per cent), SesaSterlite (down 2.3 per cent), Hindalco (down 1.3
per cent), ICICI Bank (down 0.94 per cent) and TataSteel (down 0.66 per cent) were the major
index losers.
1. After entering the input text, stopwords will be removed using the function
stopwords = list(stopwords.words('english') + list(punctuation))
and main tokens will be stored in a list as follows:
['p', 'bse', 'sensex', 'started', 'cautious', 'note', 'wednesday', 'following', 'muted', 'trend', 'seen',
'asian', 'markets', 'index', 'trading', 'narrow', 'range', 'weighed', 'losses', 'itc', 'icici', 'bank',
'hdfc', 'bank', 'sesa', 'sterlite', 'tracking', 'momentum', '50-share', 'nifty', 'index', 'also', 'turned',
'choppy', 'positive', 'start', 'weighed', 'losses', 'banks', 'metal', 'fmcg', 'stocks', '09:20', '30-share',
'index', '27419', '6', 'points', '0.02', 'per', 'cent', 'touched', 'high', '27460.76', 'low', '27351.27',
'trade', 'today', '02:30', 'pm', '30-share', 'index', '27362', '62', 'points', '0.23', 'per', 'cent',
'touched', 'high', '27512.80', 'low', '27203.25', 'trade', 'today', 'bhel', '1.2', 'per', 'cent',
'tatamotors', '1.1', 'per', 'cent', 'hul', '1.1', 'per', 'cent', 'wipro', '1.03', 'per', 'cent', 'infosys', '0.74',
'per', 'cent', 'among', 'major', 'sensex', 'gainers', 'itc', '2.8', 'per', 'cent', 'sesasterlite', '2.3', 'per',
'cent', 'hindalco', '1.3', 'per', 'cent', 'icici', 'bank', '0.94', 'per', 'cent', 'tatasteel', '0.66', 'per',
'cent', 'major', 'index', 'losers']
2. After entering the input text, keywords from database are matched with those from the article
and the intermediate output will be a list as given below.
('high'),('low'),('today'),('trading')
3. In the next step, regular expressions are applied to this list of words without stopwords and
figures are extracted. Below is an example of one such regular expression:
>>abc=re.findall("d+,*d+.d+[ points]",text)
>> ['27460.76', '27351.27', '27512.80', '27203.25']
>>time=re.findall("d+:d+[am/pm/ a.m./ p.m.]+",text)
>> ['09:20 am ', '02:30 pm']
4. These time and points values are then stored in a data structure dict of python. Following is
the list and structure of Dictionary for BSE:
['09:20', '27419']['02:30', '27362']
Table 2: Dictionary for BSE
5. Similar process is applied for the gainers and losers where positive and negative values are
assigned to gainers and losers and both lists are maintained. Regular expressions are applied to
store these values in two different lists.
KEY VALUE
09:00 am 28450.01
11:00 am 28500.98
01:00 pm 28601.89
03:00 pm 28431.78
05:00 pm 27997.01
12. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
33
Gainers:[' BHEL (up 1.2 per cent), TataMotors (up 1.1 per cent), HUL (up 1.1 per cent), Wipro
(up 1.03 per cent) and Infosys (up 0.74 per cent) were among the major Sensex gainers. ']
Losers:['ITC (down 2.8 per cent), SesaSterlite (down 2.3 per cent), Hindalco (down 1.3 per
cent), ICICI Bank (down 0.94 per cent) and TataSteel (down 0.66 per cent) were the major
index losers. ']
6. These lists are then merged into a dictionary to get a unified storage for plotting of graphs.
Following is the list and structure of Dictionary for Companies:
{'Wipro': 1.03, 'TataSteel': -0.94, 'BHEL': 1.2, 'TataMotors': 1.1, 'Hindalco': -2.3, 'ICICI': -1.3,
'HUL': 1.1, 'Infosys': 0.74, 'SesaSterlite': -2.8}
Table 3: Dictionary For Companies
7. These lists are passed to PlotPy and Matplotlib modules to generate their respective graphs.
figure = plt.figure(figsize = (12,6), facecolor = "white")
ax.set_xlabel('TIME',color='black',fontsize=18)
ax.set_ylabel('STOCK POINTS',color='black',fontsize=18)
ax.set_title('BSE',fontsize=26)
xlabel and ylabel is used to give the specifications of what is to be on x and y axis respectively
and plt.figure determines the size and colour of the graph.
Figure 2 and Figure 3 gives a screenshot of the output from the given article.
KEY VALUE
Tata Motors +1.78 %
Cipla -2.87 %
HDFC -1.89 %
Bharti Airtel +2.78 %
Sun Pharma +1.01 %
ONGC -1.44 %
NTPC +3.24 %
13. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
34
Figure 2 Bse Graph
Figure 3 Companies (Gainers And Losers)
5.1 Conclusion:
Due to World Wide Web, the rate of information growth has called for a need to develop
efficient techniques to reduce data and make it simpler to understand and convey messages
effectively. Natural Language Processing is thoroughly being used worldwide for its efficiency
14. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
35
in text processing. It’s delicate to analyze human interpretation using various principles related
to NLP. This led to the development of Automatic Summarization tools and Text to Graph
Converters. These two applications were focused on text analysis and thereby reducing time to
get the main gist of the input processing output.
Text Summarizer which uses the principle of extraction helps to maintain the pith of
what has to be read from a superfluous article, while statistical text to graph conversion eases
the method of comparison by directly giving an output as a graph which contains all the
required figures.
In text summarizer, text is analyzed using the frequency of words and ranking a
sentence resulting in proper position whereas in text to graph convertor, analysis is done using
regular expressions which lets you play with text according to the conditions you specify.
NLTK, being inbuilt in Python provides a striking approach to information extraction
and processing. Thus the text underwent splitting, tokenizing and merging using various
modules of NLTK.
5.2 Future Scope
The topics for future scope can be as follows:
Today, most of the approaches for summarizing are based on extraction. Hence the use
of abstraction in text processing which includes automated text building in response to the input
can be further implemented.
Various domains for research in wide spectrum of text to gain accuracy using Python
are helpful using frameworks like Django and Flask.
Text to graph convertors can be used for comparison of datasets having different time
domains.
For handy purposes, these applications can also be converted into a mobile application
in Android or iOS.
Databases can be used for handling multiple datasets and thereby reducing the memory
as well as time for retrieval.
References
[1] Allen, James, "Natural Language Understanding", Second edition (Redwood City:
Benjamin/Cummings, 1995).
[2] Baxendale, P. (1958). Machine-made index for technical literature - an experiment. IBM Journal
of Research Development, 2(4):354–361. [2, 3, 5]
[3] BeautifulSoup4 4.3.2, Retrieved from https://pypi.python.org/pypi/beautifulsoup4
[4] Bird Steven, Klein Ewan, Loper Edward June 2009, "Natural Language Processing with
Python", Pages 16,27,79
[5] Cortez Eli, Altigran S da da Silva 2013, " Unsupervised Information Extraction by Text
Segmentation", Ch 3
[6] Economic Times Archives Jan 2014-Dec 2014, Retrieved from
http://economictimes.indiatimes.com/
[7] Edmundson, H. P. (1969). New methods in automatic extracting. Journal of the ACM,
16(2):264–285. [2, 3, 4]
15. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
36
[8] Friedl Jeffrey E.F. August 2006,"Mastering Regular Expressions", Ch 1
[9] Goddard Cliff Second edition 2011,"Semantic Analysis: A practical introduction ", Section 1.1-
1.5
[10] Kumar Ela, "Artificial Intelligence", Pages 313-315
[11] Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research
Development, 2(2):159–165. [2, 3, 6, 8]
[12] Lukaszewski Albert 2010, "MySQL for Python", Ch 1,2,3
[13] Manning Christopher D., Schütze Hinrich Sixth Edition 2003,"Foundations of Statistical
Natural Language Processing", Ch 4 Page no. 575
[14] Martelli Alex Second edition July 2006, "Python in a Nutshell", Pages 44,201.
[15] Natural Language Toolkit, Retrieved from http://www.nltk.org
[16] Pattern 2.6, Retrieved from http://www.clips.ua.ac.be/pattern
[17] Prasad Reshma, Mary Priya Sebastian, International Journal on Natural Language Computing
(IJNLC) Vol. 3, No.2, April 2014, " A survey on phrase structure learning methods for text
classification"
[18] Pressman Rodger S 6th
edition," Software Engineering – A Practitioner’s Approach "
[19] Python Language, Retrieved from https://www.python.org/
[20] Rodrigues Mário , Teixeira António , "Advanced Applications of Natural Language Processing
for Performing ", Ch 1,2,4
[21] Stubblebine Tony, "Regular Expression Pocket Reference: Regular Expressions for Perl, Ruby,
PHP, Python, C, Java and .NET "
[22] Sobin Nicholas 2011, "Syntactic Analysis: The Basics", Ch 1,2
[23] Swaroop C H, “A Byte of Python: Basics and Syntax of Python”, Ch 5,8,9,10
[24] TextBlob: Simplified Text Processing, Retrieved from http://textblob.readthedocs.org/en/dev
[25] Thanos Costantino ,"Research and Advanced Technology for Digital Libraries", Page 338-362
[26] Tosi Sandro November 2009, "Matplotlib for Python Developers", Ch 2,3
Authors
Prajakta R. Yerpude is pursuing her Bachelor of Engineering 2012-16,
Computer Science and Engineering from Shri Ramdeobaba College Of
Engineering and Management, Nagpur. She has been working on a domain of
Natural Language Processing from two years. Her interests include Natural
Language Processing, Databases and Artificial Intelligence. Email:
yerpudepr@rknec.edu
Rashmi P. Jakhotiya is pursuing her Bachelor of Engineering 2012-16,
Computer Science and Engineering from Shri Ramdeobaba College Of
Engineering and Management, Nagpur. She has been working on a domain of
Regular Expressions and Natural Language Processing from two years. Her
interests include Natural Language Processing, Regular Expressions and Artificial
Intelligence. Email: jakhotiyarp@rknec.edu
Dr. M. B Chandak has received his PhD degree from RTM-Nagpur University,
Nagpur. He is presently working as Professor and Head of Computer Science and
Engineering Department at Shri Ramdeobaba College of Engineering and
Management, Nagpur. He has total 21 years of academic experience, with
16. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
37
research interest in Natural Language Processing, Advance networking and Big Data Analytics.
Email:hodcs@rknec.edu