Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
Applying natural language processing-related algorithms is currently a popular project in legal
applications, for instance, document classification of legal documents, contract review and machine
translation. Using the above machine learning algorithms, all need to encode the words in the document in
the form of vectors. The word embedding model is a modern distributed word representation approach and
the most common unsupervised word encoding method. It facilitates subjecting other algorithms and
subsequently performing the downstream tasks of natural language processing vis-à-vis. The most common
and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with
linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation.
This paper proposes establishing a 1,256 Legal Analogical Reasoning Questions Set (LARQS) from the
2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the
accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be
ubiquitous in the word embedding model.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
This document describes the development of a new legal word embedding evaluation dataset for Chinese called LARQS (Legal Analogical Reasoning Questions Set). It was created using a corpus of 2,388 Chinese legal documents and contains 1,256 questions evaluating 5 categories of legal relationships. The document discusses word embedding and existing evaluation benchmarks. It then describes how LARQS was created by legal experts and its potential usefulness compared to general-purpose benchmarks for evaluating legal-domain word embeddings.
French machine reading for question answeringAli Kabbadj
This paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training da-tasets. Until now these technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embed-ding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network ar-chitectures in French and carried out a French Q&A models with F1 score around 70%.
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
An in-depth review on News Classification through NLPIRJET Journal
This document provides an in-depth literature review of news classification through natural language processing (NLP). It discusses several existing approaches to news classification, including models that use convolutional neural networks (CNNs), graph-based approaches, and attention mechanisms. The document also notes that current search engines often return too many irrelevant results, so classification could help layer search results. It concludes that while many techniques have been developed, inconsistencies remain in effectively classifying news, so further research on combining NLP, feature extraction, and fuzzy logic is needed.
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
Current lexica and machine learning based sentiment analysis approaches
still suffer from a two-fold limitation. First, manual lexicon construction and
machine training is time consuming and error-prone. Second, the
prediction’s accuracy entails sentences and their corresponding training text
should fall under the same domain. In this article, we experimentally
evaluate four sentiment classifiers, namely support vector machines (SVMs),
Naive Bayes (NB), logistic regression (LR) and random forest (RF). We
quantify the quality of each of these models using three real-world datasets
that comprise 50,000 movie reviews, 10,662 sentences, and 300 generic
movie reviews. Specifically, we study the impact of a variety of natural
language processing (NLP) pipelines on the quality of the predicted
sentiment orientations. Additionally, we measure the impact of incorporating
lexical semantic knowledge captured by WordNet on expanding original
words in sentences. Findings demonstrate that the utilizing different NLP
pipelines and semantic relationships impacts the quality of the sentiment
analyzers. In particular, results indicate that coupling lemmatization and
knowledge-based n-gram features proved to produce higher accuracy results.
With this coupling, the accuracy of the SVM classifier has improved to
90.43%, while it was 86.83%, 90.11%, 86.20%, respectively using the three
other classifiers.
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...IRJET Journal
This paper proposes a method to mine rare sequential topic patterns (URSTPs) from tweet data. It involves preprocessing tweets to extract topics, identifying user sessions, generating sequential topic pattern (STP) candidates, and selecting URSTPs based on rarity analysis. Experiments show the approach can identify special users and interpretable URSTPs, indicating users' characteristics. The paper aims to capture personalized and abnormal user behaviors through sequential relationships between extracted topics from successive tweets.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
Applying natural language processing-related algorithms is currently a popular project in legal
applications, for instance, document classification of legal documents, contract review and machine
translation. Using the above machine learning algorithms, all need to encode the words in the document in
the form of vectors. The word embedding model is a modern distributed word representation approach and
the most common unsupervised word encoding method. It facilitates subjecting other algorithms and
subsequently performing the downstream tasks of natural language processing vis-à-vis. The most common
and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with
linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation.
This paper proposes establishing a 1,256 Legal Analogical Reasoning Questions Set (LARQS) from the
2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the
accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be
ubiquitous in the word embedding model.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
This document describes the development of a new legal word embedding evaluation dataset for Chinese called LARQS (Legal Analogical Reasoning Questions Set). It was created using a corpus of 2,388 Chinese legal documents and contains 1,256 questions evaluating 5 categories of legal relationships. The document discusses word embedding and existing evaluation benchmarks. It then describes how LARQS was created by legal experts and its potential usefulness compared to general-purpose benchmarks for evaluating legal-domain word embeddings.
French machine reading for question answeringAli Kabbadj
This paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training da-tasets. Until now these technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embed-ding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network ar-chitectures in French and carried out a French Q&A models with F1 score around 70%.
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
An in-depth review on News Classification through NLPIRJET Journal
This document provides an in-depth literature review of news classification through natural language processing (NLP). It discusses several existing approaches to news classification, including models that use convolutional neural networks (CNNs), graph-based approaches, and attention mechanisms. The document also notes that current search engines often return too many irrelevant results, so classification could help layer search results. It concludes that while many techniques have been developed, inconsistencies remain in effectively classifying news, so further research on combining NLP, feature extraction, and fuzzy logic is needed.
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
Current lexica and machine learning based sentiment analysis approaches
still suffer from a two-fold limitation. First, manual lexicon construction and
machine training is time consuming and error-prone. Second, the
prediction’s accuracy entails sentences and their corresponding training text
should fall under the same domain. In this article, we experimentally
evaluate four sentiment classifiers, namely support vector machines (SVMs),
Naive Bayes (NB), logistic regression (LR) and random forest (RF). We
quantify the quality of each of these models using three real-world datasets
that comprise 50,000 movie reviews, 10,662 sentences, and 300 generic
movie reviews. Specifically, we study the impact of a variety of natural
language processing (NLP) pipelines on the quality of the predicted
sentiment orientations. Additionally, we measure the impact of incorporating
lexical semantic knowledge captured by WordNet on expanding original
words in sentences. Findings demonstrate that the utilizing different NLP
pipelines and semantic relationships impacts the quality of the sentiment
analyzers. In particular, results indicate that coupling lemmatization and
knowledge-based n-gram features proved to produce higher accuracy results.
With this coupling, the accuracy of the SVM classifier has improved to
90.43%, while it was 86.83%, 90.11%, 86.20%, respectively using the three
other classifiers.
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...IRJET Journal
This paper proposes a method to mine rare sequential topic patterns (URSTPs) from tweet data. It involves preprocessing tweets to extract topics, identifying user sessions, generating sequential topic pattern (STP) candidates, and selecting URSTPs based on rarity analysis. Experiments show the approach can identify special users and interpretable URSTPs, indicating users' characteristics. The paper aims to capture personalized and abnormal user behaviors through sequential relationships between extracted topics from successive tweets.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
This document discusses using automatic text analysis techniques to streamline the process of multi-dimensional analysis of collaborative learning discussions. It describes a tool called TagHelper that was evaluated against a hand-coded corpus with a 7-dimensional coding scheme. TagHelper achieved a Cohen's Kappa agreement of over 0.7 for 6 of the 7 dimensions when considering only the text segments it was most confident about, and was confident in its coding for at least 88% of the corpus for 5 of those dimensions. The document motivates the need for such automatic analysis to reduce the time and effort required for manual coding of collaborative learning data.
An Ontology-Based Information Extraction Approach For R Sum SRichard Hogue
This document discusses developing an ontology-driven information extraction system called the Ontology-based Résumé Parser (ORP) to extract information like experiences, qualifications, education, and personal details from millions of résumés in English and Turkish. The system uses various domain ontologies within its Ontology Knowledgebase to semantically parse résumés and match concepts. It aims to assist with expert finding and skills aggregation by analyzing data semantically rather than just syntactically matching keywords. The Résumé Ontology is described in detail to represent information typically included in résumés through semantic annotations.
This document is a thesis that proposes using word embeddings to improve information retrieval by addressing term mismatch issues. It discusses word2vec, a technique for learning word embeddings from large text corpora that capture semantic relationships between words. The thesis proposes two approaches: 1) incorporating word embedding similarities into a probabilistic language model for retrieval and 2) a vector space model. Due to time constraints, only the first approach is implemented, which integrates word embeddings into ALMasri and Chevallet's probabilistic language model. Experiments are conducted to evaluate the impact of using semantic features from word embeddings on retrieval effectiveness.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
The document discusses text mining and summarizes several key points:
1) Text mining involves deriving patterns and trends from text to discover useful knowledge, but it is challenging to accurately evaluate features due to issues like polysemy and synonymy.
2) Phrase-based approaches could perform better than term-based approaches by carrying more semantic meaning, but have faced challenges due to low phrase frequencies and redundant/noisy phrases.
3) The proposed approach uses pattern mining to discover specific patterns and evaluates term weights based on pattern distributions rather than full document distributions to address misinterpretation issues and improve accuracy.
SWSN UNIT-3.pptx we can information about swsn professionalgowthamnaidu0986
Ontology engineering involves constructing ontologies through various methods. It begins with defining the scope and evaluating existing ontologies for reuse. Terms are enumerated and organized in a taxonomy with defined properties, facets, and instances. The ontology is checked for anomalies and refined iteratively. Popular tools for ontology development include Protege and WebOnto. Methods like Meth ontology and On-To-Knowledge methodology provide processes for building ontologies from scratch or reusing existing ones. Ontology sharing requires mapping between ontologies to allow interoperability, and libraries exist for storing and accessing ontologies.
Relevance feature discovery for text miningredpel dot com
The document discusses relevance feature discovery for text mining. It presents an innovative model that discovers both positive and negative patterns in text documents as higher-level features and uses them to classify terms into categories and update term weights based on their specificity and distribution in patterns. Experiments on standard datasets show the proposed model outperforms both term-based and pattern-based methods.
Great model a model for the automatic generation of semantic relations betwee...ijcsity
The
large
a
v
ailable
am
ou
n
t
of
non
-
structured
texts
that
b
e
-
long
to
differe
n
t
domains
su
c
h
as
healthcare
(e.g.
medical
records),
justice
(e.g.
l
a
ws,
declarations),
insurance
(e.g.
declarations),
etc. increases
the
effort
required
for
the
analysis
of
information
in
a
decision making
pro
-
cess.
Differe
n
t
pr
o
jects
and t
o
ols
h
av
e
pro
p
osed
strategies
to
reduce
this
complexi
t
y
b
y
classifying,
summarizing
or
annotating
the
texts.
P
artic
-
ularl
y
,
text
summary
strategies
h
av
e
pr
ov
en
to
b
e
v
ery
useful
to
pr
o
vide
a
compact
view
of
an
original
text.
H
ow
e
v
er,
the
a
v
ailable
strategies
to
generate
these
summaries
do
not
fit
v
ery
w
ell
within
the
domains
that
require
ta
k
e
i
n
to
consideration
the
tem
p
oral
dimension
of
the
text
(e.g.
a
rece
n
t
piece
of
text
in
a
medical
record
is
more
im
p
orta
n
t
than
a
pre
-
vious
one)
and
the
profile
of
the
p
erson
who
requires
the
summary
(e.g
the
medical
s
p
ecialization).
T
o
co
p
e with
these
limitations
this
pa
p
er
prese
n
ts
”GRe
A
T”
a
m
o
del
for
automatic
summary
generation
that
re
-
lies
on
natural
language
pr
o
cessing
and
text
mining
te
c
hniques
to
extract
the
most
rele
v
a
n
t
information
from
narrati
v
e
texts
and
disc
o
v
er
new
in
-
formation
from
the
detection
of
related
information. GRe
A
T
M
o
del
w
as impleme
n
ted
on
sof
tw
are
to
b
e
v
alidated
in
a
health
institution
where
it
has
sh
o
wn
to
b
e
v
ery
useful
to displ
a
y
a
preview
of
the
information
a
b
ou
t
medical
health
records
and
disc
o
v
er
new
facts
and
h
y
p
otheses
within
the
information.
Se
v
eral
tests
w
ere
executed
su
c
h
as
F
unctional
-
i
t
y
,
Usabili
t
y
and
P
erformance
regarding
to
the
impleme
n
ted
sof
t
w
are.
In
addition,
precision
and
recall
measures
w
ere
applied
on
the
results
ob
-
tained
through
the
impleme
n
ted
t
o
ol,
as
w
ell
as
on
the
loss
of
information
obtained
b
y
pr
o
viding
a
text
more
shorter than
the
original
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
This document summarizes a research paper that proposes a method for classifying news and research articles using text pattern mining. The method involves preprocessing text to remove stop words and perform stemming. Frequent and closed patterns are then discovered from the preprocessed text. These patterns are structured into a taxonomy and deployed to classify new documents. The method also involves evolving patterns by reshuffling term supports within patterns to reduce the effects of noise from negative documents. Over 80% of documents were successfully classified using this pattern-based approach.
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGijnlc
In this paper, we propose a novel algorithm that rearrange the topic assignment results obtained from topic
modeling algorithms, including NMF and LDA. The effectiveness of the algorithm is measured by how much
the results conform to expert opinion, which is a data structure called TDAG that we defined to represent the
probability that a pair of highly correlated words appear together. In order to make sure that the internal
structure does not get changed too much from the rearrangement, coherence, which is a well known metric
for measuring the effectiveness of topic modeling, is used to control the balance of the internal structure.
We developed two ways to systematically obtain the expert opinion from data, depending on whether the
data has relevant expert writing or not. The final algorithm which takes into account both coherence and
expert opinion is presented. Finally we compare amount of adjustments needed to be done for each topic
modeling method, NMF and LDA.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
The document describes a comparative study of various machine learning and neural network models for detecting abusive language on Twitter. It finds that a bidirectional GRU network trained on word-level features, with a Latent Topic Clustering module, achieves the most accurate results with an F1 score of 0.805 for detecting abusive tweets. Additionally, it explores using context tweets as additional features and finds this improves some models' performance.
This document provides an overview of natural language processing (NLP) research trends presented at ACL 2020, including shifting away from large labeled datasets towards unsupervised and data augmentation techniques. It discusses the resurgence of retrieval models combined with language models, the focus on explainable NLP models, and reflections on current achievements and limitations in the field. Key papers on BERT and XLNet are summarized, outlining their main ideas and achievements in advancing the state-of-the-art on various NLP tasks.
International Journal of Computational Engineering Research(IJCER) ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
A Review Of Text Mining Techniques And ApplicationsLisa Graves
This document provides a review of various text mining techniques and applications. It discusses techniques used for text classification and summarization, including Naive Bayes classification, backpropagation neural networks, keyword matching, and information extraction. It also covers applications of text mining in areas like sentiment analysis of social media posts and hotel reviews. Finally, it discusses the need for organizational text mining to extract useful information and insights from large amounts of unstructured text data.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
This paper discusses the capabilities and limitations of GPT-3 (0), a state-of-the-art language model, in the
context of text understanding. We begin by describing the architecture and training process of GPT-3, and
provide an overview of its impressive performance across a wide range of natural language processing
tasks, such as language translation, question-answering, and text completion. Throughout this research
project, a summarizing tool was also created to help us retrieve content from any types of document,
specifically IELTS (0) Reading Test data in this project. We also aimed to improve the accuracy of the
summarizing, as well as question-answering capabilities of GPT-3 (0) via long text
Data-to-text technologies present an enormous and exciting opportunity to help
audiences understand some of the insights present in today’s vasts and growing amounts of electronic
data. In this article we analyze the potential value and benefits of these solutions as well as their risks
and limitations for a wider penetration. These technologies already bring substantial advantages of
cost, time, accuracy and clarity versus other traditional approaches or format. On the other hand,
there are still important limitations that restrict the broad applicability of these solutions, most
importantly in the limited quality of their output. However we find that the current state of
development is sufficient for the application of these solution across many domains and use cases and
recommend businesses of all sectors to consider how to deploy them to enhance the value they are
currently getting from their data. As the availability of data keeps growing exponentially and natural
language generation technology keeps improving, we expect data-to-text solutions to take a much
more bigger role in the production of automated content across many different domains.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
Identification and Classification of Named Entities in Indian Languageskevig
The process of identification of Named Entities (NEs) in a given document and then there classification into
different categories of NEs is referred to as Named Entity Recognition (NER). We need to do a great effort
in order to perform NER in Indian languages and achieve the same or higher accuracy as that obtained by
English and the European languages. In this paper, we have presented the results that we have achieved by
performing NER in Hindi, Bengali and Telugu using Hidden Markov Model (HMM) and Performance
Metrics.
Effect of Query Formation on Web Search Engine Resultskevig
Query in a search engine is generally based on natural language. A query can be expressed in more than
one way without changing its meaning as it depends on thinking of human being at a particular moment.
Aim of the searcher is to get most relevant results immaterial of how the query has been expressed. In the
present paper, we have examined the results of search engine for change in coverage and similarity of first
few results when a query is entered in two semantically same but in different formats. Searching has been
made through Google search engine. Fifteen pairs of queries have been chosen for the study. The t-test has
been used for the purpose and the results have been checked on the basis of total documents found,
similarity of first five and first ten documents found in the results of a query entered in two different
formats. It has been found that the total coverage is same but first few results are significantly different.
More Related Content
Similar to May 2024 - Top10 Cited Articles in Natural Language Computing
This document discusses using automatic text analysis techniques to streamline the process of multi-dimensional analysis of collaborative learning discussions. It describes a tool called TagHelper that was evaluated against a hand-coded corpus with a 7-dimensional coding scheme. TagHelper achieved a Cohen's Kappa agreement of over 0.7 for 6 of the 7 dimensions when considering only the text segments it was most confident about, and was confident in its coding for at least 88% of the corpus for 5 of those dimensions. The document motivates the need for such automatic analysis to reduce the time and effort required for manual coding of collaborative learning data.
An Ontology-Based Information Extraction Approach For R Sum SRichard Hogue
This document discusses developing an ontology-driven information extraction system called the Ontology-based Résumé Parser (ORP) to extract information like experiences, qualifications, education, and personal details from millions of résumés in English and Turkish. The system uses various domain ontologies within its Ontology Knowledgebase to semantically parse résumés and match concepts. It aims to assist with expert finding and skills aggregation by analyzing data semantically rather than just syntactically matching keywords. The Résumé Ontology is described in detail to represent information typically included in résumés through semantic annotations.
This document is a thesis that proposes using word embeddings to improve information retrieval by addressing term mismatch issues. It discusses word2vec, a technique for learning word embeddings from large text corpora that capture semantic relationships between words. The thesis proposes two approaches: 1) incorporating word embedding similarities into a probabilistic language model for retrieval and 2) a vector space model. Due to time constraints, only the first approach is implemented, which integrates word embeddings into ALMasri and Chevallet's probabilistic language model. Experiments are conducted to evaluate the impact of using semantic features from word embeddings on retrieval effectiveness.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
The document discusses text mining and summarizes several key points:
1) Text mining involves deriving patterns and trends from text to discover useful knowledge, but it is challenging to accurately evaluate features due to issues like polysemy and synonymy.
2) Phrase-based approaches could perform better than term-based approaches by carrying more semantic meaning, but have faced challenges due to low phrase frequencies and redundant/noisy phrases.
3) The proposed approach uses pattern mining to discover specific patterns and evaluates term weights based on pattern distributions rather than full document distributions to address misinterpretation issues and improve accuracy.
SWSN UNIT-3.pptx we can information about swsn professionalgowthamnaidu0986
Ontology engineering involves constructing ontologies through various methods. It begins with defining the scope and evaluating existing ontologies for reuse. Terms are enumerated and organized in a taxonomy with defined properties, facets, and instances. The ontology is checked for anomalies and refined iteratively. Popular tools for ontology development include Protege and WebOnto. Methods like Meth ontology and On-To-Knowledge methodology provide processes for building ontologies from scratch or reusing existing ones. Ontology sharing requires mapping between ontologies to allow interoperability, and libraries exist for storing and accessing ontologies.
Relevance feature discovery for text miningredpel dot com
The document discusses relevance feature discovery for text mining. It presents an innovative model that discovers both positive and negative patterns in text documents as higher-level features and uses them to classify terms into categories and update term weights based on their specificity and distribution in patterns. Experiments on standard datasets show the proposed model outperforms both term-based and pattern-based methods.
Great model a model for the automatic generation of semantic relations betwee...ijcsity
The
large
a
v
ailable
am
ou
n
t
of
non
-
structured
texts
that
b
e
-
long
to
differe
n
t
domains
su
c
h
as
healthcare
(e.g.
medical
records),
justice
(e.g.
l
a
ws,
declarations),
insurance
(e.g.
declarations),
etc. increases
the
effort
required
for
the
analysis
of
information
in
a
decision making
pro
-
cess.
Differe
n
t
pr
o
jects
and t
o
ols
h
av
e
pro
p
osed
strategies
to
reduce
this
complexi
t
y
b
y
classifying,
summarizing
or
annotating
the
texts.
P
artic
-
ularl
y
,
text
summary
strategies
h
av
e
pr
ov
en
to
b
e
v
ery
useful
to
pr
o
vide
a
compact
view
of
an
original
text.
H
ow
e
v
er,
the
a
v
ailable
strategies
to
generate
these
summaries
do
not
fit
v
ery
w
ell
within
the
domains
that
require
ta
k
e
i
n
to
consideration
the
tem
p
oral
dimension
of
the
text
(e.g.
a
rece
n
t
piece
of
text
in
a
medical
record
is
more
im
p
orta
n
t
than
a
pre
-
vious
one)
and
the
profile
of
the
p
erson
who
requires
the
summary
(e.g
the
medical
s
p
ecialization).
T
o
co
p
e with
these
limitations
this
pa
p
er
prese
n
ts
”GRe
A
T”
a
m
o
del
for
automatic
summary
generation
that
re
-
lies
on
natural
language
pr
o
cessing
and
text
mining
te
c
hniques
to
extract
the
most
rele
v
a
n
t
information
from
narrati
v
e
texts
and
disc
o
v
er
new
in
-
formation
from
the
detection
of
related
information. GRe
A
T
M
o
del
w
as impleme
n
ted
on
sof
tw
are
to
b
e
v
alidated
in
a
health
institution
where
it
has
sh
o
wn
to
b
e
v
ery
useful
to displ
a
y
a
preview
of
the
information
a
b
ou
t
medical
health
records
and
disc
o
v
er
new
facts
and
h
y
p
otheses
within
the
information.
Se
v
eral
tests
w
ere
executed
su
c
h
as
F
unctional
-
i
t
y
,
Usabili
t
y
and
P
erformance
regarding
to
the
impleme
n
ted
sof
t
w
are.
In
addition,
precision
and
recall
measures
w
ere
applied
on
the
results
ob
-
tained
through
the
impleme
n
ted
t
o
ol,
as
w
ell
as
on
the
loss
of
information
obtained
b
y
pr
o
viding
a
text
more
shorter than
the
original
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
This document summarizes a research paper that proposes a method for classifying news and research articles using text pattern mining. The method involves preprocessing text to remove stop words and perform stemming. Frequent and closed patterns are then discovered from the preprocessed text. These patterns are structured into a taxonomy and deployed to classify new documents. The method also involves evolving patterns by reshuffling term supports within patterns to reduce the effects of noise from negative documents. Over 80% of documents were successfully classified using this pattern-based approach.
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGijnlc
In this paper, we propose a novel algorithm that rearrange the topic assignment results obtained from topic
modeling algorithms, including NMF and LDA. The effectiveness of the algorithm is measured by how much
the results conform to expert opinion, which is a data structure called TDAG that we defined to represent the
probability that a pair of highly correlated words appear together. In order to make sure that the internal
structure does not get changed too much from the rearrangement, coherence, which is a well known metric
for measuring the effectiveness of topic modeling, is used to control the balance of the internal structure.
We developed two ways to systematically obtain the expert opinion from data, depending on whether the
data has relevant expert writing or not. The final algorithm which takes into account both coherence and
expert opinion is presented. Finally we compare amount of adjustments needed to be done for each topic
modeling method, NMF and LDA.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
The document describes a comparative study of various machine learning and neural network models for detecting abusive language on Twitter. It finds that a bidirectional GRU network trained on word-level features, with a Latent Topic Clustering module, achieves the most accurate results with an F1 score of 0.805 for detecting abusive tweets. Additionally, it explores using context tweets as additional features and finds this improves some models' performance.
This document provides an overview of natural language processing (NLP) research trends presented at ACL 2020, including shifting away from large labeled datasets towards unsupervised and data augmentation techniques. It discusses the resurgence of retrieval models combined with language models, the focus on explainable NLP models, and reflections on current achievements and limitations in the field. Key papers on BERT and XLNet are summarized, outlining their main ideas and achievements in advancing the state-of-the-art on various NLP tasks.
International Journal of Computational Engineering Research(IJCER) ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
A Review Of Text Mining Techniques And ApplicationsLisa Graves
This document provides a review of various text mining techniques and applications. It discusses techniques used for text classification and summarization, including Naive Bayes classification, backpropagation neural networks, keyword matching, and information extraction. It also covers applications of text mining in areas like sentiment analysis of social media posts and hotel reviews. Finally, it discusses the need for organizational text mining to extract useful information and insights from large amounts of unstructured text data.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
This paper discusses the capabilities and limitations of GPT-3 (0), a state-of-the-art language model, in the
context of text understanding. We begin by describing the architecture and training process of GPT-3, and
provide an overview of its impressive performance across a wide range of natural language processing
tasks, such as language translation, question-answering, and text completion. Throughout this research
project, a summarizing tool was also created to help us retrieve content from any types of document,
specifically IELTS (0) Reading Test data in this project. We also aimed to improve the accuracy of the
summarizing, as well as question-answering capabilities of GPT-3 (0) via long text
Data-to-text technologies present an enormous and exciting opportunity to help
audiences understand some of the insights present in today’s vasts and growing amounts of electronic
data. In this article we analyze the potential value and benefits of these solutions as well as their risks
and limitations for a wider penetration. These technologies already bring substantial advantages of
cost, time, accuracy and clarity versus other traditional approaches or format. On the other hand,
there are still important limitations that restrict the broad applicability of these solutions, most
importantly in the limited quality of their output. However we find that the current state of
development is sufficient for the application of these solution across many domains and use cases and
recommend businesses of all sectors to consider how to deploy them to enhance the value they are
currently getting from their data. As the availability of data keeps growing exponentially and natural
language generation technology keeps improving, we expect data-to-text solutions to take a much
more bigger role in the production of automated content across many different domains.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
Similar to May 2024 - Top10 Cited Articles in Natural Language Computing (20)
Identification and Classification of Named Entities in Indian Languageskevig
The process of identification of Named Entities (NEs) in a given document and then there classification into
different categories of NEs is referred to as Named Entity Recognition (NER). We need to do a great effort
in order to perform NER in Indian languages and achieve the same or higher accuracy as that obtained by
English and the European languages. In this paper, we have presented the results that we have achieved by
performing NER in Hindi, Bengali and Telugu using Hidden Markov Model (HMM) and Performance
Metrics.
Effect of Query Formation on Web Search Engine Resultskevig
Query in a search engine is generally based on natural language. A query can be expressed in more than
one way without changing its meaning as it depends on thinking of human being at a particular moment.
Aim of the searcher is to get most relevant results immaterial of how the query has been expressed. In the
present paper, we have examined the results of search engine for change in coverage and similarity of first
few results when a query is entered in two semantically same but in different formats. Searching has been
made through Google search engine. Fifteen pairs of queries have been chosen for the study. The t-test has
been used for the purpose and the results have been checked on the basis of total documents found,
similarity of first five and first ten documents found in the results of a query entered in two different
formats. It has been found that the total coverage is same but first few results are significantly different.
Investigations of the Distributions of Phonemic Durations in Hindi and Dogrikevig
Speech generation is one of the most important areas of research in speech signal processing which is now gaining a serious attention. Speech is a natural form of communication in all living things. Computers with the ability to understand speech and speak with a human like voice are expected to contribute to the development of more natural man-machine interface. However, in order to give those functions that are even closer to those of human beings, we must learn more about the mechanisms by which speech is produced and perceived, and develop speech information processing technologies that can generate a more natural sounding systems. The so described field of stud, also called speech synthesis and more prominently acknowledged as text-to-speech synthesis, originated in the mid eighties because of the emergence of DSP and the rapid advancement of VLSI techniques. To understand this field of speech, it is necessary to understand the basic theory of speech production. Every language has different phonetic alphabets and a different set of possible phonemes and their combinations.
For the analysis of the speech signal, we have carried out the recording of five speakers in Dogri (3 male and 5 females) and eight speakers in Hindi language (4 male and 4 female). For estimating the durational distributions, the mean of mean of ten instances of vowels of each speaker in both the languages has been calculated. Investigations have shown that the two durational distributions differ significantly with respect to mean and standard deviation. The duration of phoneme is speaker dependent. The whole investigation can be concluded with the end result that almost all the Dogri phonemes have shorter duration, in comparison to Hindi phonemes. The period in milli seconds of same phonemes when uttered in Hindi were found to be longer compared to when they were spoken by a person with Dogri as his mother tongue. There are many applications which are directly of indirectly related to the research being carried out. For instance the main application may be for transforming Dogri speech into Hindi and vice versa, and further utilizing this application, we can develop a speech aid to teach Dogri to children. The results may also be useful for synthesizing the phonemes of Dogri using the parameters of the phonemes of Hindi and for building large vocabulary speech recognition systems.
Effect of Singular Value Decomposition Based Processing on Speech Perceptionkevig
Speech is an important biological signal for primary mode of communication among human being and also the most natural and efficient form of exchanging information among human in speech. Speech processing is the most important aspect in signal processing. In this paper the theory of linear algebra called singular value decomposition (SVD) is applied to the speech signal. SVD is a technique for deriving important parameters of a signal. The parameters derived using SVD may further be reduced by perceptual evaluation of the synthesized speech using only perceptually important parameters, where the speech signal can be compressed so that the information can be transformed into compressed form without losing its quality. This technique finds wide applications in speech compression, speech recognition, and speech synthesis. The objective of this paper is to investigate the effect of SVD based feature selection of the input speech on the perception of the processed speech signal. The speech signal which is in the form of vowels \a\, \e\, \u\ were recorded from each of the six speakers (3 males and 3 females). The vowels for the six speakers were analyzed using SVD based processing and the effect of the reduction in singular values was investigated on the perception of the resynthesized vowels using reduced singular values. Investigations have shown that the number of singular values can be drastically reduced without significantly affecting the perception of the vowels.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4,
demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to
identify which specific terms in prompts positively or negatively impact relevance
evaluation with LLMs. We employed two types of prompts: those used in previous
research and generated automatically by LLMs. By comparing the performance of
these prompts in both few-shot and zero-shot settings, we analyze the influence of
specific terms in the prompts. We have observed two main findings from our study.
First, we discovered that prompts using the term ‘answer’ lead to more effective
relevance evaluations than those using ‘relevant.’ This indicates that a more direct
approach, focusing on answering the query, tends to enhance performance. Second,
we noted the importance of appropriately balancing the scope of ‘relevance.’ While
the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments.
The inclusion of few-shot examples helps in more precisely defining this balance.
By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute
to refine relevance criteria. In conclusion, our study highlights the significance of
carefully selecting terms in prompts for relevance evaluation with LLMs.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4, demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to identify which specific terms in prompts positively or negatively impact relevance evaluation with LLMs. We employed two types of prompts: those used in previous research and generated automatically by LLMs. By comparing the performance of these prompts in both few-shot and zero-shot settings, we analyze the influence of specific terms in the prompts. We have observed two main findings from our study. First, we discovered that prompts using the term ‘answer’ lead to more effective relevance evaluations than those using ‘relevant.’ This indicates that a more direct approach, focusing on answering the query, tends to enhance performance. Second, we noted the importance of appropriately balancing the scope of ‘relevance.’ While the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments. The inclusion of few-shot examples helps in more precisely defining this balance. By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute to refine relevance criteria. In conclusion, our study highlights the significance of carefully selecting terms in prompts for relevance evaluation with LLMs.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
Genetic Approach For Arabic Part Of Speech Taggingkevig
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
Rule Based Transliteration Scheme for English to Punjabikevig
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
Improving Dialogue Management Through Data Optimizationkevig
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and computers through natural language stands as a critical advancement. Central to these systems is the dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue management has embraced a variety of methodologies, including rule-based systems, reinforcement learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset imperfections, especially mislabeling, on the challenges inherent in refining dialogue management processes.
Document Author Classification using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statisticalnaturallanguage parserwere explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.
Rag-Fusion: A New Take on Retrieval Augmented Generationkevig
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots, but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal scores and fusing the documents and scores. Through manually evaluating answers on accuracy, relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and comprehensive answers due to the generated queries contextualizing the original query from various perspectives. However, some answers strayed off topic when the generated queries' relevance to the original query is insufficient. This research marks significant progress in artificial intelligence (AI) and natural language processing (NLP) applications and demonstrates transformations in a global and multi-industry context.
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...kevig
The common practice in Machine Learning research is to evaluate the top-performing models based on their performance. However, this often leads to overlooking other crucial aspects that should be given careful consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches (SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance parameters (standard indices), but also alternative measures such as timing, energy consumption and costs, which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of complex models (LLM and generative models), consuming much less energy and requiring fewer resources. These findings suggest that companies should consider additional considerations when choosing machine learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
Evaluation of Medium-Sized Language Models in German and English Languagekevig
Large language models (LLMs) have garnered significant attention, but the definition of “large” lacks clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative question answering, which requires models to provide elaborate answers without external document retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results show that combining the best answers from different MLMs yielded an overall correct answer rate of 82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B parameters, which highlights the importance of using appropriate training data for fine-tuning rather than solely relying on the number of parameters. More fine-grained feedback should be used to further improve the quality of answers. The open source community is quickly closing the gap to the best commercial models.
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATIONkevig
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and
computers through natural language stands as a critical advancement. Central to these systems is the
dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user
goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue
management has embraced a variety of methodologies, including rule-based systems, reinforcement
learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This
research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue
managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as
Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the
introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset
imperfections, especially mislabeling, on the challenges inherent in refining dialogue management
processes.
Document Author Classification Using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the
text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used,
for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern
times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of
using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship
using grammatical structural information extracted using a statistical natural language parser. This paper provides a
proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist
Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted
of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the
features into a lower dimensional space. Statistical experiments on these documents demonstrate that information
from a statistical parser can, in fact, assist in distinguishing authors.
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATIONkevig
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product
information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots,
but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines
RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal
scores and fusing the documents and scores. Through manually evaluating answers on accuracy,
relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and
comprehensive answers due to the generated queries contextualizing the original query from various
perspectives. However, some answers strayed off topic when the generated queries' relevance to the
original query is insufficient. This research marks significant progress in artificial intelligence (AI) and
natural language processing (NLP) applications and demonstrates transformations in a global and multiindustry context
Performance, energy consumption and costs: a comparative analysis of automati...kevig
The common practice in Machine Learning research is to evaluate the top-performing models based on their
performance. However, this often leads to overlooking other crucial aspects that should be given careful
consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into
account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches
(SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance
parameters (standard indices), but also alternative measures such as timing, energy consumption and costs,
which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and
the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of
complex models (LLM and generative models), consuming much less energy and requiring fewer resources.
These findings suggest that companies should consider additional considerations when choosing machine
learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific
world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to
give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGEkevig
Large language models (LLMs) have garnered significant attention, but the definition of “large” lacks
clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six
billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative
question answering, which requires models to provide elaborate answers without external document
retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results
show that combining the best answers from different MLMs yielded an overall correct answer rate of
82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B
parameters, which highlights the importance of using appropriate training data for fine-tuning rather than
solely relying on the number of parameters. More fine-grained feedback should be used to further improve
the quality of answers. The open source community is quickly closing the gap to the best commercial
models.
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
International Conference on NLP, Artificial Intelligence, Machine Learning an...
May 2024 - Top10 Cited Articles in Natural Language Computing
1. May 2024: Top10 Cited Articles in Natural
Language Computing
International Journal on Natural Language
Computing (IJNLC)
https://airccse.org/journal/ijnlc/index.html
ISSN: 2278 - 1307 [Online]; 2319 - 4111 [Print]
Google Scholar
https://scholar.google.com/citations?user=A5tqIdoAAAAJ&hl=en
2. Rag-Fusion: A New Take on Retrieval Augmented Generation
Zackary Rackauckas, Infineon Technologies, California
Abstract
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain
product information. This problem is traditionally addressed with retrieval-augmented generation
(RAG) chatbots, but in this study, I evaluated the use of the newly popularized RAG-Fusion
method. RAG-Fusion combines RAG and reciprocal rank fusion (RRF) by generating multiple
queries, reranking them with reciprocal scores and fusing the documents and scores. Through
manually evaluating answers on accuracy, relevance, and comprehensiveness, I found that RAG-
Fusion was able to provide accurate and comprehensive answers due to the generated queries
contextualizing the original query from various perspectives. However, some answers strayed off
topic when the generated queries' relevance to the original query is insufficient. This research
marks significant progress in artificial intelligence (AI) and natural language processing (NLP)
applications and demonstrates transformations in a global and multi-industry context.
Keywords
Chatbot, Retrieval-augmented Generation, Reciprocal Rank Fusion, Natural Language
Processing
Full Text: https://aircconline.com/ijnlc/V13N1/13124ijnlc03.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol13.html
3. Performance, Energy Consumption and Costs: A Comparative Analysis of Automatic Text
Classification Approaches in the Legal Domain
Leonardo Rigutini1, Achille Globo1, Marco Stefanelli2, Andrea Zugarini1, Sinan Gultekin1,
Marco Ernandes1, 1expert.ai spa, Italy, 2University of Siena, Italy
Abstract
The common practice in Machine Learning research is to evaluate the top-performing models
based on their performance. However, this often leads to overlooking other crucial aspects that
should be given careful consideration. In some cases, the performance differences between
various approaches may be insignificant, whereas factors like production costs, energy
consumption, and carbon footprint should be taken into account. Large Language Models
(LLMs) are widely used in academia and industry to address NLP problems. In this study, we
present a comprehensive quantitative comparison between traditional approaches (SVM-based)
and more recent approaches such as LLM (BERT family models) and generative models (GPT2
and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only
performance parameters (standard indices), but also alternative measures such as timing, energy
consumption and costs, which collectively contribute to the carbon footprint. To ensure a
complete analysis, we separately considered the prototyping phase (which involves model
selection through training-validation-test iterations) and the in-production phases. These phases
follow distinct implementation procedures and require different resources. The results indicate
that simpler algorithms often achieve performance levels similar to those of complex models
(LLM and generative models), consuming much less energy and requiring fewer resources.
These findings suggest that companies should consider additional considerations when choosing
machine learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary
for the scientific world to also begin to consider aspects of energy consumption in model
evaluations, in order to be able to give real meaning to the results obtained using standard
metrics (Precision, Recall, F1 and so on).
Keywords
NLP, text mining, green AI, green NLP, carbon footprint, energy consumption, evaluation.
Full Text: https://aircconline.com/ijnlc/V13N1/13124ijnlc02.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol13.html
4. A Study on the Appropriate Size of the Mongolian General Corpus
Choi Sun Soo1 and Ganbat Tsend2, 1University of the Humanities, Mongolia, 2Otgontenger
University, Mongolia
Abstract
This study aims to determine the appropriate size of the Mongolian general corpus. This study
used the Heaps’ function and Type-Token Ratio (TTR) to determine the appropriate size of the
Mongolian general corpus. This study’s sample corpus of 906,064 tokens comprised texts from
10 domains of newspaper politics, economy, society, culture, sports, world articles and laws,
middle and high school literature textbooks, interview articles, and podcast transcripts. First, we
estimated the Heaps’ function with this sample corpus. Next, we observed changes in the number
of types and TTR values while increasing the number of tokens by one million using the
estimated Heaps’ function. As a result of observation, we found that the TTR value hardly
changed when the number of tokens exceeded 39~42 million. Thus, we conclude that an
appropriate size for a Mongolian general corpus is 39-42 million tokens.
Keywords
Mongolian general corpus, Appropriate size of corpus, Sample corpus, Heaps’ function, TTR,
Type, Token.
Full Text: https://aircconline.com/ijnlc/V12N3/12323ijnlc02.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol12.html
5. Evaluating BERT and ParsBERT for Analyzing Persian Advertisement Data
Ali Mehrban1 and Pegah Ahadian2, 1Newcastle University, UK, 2Kent State University,
USA
Abstract
This paper discusses the impact of the Internet on modern trading and the importance of data
generated from these transactions for organizations to improve their marketing efforts. The paper
uses the example of Divar, an online marketplace for buying and selling products and services in
Iran, and presents a competition to predict the percentage of a car sales ad that would be
published on the Divar website. Since the dataset provides a rich source of Persian text data, the
authors use the Hazm library, a Python library designed for processing Persian text, and two
state-of-the-art language models, mBERT and ParsBERT, to analyze it. The paper's primary
objective is to compare the performance of mBERT and ParsBERT on the Divar dataset. The
authors provide some background on data mining, Persian language, and the two language
models, examine the dataset's composition and statistical features, and provide details on their
fine-tuning and training configurations for both approaches. They present the results of their
analysis and highlight the strengths and weaknesses of the two language models when applied to
Persian text data. The paper offers valuable insights into the challenges and opportunities of
working with low-resource languages such as Persian and the potential of advanced language
models like BERT for analyzing such data. The paper also explains the data mining process,
including steps such as data cleaning and normalization techniques. Finally, the paper discusses
the types of machine learning problems, such as supervised, unsupervised, and reinforcement
learning, and the pattern evaluation techniques, such as confusion matrix. Overall, the paper
provides an informative overview of the use of language models and data mining techniques for
analyzing text data in low-resource languages, using the example of the Divar dataset.
Keywords
Text Recognition, Persian text, NLP, mBERT, ParsBERT
Full Text: https://aircconline.com/ijnlc/V12N2/12223ijnlc02.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol12.html
6. Understanding Chinese Moral Stories with Further Pre-Training
Jing Qian1, Yong Yue1, Katie Atkinson2 and Gangmin Li3, 1Xi’an Jiaotong Liverpool
University, China, 2University of Liverpool, UK, 3University of Bedfordshire, UK
Abstract
The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by
delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is
compacted into a single statement without involving any characters within the original text,
necessitating a more astute language model that can comprehend connotative morality and
exhibit commonsense reasoning. The “pre-training + fine-tuning” paradigm is widely embraced
in neural language models. In this paper, we propose an intermediary phase to establish an
improved paradigm of “pre-training + further pre-training + fine-tuning”. Further pre-training
generally refers to continual learning on task-specific or domain-relevant corpora before being
applied to target tasks, which aims at bridging the gap in data distribution between the phases of
pre-training and fine-tuning. Our work is based on a Chinese dataset named STORAL-ZH that
composes of 4k human-written story-moral pairs. Furthermore, we design a two-step process of
domain-adaptive pre-training in the intermediary phase. The first step depends on a newly-
collected Chinese dataset of Confucian moral culture. And the second step bases on the Chinese
version of a frequently-used commonsense knowledge graph (i.e. ATOMIC) to enrich the
backbone model with inferential knowledge besides morality. By comparison with several
advanced models including BERT-base, RoBERTa-base and T5-base, experimental results on
two understanding tasks demonstrate the effectiveness of our proposed three-phase paradigm.
Keywords
Moral Understanding, Further Pre-training, Knowledge Graph, Pre-trained Language Model
Full Text: https://aircconline.com/ijnlc/V12N2/12223ijnlc01.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol12.html
7. LOCATION-BASED SENTIMENT ANALYSIS OF 2019 NIGERIA PRESIDENTIAL
ELECTION USING A VOTING ENSEMBLE APPROACH
Ikechukwu Onyenwe1, Samuel N.C. Nwagbo2, Ebele Onyedinma1, Onyedika Ikechukwu-
Onyenwe1, Chidinma A. Nwafor3 and Obinna Agbata1
1*
Computer Science Department, Nnamdi Azikiwe University, Onitsha-Enugu Expressway,
Awka, PMB 5025, Anambra, Nigeria.
2*
Political Science Department, Nnamdi Azikiwe University, Onitsha-Enugu Expressway, Awka,
PMB 5025, Anambra, Nigeria.
3*
Computer Science Department, Nigerian Army College of Environmental Science and
Technology, North-Bank, Makurdi,PMB 102272, Benue, Nigeria
Abstract
Nigeria president Buhari defeated his closest rival Atiku Abubakar by over 3 million votes. He
was issued a Certificate of Return and was sworn in on 29 May 2019. However, there were
claims of widespread hoax by the opposition. The sentiment analysis captures the opinions of the
masses over social media for global events. In this paper, we use 2019 Nigeria presidential
election tweets to perform sentiment analysis through the application of a voting ensemble
approach (VEA) in which the predictions from multiple techniques are combined to find the best
polarity of a tweet (sentence). This is to determine public views on the 2019 Nigeria Presidential
elections and compare them with actual election results. Our sentiment analysis experiment is
focused on location-based viewpoints where we used Twitter location data. For this experiment,
we live-streamed Nigeria 2019 election tweets via Twitter API to create tweets dataset of 583816
size, pre-processed the data, and applied VEA by utilizing three different Sentiment Classifiers
to obtain the choicest polarity of a given tweet. Furthermore, we segmented our tweets dataset
into Nigerian states and geopolitical zones, then plotted state-wise and geopolitical-wise user
sentiments towards Buhari and Atiku and their political parties. The overall objective of the use
of states/geopolitical zones is to evaluate the similarity between the sentiment of location-based
tweets compared to actual election results. The results reveal that whereas there are election
outcomes that coincide with the sentiment expressed on Twitter social media in most cases as
shown by the polarity scores of different locations, there are also some election results where our
location analysis similarity test failed.
Keywords
Nigeria, Election, Sentiment Analysis, Politics, Tweets, Exploration Data Analysis, location data
Full Text: https://aircconline.com/ijnlc/V12N1/12123ijnlc01.pdf
Volume URL: https://airccse.org/journal/ijnlc/vol12.html
8. Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional Context
for Continuous Speech Recognition
Piyush Behre, Sharman Tan, Padma Varadharajan and Shuangyu Chang, Microsoft
Corporation
Abstract
While speech recognition Word Error Rate (WER) has reached human parity for English,
continuous speech recognition scenarios such as voice typing and meeting transcriptions still
suffer from segmentation and punctuation problems, resulting from irregular pausing patterns or
slow speakers. Transformer sequence tagging models are effective at capturing long bi-
directional context, which is crucial for automatic punctuation. Automatic Speech Recognition
(ASR) production systems, however, are constrained by real-time requirements, making it hard
to incorporate the right context when making punctuation decisions. Context within the segments
produced by ASR decoders can be helpful but limiting in overall punctuation performance for a
continuous speech session. In this paper, we propose a streaming approach for punctuation or re-
punctuation of ASR output using dynamic decoding windows and measure its impact on
punctuation and segmentation accuracy across scenarios. The new system tackles over-
segmentation issues, improving segmentation F0.5-score by 13.9%. Streaming punctuation
achieves an average BLEUscore improvement of 0.66 for the downstream task of Machine
Translation (MT).
Keywords
automatic punctuation, automatic speech recognition, re-punctuation, speech segmentation.
Full Text: https://aircconline.com/ijnlc/V11N6/11622ijnlc01.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol11.html
9. A Robust Three-Stage Hybrid Framework for English to Bangla Transliteration
Redwan Ahmed Rizvee, Asif Mahmood, Shakur Shams Mullick and Sajjadul Hakim, Tiger
IT Bangladesh Limited, Dhaka, Bangladesh
Abstract
Phonetic typing using the English alphabet has become widely popular nowadays for social
media and chat services. As a result, a text containing various English and Bangla words and
phrases has become increasingly common. Existing transliteration tools display poor
performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration
(THT) framework that can transliterate both English words and phonetic typed Bangla words
satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based
techniques. Experimental results confirm superiority of THT as it significantly outperforms the
benchmark transliteration tool.
Keywords
Transliteration framework, phonetic typing, English to Bangla, hybrid framework, THT.
Full Text: https://aircconline.com/ijnlc/V11N1/11122ijnlc04.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol11.html
10. Analyzing Architectures for Neural Machine Translation using Low Computational
Resources
Aditya Mandke, Onkar Litake, and Dipali Kadam, SCTR’s Pune Institute of Computer
Technology, India
Abstract
With the recent developments in the field of Natural Language Processing, there has been a rise
in the use of different architectures for Neural Machine Translation. Transformer architectures
are used to achieve state-of-the-art accuracy, but they are very computationally expensive to
train. Everyone cannot have such setups consisting of high-end GPUs and other resources. We
train our models on low computational resources and investigate the results. As expected,
transformers outperformed other architectures, but there were some surprising results.
Transformers consisting of more encoders and decoders took more time to train but had fewer
BLEU scores. LSTM performed well in the experiment and took comparatively less time to train
than transformers, making it suitable to use in situations having time constraints.
Keywords
Machine Translation, Indic Languages, Natural Language Processing.
Full Text: https://aircconline.com/ijnlc/V10N5/10521ijnlc02.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol10.html
11. Developing Products Update-Alert System for E-Commerce Websites Users using Html
Data and Web Scraping Technique
Ikechukwu Onyenwe, Ebele Onyedinma, Chidinma Nwafor and Obinna Agbata, Nnamdi
Azikiwe University, Nigeria
Abstract
Websites are regarded as domains of limitless information which anyone and everyone can
access. The new trend of technology has shaped the way we do and manage our businesses.
Today, advancements in Internet technology has given rise to the proliferation of e-commerce
websites. This, in turn made the activities and lifestyles of marketers/vendors, retailers and
consumers (collectively regarded as users in this paper) easier as it provides convenient
platforms to sale/order items through the internet. Unfortunately, these desirable benefits are not
without drawbacks as these platforms require that the users spend a lot of time and efforts
searching for best product deals, products updates and offers on ecommerce websites.
Furthermore, they need to filter and compare search results by themselves which takes a lot of
time and there are chances of ambiguous results. In this paper, we applied web crawling and
scraping methods on an e-commerce website to obtain HTML data for identifying products
updates based on the current time. These HTML data are preprocessed to extract details of the
products such as name, price, post date and time, etc. to serve as useful information for users.
Keywords
NATURAL LANGUAGE PREPROCESSING (NLP), E-COMMERCE, E-RETAIL, HTML,
DATA, Web, Webscrapping
Full Text: https://aircconline.com/ijnlc/V10N5/10521ijnlc01.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol10.html