Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig
Relation extraction is one of the most important parts of natural language processing. It is the process of extracting relationships from a text. Extracted relationships actually occur between two or more entities of a certain type and these relations may have different patterns. The goal of the paper is to find out the noisy patterns for relation extraction of Bangla sentences. For the research work, seed tuples were needed containing two entities and the relation between them. We can get seed tuples from Freebase. Freebase is a large collaborative knowledge base and database of general, structured information for public use. But for Bangla language, there is no available Freebase. So we made Bangla Freebase which was the real challenge and it can be used for any other NLP based works. Then we tried to find out the noisy patterns for relation extraction by measuring conflict score.
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ijnlc
In this paper we implement a document retrieval system using the Lucene tool and we conduct some experiments in order to compare the efficiency of two different weighting schema: the well-known TF-IDF and the BM25. Then, we expand queries using a comparable corpus (wikipedia) and word embeddings. Obtained results show that the latter method (word embeddings) is a good way to achieve higher precision rates and retrieve more accurate documents
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...ijseajournal
Extensive amount of data stored in medical documents require developing methods that help users to find
what they are looking for effectively by organizing large amounts of information into a small number of
meaningful clusters. The produced clusters contain groups of objects which are more similar to each other
than to the members of any other group. Thus, the aim of high-quality document clustering algorithms is to
determine a set of clusters in which the inter-cluster similarity is minimized and intra-cluster similarity is
maximized. The most important feature in many clustering algorithms is treating the clustering problem as
an optimization process, that is, maximizing or minimizing a particular clustering criterion function
defined over the whole clustering solution. The only real difference between agglomerative algorithms is
how they choose which clusters to merge. The main purpose of this paper is to compare different
agglomerative algorithms based on the evaluation of the clusters quality produced by different hierarchical
agglomerative clustering algorithms using different criterion functions for the problem of clustering
medical documents. Our experimental results showed that the agglomerative algorithm that uses I1 as its
criterion function for choosing which clusters to merge produced better clusters quality than the other
criterion functions in term of entropy and purity as external measures.
APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL IJCSEA Journal
Genetic algorithms are usually used in information retrieval systems (IRs) to enhance the information retrieval process, and to increase the efficiency of the optimal information retrieval in order to meet the users’ needs and help them find what they want exactly among the growing numbers of available information. The improvement of adaptive genetic algorithms helps to retrieve the information needed by the user accurately, reduces the retrieved relevant files and excludes irrelevant files. In this study, the researcher explored the problems embedded in this process, attempted to find solutions such as the way of choosing mutation probability and fitness function, and chose Cranfield English Corpus test collection on mathematics. Such collection was conducted by Cyrial Cleverdon and used at the University of Cranfield in 1960 containing 1400 documents, and 225 queries for simulation purposes. The researcher also used cosine similarity and jaccards to compute similarity between the query and documents, and used two proposed adaptive fitness function, mutation operators as well as adaptive crossover. The process aimed at evaluating the effectiveness of results according to the measures of precision and recall. Finally, the study concluded that we might have several improvements when using adaptive genetic algorithms.
Applying genetic algorithms to information retrieval using vector space modelIJCSEA Journal
Genetic algorithms are usually used in information retrieval systems (IRs) to enhance the information retrieval process, and to increase the efficiency of the optimal information retrieval in order to meet the users’ needs and help them find what they want exactly among the growing numbers of available information. The improvement of adaptive genetic algorithms helps to retrieve the information needed by the user accurately, reduces the retrieved relevant files and excludes irrelevant files. In this study, the researcher explored the problems embedded in this process, attempted to find solutions such as the way of choosing mutation probability and fitness function, and chose Cranfield English Corpus test collection on
mathematics. Such collection was conducted by Cyrial Cleverdon and used at the University of Cranfield in
1960 containing 1400 documents, and 225 queries for simulation purposes. The researcher also used
cosine similarity and jaccards to compute similarity between the query and documents, and used two
proposed adaptive fitness function, mutation operators as well as adaptive crossover. The process aimed at
evaluating the effectiveness of results according to the measures of precision and recall. Finally, the study
concluded that we might have several improvements when using adaptive genetic algorithms.
Applying Genetic Algorithms to Information Retrieval Using Vector Space ModelIJCSEA Journal
Genetic algorithms are usually used in information retrieval systems (IRs) to enhance the information retrieval process, and to increase the efficiency of the optimal information retrieval in order to meet the users’ needs and help them find what they want exactly among the growing numbers of available information. The improvement of adaptive genetic algorithms helps to retrieve the information needed by the user accurately, reduces the retrieved relevant files and excludes irrelevant files. In this study, the researcher explored the problems embedded in this process, attempted to find solutions such as the way of choosing mutation probability and fitness function, and chose Cranfield English Corpus test collection on mathematics. Such collection was conducted by Cyrial Cleverdon and used at the University of Cranfield in 1960 containing 1400 documents, and 225 queries for simulation purposes. The researcher also used cosine similarity and jaccards to compute similarity between the query and documents, and used two proposed adaptive fitness function, mutation operators as well as adaptive crossover. The process aimed at evaluating the effectiveness of results according to the measures of precision and recall. Finally, the study concluded that we might have several improvements when using adaptive genetic algorithms.
Answer extraction and passage retrieval forWaheeb Ahmed
—Question Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the user’s questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the user’s
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig
Relation extraction is one of the most important parts of natural language processing. It is the process of extracting relationships from a text. Extracted relationships actually occur between two or more entities of a certain type and these relations may have different patterns. The goal of the paper is to find out the noisy patterns for relation extraction of Bangla sentences. For the research work, seed tuples were needed containing two entities and the relation between them. We can get seed tuples from Freebase. Freebase is a large collaborative knowledge base and database of general, structured information for public use. But for Bangla language, there is no available Freebase. So we made Bangla Freebase which was the real challenge and it can be used for any other NLP based works. Then we tried to find out the noisy patterns for relation extraction by measuring conflict score.
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ijnlc
In this paper we implement a document retrieval system using the Lucene tool and we conduct some experiments in order to compare the efficiency of two different weighting schema: the well-known TF-IDF and the BM25. Then, we expand queries using a comparable corpus (wikipedia) and word embeddings. Obtained results show that the latter method (word embeddings) is a good way to achieve higher precision rates and retrieve more accurate documents
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...ijseajournal
Extensive amount of data stored in medical documents require developing methods that help users to find
what they are looking for effectively by organizing large amounts of information into a small number of
meaningful clusters. The produced clusters contain groups of objects which are more similar to each other
than to the members of any other group. Thus, the aim of high-quality document clustering algorithms is to
determine a set of clusters in which the inter-cluster similarity is minimized and intra-cluster similarity is
maximized. The most important feature in many clustering algorithms is treating the clustering problem as
an optimization process, that is, maximizing or minimizing a particular clustering criterion function
defined over the whole clustering solution. The only real difference between agglomerative algorithms is
how they choose which clusters to merge. The main purpose of this paper is to compare different
agglomerative algorithms based on the evaluation of the clusters quality produced by different hierarchical
agglomerative clustering algorithms using different criterion functions for the problem of clustering
medical documents. Our experimental results showed that the agglomerative algorithm that uses I1 as its
criterion function for choosing which clusters to merge produced better clusters quality than the other
criterion functions in term of entropy and purity as external measures.
APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL IJCSEA Journal
Genetic algorithms are usually used in information retrieval systems (IRs) to enhance the information retrieval process, and to increase the efficiency of the optimal information retrieval in order to meet the users’ needs and help them find what they want exactly among the growing numbers of available information. The improvement of adaptive genetic algorithms helps to retrieve the information needed by the user accurately, reduces the retrieved relevant files and excludes irrelevant files. In this study, the researcher explored the problems embedded in this process, attempted to find solutions such as the way of choosing mutation probability and fitness function, and chose Cranfield English Corpus test collection on mathematics. Such collection was conducted by Cyrial Cleverdon and used at the University of Cranfield in 1960 containing 1400 documents, and 225 queries for simulation purposes. The researcher also used cosine similarity and jaccards to compute similarity between the query and documents, and used two proposed adaptive fitness function, mutation operators as well as adaptive crossover. The process aimed at evaluating the effectiveness of results according to the measures of precision and recall. Finally, the study concluded that we might have several improvements when using adaptive genetic algorithms.
Applying genetic algorithms to information retrieval using vector space modelIJCSEA Journal
Genetic algorithms are usually used in information retrieval systems (IRs) to enhance the information retrieval process, and to increase the efficiency of the optimal information retrieval in order to meet the users’ needs and help them find what they want exactly among the growing numbers of available information. The improvement of adaptive genetic algorithms helps to retrieve the information needed by the user accurately, reduces the retrieved relevant files and excludes irrelevant files. In this study, the researcher explored the problems embedded in this process, attempted to find solutions such as the way of choosing mutation probability and fitness function, and chose Cranfield English Corpus test collection on
mathematics. Such collection was conducted by Cyrial Cleverdon and used at the University of Cranfield in
1960 containing 1400 documents, and 225 queries for simulation purposes. The researcher also used
cosine similarity and jaccards to compute similarity between the query and documents, and used two
proposed adaptive fitness function, mutation operators as well as adaptive crossover. The process aimed at
evaluating the effectiveness of results according to the measures of precision and recall. Finally, the study
concluded that we might have several improvements when using adaptive genetic algorithms.
Applying Genetic Algorithms to Information Retrieval Using Vector Space ModelIJCSEA Journal
Genetic algorithms are usually used in information retrieval systems (IRs) to enhance the information retrieval process, and to increase the efficiency of the optimal information retrieval in order to meet the users’ needs and help them find what they want exactly among the growing numbers of available information. The improvement of adaptive genetic algorithms helps to retrieve the information needed by the user accurately, reduces the retrieved relevant files and excludes irrelevant files. In this study, the researcher explored the problems embedded in this process, attempted to find solutions such as the way of choosing mutation probability and fitness function, and chose Cranfield English Corpus test collection on mathematics. Such collection was conducted by Cyrial Cleverdon and used at the University of Cranfield in 1960 containing 1400 documents, and 225 queries for simulation purposes. The researcher also used cosine similarity and jaccards to compute similarity between the query and documents, and used two proposed adaptive fitness function, mutation operators as well as adaptive crossover. The process aimed at evaluating the effectiveness of results according to the measures of precision and recall. Finally, the study concluded that we might have several improvements when using adaptive genetic algorithms.
Answer extraction and passage retrieval forWaheeb Ahmed
—Question Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the user’s questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the user’s
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
IMPROVING SEARCH ENGINES BY DEMOTING NON-RELEVANT DOCUMENTSkevig
A good search engine aims to have more relevant documents on the top of the list. This paper describes a new technique called “Improving search engines by demoting non-relevant documents” (DNR) that improves the precision by detecting and demoting non-relevant documents. DNR generates a new set of queries that are composed of the terms of the original query combined in different ways. The documents retrieved from those new queries are evaluated using a heuristic algorithm to detect the non-relevant ones. These non-relevant documents are moved down the list which will consequently improve the precision. The new technique is tested on WT2g test collection. The testing of the new technique is done using variant retrieval models, which are the vector model based on the TFIDF weighing measure, the probabilistic models based on the BM25, and DFR-BM25 weighing measures. The recall and precision ratios are used to compare the performance of the new technique against the performance of the original query
A good search engine aims to have more relevant documents on the top of the list. This paper describes a new technique called “Improving search engines by demoting non-relevant documents” (DNR) that
improves the precision by detecting and demoting non-relevant documents. DNR generates a new set of
queries that are composed of the terms of the original query combined in different ways
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Supreme court dialogue classification using machine learning models IJECEIAES
Legal classification models help lawyers identify the relevant documents required for a study. In this study, the focus is on sentence level classification. To be more precise, the work undertaken focuses on a conversation in the supreme court between the justice and other correspondents. In the study, both the naïve Bayes classifier and logistic regression are used to classify conversations at the sentence level. The performance is measured with the help of the area under the curve score. The study found that the model that was trained on a specific case yielded better results than a model that was trained on a larger number of conversations. Case specificity is found to be more crucial in gaining better results from the classifier.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Editor IJCATR
Report handling on "LAPOR!" (Laporan, Aspirasi dan Pengaduan Online Rakyat) system depending on the system administrator who manually reads every incoming report [3]. Read manually can lead to errors in handling complaints [4] if the data flow is huge and grows rapidly, it needs at least three days to prepare a confirmation and it sensitive to inconsistencies [3]. In this study, the authors propose a model that can measure the identities of the Query (Incoming) with Document (Archive). The authors employed Class-Based Indexing term weighting scheme, and Cosine Similarities to analyse document similarities. CoSimTFIDF, CoSimTFICF and CoSimTFIDFICF values used in classification as feature for K-Nearest Neighbour (K-NN) classifier. The optimum result evaluation is pre-processing employ 75% of training data ratio and 25% of test data with CoSimTFIDF feature. It deliver a high accuracy 84%. The k = 5 value obtain high accuracy 84.12%
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
Abstract
There are many data mining techniques have been proposed for mining useful patterns from documents. Still, how to effectively use and update discovered patterns is open for future research , especially in the field of text mining. As most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy(words have multiple meanings) and synonymy(multiple words have same meaning). People have held hypothesis that pattern-based approaches should perform better than the term-based, but many experiments does not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern matching, to improve the effective use of discovered patterns.
Keywords: Pattern Mining, Pattern Taxonomy Model, Inner Pattern Evolving, TF-IDF, NLP etc.
Using data mining methods knowledge discovery for text miningeSAT Journals
Abstract Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase)-based approaches should perform better than the term-based ones, but many experiments do not support this hypothesis. Proposed work presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Keywords:-Text mining, text classification, pattern mining, pattern evolving, information filtering.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Elevating forensic investigation system for file clusteringeSAT Journals
Abstract In computer forensic investigation, thousands of files are usually surveyed. Much of the data in those files consists of formless manuscript, whose investigation by computer examiners is very tough to accomplish. Clustering is the unverified organization of designs that is data items, remarks, or feature vectors into groups (clusters). To find a noble clarification for this automated method of analysis are of great interest. In particular, algorithms such as K-means, K-medoids, Single Link, Complete Link and Average Link can simplify the detection of new and valuable information from the documents under investigation. This paper is going to present an tactic that applies text clustering algorithms to forensic examination of computers seized in police investigations using multithreading technique for data clustering. Keywords- Clustering, forensic computing, text mining, multithreading.
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree
based on Arabic documents collection. The paper also aims at giving the comparison points between all
these techniques and the performance of this techniques on each of the comparison points. Any information
retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there
are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a
document relevant to a specified query, while space represents the capacity of memory needed to create
the two indices.
In this paper, four indices will be built: inverted-file , signature-file, suffix array and suffix tree. However,
to measure the performance of each one, a retrieval system must be built to compare the results of using
these indices.
A collection of 242 Arabic Abstracts from the proceeding of the Saudi Arabian National Computer
Conferences have been used in these systems, and a collection of 60 Arabic queries have been run on the
there systems. We found out that the retrieval result for inverted files is better than the retrieval result for
other indices.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Effect of Query Formation on Web Search Engine Resultskevig
Query in a search engine is generally based on natural language. A query can be expressed in more than
one way without changing its meaning as it depends on thinking of human being at a particular moment.
Aim of the searcher is to get most relevant results immaterial of how the query has been expressed. In the
present paper, we have examined the results of search engine for change in coverage and similarity of first
few results when a query is entered in two semantically same but in different formats. Searching has been
made through Google search engine. Fifteen pairs of queries have been chosen for the study. The t-test has
been used for the purpose and the results have been checked on the basis of total documents found,
similarity of first five and first ten documents found in the results of a query entered in two different
formats. It has been found that the total coverage is same but first few results are significantly different.
Investigations of the Distributions of Phonemic Durations in Hindi and Dogrikevig
Speech generation is one of the most important areas of research in speech signal processing which is now gaining a serious attention. Speech is a natural form of communication in all living things. Computers with the ability to understand speech and speak with a human like voice are expected to contribute to the development of more natural man-machine interface. However, in order to give those functions that are even closer to those of human beings, we must learn more about the mechanisms by which speech is produced and perceived, and develop speech information processing technologies that can generate a more natural sounding systems. The so described field of stud, also called speech synthesis and more prominently acknowledged as text-to-speech synthesis, originated in the mid eighties because of the emergence of DSP and the rapid advancement of VLSI techniques. To understand this field of speech, it is necessary to understand the basic theory of speech production. Every language has different phonetic alphabets and a different set of possible phonemes and their combinations.
For the analysis of the speech signal, we have carried out the recording of five speakers in Dogri (3 male and 5 females) and eight speakers in Hindi language (4 male and 4 female). For estimating the durational distributions, the mean of mean of ten instances of vowels of each speaker in both the languages has been calculated. Investigations have shown that the two durational distributions differ significantly with respect to mean and standard deviation. The duration of phoneme is speaker dependent. The whole investigation can be concluded with the end result that almost all the Dogri phonemes have shorter duration, in comparison to Hindi phonemes. The period in milli seconds of same phonemes when uttered in Hindi were found to be longer compared to when they were spoken by a person with Dogri as his mother tongue. There are many applications which are directly of indirectly related to the research being carried out. For instance the main application may be for transforming Dogri speech into Hindi and vice versa, and further utilizing this application, we can develop a speech aid to teach Dogri to children. The results may also be useful for synthesizing the phonemes of Dogri using the parameters of the phonemes of Hindi and for building large vocabulary speech recognition systems.
More Related Content
Similar to Testing Different Log Bases for Vector Model Weighting Technique
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
IMPROVING SEARCH ENGINES BY DEMOTING NON-RELEVANT DOCUMENTSkevig
A good search engine aims to have more relevant documents on the top of the list. This paper describes a new technique called “Improving search engines by demoting non-relevant documents” (DNR) that improves the precision by detecting and demoting non-relevant documents. DNR generates a new set of queries that are composed of the terms of the original query combined in different ways. The documents retrieved from those new queries are evaluated using a heuristic algorithm to detect the non-relevant ones. These non-relevant documents are moved down the list which will consequently improve the precision. The new technique is tested on WT2g test collection. The testing of the new technique is done using variant retrieval models, which are the vector model based on the TFIDF weighing measure, the probabilistic models based on the BM25, and DFR-BM25 weighing measures. The recall and precision ratios are used to compare the performance of the new technique against the performance of the original query
A good search engine aims to have more relevant documents on the top of the list. This paper describes a new technique called “Improving search engines by demoting non-relevant documents” (DNR) that
improves the precision by detecting and demoting non-relevant documents. DNR generates a new set of
queries that are composed of the terms of the original query combined in different ways
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Supreme court dialogue classification using machine learning models IJECEIAES
Legal classification models help lawyers identify the relevant documents required for a study. In this study, the focus is on sentence level classification. To be more precise, the work undertaken focuses on a conversation in the supreme court between the justice and other correspondents. In the study, both the naïve Bayes classifier and logistic regression are used to classify conversations at the sentence level. The performance is measured with the help of the area under the curve score. The study found that the model that was trained on a specific case yielded better results than a model that was trained on a larger number of conversations. Case specificity is found to be more crucial in gaining better results from the classifier.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Editor IJCATR
Report handling on "LAPOR!" (Laporan, Aspirasi dan Pengaduan Online Rakyat) system depending on the system administrator who manually reads every incoming report [3]. Read manually can lead to errors in handling complaints [4] if the data flow is huge and grows rapidly, it needs at least three days to prepare a confirmation and it sensitive to inconsistencies [3]. In this study, the authors propose a model that can measure the identities of the Query (Incoming) with Document (Archive). The authors employed Class-Based Indexing term weighting scheme, and Cosine Similarities to analyse document similarities. CoSimTFIDF, CoSimTFICF and CoSimTFIDFICF values used in classification as feature for K-Nearest Neighbour (K-NN) classifier. The optimum result evaluation is pre-processing employ 75% of training data ratio and 25% of test data with CoSimTFIDF feature. It deliver a high accuracy 84%. The k = 5 value obtain high accuracy 84.12%
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
Abstract
There are many data mining techniques have been proposed for mining useful patterns from documents. Still, how to effectively use and update discovered patterns is open for future research , especially in the field of text mining. As most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy(words have multiple meanings) and synonymy(multiple words have same meaning). People have held hypothesis that pattern-based approaches should perform better than the term-based, but many experiments does not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern matching, to improve the effective use of discovered patterns.
Keywords: Pattern Mining, Pattern Taxonomy Model, Inner Pattern Evolving, TF-IDF, NLP etc.
Using data mining methods knowledge discovery for text miningeSAT Journals
Abstract Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase)-based approaches should perform better than the term-based ones, but many experiments do not support this hypothesis. Proposed work presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Keywords:-Text mining, text classification, pattern mining, pattern evolving, information filtering.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Elevating forensic investigation system for file clusteringeSAT Journals
Abstract In computer forensic investigation, thousands of files are usually surveyed. Much of the data in those files consists of formless manuscript, whose investigation by computer examiners is very tough to accomplish. Clustering is the unverified organization of designs that is data items, remarks, or feature vectors into groups (clusters). To find a noble clarification for this automated method of analysis are of great interest. In particular, algorithms such as K-means, K-medoids, Single Link, Complete Link and Average Link can simplify the detection of new and valuable information from the documents under investigation. This paper is going to present an tactic that applies text clustering algorithms to forensic examination of computers seized in police investigations using multithreading technique for data clustering. Keywords- Clustering, forensic computing, text mining, multithreading.
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree
based on Arabic documents collection. The paper also aims at giving the comparison points between all
these techniques and the performance of this techniques on each of the comparison points. Any information
retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there
are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a
document relevant to a specified query, while space represents the capacity of memory needed to create
the two indices.
In this paper, four indices will be built: inverted-file , signature-file, suffix array and suffix tree. However,
to measure the performance of each one, a retrieval system must be built to compare the results of using
these indices.
A collection of 242 Arabic Abstracts from the proceeding of the Saudi Arabian National Computer
Conferences have been used in these systems, and a collection of 60 Arabic queries have been run on the
there systems. We found out that the retrieval result for inverted files is better than the retrieval result for
other indices.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Similar to Testing Different Log Bases for Vector Model Weighting Technique (20)
Effect of Query Formation on Web Search Engine Resultskevig
Query in a search engine is generally based on natural language. A query can be expressed in more than
one way without changing its meaning as it depends on thinking of human being at a particular moment.
Aim of the searcher is to get most relevant results immaterial of how the query has been expressed. In the
present paper, we have examined the results of search engine for change in coverage and similarity of first
few results when a query is entered in two semantically same but in different formats. Searching has been
made through Google search engine. Fifteen pairs of queries have been chosen for the study. The t-test has
been used for the purpose and the results have been checked on the basis of total documents found,
similarity of first five and first ten documents found in the results of a query entered in two different
formats. It has been found that the total coverage is same but first few results are significantly different.
Investigations of the Distributions of Phonemic Durations in Hindi and Dogrikevig
Speech generation is one of the most important areas of research in speech signal processing which is now gaining a serious attention. Speech is a natural form of communication in all living things. Computers with the ability to understand speech and speak with a human like voice are expected to contribute to the development of more natural man-machine interface. However, in order to give those functions that are even closer to those of human beings, we must learn more about the mechanisms by which speech is produced and perceived, and develop speech information processing technologies that can generate a more natural sounding systems. The so described field of stud, also called speech synthesis and more prominently acknowledged as text-to-speech synthesis, originated in the mid eighties because of the emergence of DSP and the rapid advancement of VLSI techniques. To understand this field of speech, it is necessary to understand the basic theory of speech production. Every language has different phonetic alphabets and a different set of possible phonemes and their combinations.
For the analysis of the speech signal, we have carried out the recording of five speakers in Dogri (3 male and 5 females) and eight speakers in Hindi language (4 male and 4 female). For estimating the durational distributions, the mean of mean of ten instances of vowels of each speaker in both the languages has been calculated. Investigations have shown that the two durational distributions differ significantly with respect to mean and standard deviation. The duration of phoneme is speaker dependent. The whole investigation can be concluded with the end result that almost all the Dogri phonemes have shorter duration, in comparison to Hindi phonemes. The period in milli seconds of same phonemes when uttered in Hindi were found to be longer compared to when they were spoken by a person with Dogri as his mother tongue. There are many applications which are directly of indirectly related to the research being carried out. For instance the main application may be for transforming Dogri speech into Hindi and vice versa, and further utilizing this application, we can develop a speech aid to teach Dogri to children. The results may also be useful for synthesizing the phonemes of Dogri using the parameters of the phonemes of Hindi and for building large vocabulary speech recognition systems.
May 2024 - Top10 Cited Articles in Natural Language Computingkevig
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Effect of Singular Value Decomposition Based Processing on Speech Perceptionkevig
Speech is an important biological signal for primary mode of communication among human being and also the most natural and efficient form of exchanging information among human in speech. Speech processing is the most important aspect in signal processing. In this paper the theory of linear algebra called singular value decomposition (SVD) is applied to the speech signal. SVD is a technique for deriving important parameters of a signal. The parameters derived using SVD may further be reduced by perceptual evaluation of the synthesized speech using only perceptually important parameters, where the speech signal can be compressed so that the information can be transformed into compressed form without losing its quality. This technique finds wide applications in speech compression, speech recognition, and speech synthesis. The objective of this paper is to investigate the effect of SVD based feature selection of the input speech on the perception of the processed speech signal. The speech signal which is in the form of vowels \a\, \e\, \u\ were recorded from each of the six speakers (3 males and 3 females). The vowels for the six speakers were analyzed using SVD based processing and the effect of the reduction in singular values was investigated on the perception of the resynthesized vowels using reduced singular values. Investigations have shown that the number of singular values can be drastically reduced without significantly affecting the perception of the vowels.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4,
demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to
identify which specific terms in prompts positively or negatively impact relevance
evaluation with LLMs. We employed two types of prompts: those used in previous
research and generated automatically by LLMs. By comparing the performance of
these prompts in both few-shot and zero-shot settings, we analyze the influence of
specific terms in the prompts. We have observed two main findings from our study.
First, we discovered that prompts using the term ‘answer’ lead to more effective
relevance evaluations than those using ‘relevant.’ This indicates that a more direct
approach, focusing on answering the query, tends to enhance performance. Second,
we noted the importance of appropriately balancing the scope of ‘relevance.’ While
the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments.
The inclusion of few-shot examples helps in more precisely defining this balance.
By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute
to refine relevance criteria. In conclusion, our study highlights the significance of
carefully selecting terms in prompts for relevance evaluation with LLMs.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4, demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to identify which specific terms in prompts positively or negatively impact relevance evaluation with LLMs. We employed two types of prompts: those used in previous research and generated automatically by LLMs. By comparing the performance of these prompts in both few-shot and zero-shot settings, we analyze the influence of specific terms in the prompts. We have observed two main findings from our study. First, we discovered that prompts using the term ‘answer’ lead to more effective relevance evaluations than those using ‘relevant.’ This indicates that a more direct approach, focusing on answering the query, tends to enhance performance. Second, we noted the importance of appropriately balancing the scope of ‘relevance.’ While the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments. The inclusion of few-shot examples helps in more precisely defining this balance. By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute to refine relevance criteria. In conclusion, our study highlights the significance of carefully selecting terms in prompts for relevance evaluation with LLMs.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
Genetic Approach For Arabic Part Of Speech Taggingkevig
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
Rule Based Transliteration Scheme for English to Punjabikevig
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
Improving Dialogue Management Through Data Optimizationkevig
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and computers through natural language stands as a critical advancement. Central to these systems is the dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue management has embraced a variety of methodologies, including rule-based systems, reinforcement learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset imperfections, especially mislabeling, on the challenges inherent in refining dialogue management processes.
Document Author Classification using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statisticalnaturallanguage parserwere explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.
Rag-Fusion: A New Take on Retrieval Augmented Generationkevig
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots, but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal scores and fusing the documents and scores. Through manually evaluating answers on accuracy, relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and comprehensive answers due to the generated queries contextualizing the original query from various perspectives. However, some answers strayed off topic when the generated queries' relevance to the original query is insufficient. This research marks significant progress in artificial intelligence (AI) and natural language processing (NLP) applications and demonstrates transformations in a global and multi-industry context.
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...kevig
The common practice in Machine Learning research is to evaluate the top-performing models based on their performance. However, this often leads to overlooking other crucial aspects that should be given careful consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches (SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance parameters (standard indices), but also alternative measures such as timing, energy consumption and costs, which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of complex models (LLM and generative models), consuming much less energy and requiring fewer resources. These findings suggest that companies should consider additional considerations when choosing machine learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
Evaluation of Medium-Sized Language Models in German and English Languagekevig
Large language models (LLMs) have garnered significant attention, but the definition of “large” lacks clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative question answering, which requires models to provide elaborate answers without external document retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results show that combining the best answers from different MLMs yielded an overall correct answer rate of 82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B parameters, which highlights the importance of using appropriate training data for fine-tuning rather than solely relying on the number of parameters. More fine-grained feedback should be used to further improve the quality of answers. The open source community is quickly closing the gap to the best commercial models.
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATIONkevig
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and
computers through natural language stands as a critical advancement. Central to these systems is the
dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user
goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue
management has embraced a variety of methodologies, including rule-based systems, reinforcement
learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This
research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue
managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as
Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the
introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset
imperfections, especially mislabeling, on the challenges inherent in refining dialogue management
processes.
Document Author Classification Using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the
text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used,
for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern
times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of
using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship
using grammatical structural information extracted using a statistical natural language parser. This paper provides a
proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist
Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted
of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the
features into a lower dimensional space. Statistical experiments on these documents demonstrate that information
from a statistical parser can, in fact, assist in distinguishing authors.
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATIONkevig
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product
information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots,
but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines
RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal
scores and fusing the documents and scores. Through manually evaluating answers on accuracy,
relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and
comprehensive answers due to the generated queries contextualizing the original query from various
perspectives. However, some answers strayed off topic when the generated queries' relevance to the
original query is insufficient. This research marks significant progress in artificial intelligence (AI) and
natural language processing (NLP) applications and demonstrates transformations in a global and multiindustry context
Performance, energy consumption and costs: a comparative analysis of automati...kevig
The common practice in Machine Learning research is to evaluate the top-performing models based on their
performance. However, this often leads to overlooking other crucial aspects that should be given careful
consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into
account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches
(SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance
parameters (standard indices), but also alternative measures such as timing, energy consumption and costs,
which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and
the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of
complex models (LLM and generative models), consuming much less energy and requiring fewer resources.
These findings suggest that companies should consider additional considerations when choosing machine
learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific
world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to
give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGEkevig
Large language models (LLMs) have garnered significant attention, but the definition of “large” lacks
clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six
billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative
question answering, which requires models to provide elaborate answers without external document
retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results
show that combining the best answers from different MLMs yielded an overall correct answer rate of
82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B
parameters, which highlights the importance of using appropriate training data for fine-tuning rather than
solely relying on the number of parameters. More fine-grained feedback should be used to further improve
the quality of answers. The open source community is quickly closing the gap to the best commercial
models.
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Testing Different Log Bases for Vector Model Weighting Technique
1. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
DOI: 10.5121/ijnlc.2023.12301 1
TESTING DIFFERENT LOG BASES FOR VECTOR
MODEL WEIGHTING TECHNIQUE
Kamel Assaf
Programmer, Edmonton, AB, Canada
ABSTRACT
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The
documents are initially indexed and the words in the documents are assigned weights using a weighting
technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency
(IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term
is common or rare across all documents. It is computed by dividing the total number of documents in the
system by the number of documents containing the term and then computing the logarithm of the quotient.
By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting
technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases
for vector model weighting technique is to highlight the importance of understanding the performance of
the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test
collections that scientists assembled explicitly for experiments in data information retrieval systems.
KEYWORDS
Information Retrieval, Vector Model, Logarithms, tfidf
1. INTRODUCTION
Just not from long time ago libraries were the right place to search for an information. People
usually ask the librarian for a particular book, content or search among the available books.
Consequently, this becomes a tedious and inefficient task to find what you really need. With the
advancements of technology and computing, it is now possible to create a digital information
retrieval system capable of letting users input queries, process it and deliver relevant information
as output to the user[1, 13]. It is used to find relevant information to the user query [9, 13].
Relevant information is any set of items such as images, documents, videos and other types of
data. For the satisfaction of end users, it is expected by the system to retrieve precise and
accurate information. But, it is not always that the system returns with the desired data. Hence,
allowing the expansion of research in the subject. The main goal of this research is to elevate the
performance of information retrieval systems by finding new values as we calculate the TFIDF
weighting model. Modifying the default log base when calculating the IDF portion of the
equation could potentially make the system retrieve more precise and accurate information.
Many models exist for information retrieval systems such as the vector model[4, 13] and the
probabilistic model[12, 13]. In particular to this research, we will explore the vector model and
its common method of weighting called TFIDF which is the product of Term Frequency and
Inverse Document Frequency[4, 8]. Term Frequency is the representation of a word occurrences
in a text and the Inverse Document Frequency define if a word is common or rare across all
documents. This research looks into a different way to improve the information retrieval system
results. My idea is to change the log base when calculating the IDF from the default base 10 to
2. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
2
test a range of variant log bases ranging from 0.1 to 100.0. To solve this problem, we explore a
mathematical method used to calculate the log of a value using a different base.
2. IR SYSTEM REVIEW
This section is reserved to explore some of the components of an information retrieval system
and its characteristics. I did develop my own system in order to made this research possible and
understand the subject in deep.
2.1. Test Collection
Many standard collections are available for testing information retrieval system[2]. Table 1,
shows the five test collections that was used to evaluate this research, it contains information
regarding the size, number of documents and the number of terms in the test collection. Each one
of the test collections comes with a queries list and a relevance judgment list that indicates the
relevant documents for each query.
Table 1. Small Test Collections: Size, Documents and Terms
Test Collection Size in Megabytes No of Documents No of Terms
MED 1.05 1,033 8,915
CRAN 1.40 1,400 4,217
NPL 3.02 11,429 7,934
LISA 3.04 6,003 11,291
CISI 1.98 1,460 5,591
2.2. Indexing
Indexing refers to the steps of eliminating stopwords, removing connectives, punctuations and
stemming words by removing its affixes and suffixes[11].
2.2.1. Stopwords
In the first step, we read all the documents and compare its words with a list of stopwords found
in a stoplist[1]. The stopwords list consists of adjectives, adverbs, and connectives such as
another, again, between, mostly, the, a, in and others are considered unnecessary words,
because they have different functionality in the grammar, they are mostly used to connect words
and create sentences. This step also include the removal of punctuations and special characters
from the documents.
2.2.2. Stemming
The stemming process is the technique used to transform particular words by removing the
suffixes from the word[7]. A well-known technique is the Porter stemming algorithm, developed
at the University of Cambridge[11]. A simple example is the word playing that becomes play
and books becomes book, removing the endings -ing and -srespectivelly. Another example
below shows the removal of some suffixes from the words ending:
3. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
3
SSES → SS
• caresses = caress
• ponies = poni
• ties = ti
IES → I
• ponies = poni
ES → in this case, remove the ending “ES”and add nothing to it.
• beaches = beach
• bushes = bush
The stemming algorithm cleans and make words smaller in size and it is proven in previous
research that using it will improve the performance of your system[1, 7].
2.3. Vector Model
This model represents documents as vectors in space[2]. In this model, words are referred as
terms and every indexed term extracted from a document becomes an independent dimension in
a dimensional space. Therefore, any text document is represented by a vector in the dimension
space[4]. Various techniques for weighting exists[4]. One of the primary weights is the TFIDF
weighting measure. It is composed of a local and global weight. For instance, Term Frequency
(TF) is the number of a term occurrences in a document. The global weight is the document
frequency (DF), it defines in how many documents the word occurs. Inverse document frequency
(IDF) has DF scaled to the total number of documents in the collection as shown the equation
below [12, 13]:
Where N is the total number of documents in the test collection, DocFreq(t) is the total number
of documents containing term t and b is calculated using base ten. Finally, TFIDF is computed as
shown in below equation [12, 13]:
𝑤𝑡 = 𝑡𝑓𝑖,j 𝑖𝑑𝑓𝑖 (2)
This technique of weighting is applied for both the terms in the test collection and also to the
terms in the query.
2.4. Querying
The set of documents in a collection are a set of vectors in space[1]. To compare a query
submitted by the user to the documents, we use the cosine similarity[3, 13] as shown below:
4. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
4
In the above equation, a andb represent the words in the query and in the document respectivelly.
It was used to apply the cosine rule for each of the documents against the submitted query. The
results indicate the top-scoring documents for a query [1]. Documents with the highest similarity
value are considered the most relevant. After applying this rule for each document in the test
collection, we use the above equation to calculate the query and rank the relevant documents
related to the query in the space vector, a vector space relies between 0.0 and 1.0.
2.5. Assessing
For a better understanding of the research and the results, it was used some sort of assessment
that clarifies the results. In the case of information retrieval, the process involves the use of two
files that comes with the test collection. Usually, a file which contains a list of predefined
queries[5] and another file containing a judgment list that has a query number that indicates the
number of the query in the queries list file and a list of relevant documents for that specific
query. Both files has been assembled by researchers to make further assessments possible in
information retrieval system evaluations[3]. When a researcher submits a query to the
information retrieval system, a list of documents are usually returned, for each iteration in the
execution of a query in the queries list, the results are then used to calculate precision and recall
values[6]. The below equations [12] shows how precision is calculated:
On the other hand, we also calculate the recall which gives a view on how effective is your
system.
Calculating the precision and recall values could be done in parallel as they don’t rely on values
of each other at the time of their calculations. The below figure, shows two possible rankings
together with the recall and precision values calculated at every rank position for a query that has
six relevant documents.
Figure 1. Precisions values averaged for Ranking #1 and Ranking #2
Consequently, precision and recall values are plotted to give a precision-recall curve as shown
below:
5. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
5
Figure 2. Precision-Recall curve for Rankings #1 and #2
The above example is about ten documents only. However, when we have too many documents,
the most typical representation is to set the recall levels to fixed values 0.0, 0.1, 0.2, 0.3, 0.4, 0.5,
0.6, 0.7, 0.8, 0.9, 1.0, then calculate the precision values to the closest fixed recall. Below is an
example of assessing many documents:
Table 2. Precision values for many documents at different recall levels
It is consider in the above table that all precision values in the recall range 0.00 and 0.05 belong
to the recall level 0.0. Likewise, we consider that all precision values that are in the recall range
0.05 and 0.15 belong to the recall level 0.1. Ongoing in the same sequence, all precision values
that are in the recall range 0.15 and 0.25 belong to the recall level 0.2 and so on. Table 2 shows
how we group the precision values based on the recall ranges and then we average them as
shown in table 3.
Table 3. Average Precisions values for all fixed recall levels
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.867 0.675 0.570 0.520 0.500 0.420 0.350 0.340 0.330 0.313 0.000
Mean Average Precision (MAP) is a further step to assess the results. MAP is the average of
precision values at the fixed recall levels together[1]. The MAP value for the precision values in
table 3 is 0.867 + 0.675 + 0.570 + 0.520 + 0.500 + 0.420 + 0.350 + 0.340 + 0.330 + 0.313 +
0.000 = 0.444. MAP@30 is another measure that is more significant than MAP since it indicates
how well the system is retrieving relevant documents on the first few pages. MAP@30 is the
average of precision values at recall levels 0.0, 0.1, 0.2, and 0.3. MAP@30 for the precision
values in table 3 is 0.867 + 0.675 + 0.570 + 0.520 = 0.658.
6. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
6
3. RESEARCH METHODOLOGY
In this paper, a research targeting information retrieval systems has been applied to study
opportunities that could lead to further improvements in existing methods for information
retrieval systems. The goal is to find new log base values that potentially elevates the
performance and effectiveness of the system, which means more accurate and precise results
retrieved. Based on books and other researchers, it is used to strongly point that the vector model
and the probabilistic model are the most interesting ones and they rely heavily on complex
equations to calculate weight values of words in documents[1, 2]. In particular to this paper, the
research was focused on the vector model. It was used to change the log base value to calculate
the IDF portion of the TFIDF using different log bases varying from 0.1 to 100.0, then compare
the results with the default log base 10.
In mathematics, common logarithms calculations are done using the value of 10 as the default
base. As we need to change that base for this research, it was used another way to calculate the
log, the below equation shows a method that already exists in mathematics that is used to
calculate the log using different bases:
As described in the above equation, we note that it divides the log value of x by the log of the
new desired base b. Many properties of logarithms exist, but in this case, we focus on the
property of changing the base of the logarithm. It was used to implement an algorithm that will
allow us to calculate different log bases, simply by diving the log base of the value of idf by the
log of the desired base, the algorithm below describes:
1 for ∃ 𝑏𝑎𝑠𝑒 (𝑥) 𝑖𝑛 𝑟𝑎𝑛𝑔𝑒 (0.1 → 100.0)
2 tf = occurrences in document
3 𝑣 = docfreq(t)
4 𝑖𝑑𝑓 = log (𝑣)/ log (𝑥)
5 tfidf = tf * idf
Therefore, for each log base, a new test for the system is considered, using the equation number 6
of this section for idf calculation. This has been applied to all five test collections, in a total of a
thousand execution per test collection.
4. RESEARCH RESULTS
The experiments were applied to the public available small test collections such as MED, CRAN,
NPL, LISA, and CISI. I have obtained them from
https://ir.dcs.gla.ac.uk/resources/test_collections. In this section we explain the results in a
tabular and graph form. Each of below points describes the results found in the five test
collections, we classify the results as the best top for MAP and MAP@30, finalizing it with a
graph comparison between the best log found, the standard log base 10 and the worst log base
found for the test collection.
7. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
7
4.1. MED Test Collection
For the MED test collection, it was found significant improvements using log base 0.1, better
precision values found at recall levels 0.6, 0.7, 0.8, 0.9 and 1.0 in comparison with the standard
log base 10.
Table 4. MED with top 5 log values for MAP
RECALL
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP
0.1 1 0.83 0.75 0.68 0.63 0.62 0.62 0.58 0.5 0.43 0.42 0.641818
0.2 1 0.83 0.75 0.68 0.66 0.63 0.62 0.5 0.47 0.43 0.42 0.635455
0.3 1 0.83 0.75 0.68 0.66 0.63 0.62 0.5 0.47 0.43 0.42 0.635455
1.5 1 0.83 0.75 0.68 0.63 0.62 0.61 0.5 0.43 0.42 0.41 0.625455
2 1 0.83 0.75 0.68 0.63 0.62 0.61 0.5 0.43 0.42 0.41 0.625455
Table 5. MED with top 5 log values for MAP@30
RECALL
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP@30
0.1 1 0.83 0.75 0.68 0.63 0.62 0.62 0.58 0.5 0.43 0.42 0.815
0.2 1 0.83 0.75 0.68 0.66 0.63 0.62 0.5 0.47 0.43 0.42 0.815
0.3 1 0.83 0.75 0.68 0.66 0.63 0.62 0.5 0.47 0.43 0.42 0.815
1.5 1 0.83 0.75 0.68 0.63 0.62 0.61 0.5 0.43 0.42 0.41 0.815
2 1 0.83 0.75 0.68 0.63 0.62 0.61 0.5 0.43 0.42 0.41 0.815
In the above tables, it was indicated the top five log values, for MAP and MAP@30, it was found
that the best MAP is at log 0.1 with a MAP score 0.641, the other two best log values are the 0.2
and 0.3 respectively with both of them having a MAP value of 0.635.
Table 6. MED Compare the best, standard and worst for MAP@30
RECALL
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP@30
0.1 1 0.83 0.75 0.68 0.63 0.62 0.62 0.58 0.5 0.43 0.42 0.815
10 1 0.83 0.75 0.68 0.63 0.62 0.61 0.5 0.43 0.42 0.36 0.815
0.5 1 0.83 0.75 0.68 0.63 0.62 0.58 0.5 0.43 0.42 0.36 0.815
In the above table, we define the values of the best log results, for the base 10 and other bases. A
graph then is plotted in below figure with the values from the above table for a better
visualization of the results. We can see that log base 0.1 performs better than other bases and
compared to base 10 on many recall levels such as 0.6, 0.7, 0.8, 0.9 and 1.0.
8. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
8
Figure 3. MED Best, Worst and Standard Precision values.
We have determined the logarithm’s base, which gives the worst and best performance. When we
assess the MED test collection, we find that the system has a MAP@30 equal to 0.815 at all
logarithm’s bases. The system is at its best performance with a MAP value equal to 0.641 when
we apply the TFIDF weighting technique using logarithm at base 0.1. The results are better than
the standard logarithm at base 10. It was found the worst performance using log base 0.5.
4.2.CRAN Test Collection
In the CRAN test collection, it was found some improvements using log base, higher precision
values found at recall levels 0.1, 0.7 and 0.8 in comparison with the standard log base 10.
Table 7. CRAN top 5 log values for MAP
RECALL
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP
0.3 1 0.88 0.83 0.8 0.78 0.76 0.75 0.73 0.7 0.68 0.68 0.780909
0.2 1 0.9 0.83 0.8 0.78 0.75 0.73 0.72 0.7 0.66 0.68 0.777273
0.5 1 0.88 0.83 0.8 0.78 0.75 0.73 0.72 0.7 0.68 0.68 0.777273
1.6 1 0.88 0.83 0.8 0.78 0.75 0.73 0.72 0.7 0.68 0.68 0.777273
18.7 1 0.88 0.83 0.8 0.78 0.75 0.73 0.72 0.7 0.68 0.68 0.777273
Table 8. CRAN top 5 log values for MAP@30
RECALL
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP@30
0.2 1 0.9 0.83 0.8 0.78 0.75 0.73 0.72 0.7 0.68 0.66 0.8825
0.3 1 0.88 0.83 0.8 0.78 0.76 0.75 0.73 0.7 0.68 0.68 0.8775
0.5 1 0.88 0.83 0.8 0.78 0.75 0.73 0.72 0.7 0.68 0.68 0.8775
1.6 1 0.88 0.83 0.8 0.78 0.75 0.73 0.72 0.7 0.68 0.68 0.8775
18.7 1 0.88 0.83 0.8 0.78 0.75 0.73 0.72 0.7 0.68 0.68 0.8775
9. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
9
In above tables, it indicates the top five log values for MAP and MAP@30. It was found the best
MAP at log 0.3 with a MAP score 0.7809, the other two best log values are the 0.2 and 0.5 with
both of them having a MAP value of 0.7772. For MAP@30 log base 0.2 performs better with a
score of 0.8825.
Table 9. CRAN compare the best, standard and worst for MAP@30
RECALL
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP@30
0.2 1 0.9 0.83 0.8 0.78 0.75 0.73 0.72 0.7 0.68 0.66 0.8825
10 1 0.88 0.83 0.8 0.78 0.75 0.73 0.7 0.68 0.66 0.68 0.8775
21.7 1 0.83 0.8 0.78 0.76 0.75 0.73 0.7 0.68 0.68 0.66 0.8525
We define the values of the best log results compared with the base 10 and others. In the below
figure, a graph is plotted with the values from the above table for a better visualization of the
results. It indicates that log base 0.2 is really better than at 0.1, 0.8 and 0.9 recall levels.
Figure 4. CRAN Best, Worst, and Standard Precision Values
It was determined the logarithm’s base 0.2 is better than the standard log base 10. The
assessment results in MAP@30 is equal 0.8825 at the best logarithm 0.2. The results are better
than the standard logarithm at base 10 where MAP@30 value is equal to 0.8775. It was
identified the worst performance at log base 21.7 for MAP@30 with a score of 0.8525. The best
MAP value is equal to 0.7809 when we apply the TFIDF weighting technique using logarithm at
base 0.3.
4.3. NPL Test Collection
The NPL test collection is the largest in the group of small test collections used for this research,
it was found some improvements using log base 32.6, better precision values found at recall
levels 0.2, 0.3, 0.4, 0.5, 0.6, 0.8, 0.9 and 1.0 in comparison with the standard log base 10, which
is a very good improvement based on the data and in comparison, to other test collection results.
10. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
10
Table 10. NPL top results for MAP
RECALL
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP
32.6 1 0.9 0.88 0.87 0.85 0.84 0.83 0.83 0.8 0.78 0.77 0.85
26.3 1 0.9 0.87 0.85 0.84 0.83 0.82 0.83 0.8 0.78 0.78 0.845455
55.4 1 0.9 0.87 0.85 0.84 0.83 0.82 0.83 0.8 0.78 0.78 0.845455
0.1 1 0.9 0.87 0.85 0.84 0.83 0.82 0.83 0.8 0.78 0.77 0.844545
1.8 1 0.9 0.87 0.85 0.84 0.83 0.82 0.83 0.8 0.78 0.77 0.844545
Above table defines log base 32.6 with MAP computed result of 0.85.
Table 11. NPL top results for MAP@30
RECALL
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP@30
32.6 1 0.9 0.88 0.87 0.85 0.84 0.83 0.83 0.8 0.78 0.77 0.9125
0.7 1 0.92 0.87 0.85 0.84 0.83 0.8 0.83 0.77 0.76 0.75 0.91
0.1 1 0.9 0.87 0.85 0.84 0.83 0.82 0.83 0.8 0.78 0.77 0.905
1.8 1 0.9 0.87 0.85 0.84 0.83 0.82 0.83 0.8 0.78 0.77 0.905
24.7 1 0.9 0.87 0.85 0.84 0.83 0.82 0.83 0.8 0.78 0.77 0.905
Also for MAP @30, it was found that the best result was gained when using the log base 32.6
with a score 0.9125.
Table 12. NPL comparison of MAP@30 of the best, worst and standard base 10
RECALL
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP@30
32.6 1 0.9 0.88 0.87 0.85 0.84 0.83 0.83 0.8 0.78 0.77 0.9125
10 1 0.9 0.87 0.85 0.84 0.83 0.8 0.83 0.78 0.77 0.76 0.905
26.3 1 0.9 0.87 0.85 0.84 0.83 0.83 0.82 0.8 0.78 0.78 0.905
The differences between the best, worst and the standard log base 10 results of MAP@30, are
plotted in below graph in the figure to provide a better view of the results.
11. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
11
Figure 5. NPL test collection, best log at 32.6
The log base 32.6 has better precision at many recall levels, a second log base value of 0.7 was
also determined having better MAP@30 compared to the standard log base 10.
4.4. LISA Test Collection
In this test collection, it was found other log bases that performs better than the standard base 10.
Log base 49 has a score of 0.44 for MAP and 0.6925 for MAP@30 and is considered the best
found per below data.
Table 13. LISA best log values for MAP
RECALL
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP
49 1 0.72 0.55 0.5 0.48 0.42 0.42 0.29 0.18 0.15 0.13 0.44
49.1 1 0.72 0.55 0.5 0.48 0.42 0.42 0.29 0.18 0.15 0.13 0.44
85.1 1 0.72 0.55 0.5 0.48 0.42 0.42 0.29 0.18 0.15 0.13 0.44
0.3 1 0.72 0.55 0.5 0.48 0.42 0.42 0.29 0.18 0.14 0.13 0.439091
0.4 1 0.72 0.55 0.5 0.48 0.42 0.42 0.29 0.18 0.14 0.13 0.439091
Table 14. LISA best log values for MAP@30
RECALL
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP@30
49 1 0.72 0.55 0.5 0.48 0.42 0.42 0.29 0.18 0.15 0.13 0.6925
49.1 1 0.72 0.55 0.5 0.48 0.42 0.42 0.29 0.18 0.15 0.13 0.6925
85.1 1 0.72 0.55 0.5 0.48 0.42 0.42 0.29 0.18 0.15 0.13 0.6925
0.3 1 0.72 0.55 0.5 0.48 0.42 0.42 0.29 0.18 0.14 0.13 0.6925
0.4 1 0.72 0.55 0.5 0.48 0.42 0.42 0.29 0.18 0.14 0.13 0.6925
12. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
12
The table above indicates top log values for MAP and MAP@30, it was found that the best MAP
is at log 49 with a MAP score 0.85.
Table 15. LISA best and worst values compared to the standard base 10
RECALL
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP@30
49 1 0.72 0.55 0.5 0.48 0.42 0.42 0.29 0.18 0.15 0.13 0.6925
10 1 0.72 0.51 0.5 0.45 0.42 0.42 0.29 0.18 0.13 0.11 0.6825
0.6 1 0.72 0.51 0.5 0.45 0.42 0.42 0.29 0.18 0.15 0.14 0.6825
The above table’s data is shown as a graph in the below figure.
Figure 6. List best and worst compared with log base 10
At recall levels 0.2, and 0.4, for example, the curve is higher than the other log bases compared.
Therefore, log base 49 is considered better than the others. This has improvements in the
performance for MAP@30 value equal to 0.6925. The results are better than the standard
logarithm at base ten where the MAP@30 value is equal to 0.6825. We found the worst
performance was of log base 0.6 and it has the same MAP@30 value of 0.6825 compared to the
standard log base 10.
4.5. CISI Test Collection
It was also used to find better results using different log bases for this test collection. Log 84.6
have a MAP@30 score of 0.87 and MAP score of 0.7136. Its precision is higher at recall levels
0.1, 0.2, 0.5, 0.7 and 0.8.
13. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
13
Table 16. CISI best 5 log values MAP
RECALL
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP
84.6 1 0.9 0.83 0.75 0.72 0.68 0.63 0.62 0.61 0.56 0.55 0.713636
0.7 1 0.88 0.75 0.73 0.72 0.67 0.66 0.63 0.61 0.6 0.6 0.713636
53.1 1 0.88 0.75 0.73 0.72 0.67 0.66 0.63 0.61 0.6 0.6 0.713636
58.2 1 0.88 0.75 0.73 0.72 0.67 0.66 0.63 0.61 0.6 0.6 0.713636
58.3 1 0.88 0.75 0.73 0.72 0.67 0.66 0.63 0.61 0.6 0.6 0.713636
Table 17. CISI best 5 log values MAP @30
Recall
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP@30
84.6 1 0.9 0.83 0.75 0.72 0.68 0.63 0.62 0.61 0.56 0.55 0.87
91.8 1 0.88 0.8 0.75 0.72 0.66 0.63 0.61 0.6 0.6 0.58 0.8575
94 1 0.88 0.8 0.75 0.72 0.66 0.63 0.61 0.6 0.6 0.58 0.8575
0.9 1 0.88 0.8 0.75 0.72 0.66 0.63 0.61 0.6 0.6 0.55 0.8575
58 1 0.88 0.8 0.75 0.72 0.66 0.63 0.61 0.6 0.6 0.55 0.8575
When we assess the CISI test collection, we find that the system is at its best performance with a
MAP@30 value equal to 0.87 when we apply the TFIDF weighting technique using logarithm
base equal to 84.6. The results are better than the standard logarithm at base ten where the
MAP@30 value is equal to 0.8575.
Table 18. CISI Comparison of Best and Worst with the Standard log 10.
RECALL
LOG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP@30
84.6 1 0.9 0.83 0.75 0.72 0.68 0.63 0.62 0.61 0.56 0.55 0.87
10 1 0.88 0.8 0.75 0.72 0.66 0.63 0.61 0.6 0.6 0.55 0.8575
85.1 1 0.9 0.72 0.66 0.63 0.61 0.6 0.6 0.55 0.54 0.52 0.82
Based on the above table, improvements has been determined by the results of calculation for
MAP@30 and it is illustrated in below figure the graph.
14. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
14
Figure 7. CISI Best, Worst, and Standard based comparison on MAP@30
5. CONCLUSIONS
The experiments show that calculating logarithm’s using base 10 is not always the optimal
solution and it was found new values to calculate the TFIDF in all five test collections, resulting
in improvements in the matter of efficiency and effectiveness of the information retrieval system.
It has been proved in this research that a variation in the logarithm’s bases improves the
precision at high, middle and low recall levels, which is considered a good improvement. Many
log bases outperforms log base 10. Log values such as 84.6, 49, 32.6, 0.2 and 0.1 had an impact
for better performance in the system. It must be taken in consideration that in the test collections
CISI, LISA and NPL, better results were found using log base higher than 10 and that MED and
CRAN had better results with log bases lower than base ten. I conclude that in this research, a
variant of the logarithms bases in the TFIDF weighting technique affects the performance of the
results and it was found in the context. It has been determined which log base returns better
results for each of the test collections that was used. Further research must be done on large test
collections such as TREC, GOV. The method of changing the log base should also be applied in
a probabilistic model.
ACKNOWLEDGEMENTS
I would like to thank everyone that have taught me a lesson in my life, specially authors of
scientific papers and books.
REFERENCES
[1] Christopher D. M.,P.Raghavanand H.Schütze(2009), “An Introduction to Information Retrieval”,
Cambridge University Press, ISBN-10: 0521865719
[2] Baeza and Yates (1999), “Modern Information Retrieval”, Pearson Education, ISBN-10:
020139829X
[3] Josef Steinberger, Karel Jezek (2009), “Evaluation Measures for Text Summarization”, Computing
and Informatics, Vol. 28, 2009, pages 1001-1026
[4] Amit Singhal (2001), “Modern Information Retrieval: A Brief Overview”, IEEE Data Eng. Bull,
Vol. 24, pp. 35-43
[5] Mark Sanderson (2010), “Test Collection Based Evaluation of Information Retrieval”, Foundations
and Trends in Information Retrieval: Vol. 4: No. 4, pp 247-375
15. International Journal on Natural Language Computing (IJNLC) Vol.12, No.3, June 2023
15
[6] Croft, Metzler &Strohman (2009), “Search Engines Information Retrieval in Practice”, Pearson
Education, ISBN-10: 0136072240
[7] Martin Porter (1980), “An Algorithm for suffix stripping”, Program: electronic library and
information systems, Vol. 14 No. 3, pp. 130-137
[8] Justin Zobel (2006), “Inverted Files for Text Search Engines”, ACM Computing Surveys, Vol. 38,
Issue 2, pp 6-es
[9] Arvind Rangaswamy, C. Lee Giles &Silvija Se res (2009), “A Strategic Perspective on Search
Engines, Thoughts Candies for Practitioners and Researchers”, Journal of Interactive Marketing,
Vol 23, Issue 1, Pages 49-60
[11] William B. Frakes and Ricardo Baeza-Yates (2004), “Information Retrieval Data Structures and
Algorithms”, Prentice Hall, ISBN-10: 0134638379, ISBN-13: 978-0134638379
[12] F. Yamout and M. Makary (2019), “Improving Search Engines By Demoting Non-Relevant
Documents”, International Journal on Natural Language Computing (IJNLC), Vol. 8, No 4
[13] Ab. Jain, A. Jain, N. Chauhan, V. Singh, N. Thakur, “ Information Retrieval using Cosine and
Jaccard Similarity Measures in Vector Space Model “, Vol. 164, No 6, April 2017.
AUTHOR
Kamel Assaf is a programmer in the I.T. department for a company in Edmonton, Ab,
Canada. He has completed his M.Sc. in Computer Science from the Lebanese
International University. His research interest lies in the area of Information Retrieval,
Data Mining and Computer Graphics