In this paper, we presented a method to retrieve documents with unstructured text data written in different
languages. Apart from the ordinary document retrieval systems, the proposed system can also process
queries with terms in more than one language. Unicode, the universally accepted encoding standard is used
to present the data in a common platform while converting the text data into Vector Space Model. We got
notable F measure values in the experiments irrespective of languages used in documents and queries.
The document describes latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA represents documents as random mixtures over latent topics, characterized by distributions over words. It is a three-level hierarchical Bayesian model where documents are generated by first sampling a per-document topic distribution from a Dirichlet prior, then repeatedly sampling topics and words from these distributions. LDA addresses limitations of previous models by capturing statistical structure within and between documents through the hierarchical Bayesian formulation.
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...ijseajournal
With the emergence of XML as de facto format for storing and exchanging information over the Internet, the search for ever more innovative and effective techniques for their querying is a major and current concern of the XML database community. Several studies carried out to help solve this problem are mostly oriented towards the evaluation of so-called exact queries which, unfortunately, are likely (especially in the case of semi-structured documents) to yield abundant results (in the case of vague queries) or empty results (in the case of very precise queries). From the observation that users who make requests are not necessarily interested in all possible solutions, but rather in those that are closest to their needs, an important field of research has been opened on the evaluation of preferences queries. In this paper, we propose an approach for the evaluation of such queries, in case the preferences concern the structure of the document. The solution investigated revolves around the proposal of an evaluation plan in three phases: rewriting-evaluation-merge. The rewriting phase makes it possible to obtain, from a partitioningtransformation operation of the initial query, a hierarchical set of preferences path queries which are holistically evaluated in the second phase by an instrumented version of the algorithm TwigStack. The merge phase is the synthesis of the best results.
This document summarizes a research paper that introduces a novel multi-viewpoint similarity measure for clustering text documents. The paper begins with background on commonly used similarity measures like Euclidean distance and cosine similarity. It then presents the novel multi-viewpoint measure, which considers multiple viewpoints (objects not assumed to be in the same cluster) rather than a single viewpoint. The paper proposes two new clustering criterion functions based on this measure and compares them to other algorithms on benchmark datasets. The goal is to develop a similarity measure and clustering methods that provide high-quality, consistent performance like k-means but can better handle sparse, high-dimensional text data.
Farthest Neighbor Approach for Finding Initial Centroids in K- MeansWaqas Tariq
Text document clustering is gaining popularity in the knowledge discovery field for effectively navigating, browsing and organizing large amounts of textual information into a small number of meaningful clusters. Text mining is a semi-automated process of extracting knowledge from voluminous unstructured data. A widely studied data mining problem in the text domain is clustering. Clustering is an unsupervised learning method that aims to find groups of similar objects in the data with respect to some predefined criterion. In this work we propose a variant method for finding initial centroids. The initial centroids are chosen by using farthest neighbors. For the partitioning based clustering algorithms traditionally the initial centroids are chosen randomly but in the proposed method the initial centroids are chosen by using farthest neighbors. The accuracy of the clusters and efficiency of the partition based clustering algorithms depend on the initial centroids chosen. In the experiment, kmeans algorithm is applied and the initial centroids for kmeans are chosen by using farthest neighbors. Our experimental results shows the accuracy of the clusters and efficiency of the kmeans algorithm is improved compared to the traditional way of choosing initial centroids.
This document discusses probabilistic models used for text mining. It introduces mixture models, Bayesian nonparametric models, and graphical models including Bayesian networks, hidden Markov models, Markov random fields, and conditional random fields. It provides details on the general framework of mixture models and examples like topic models PLSA and LDA. It also discusses learning algorithms for probabilistic models like EM algorithm and Gibbs sampling.
This document provides an introduction and overview of 5 papers related to topic modeling techniques. It begins with introducing the speaker and their research interests in text analysis using topic modeling. It then lists the 5 papers that will be discussed: LSA, pLSI, LDA, Gaussian LDA, and criticisms of topic modeling. The document focuses on summarizing each paper's motivation, key points, model, parameter estimation methods, and deficiencies. It provides high-level summaries of key aspects of influential topic modeling papers to introduce the topic.
Paper presentation for the final course Advanced Concept in Machine Learning.
The paper is @Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data"
http://jmlr.org/proceedings/papers/v32/chenf14.pdf
This document describes the topicmodels package in R, which provides tools for fitting topic models to text data. The package interfaces with existing C/C++ code for fitting LDA and CTM topic models using either variational EM or Gibbs sampling algorithms. It builds on the tm package to preprocess text into a document-term matrix. The topicmodels package allows fitting different topic model types with different estimation methods and provides functions for model selection and analyzing fitted models.
The document describes latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA represents documents as random mixtures over latent topics, characterized by distributions over words. It is a three-level hierarchical Bayesian model where documents are generated by first sampling a per-document topic distribution from a Dirichlet prior, then repeatedly sampling topics and words from these distributions. LDA addresses limitations of previous models by capturing statistical structure within and between documents through the hierarchical Bayesian formulation.
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...ijseajournal
With the emergence of XML as de facto format for storing and exchanging information over the Internet, the search for ever more innovative and effective techniques for their querying is a major and current concern of the XML database community. Several studies carried out to help solve this problem are mostly oriented towards the evaluation of so-called exact queries which, unfortunately, are likely (especially in the case of semi-structured documents) to yield abundant results (in the case of vague queries) or empty results (in the case of very precise queries). From the observation that users who make requests are not necessarily interested in all possible solutions, but rather in those that are closest to their needs, an important field of research has been opened on the evaluation of preferences queries. In this paper, we propose an approach for the evaluation of such queries, in case the preferences concern the structure of the document. The solution investigated revolves around the proposal of an evaluation plan in three phases: rewriting-evaluation-merge. The rewriting phase makes it possible to obtain, from a partitioningtransformation operation of the initial query, a hierarchical set of preferences path queries which are holistically evaluated in the second phase by an instrumented version of the algorithm TwigStack. The merge phase is the synthesis of the best results.
This document summarizes a research paper that introduces a novel multi-viewpoint similarity measure for clustering text documents. The paper begins with background on commonly used similarity measures like Euclidean distance and cosine similarity. It then presents the novel multi-viewpoint measure, which considers multiple viewpoints (objects not assumed to be in the same cluster) rather than a single viewpoint. The paper proposes two new clustering criterion functions based on this measure and compares them to other algorithms on benchmark datasets. The goal is to develop a similarity measure and clustering methods that provide high-quality, consistent performance like k-means but can better handle sparse, high-dimensional text data.
Farthest Neighbor Approach for Finding Initial Centroids in K- MeansWaqas Tariq
Text document clustering is gaining popularity in the knowledge discovery field for effectively navigating, browsing and organizing large amounts of textual information into a small number of meaningful clusters. Text mining is a semi-automated process of extracting knowledge from voluminous unstructured data. A widely studied data mining problem in the text domain is clustering. Clustering is an unsupervised learning method that aims to find groups of similar objects in the data with respect to some predefined criterion. In this work we propose a variant method for finding initial centroids. The initial centroids are chosen by using farthest neighbors. For the partitioning based clustering algorithms traditionally the initial centroids are chosen randomly but in the proposed method the initial centroids are chosen by using farthest neighbors. The accuracy of the clusters and efficiency of the partition based clustering algorithms depend on the initial centroids chosen. In the experiment, kmeans algorithm is applied and the initial centroids for kmeans are chosen by using farthest neighbors. Our experimental results shows the accuracy of the clusters and efficiency of the kmeans algorithm is improved compared to the traditional way of choosing initial centroids.
This document discusses probabilistic models used for text mining. It introduces mixture models, Bayesian nonparametric models, and graphical models including Bayesian networks, hidden Markov models, Markov random fields, and conditional random fields. It provides details on the general framework of mixture models and examples like topic models PLSA and LDA. It also discusses learning algorithms for probabilistic models like EM algorithm and Gibbs sampling.
This document provides an introduction and overview of 5 papers related to topic modeling techniques. It begins with introducing the speaker and their research interests in text analysis using topic modeling. It then lists the 5 papers that will be discussed: LSA, pLSI, LDA, Gaussian LDA, and criticisms of topic modeling. The document focuses on summarizing each paper's motivation, key points, model, parameter estimation methods, and deficiencies. It provides high-level summaries of key aspects of influential topic modeling papers to introduce the topic.
Paper presentation for the final course Advanced Concept in Machine Learning.
The paper is @Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data"
http://jmlr.org/proceedings/papers/v32/chenf14.pdf
This document describes the topicmodels package in R, which provides tools for fitting topic models to text data. The package interfaces with existing C/C++ code for fitting LDA and CTM topic models using either variational EM or Gibbs sampling algorithms. It builds on the tm package to preprocess text into a document-term matrix. The topicmodels package allows fitting different topic model types with different estimation methods and provides functions for model selection and analyzing fitted models.
Topic modeling is a technique for discovering hidden semantic patterns in large document collections. It represents documents as probability distributions over latent topics, where each topic is characterized by a distribution over words. Two common probabilistic topic models are latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (pLSA). LDA assumes each document exhibits multiple topics in different proportions, with topics modeled as distributions over words. Topic modeling provides dimensionality reduction and can be applied to problems like text classification, collaborative filtering, and computer vision tasks like image classification.
Comparative study of classification algorithm for text based categorizationeSAT Journals
Abstract
Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine
learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques.
It provides conceptual views of the collected documents and has important applications in the real world. Text based
categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number
of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes'
algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of
the above mentioned classification algorithm.
Keywords: Data Mining, Text Mining, Text Categorization, Machine Learning, Pattern Analysis, Naive Bayes’, KNN,
Decision Tree.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
The document describes the Correlated Topic Model (CTM), which addresses a limitation of LDA and other topic models by directly modeling correlations between topics. CTM uses a logistic normal distribution over topic proportions instead of a Dirichlet, allowing for covariance structure between topics. This provides a more realistic model of latent topic structure where presence of one topic may be correlated with another. Variational inference is used to approximate posterior inference in CTM. The model is shown to provide a better fit than LDA on a corpus of journal articles.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
This document proposes online inference algorithms for topic models as an alternative to traditional batch algorithms. It introduces two related online algorithms: incremental Gibbs samplers and particle filters. These algorithms update estimates of topics incrementally as each new document is observed, making them suitable for applications where the document collection grows over time. The algorithms are evaluated in comparison to existing batch algorithms to analyze their runtime and performance.
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Editor IJMTER
Databases are build with the fixed number of fields and records. Uncertain database contains a
different number of fields and records. Clustering techniques are used to group up the relevant records
based on the similarity values. The similarity measures are designed to estimate the relationship between
the transactions with fixed attributes. The uncertain data similarity is estimated using similarity
measures with some modifications.
Clustering on uncertain data is one of the essential tasks in mining uncertain data. The existing
methods extend traditional partitioning clustering methods like k-means and density-based clustering
methods like DBSCAN to uncertain data. Such methods cannot handle uncertain objects. Probability
distributions are essential characteristics of uncertain objects have not been considered in measuring
similarity between uncertain objects.
The customer purchase transaction data is analyzed using uncertain data clustering scheme. The
density based clustering mechanism is used for the uncertain data clustering process. This model
produces results with minimum accuracy levels. The clustering technique is improved with distribution
based similarity model for uncertain data. The nearest neighbor search technique is applied on the
distribution based data environment. The system is designed using java as a front end and oracle as a
back end.
Boolean,vector space retrieval Models Primya Tamil
The document discusses various information retrieval models including Boolean, vector space, and probabilistic models. It provides details on how documents and queries are represented and compared in the vector space model. Specifically, it explains that in this model, documents and queries are represented as vectors of term weights in a multi-dimensional space. The similarity between a document and query vector is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
This document proposes improvements to domain-specific term extraction for ontology construction. It discusses issues with existing term extraction approaches and presents a new method that selects and organizes target and contrastive corpora. Terms are extracted using linguistic rules on part-of-speech tagged text. Statistical distributions are calculated to identify terms based on their frequency across multiple contrastive corpora. The approach achieves better precision in extracting simple and complex terms for computer science and biomedical domains compared to existing methods.
This document summarizes the agenda and key topics for a CIS 890 project final presentation on topics modelling with LDA. The presentation will cover LDA modelling, HMMLDA modelling, LDA with collocations modelling, and experimental results on the NIPS collection. It will discuss topic modelling approaches like LDA, discriminative vs generative methods, and limitations of bag-of-words assumptions.
The Search of New Issues in the Detection of Near-duplicated Documentsijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
The document discusses using word sense disambiguation (WSD) in concept identification for ontology construction. It describes implementing an approach that forms concepts from terms by meeting certain criteria, such as having an intentional definition and instances. WSD is needed to identify the sense of terms related to the domain when forming concepts. The Lesk algorithm is discussed as one method for WSD and concept disambiguation, involving calculating similarity between terms and WordNet senses. Evaluation shows the approach identified domain-specific concepts with reasonable precision and recall compared to other methods. Choosing the best WSD algorithm depends on factors like the problem nature and performance metrics.
Different Similarity Measures for Text Classification Using KnnIOSR Journals
This document summarizes research on classifying textual data using the k-nearest neighbors (KNN) algorithm and different similarity measures. It explores generating 9 different vector representations of text documents and using KNN with similarity measures like Euclidean, Manhattan, squared Euclidean, etc. to classify documents. The researchers tested KNN on a Reuters news corpus with 5,485 training documents across 8 classes and found that normalization and k=4 produced the best accuracy of 94.47%. They conclude KNN with different similarity measures and vector representations is effective for multi-class text classification.
This document summarizes a presentation on using string kernels for text classification. It introduces text classification and the challenge of representing text documents as feature vectors. It then discusses how kernel methods can be used as an alternative, by mapping documents into a feature space without explicitly extracting features. Different string kernel algorithms are described that measure similarity between documents based on common subsequences of characters. The document evaluates the performance of these kernels on a text dataset and explores ways to improve efficiency, such as through kernel approximation.
The document discusses two main types of retrieval models: Boolean models which use set theory and vector space models which use statistical and algebraic approaches. Vector space models represent documents and queries as vectors of keywords weighted by factors like term frequency and inverse document frequency. Similarity between document and query vectors is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
Topic modeling using big data analytics can analyze large datasets. It involves installing Hadoop on multiple nodes for distributed processing, preprocessing data into a desired format, and using modeling tools to parallelize computation and select algorithms. Topic modeling identifies patterns in corpora to develop new ways to search, browse, and summarize large text archives. Tools like Mallet use algorithms like LDA and PLSI to achieve topic modeling on Hadoop, applying it to analyze news articles, search engine rankings, genetic and image data, and more.
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
Abstract— Text categorization is the process of identifying and assigning predefined class to which a document belongs. A wide variety of algorithms are currently available to perform the text categorization. Among them, K-Nearest Neighbor text classifier is the most commonly used one. It is used to test the degree of similarity between documents and k training data, thereby determining the category of test documents. In this paper, an improved K-Nearest Neighbor algorithm for text categorization is proposed. In this method, the text is categorized into different classes based on K-Nearest Neighbor algorithm and constrained one-pass clustering, which provides an effective strategy for categorizing the text. This improves the efficiency of K-Nearest Neighbor algorithm by generating the classification model. The text classification using K-Nearest Neighbor algorithm has a wide variety of text mining applications.
The document discusses text categorization, which involves assigning categories or topics to documents. It covers key aspects of text categorization including definitions, applications, document representation, feature selection, dimensionality reduction, knowledge engineering and machine learning approaches. Specific classification algorithms discussed include naïve Bayes, Bayesian logistic regression, decision trees, decision rules, and more. The document provides details on how these algorithms work and their advantages/disadvantages for text categorization tasks.
Writers can market themselves through various methods including leaflets, print media, websites, blogging, and social networking sites. A survey found that 50% of people felt social networking sites were the best way for a new business to advertise as they provide worldwide reach and easy accessibility. While traditional print advertising is declining due to increased online options, print still targets specific audiences. Websites are also effective but require promotion so people know how to find the site. The most successful promotions utilize multiple methods rather than relying on just one.
Topic modeling is a technique for discovering hidden semantic patterns in large document collections. It represents documents as probability distributions over latent topics, where each topic is characterized by a distribution over words. Two common probabilistic topic models are latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (pLSA). LDA assumes each document exhibits multiple topics in different proportions, with topics modeled as distributions over words. Topic modeling provides dimensionality reduction and can be applied to problems like text classification, collaborative filtering, and computer vision tasks like image classification.
Comparative study of classification algorithm for text based categorizationeSAT Journals
Abstract
Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine
learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques.
It provides conceptual views of the collected documents and has important applications in the real world. Text based
categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number
of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes'
algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of
the above mentioned classification algorithm.
Keywords: Data Mining, Text Mining, Text Categorization, Machine Learning, Pattern Analysis, Naive Bayes’, KNN,
Decision Tree.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
The document describes the Correlated Topic Model (CTM), which addresses a limitation of LDA and other topic models by directly modeling correlations between topics. CTM uses a logistic normal distribution over topic proportions instead of a Dirichlet, allowing for covariance structure between topics. This provides a more realistic model of latent topic structure where presence of one topic may be correlated with another. Variational inference is used to approximate posterior inference in CTM. The model is shown to provide a better fit than LDA on a corpus of journal articles.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
This document proposes online inference algorithms for topic models as an alternative to traditional batch algorithms. It introduces two related online algorithms: incremental Gibbs samplers and particle filters. These algorithms update estimates of topics incrementally as each new document is observed, making them suitable for applications where the document collection grows over time. The algorithms are evaluated in comparison to existing batch algorithms to analyze their runtime and performance.
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Editor IJMTER
Databases are build with the fixed number of fields and records. Uncertain database contains a
different number of fields and records. Clustering techniques are used to group up the relevant records
based on the similarity values. The similarity measures are designed to estimate the relationship between
the transactions with fixed attributes. The uncertain data similarity is estimated using similarity
measures with some modifications.
Clustering on uncertain data is one of the essential tasks in mining uncertain data. The existing
methods extend traditional partitioning clustering methods like k-means and density-based clustering
methods like DBSCAN to uncertain data. Such methods cannot handle uncertain objects. Probability
distributions are essential characteristics of uncertain objects have not been considered in measuring
similarity between uncertain objects.
The customer purchase transaction data is analyzed using uncertain data clustering scheme. The
density based clustering mechanism is used for the uncertain data clustering process. This model
produces results with minimum accuracy levels. The clustering technique is improved with distribution
based similarity model for uncertain data. The nearest neighbor search technique is applied on the
distribution based data environment. The system is designed using java as a front end and oracle as a
back end.
Boolean,vector space retrieval Models Primya Tamil
The document discusses various information retrieval models including Boolean, vector space, and probabilistic models. It provides details on how documents and queries are represented and compared in the vector space model. Specifically, it explains that in this model, documents and queries are represented as vectors of term weights in a multi-dimensional space. The similarity between a document and query vector is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
This document proposes improvements to domain-specific term extraction for ontology construction. It discusses issues with existing term extraction approaches and presents a new method that selects and organizes target and contrastive corpora. Terms are extracted using linguistic rules on part-of-speech tagged text. Statistical distributions are calculated to identify terms based on their frequency across multiple contrastive corpora. The approach achieves better precision in extracting simple and complex terms for computer science and biomedical domains compared to existing methods.
This document summarizes the agenda and key topics for a CIS 890 project final presentation on topics modelling with LDA. The presentation will cover LDA modelling, HMMLDA modelling, LDA with collocations modelling, and experimental results on the NIPS collection. It will discuss topic modelling approaches like LDA, discriminative vs generative methods, and limitations of bag-of-words assumptions.
The Search of New Issues in the Detection of Near-duplicated Documentsijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
The document discusses using word sense disambiguation (WSD) in concept identification for ontology construction. It describes implementing an approach that forms concepts from terms by meeting certain criteria, such as having an intentional definition and instances. WSD is needed to identify the sense of terms related to the domain when forming concepts. The Lesk algorithm is discussed as one method for WSD and concept disambiguation, involving calculating similarity between terms and WordNet senses. Evaluation shows the approach identified domain-specific concepts with reasonable precision and recall compared to other methods. Choosing the best WSD algorithm depends on factors like the problem nature and performance metrics.
Different Similarity Measures for Text Classification Using KnnIOSR Journals
This document summarizes research on classifying textual data using the k-nearest neighbors (KNN) algorithm and different similarity measures. It explores generating 9 different vector representations of text documents and using KNN with similarity measures like Euclidean, Manhattan, squared Euclidean, etc. to classify documents. The researchers tested KNN on a Reuters news corpus with 5,485 training documents across 8 classes and found that normalization and k=4 produced the best accuracy of 94.47%. They conclude KNN with different similarity measures and vector representations is effective for multi-class text classification.
This document summarizes a presentation on using string kernels for text classification. It introduces text classification and the challenge of representing text documents as feature vectors. It then discusses how kernel methods can be used as an alternative, by mapping documents into a feature space without explicitly extracting features. Different string kernel algorithms are described that measure similarity between documents based on common subsequences of characters. The document evaluates the performance of these kernels on a text dataset and explores ways to improve efficiency, such as through kernel approximation.
The document discusses two main types of retrieval models: Boolean models which use set theory and vector space models which use statistical and algebraic approaches. Vector space models represent documents and queries as vectors of keywords weighted by factors like term frequency and inverse document frequency. Similarity between document and query vectors is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
Topic modeling using big data analytics can analyze large datasets. It involves installing Hadoop on multiple nodes for distributed processing, preprocessing data into a desired format, and using modeling tools to parallelize computation and select algorithms. Topic modeling identifies patterns in corpora to develop new ways to search, browse, and summarize large text archives. Tools like Mallet use algorithms like LDA and PLSI to achieve topic modeling on Hadoop, applying it to analyze news articles, search engine rankings, genetic and image data, and more.
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
Abstract— Text categorization is the process of identifying and assigning predefined class to which a document belongs. A wide variety of algorithms are currently available to perform the text categorization. Among them, K-Nearest Neighbor text classifier is the most commonly used one. It is used to test the degree of similarity between documents and k training data, thereby determining the category of test documents. In this paper, an improved K-Nearest Neighbor algorithm for text categorization is proposed. In this method, the text is categorized into different classes based on K-Nearest Neighbor algorithm and constrained one-pass clustering, which provides an effective strategy for categorizing the text. This improves the efficiency of K-Nearest Neighbor algorithm by generating the classification model. The text classification using K-Nearest Neighbor algorithm has a wide variety of text mining applications.
The document discusses text categorization, which involves assigning categories or topics to documents. It covers key aspects of text categorization including definitions, applications, document representation, feature selection, dimensionality reduction, knowledge engineering and machine learning approaches. Specific classification algorithms discussed include naïve Bayes, Bayesian logistic regression, decision trees, decision rules, and more. The document provides details on how these algorithms work and their advantages/disadvantages for text categorization tasks.
Writers can market themselves through various methods including leaflets, print media, websites, blogging, and social networking sites. A survey found that 50% of people felt social networking sites were the best way for a new business to advertise as they provide worldwide reach and easy accessibility. While traditional print advertising is declining due to increased online options, print still targets specific audiences. Websites are also effective but require promotion so people know how to find the site. The most successful promotions utilize multiple methods rather than relying on just one.
This document summarizes a study that examined the effects of electronic textbook-aided remedial teaching on the learning outcomes of junior high school students in Taiwan who had low academic achievement in optics. 92 grade 8 students participated in the study. Students scoring in the bottom 35% on an optics test were assigned to an experimental group that received remedial teaching using electronic textbooks or a control group that received traditional teaching. Both groups took the test before and after teaching. Results showed the experimental group scored significantly higher after electronic textbook teaching compared to before, while the control group also scored higher with traditional teaching. However, the experimental group scored significantly higher than the control group after teaching, indicating electronic textbooks improved learning outcomes more than traditional teaching for
Improving initial generations in pso algorithm for transportation network des...ijcsit
Transportation Network Design Problem (TNDP) aims to select the best project sets among a number of new projects. Recently, metaheuristic methods are applied to solve TNDP in the sense of finding better solutions sooner. PSO as a metaheuristic method is based on stochastic optimization and is a parallel revolutionary computation technique. The PSO system initializes with a number of random solutions and seeks for optimal solution by improving generations. This paper studies the behavior of PSO on account of improving initial generation and fitness value domain to find better solutions in comparison with previous attempts.
Căn hộ An gia garden Tân Phú, Giá chỉ 799tr/ căn 2PN. Hotline PKD: 0985 889 990TTC Land
CĂN HỘ AN GIA GARDEN TÂN PHÚ- Nơi muốn đến - Chốn muốn về
KHU DÂN CƯ BIỆT LẬP ĐẲNG CẤP HÀNG ĐẦU QUẬN TÂN PHÚ
***CHỈ TỪ 799TR/CĂN - 2 PHÒNG NGỦ***
****CHÍNH THỨC NHẬN ĐẶT MUA CĂN HỘ - ƯU TIÊN CHỌN VỊ TRÍ ĐẸP****
An ninh tuyệt đối, tích hợp kỹ thuật hiện đại
Ngân hàng hỗ trợ tối đa 70% giá trị căn hộ trong 10 năm
(Mở bán không mua hoàn tiền 100% - Cam kết trong phiếu đặt mua)
Hotline PKD: 0985 889 990
http://canhosaigon365.com/du-an/du-an-can-ho/300-can-ho-an-gia-garden-quan-tan-phu
This document provides tips and sample answers for common interview questions for an HR generalist position. It discusses how to answer questions about yourself, your strengths, career goals, reasons for leaving previous jobs, weaknesses, and knowledge of the organization. For each question, it offers a step-by-step approach and emphasizes connecting your experiences to the employer's needs, providing evidence for your strengths, and avoiding negative responses. Sample answers are provided for each question to demonstrate effective responses.
Handover management scheme in LTE FEMTOCELL networksijcsit
This document discusses handover management in LTE femtocell networks. It presents the architecture of LTE femtocell networks and investigates different handover scenarios, particularly macrocell to femtocell handover which is difficult due to the large number of candidate femtocells. The document proposes using the HeNB Policy Function entity to optimize handover decision making by selecting the target femtocell based on constraints to make the optimal decision and reduce unnecessary handovers. An analytical model is also presented to evaluate handover signalling costs.
This document provides tips and sample answers for common interview questions for lawyers. It discusses how to answer questions about yourself, your strengths, career goals, reasons for leaving previous jobs, weaknesses, knowledge of the organization, and ways you've improved your legal knowledge. For each question, it offers steps and guidelines for effective answers, including giving relevant background, connecting your experience to the role, and providing evidence without criticizing past employers or colleagues. Sample answers are provided for questions about strengths, career goals, reasons for leaving a job, knowledge of the organization, and professional development.
ANALYSIS OF ELEMENTARY CELLULAR AUTOMATA BOUNDARY CONDITIONSijcsit
We present the findings of analysis of elementary cellular automata (ECA) boundary conditions. Fixed and variable boundaries are attempted. The outputs of linear feedback shift registers (LFSRs) act as continuous inputs to the two boundaries of a one-dimensional (1-D) Elementary Cellular Automata (ECA) are analyzed and compared. The results show superior randomness features and the output string has passed the Diehard statistical battery of tests. The design has strong correlation immunity and it is inherently amenable for VLSI implementation. Therefore it can be considered to be a good and viable candidate for parallel pseudo random number generation
Portraying Indonesia's Media Power: Election and Political ControlHijjaz Sutriadi
Portraying Indonesia's Media Power: Election and Political Control
A 5-minute presentation material in Media and Information Discussion Group (DG-8) for the 41st Ship for Southeast Asian and Japanese Youth Programme (SSEAYP)
This document provides lecture notes on information retrieval systems. It covers key concepts like precision and recall, different retrieval strategies including vector space model and probabilistic models, and retrieval utilities. The vector space model represents documents and queries as vectors in a shared space and calculates similarity using cosine similarity. Probabilistic models assign probabilities to terms and documents and estimate relevance probabilities. The notes discuss term weighting schemes, inverted indexes to improve efficiency, and integrating structured data with text retrieval. The overall objective is for students to learn fundamental models and techniques for information storage and retrieval.
A new approach to achieve the users’ habitual opportunities on social mediaIAESIJAI
The data generated from social media is very large, while the use of data
from social media has not been fully utilized to become new knowledge.
One of the things that can become new knowledge is user habits on social
media. Searching for user habits on Twitter by using user tweets can be done
by using modeling, the use of modeling lies when the data has been
preprocessed, and the ranking will then be checked in the dictionary, this is
where the role of the model is carried out to get a chance that the words that
have been ranked will perform check the word in the dictionary. The benefit
of the model in general is to get an understanding of the mechanism in the
problem so that it can predict events that will arise from a phenomenon
which in this case is user habits. So that with the availability of this model, it
can be a model in getting opportunities for user habits on Twitter social
media.
A number of benefits have been reported for computer-based assessments over traditional paper-based exams, both in terms of IT support for question development, reduced distribution and test administration costs, and automated support. Possible for the ranking. However, existing computerized assessment systems do not provide all kinds of questions, namely open questions that require writing solutions. To overcome the challenges of the existing, the objective of this work is to achieve an intelligent evaluation system (IES) responding to the problems identified, and which adapts to the different types of questions, especially open-ended questions of which the answer requires sentence writing or programming.
The document discusses various techniques for information retrieval and language modeling approaches to IR, including:
- Clustering documents into similar groups to aid in retrieval
- Using term frequency-inverse document frequency (TF-IDF) to measure word importance in documents
- Language models that represent documents and queries as probability distributions over words
- Smoothing language models to address data sparsity issues
- Cluster-based scoring methods that incorporate information from query-relevant document clusters
Convincing a customer is always considered as a challenging task in every business. But when it comes to online business, this task becomes even more difficult. Online retailers try everything possible to gain the trust of the customer. One of the solutions is to provide an area for existing users to leave their comments. This service can effectively develop the trust of the customer however normally the customer comments about the product in their native language using Roman script. If there are hundreds of comments this makes difficulty even for the native customers to make a buying decision. This research proposes a system which extracts the comments posted in Roman Urdu, translate them, find their polarity and then gives us the rating of the product. This rating will help the native and non-native customers to make buying decision efficiently from the comments posted in Roman Urdu.
Spatial database are becoming more and more popular in recent years. There is more and more
commercial and research interest in location-based search from spatial database. Spatial keyword search
has been well studied for years due to its importance to commercial search engines. Specially, a spatial
keyword query takes a user location and user-supplied keywords as arguments and returns objects that are
spatially and textually relevant to these arguments. Geo-textual index play an important role in spatial
keyword querying. A number of geo-textual indices have been proposed in recent years which mainly
combine the R-tree and its variants and the inverted file. This paper propose new index structure that
combine K-d tree and inverted file for spatial range keyword query which are based on the most spatial
and textual relevance to query point within given range.
Document ranking using qprp with concept of multi dimensional subspacePrakash Dubey
This presentation discusses a project titled "Document Ranking Using QPRP with Concept of Multi-Dimensional Subspace". It was presented by Prakash Kumar Dubey and guided by Mr. Sourish Dhar and Mr. Bhagaban Swain of the Department of IT. The presentation provides an overview of the project, including an introduction to information retrieval, classical IR models such as Boolean, vector space, and probabilistic models. It then discusses quantum probability and how it can be applied to document ranking. The presentation outlines the proposed solution, data collection and implementation, and concludes with future work.
Testing Different Log Bases for Vector Model Weighting Techniquekevig
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
Testing Different Log Bases for Vector Model Weighting Techniquekevig
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
IRJET - Document Comparison based on TF-IDF MetricIRJET Journal
This document discusses comparing documents based on the TF-IDF metric and cosine similarity. It begins by representing documents as vectors of terms weighted by TF-IDF. Cosine similarity is then used to measure the similarity between document vectors, with values ranging from 0 (completely dissimilar) to 1 (identical). The document demonstrates this approach on 5 sample documents from different domains, showing their pairwise cosine similarities. Comparing documents based on TF-IDF and cosine similarity allows analyzing relationships between documents in large corpora.
This document summarizes an article from the International Journal of Computer Engineering and Technology (IJCET). It discusses clustering algorithms for automatically organizing text documents into meaningful groups.
It begins by introducing the journal and describing document clustering as a technique to organize large amounts of text data. Then, it reviews four prominent clustering algorithms - cliques, single linkage, stars, and connected components - that use similarity metrics to group documents based on a term-by-document matrix.
The document analyzes the performance of these four algorithms on a document corpus, comparing their clustering results and processing times to determine documents' inherent groupings in an efficient manner.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...IRJET Journal
This paper proposes a method to mine rare sequential topic patterns (URSTPs) from tweet data. It involves preprocessing tweets to extract topics, identifying user sessions, generating sequential topic pattern (STP) candidates, and selecting URSTPs based on rarity analysis. Experiments show the approach can identify special users and interpretable URSTPs, indicating users' characteristics. The paper aims to capture personalized and abnormal user behaviors through sequential relationships between extracted topics from successive tweets.
This document summarizes a research paper that proposes a method to improve web image search results by re-ranking the initial text-based search results using visual similarity measures. The researchers develop an adaptive visual similarity approach that categorizes the query image and then uses a specific similarity measure tailored to each category to re-rank the images based on visual features. They test their method on search results from Google Image Search and Microsoft Live Image Search and find it effectively improves search accuracy by better incorporating visual content into the ranking.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES cscpconf
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research Articles”. A method of identifying novelty at each sections of the article is highly required for determining the novel idea proposed in the research paper. Since research articles are semistructured,detecting novelty of information from them requires more accurate systems. Topic model provides a useful means to process them and provides a simple way to analyze them. This work compares the most predominantly used topic model- Latent Dirichlet Allocation with the hierarchical Pachinko Allocation Model. The results obtained are promising towards hierarchical Pachinko Allocation Model when used for document retrieval.
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
Interactive Information Retrieval inspired by Quantum TheoryIngo Frommholz
Triggered by van Rijsbergen's seminal work about the geometry of information retrieval, a recent development is the utilisation of the theory of quantum mechanics and quantum probabilities as an expressive integrated framework to capture a user's context and interaction with the system. In our talk, we will briefly introduce some fundamentals and early works behind information retrieval inspired by quantum theory. We will discuss how information needs and relevance can be expressed in our interactive framework, neatly combining geometry and (quantum) probability theory. We will then outline how quantum probabilities and Hilbert spaces can be utilised to express concepts of information foraging theory as an example of modelling user interaction.
Impact of Crowdsourcing OCR Improvements on Retrievability Bias Myriam Traub
Digitized document collections often suffer from OCR errors that may impact a document’s readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library’s search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.
This document provides an overview of various information retrieval models and techniques used in search engines, including:
- Boolean, vector space, probabilistic models like BM25, and language models are described as older retrieval models.
- Learning to rank uses machine learning techniques to optimize ranking functions using features and training data.
- Relevance feedback, query likelihood models, and pseudo-relevance feedback are discussed as techniques for improving retrieval effectiveness by incorporating user feedback.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Mind map of terminologies used in context of Generative AI
Language independent document
1. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 4, August 2014
LANGUAGE INDEPENDENT DOCUMENT
RETRIEVAL USING UNICODE STANDARD
Vidhya M1 and Aji S2
Department of Computer Science, University of Kerala
Thiruvananthapuram, Kerala, India
ABSTRACT
In this paper, we presented a method to retrieve documents with unstructured text data written in different
languages. Apart from the ordinary document retrieval systems, the proposed system can also process
queries with terms in more than one language. Unicode, the universally accepted encoding standard is used
to present the data in a common platform while converting the text data into Vector Space Model. We got
notable F measure values in the experiments irrespective of languages used in documents and queries.
KEYWORDS
Language independent searching, Information Retrieval, Multilingual searching, Unicode, QR
Factorization, Vector Space Model.
1. INTRODUCTION
The digital world is becoming a universal storehouse of knowledge and culture, which has
allowed an unswerving sharing of ideas and information in an unpredictable rate. The efficiency
of a depository is always measured in terms of the accessibility of information, which depends on
the searching process that essentially acts as filters for the richness of information available in a
data container. The source of information or data in the depository can be any type of digital data
that have been produced or transformed into the digital format such as electronic books, articles,
music, movies, images, etc. The language is also not becoming a barrier for expressing the ideas
and thoughts in the digital form; as a result, the diversity of data is also increasing along with the
volume.
Research and findings in Information Retrieval methods that support Multilanguage has a vital
role in the upcoming years of information era. Even though there are lots of findings in the
information retrieval mechanisms [1, 11], there is not that much works in the language
independent information retrieval methods. The output of the CLEF (Cross-Language Evaluation
Forum) workshop [18] also points the need for increasing research in multi-lingual processing
and retrieval.
Text is the natural way recording the thoughts and feelings of human being and as a result; the
text data became the major component in the entire digital data. In traditional Information
Retrieval (IR) methods [1], the text is tokenized into words and a corresponding Vector Space
Model (VSM) [11] is created in the initial phase. The IR algorithms such as LSI [19], QR
Factorization [10], etc. are applied to retrieve the documents which are most relevant to the query
given by the user. Since the steps stemming and stop word elimination [20] are purely language
dependent, the VSM formed in most of the works related with IR are language dependent,
Naturally this type of VSM cannot accommodate the documents which are written in a couple of
DOI:10.5121/ijcsit.2014.6413 195
2. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 4, August 2014
different languages. This paper gives a new mechanism that will convert a document in VSM
irrespective of the language used. A special module in our work called alltoOne will convert the
tokens in a document, irrespective of language used, into Unicode representation. The query can
also be specified in different languages. The result obtained in the experiment shows that the
proposed method can do some better improvements in the existing document retrieval models.
2.METHODS AND PRACTICES
The Information retrieval (IR) systems are indented to identify the documents which are more
relevant to a particular query given by the user. Most of the information retrieval systems such as
search engines use the generalized or knowledge imparted version of the document listing
problem, a collection of D text documents d ,d .....dk 1 2 , each i d being a string of alphabets and
the document listing problem tries to find the set of all documents that contain one or more copies
of a pattern string p of length m. The said version of the document listing problem is called
document mining [21] problem that will find the set of all documents that contain p text at least
K, a predefined threshold, times.
Formally, the output is{i there exist at least K j s such that d [j j m ] p} i | ' ...... + -1 = .
The works and findings put forward by various database groups [7] supplement the modifications
of the document mining problem. The later developments in this area have evolutionary
significance in molecular data, which are extensively used in the computational Biology as well
[2, 15].
2.1. Methods of Information Retrieval
The works that have been evolved in the IR can be classified according to the methods that used
to measure the relevance of the document to be retrieved for a specific query. In mathematical
modeling, method to explain features and characteristics of a problem with mathematical
equations and techniques, the works can be classified into three- Probabilistic Relevance Models,
Probabilistic Inference Models and Similarity-based Models [22].
It is hard to measure the true relevance of a document for a given query, and the probabilistic
relevance model is the mechanism to estimate it. Consider the random variables D, Q and R that
represent the document d, query q and the relevance of d for q. The probabilistic relevance model
will be used to estimate the values of R such as, p(R = r | D,Q) . There are two possible values
for R, r(relevant) and r (not relevant), which can be calculated using either directly by the
discriminative (regression) model or indirectly by a generative model.
The earlier works in the regression model [3] deals with features that characterize the matching of
D and Q. Later the polynomial regression [4] came to the picture to approximate relevance. The
generative model finds the value of R as
196
(1)
= = = =
( , | ) ( )
p D Q R r p R r
( , )
( | , )
p D Q
p R r D Q
The probabilistic inference model is trying to prove that the query supplied to the system is from
a document in the collection. The measure of uncertainty associated with this inference is treated
as the relevance of a document with respect to the query. The logic-based probabilistic inference
3. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 4, August 2014
model [5], Boolean retrieval model and the general probabilistic inference model [6] are some of
the published works in the probabilistic inference model.
In Similarity-based Models, the correlation between the query and document is treated as the
measure of relevance. In order to find the correlation, the text data need to be converted into some
another common representation. The vector space model [11], an algebraic framework for
processing text documents, is a common platform for finding the correlation. In the vector space
model, each document is treated as a vector of frequencies of elements (words or terms) in it.
That is, each document can be represented as a point in a space of all documents in the collection.
The degree of closeness of points in this space shows the semantic similarity.
197
Figure 1.Representation of documents in Vector Space Model
The flexibility of the vector space models in IR is that it can easily incorporate different indexing
models [8, 9].
2.2. QR Factorization
The size of Vector Space Model will increase along with the number of unique documents in the
collection. Once the high-dimensional data is mapped into a low dimensional space, IR models
can be effectively applied to retrieve the information [10]. QR Factorization is one of the well
known techniques in dimension reduction in Information Retrieval [13, 11]. The basis of the
column space of term-document-matrix W with t unique terms and d documents can be computed
through QR factorization [11]:
W = QR (2)
where R is a t x d upper triangular matrix and Q is a t x t orthogonal matrix.
In order to find the dependence among the columns of W by examining the matrix Q, we can
rewrite the relation as
[ w w ... w ] = [ q r T q r T
... q r
T
] (3) 1 2 n 1 1 2 2
n n
The following block multiplication is used to show that the basis of W is contained in the
independent columns of Q.
4. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 4, August 2014
Q 1 do not contribute to the
198
R
ö
÷ ÷ø
æ
ç çè
ö
÷ ÷
ø
æ
=
ç ç
è
= + =
*0 (4)
0
1
1
W W
W
W Q Q
W W
W
W
W
Q R Q Q R
where W R is the non-zero part of R with W rows, W Q is the first W columns of Q and
Q1 will be other part. This partitioning shows that the columns of
W
W
value of W and that the ranks of W, R and W R are equal. Thus the columns of W Q constitute a
basis for the column space of W.
The similarity between two vectors can be measured using the equation,
(5)
.
D D
1 2
D D
.
cos
1 2
=
So that the similarity measure between the jth document Tj
w and q will be
(6)
w q
.
cos
w q
j
Tj
j =
by substituting the property of equation 4.4, we will get
( Q r ) q
(7)
.
cos
Q r q
W j
T
W j
j =
by applying the properties of an orthogonal we can write the cosine similarity as
( ) (8)
.
cos
r Q q
r q
j
TW
T
j
j =
The query vector q as the sum of its components in the column space of W and in the orthogonal
compliment of the column space as
= =
q Iq QQ
ù
é
ö
æ
Q Q Q Q q
1 1
ö
æ
Q Q q Q Q q
1 1
(9) 1
W
= +
W
T
W W
TW
W
T
W W
TW
W
T
q q
÷ ÷
ø
ç ç
è
= +
ú ú
û
ê ê
ë
÷ ÷
ø
ç ç
è
= +
It is noted that QA is a basis for the column space of A and QQT is the projection matrix onto the
column space of Q. Then
5. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 4, August 2014
Q 1 respectively.
199
ö
æ
W q q q Q Q q Q Q + = ÷ ÷
(10) 1 1 1
W
A
T
W W
TW
ø
ç ç
è
+
where W q and
q 1 are the projections of q onto the spaces of W Q and
W
W
Therefore, the properties of the projection allow us to say that W q is the best approximation of
the query vector in the column space of W.
Now recall the similarity measure specified above that is, similarity between the jth document
Tj
w and q will be
(11)
w q
=
w q w q
w q
ö
æ
w q w Q Q q
.
.
.
cos
1
1 1
w q
w q
j
T
W W
Tj
W
Tj
j
W
Tj
W
Tj
j
Tj
j
÷ ÷
ø
ç ç
è
+
=
+
=
Note that j w
is in the column space of W, which is an orthogonal complement to the column
space of
1W
Q . Then
1 = 0 W
Tj
w Q
Then the formula becomes
(12)
ö
æ
w q Q q
w q
.
0*
.
cos
1
w q
w q
j
W
Tj
j
T
W
W
Tj
j
=
÷ ÷
ø
ç ç
è
+
=
3. PROPOSEDMETHOD
Even though the meaning of the term information retrieval is very broad, the people around our
digital world visualising or simplifying it with the searching process in the information
repositories, especially in the web contents. Web searching has undergone remarkable
improvements, but researches in multi language supported searching are still in the childhood
stage. The method explained in this paper is an attempt for boost up the works in language
supported information retrieval. The working of our proposed method is abstracted in the
following block diagram.
6. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 4, August 2014
As mentioned in the introduction, the AlltoOne module will convert all the unique words in the
documents into an Unicode representation, the universally accepted encoding schema for
symbols. The vector space model gives two outputs-the term document matrix (TDM) and bag of
unique words in the entire collection. The TDM is a weighted matrix which has properly applied
TF-IDF weight function. The query processing module will convert the query terms into a query
vector with the help of the bag of words. The similarity checker will be used to measure the
relevance of each document with regard to the query vector using the QR-factorization. Finally,
the list of documents, if any, will be given out as output.
3.1. Unicode Standard
There are several coding schemes, but no two schemes were compatible. The Unicode Standard
not only solved these problems, but made multilingual computing challenge a less daunting task
to tackle [23]. The Unicode specification includes a huge number of different letters, symbols,
characters, mathematical and musical symbols, dingbats, etc (known collectively as 'glyphs').
Unicode provides a unique number for every glyph, no matter what the platform, no matter what
the program, no matter what the language. Any software that has Unicode enabled fonts
containing the relevant glyphs can, in theory, display any of the glyphs.
The latest version of Unicode (7.0) adds 2,834 new characters. This latest version adds the new
currency symbols for the Russian ruble and Azerbaijani manat, approximately 250 emoji
(pictographic symbols), many other symbols, and 23 new lesser-used and historic scripts, as well
as character additions to many existing scripts.
The Unicode Standard has been adopted by industry leaders such as Apple, HP, IBM, Microsoft,
Oracle, SAP, Sun, etc. The emergence of the Unicode Standard and the availability of tools
supporting it, are among the most significant recent global software technology trends.
200
Query Terms
AlltoOne
Vector Space
Modelling
Bag of words
Term-doc
matrix
Query
processing
Similarity
checker
Documents
List of documents
Figure 2: Block diagram of proposed system
Figure 3: Unicode of some letters in Tamil and Malayalam languages
7. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 4, August 2014
201
3.2. Term frequency-inverse document frequency (tf-idf)
Term weighting models project the term’s ability to represent a document’s content, otherwise to
distinguish it from other documents. The weighting models give more weight to surprising events
and less weight to expected events [14]. The term ‘inverse document frequency’ [16], a simple
and classical approach in term weighting, was proposed in 1972.
The TF factor has been used for Term Weighting for years in text analysis, especially in
classification. The IDF is inversely proportional to the number of documents (n) to which a term
is assigned in a set of documents N. A typical IDF factor is log (N / n) [17]. So the best index
terms to identify the contents of a document are those able to distinguish certain individual
documents from the rest of the set. This implies that the best terms should have high term
frequencies, but low overall collection frequencies. A reasonable measure of the importance of a
term can be obtained, therefore, by the product of term frequency and inverse document
frequency (TF x IDF). Hence the weight can be derived as
ö
æ
n
) 13 ( log ÷ ÷ø
ç çè
=
dfi
w tf X ij ij
where ij tf is the number of occurrence of term i in j th document, i df is the number of
documents that contain the ith term and n is the number documents in the entire collection.
4. EXPERIMENT AND RESULTS
We conducted the experiments using a bunch of text documents in English and two South Indian
languages-Tamil and Malayalam. A collection of text documents is prepared from these
documents by randomly mixing different language contents. We prepared the documents under
the condition that there should be at least 300 words and 12-15 sentences in a single document. A
minimum of 30% words and three sentences should be from a single language. Same sentences
can be placed in more than one documents, and so that we can directly check the retrieval
process. A sample, randomly selected five documents, statistics of words and sentences after the
initial processing is tabulated below.
Table 1: Sentence and word statistics of five random documents
Doc
No.
Sentence Words Unique Words
English Tamil Mal. English Tamil Mal. English Tamil Mal.
27 4 6 5 86 138 142 69 121 118
58 7 5 6 163 115 139 147 102 112
96 6 4 5 137 93 108 119 79 88
115 8 5 4 174 111 72 153 94 61
176 4 4 5 95 90 83 87 69 72
It is also noted that 13-17 percentage of words in the total number of words are duplicate
irrespective of language and the following figure reveals the same.
8. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 4, August 2014
202
0 500 1000 1500
M alay alam
T am il
E nglis h
Num ber of words
To ta l n o . o f w o rd s
N o . o f u n iq u e w o rd s
There are 3278 unique words in the entire collection of documents prepared for the experiments.
This bag of words is used to create VSM of the text collection and prepare the query vector.
4.1. Results and Analysis
As noted in the initial sections of this paper, the language dependent searching is familiar
mechanism to all information seekers, while our experiments concentrate on the language
independent searching. In the first phase of the experiments, the bag of words will be converted
into Unicode, the unique and standards representation of alphabets. The TF-IDF weight will be
applied to the term document matrices of VSM to keep the semantic relations in the document.
The query set for testing is also generated by randomly picking the words, irrespective of
languages, from the collection of documents. There can be words from more than one language
in the query. Three different sets of queries are generated for testing-query with 3 words, query
with 7 words and query with 12 words. The query will be converted into query vector using the
bag of words identified in the VSM, and the query vector will be used to identify the documents
which are more relevant to query.
The first experiment was carried out using the query with 3 words, and the result is shown below.
Malayalam English Tamil
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
F Measure
Figure 4: F Measures of different languages with Type 1 Query
9. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 4, August 2014
It is noticed that the experiments with all languages obtained almost same range of F measure
[24]. We have conducted the rest of the experiments with other sets of queries, and the result
obtained is abstracted in the following figure.
Type 2 Query
Type 3 Query
Type 1 Query
0 0.2 0.4 0.6 0.8 1
1
0.9
0.8
0.7
0.6
0.5
0.4
Recall
Precision
Figure 5: Precision-Recall curve of entire experiments
It has seen that the value of F measure increases with the number of terms in the query and there
was no influence of words with different languages in the query. We got an F measure value of
0.664 which is a remarkable result and proves that there are enough possibilities in the area of
language independent searching using a standard encoding mechanism such as Unicode.
5. CONCLUSIONS
As the human can express his feelings without any barrier of languages, he should have the
facility to access the information with different languages and style. In this work, the searching
process was carried out in the VSM which has generated from the common Unicode
representation. We obtained an overall F measure of 0.642, in the different experiments with
words from different languages. Even though the morphological operations could not address in
the work, the result shows that the proposed method will generate a new way of thinking among
the researchers in IR, especially language independent searching.
REFERENCES
[1] Salton, G. (1971). The SMART Retrieval System—Experiments in Automatic Document Retrieval.
203
Prentice Hall Inc., Englewood Cliffs, NJ.
[2] D. Gusfield. (1997) Algorithms on Strings, Trees, and Sequences: Computer Science and
Computational Biology. Cambridge Univ Pr; ISBN: 0521585198 ; Dimensions (in inches): 1.44
x10.30 x 7.41
[3] Fox, E. (1983). Expending the Boolean and Vector Space Models of Information Retrieval with P-Norm
Queries and Multiple Concept Types. PhD thesis, Cornell University.
[4] Fuhr, N. and Buckley, C. (1991). A probabilistic learning approach for document indexing. ACM
Transactions on Information Systems, 9(3):223.248.
[5] Van Rijsbergen, C. J. (1986). A non-classical logic for information retrieval. The Computer Journal,
29(6).
[6] Wong, S. K. M. and Yao, Y. Y. (1995). On modeling information retrieval with probabilistic
inference. ACM Transactions on Information Systems, 13(1):69.99.
[7] J. Han, L. Lakshmanan and J. Pei. (2001). Scalable Frequent-Pattern Mining Methods: An Overview.
7th ACMConf on Knowledge Discovery and Data Mining (KDD), Tutorial.
10. International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 4, August 2014
[8] Bookstein, A. and Swanson, D. (1975). A decision theoretic foundation for indexing. Journal for the
204
American Society for Information Science, 26:45.50.
[9] Harter, S. P. (1975). A probabilistic approach to automatic keyword indexing (part I & II). Journal of
the American Society for Information Science, 26:197.206 (Part I), 280.289 (Part II)
[10] Ravi Kanth K.V, Divyakant Agrawal, Amr El Abbadi, Ambuj Singh. (1999). Dimensionality
reduction for similarity searching in dynamic databases. Computer Vision and Image Understanding:
CVIU, 75(1–2):59–72.
[11] Michael W. Berry; Zlatko Drmac; Elizabeth R. Jessup. (1999). Matrices, Vector Spaces and
Information Retrieval. SIAM Review, Vol. 41, No. 2. pp. 335-362.
[12] R. Baeza-Yates and B. Ribeiro-Neto. (1999). Modern Information Retrieval. Addison Wesley
Longman Publ.
[13] Barry Schiffman, Kathleen R. McKeown. (2005). Context and Learning in Novelty Detection,
HLT/EMNLP.
[14] Peter D. Turney, Patrick Pantel. (2010). From Frequency to Meaning: Vector Space Models of
Semantics. J. Artif. Intell. Res. (JAIR) 37: 141-188.
[15] G. Benson and M. Waterman. (1994). A Method for Fast Database Search for All k-nucleotide
Repeats. Nucleic Acids Research, Vol 22, No. 22.
[16] Robertson, S. (2004). Understanding Inverse Document Frequency: on Theoretical Arguments for
IDF. In: Journal of Documentation, Vol. 60, pp. 503-520.
[17] Salton G. Buckley C. (1987). Term Weighting Approaches in Automatic Text Retrieval. Technical
Report TR87-881, Department of Computer Science, Cornell University. Information Processing and
Management Vol.32 (4), pp. 431-443.
[18] Information Access Evaluation. Multilinguality, Multimodality, and Visualization 4th International
Conference of the CLEF Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Proceedings
Series: Lecture Notes in Computer Science Vol. 8138.
[19] Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., Harshman, R. A. (1990). Indexing
by latent semantic analysis. Journal of the American Society for Information Science (JASIS), 41 (6),
pp:391-407.
[20] Hankyu Lim., Ungmo Kim, (1995). Word recognition by morphological analysis, Intelligent
Information Systems. ANZIIS-95. pp: 236 – 241
[21] S. Muthukrishnan. ( 2002). E_cient algorithms for document retrieval problems. In SODA, pp: 657-
666.
[22] ChengXiang Zhai. (2008). Statistical Language Models for Information Retrieval A Critical Review,
Journal Foundations and Trends in Information Retrieval archive Volume 2 Issue 3,
Pages 137-213.
[23] Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham, Diana Maynard, Oana
Hamza, Tony McEnery, Paul Baker, Mark Leisher. (2002). A Unicode-based Environment for
Creation and Use of Language Resources. LREC.
[24] Raghavan, V., Bollmann, P., & Jung, G. S. (1989). A critical investigation of recall and precision as
measures of retrieval system performance. ACM Trans. Inf. Syst., 7, pp: 205-229.