The document is a presentation on text mining with Julia. It introduces vector space models (VSM), latent semantic indexing (LSI) using singular value decomposition (SVD), and their implementation in Julia. The major steps are preprocessing text data, creating a term-document matrix, applying VSM and LSI to measure document similarity, and evaluating performance using precision and recall as the tolerance varies. LSI projects documents into lower dimensions using SVD to remove noise and better capture the latent semantic structure of the data.
A scalable gibbs sampler for probabilistic entity linkingSunny Kr
This document summarizes a research paper that proposes a scalable Gibbs sampling approach for probabilistic entity linking. The approach formulates entity linking as probabilistic inference in a topic model where each topic corresponds to a Wikipedia article. It introduces an efficient Gibbs sampling scheme that exploits the sparsity in the Wikipedia-LDA model to allow inference over millions of topics. Experimental results show it achieves state-of-the-art performance on the Aida-CoNLL dataset.
This document provides an introduction and overview of 5 papers related to topic modeling techniques. It begins with introducing the speaker and their research interests in text analysis using topic modeling. It then lists the 5 papers that will be discussed: LSA, pLSI, LDA, Gaussian LDA, and criticisms of topic modeling. The document focuses on summarizing each paper's motivation, key points, model, parameter estimation methods, and deficiencies. It provides high-level summaries of key aspects of influential topic modeling papers to introduce the topic.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models may also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
In this talk, I will present my recent work on neural IR models. We begin with a discussion on learning good representations of text for retrieval. I will present visual intuitions about how different embeddings spaces capture different relationships between items, and their usefulness to different types of IR tasks. The second part of this talk is focused on the applications of deep neural architectures to the document ranking task.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
Deep neural methods have recently demonstrated significant performance improvements in several IR tasks. In this lecture, we will present a brief overview of deep models for ranking and retrieval.
This is a follow-up lecture to "Neural Learning to Rank" (https://www.slideshare.net/BhaskarMitra3/neural-learning-to-rank-231759858)
In this natural language understanding (NLU) project, we implemented and compared various approaches for predicting the topics of paragraph-length texts. This paper explains our methodology and results for the following approaches: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks (LSTM).
Paper presentation for the final course Advanced Concept in Machine Learning.
The paper is @Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data"
http://jmlr.org/proceedings/papers/v32/chenf14.pdf
Using Text Embeddings for Information RetrievalBhaskar Mitra
Neural text embeddings provide dense vector representations of words and documents that encode various notions of semantic relatedness. Word2vec models typical similarity by representing words based on neighboring context words, while models like latent semantic analysis encode topical similarity through co-occurrence in documents. Dual embedding spaces can separately model both typical and topical similarities. Recent work has applied text embeddings to tasks like query auto-completion, session modeling, and document ranking, demonstrating their ability to capture semantic relationships between text beyond just words.
A scalable gibbs sampler for probabilistic entity linkingSunny Kr
This document summarizes a research paper that proposes a scalable Gibbs sampling approach for probabilistic entity linking. The approach formulates entity linking as probabilistic inference in a topic model where each topic corresponds to a Wikipedia article. It introduces an efficient Gibbs sampling scheme that exploits the sparsity in the Wikipedia-LDA model to allow inference over millions of topics. Experimental results show it achieves state-of-the-art performance on the Aida-CoNLL dataset.
This document provides an introduction and overview of 5 papers related to topic modeling techniques. It begins with introducing the speaker and their research interests in text analysis using topic modeling. It then lists the 5 papers that will be discussed: LSA, pLSI, LDA, Gaussian LDA, and criticisms of topic modeling. The document focuses on summarizing each paper's motivation, key points, model, parameter estimation methods, and deficiencies. It provides high-level summaries of key aspects of influential topic modeling papers to introduce the topic.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models may also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
In this talk, I will present my recent work on neural IR models. We begin with a discussion on learning good representations of text for retrieval. I will present visual intuitions about how different embeddings spaces capture different relationships between items, and their usefulness to different types of IR tasks. The second part of this talk is focused on the applications of deep neural architectures to the document ranking task.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
Deep neural methods have recently demonstrated significant performance improvements in several IR tasks. In this lecture, we will present a brief overview of deep models for ranking and retrieval.
This is a follow-up lecture to "Neural Learning to Rank" (https://www.slideshare.net/BhaskarMitra3/neural-learning-to-rank-231759858)
In this natural language understanding (NLU) project, we implemented and compared various approaches for predicting the topics of paragraph-length texts. This paper explains our methodology and results for the following approaches: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks (LSTM).
Paper presentation for the final course Advanced Concept in Machine Learning.
The paper is @Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data"
http://jmlr.org/proceedings/papers/v32/chenf14.pdf
Using Text Embeddings for Information RetrievalBhaskar Mitra
Neural text embeddings provide dense vector representations of words and documents that encode various notions of semantic relatedness. Word2vec models typical similarity by representing words based on neighboring context words, while models like latent semantic analysis encode topical similarity through co-occurrence in documents. Dual embedding spaces can separately model both typical and topical similarities. Recent work has applied text embeddings to tasks like query auto-completion, session modeling, and document ranking, demonstrating their ability to capture semantic relationships between text beyond just words.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
Topic modeling using big data analytics can analyze large datasets. It involves installing Hadoop on multiple nodes for distributed processing, preprocessing data into a desired format, and using modeling tools to parallelize computation and select algorithms. Topic modeling identifies patterns in corpora to develop new ways to search, browse, and summarize large text archives. Tools like Mallet use algorithms like LDA and PLSI to achieve topic modeling on Hadoop, applying it to analyze news articles, search engine rankings, genetic and image data, and more.
This document is a thesis that proposes using word embeddings to improve information retrieval by addressing term mismatch issues. It discusses word2vec, a technique for learning word embeddings from large text corpora that capture semantic relationships between words. The thesis proposes two approaches: 1) incorporating word embedding similarities into a probabilistic language model for retrieval and 2) a vector space model. Due to time constraints, only the first approach is implemented, which integrates word embeddings into ALMasri and Chevallet's probabilistic language model. Experiments are conducted to evaluate the impact of using semantic features from word embeddings on retrieval effectiveness.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
In recent years, Word2Vec and its expansion (Doc2Vec, Paragraph2Vec, etc.) is receiving a lot of attention in the NLP field.
In this slide, we will introduce our approach for applying the Doc2Vec to the item recommender system. And we report the results of the performance evaluation of Doc2Vec-based recommender by using Rakuten Singapore EC data.
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. In this lecture, we will cover some of the fundamentals of neural representation learning for text retrieval. We will also discuss some of the recent advances in the applications of deep neural architectures to retrieval tasks.
(These slides were presented at a lecture as part of the Information Retrieval and Data Mining course taught at UCL.)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
The document describes a tutorial on using neural networks for information retrieval. It discusses an agenda for the tutorial that includes fundamentals of IR, word embeddings, using word embeddings for IR, deep neural networks, and applications of neural networks to IR problems. It provides context on the increasing use of neural methods in IR applications and research.
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
The document discusses a neural model called Duet for ranking documents based on their relevance to a query. Duet uses both a local model that operates on exact term matches between queries and documents, and a distributed model that learns embeddings to match queries and documents in the embedding space. The two models are combined using a linear combination and trained jointly on labeled query-document pairs. Experimental results show Duet performs significantly better at document ranking and other IR tasks compared to using the local and distributed models individually. The amount of training data is also important, with larger datasets needed to learn better representations.
Learning deep structured semantic models for web searchhyunsung lee
DSSM is a model for learning semantic representations of text to improve web search using clickthrough data. It maps queries and documents to vectors using letter n-grams and multilayer perceptrons. The model is trained by maximizing the likelihood of clicked documents given queries. In experiments, DSSM outperforms other models on ranking metrics, achieving the best performance with two hidden layers and without word hashing. The model effectively learns semantic similarities that can be evaluated on analogy tasks.
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
Search logs contain examples of frequently occurring patterns of user reformulations of queries. Intuitively, the reformulation "san francisco" → "san francisco 49ers" is semantically similar to "detroit" →"detroit lions". Likewise, "london"→"things to do in london" and "new york"→"new york tourist attractions" can also be considered similar transitions in intent. The reformulation "movies" → "new movies" and "york" → "new york", however, are clearly different despite the lexical similarities in the two reformulations. In this paper, we study the distributed representation of queries learnt by deep neural network models, such as the Convolutional Latent Semantic Model, and show that they can be used to represent query reformulations as vectors. These reformulation vectors exhibit favourable properties such as mapping semantically and syntactically similar query changes closer in the embedding space. Our work is motivated by the success of continuous space language models in capturing relationships between words and their meanings using offset vectors. We demonstrate a way to extend the same intuition to represent query reformulations.
Furthermore, we show that the distributed representations of queries and reformulations are both useful for modelling session context for query prediction tasks, such as for query auto-completion (QAC) ranking. Our empirical study demonstrates that short-term (session) history context features based on these two representations improves the mean reciprocal rank (MRR) for the QAC ranking task by more than 10% over a supervised ranker baseline. Our results also show that by using features based on both these representations together we achieve a better performance, than either of them individually.
Paper: http://research.microsoft.com/apps/pubs/default.aspx?id=244728
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
Traditionally, machine learning based approaches to information retrieval have taken the form of supervised learning-to-rank models. Recently, other machine learning approaches—such as adversarial learning and reinforcement learning—have started to find interesting applications in retrieval systems. At Bing, we have been exploring some of these methods in the context of web search. In this talk, I will share couple of our recent work in this area that we presented at SIGIR 2018.
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
We benchmark Conformer-Kernel models under the strict blind evaluation setting of the TREC 2020 Deep Learning track. In particular, we study the impact of incorporating: (i) Explicit term matching to complement matching based on learned representations (i.e., the “Duet principle”), (ii) query term independence (i.e., the “QTI assumption”) to scale the model to the full retrieval setting, and (iii) the ORCAS click data as an additional document description field. We find evidence which supports that all three aforementioned strategies can lead to improved retrieval quality.
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...IOSR Journals
The document describes a hybrid computational intelligence method for clustering sense-tagged Nepali documents. It combines self-organizing maps (SOM), particle swarm optimization (PSO), and k-means clustering. Feature vectors are generated from the sense-tagged documents and the hybrid algorithm is applied in three phases: (1) SOM produces prototype vectors from the feature vectors, (2) PSO initializes k-means centroids, (3) k-means clusters the prototypes. The method aims to address limitations of bag-of-words representations by incorporating word sense information. Experiments show the approach effectively clusters sense-tagged Nepali texts.
This document summarizes a research paper that proposes a method to improve web image search results by re-ranking the initial text-based search results using visual similarity measures. The researchers develop an adaptive visual similarity approach that categorizes the query image and then uses a specific similarity measure tailored to each category to re-rank the images based on visual features. They test their method on search results from Google Image Search and Microsoft Live Image Search and find it effectively improves search accuracy by better incorporating visual content into the ranking.
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
This document provides an introduction to Julia and DataFrames. It demonstrates basic Julia syntax for characters, strings, and the NA type to represent missing values. It then shows how to create and access a DataFrame, including mixed indexing of rows and columns by number and name. As an example, it loads the iris dataset to demonstrate creating a DataFrame from external data.
This document discusses information retrieval techniques. It begins by defining information retrieval as selecting the most relevant documents from a large collection based on a query. It then discusses some key aspects of information retrieval including document representation, indexing, query representation, and ranking models. The document also covers specific techniques used in information retrieval systems like parsing documents, tokenization, removing stop words, normalization, stemming, and lemmatization.
This talk features the basics behind the science of Information Retrieval with a story-mode on information and its various aspects. It then takes you through a quick journey into the process behind building of the search engine.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
Topic modeling using big data analytics can analyze large datasets. It involves installing Hadoop on multiple nodes for distributed processing, preprocessing data into a desired format, and using modeling tools to parallelize computation and select algorithms. Topic modeling identifies patterns in corpora to develop new ways to search, browse, and summarize large text archives. Tools like Mallet use algorithms like LDA and PLSI to achieve topic modeling on Hadoop, applying it to analyze news articles, search engine rankings, genetic and image data, and more.
This document is a thesis that proposes using word embeddings to improve information retrieval by addressing term mismatch issues. It discusses word2vec, a technique for learning word embeddings from large text corpora that capture semantic relationships between words. The thesis proposes two approaches: 1) incorporating word embedding similarities into a probabilistic language model for retrieval and 2) a vector space model. Due to time constraints, only the first approach is implemented, which integrates word embeddings into ALMasri and Chevallet's probabilistic language model. Experiments are conducted to evaluate the impact of using semantic features from word embeddings on retrieval effectiveness.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
In recent years, Word2Vec and its expansion (Doc2Vec, Paragraph2Vec, etc.) is receiving a lot of attention in the NLP field.
In this slide, we will introduce our approach for applying the Doc2Vec to the item recommender system. And we report the results of the performance evaluation of Doc2Vec-based recommender by using Rakuten Singapore EC data.
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. In this lecture, we will cover some of the fundamentals of neural representation learning for text retrieval. We will also discuss some of the recent advances in the applications of deep neural architectures to retrieval tasks.
(These slides were presented at a lecture as part of the Information Retrieval and Data Mining course taught at UCL.)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
The document describes a tutorial on using neural networks for information retrieval. It discusses an agenda for the tutorial that includes fundamentals of IR, word embeddings, using word embeddings for IR, deep neural networks, and applications of neural networks to IR problems. It provides context on the increasing use of neural methods in IR applications and research.
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
The document discusses a neural model called Duet for ranking documents based on their relevance to a query. Duet uses both a local model that operates on exact term matches between queries and documents, and a distributed model that learns embeddings to match queries and documents in the embedding space. The two models are combined using a linear combination and trained jointly on labeled query-document pairs. Experimental results show Duet performs significantly better at document ranking and other IR tasks compared to using the local and distributed models individually. The amount of training data is also important, with larger datasets needed to learn better representations.
Learning deep structured semantic models for web searchhyunsung lee
DSSM is a model for learning semantic representations of text to improve web search using clickthrough data. It maps queries and documents to vectors using letter n-grams and multilayer perceptrons. The model is trained by maximizing the likelihood of clicked documents given queries. In experiments, DSSM outperforms other models on ranking metrics, achieving the best performance with two hidden layers and without word hashing. The model effectively learns semantic similarities that can be evaluated on analogy tasks.
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
Search logs contain examples of frequently occurring patterns of user reformulations of queries. Intuitively, the reformulation "san francisco" → "san francisco 49ers" is semantically similar to "detroit" →"detroit lions". Likewise, "london"→"things to do in london" and "new york"→"new york tourist attractions" can also be considered similar transitions in intent. The reformulation "movies" → "new movies" and "york" → "new york", however, are clearly different despite the lexical similarities in the two reformulations. In this paper, we study the distributed representation of queries learnt by deep neural network models, such as the Convolutional Latent Semantic Model, and show that they can be used to represent query reformulations as vectors. These reformulation vectors exhibit favourable properties such as mapping semantically and syntactically similar query changes closer in the embedding space. Our work is motivated by the success of continuous space language models in capturing relationships between words and their meanings using offset vectors. We demonstrate a way to extend the same intuition to represent query reformulations.
Furthermore, we show that the distributed representations of queries and reformulations are both useful for modelling session context for query prediction tasks, such as for query auto-completion (QAC) ranking. Our empirical study demonstrates that short-term (session) history context features based on these two representations improves the mean reciprocal rank (MRR) for the QAC ranking task by more than 10% over a supervised ranker baseline. Our results also show that by using features based on both these representations together we achieve a better performance, than either of them individually.
Paper: http://research.microsoft.com/apps/pubs/default.aspx?id=244728
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
Traditionally, machine learning based approaches to information retrieval have taken the form of supervised learning-to-rank models. Recently, other machine learning approaches—such as adversarial learning and reinforcement learning—have started to find interesting applications in retrieval systems. At Bing, we have been exploring some of these methods in the context of web search. In this talk, I will share couple of our recent work in this area that we presented at SIGIR 2018.
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
We benchmark Conformer-Kernel models under the strict blind evaluation setting of the TREC 2020 Deep Learning track. In particular, we study the impact of incorporating: (i) Explicit term matching to complement matching based on learned representations (i.e., the “Duet principle”), (ii) query term independence (i.e., the “QTI assumption”) to scale the model to the full retrieval setting, and (iii) the ORCAS click data as an additional document description field. We find evidence which supports that all three aforementioned strategies can lead to improved retrieval quality.
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...IOSR Journals
The document describes a hybrid computational intelligence method for clustering sense-tagged Nepali documents. It combines self-organizing maps (SOM), particle swarm optimization (PSO), and k-means clustering. Feature vectors are generated from the sense-tagged documents and the hybrid algorithm is applied in three phases: (1) SOM produces prototype vectors from the feature vectors, (2) PSO initializes k-means centroids, (3) k-means clusters the prototypes. The method aims to address limitations of bag-of-words representations by incorporating word sense information. Experiments show the approach effectively clusters sense-tagged Nepali texts.
This document summarizes a research paper that proposes a method to improve web image search results by re-ranking the initial text-based search results using visual similarity measures. The researchers develop an adaptive visual similarity approach that categorizes the query image and then uses a specific similarity measure tailored to each category to re-rank the images based on visual features. They test their method on search results from Google Image Search and Microsoft Live Image Search and find it effectively improves search accuracy by better incorporating visual content into the ranking.
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
This document provides an introduction to Julia and DataFrames. It demonstrates basic Julia syntax for characters, strings, and the NA type to represent missing values. It then shows how to create and access a DataFrame, including mixed indexing of rows and columns by number and name. As an example, it loads the iris dataset to demonstrate creating a DataFrame from external data.
This document discusses information retrieval techniques. It begins by defining information retrieval as selecting the most relevant documents from a large collection based on a query. It then discusses some key aspects of information retrieval including document representation, indexing, query representation, and ranking models. The document also covers specific techniques used in information retrieval systems like parsing documents, tokenization, removing stop words, normalization, stemming, and lemmatization.
This talk features the basics behind the science of Information Retrieval with a story-mode on information and its various aspects. It then takes you through a quick journey into the process behind building of the search engine.
Indexes in information retrieval systems serve three main purposes:
1) To construct representations of documents that are suitable for users to browse.
2) To maximize users' searching success.
3) To minimize the time and effort required for users to find information.
The document discusses information retrieval, which involves obtaining information resources relevant to an information need from a collection. The information retrieval process begins when a user submits a query. The system matches queries to database information, ranks objects based on relevance, and returns top results to the user. The process involves document acquisition and representation, user problem representation as queries, and searching/retrieval through matching and result retrieval.
INTRODUCTION TO INFORMATION RETRIEVAL
This lecture will introduce the information retrieval problem, introduce the terminology related to IR, and provide a history of IR. In particular, the history of the web and its impact on IR will be discussed. Special attention and emphasis will be given to the concept of relevance in IR and the critical role it has played in the development of the subject. The lecture will end with a conceptual explanation of the IR process, and its relationships with other domains as well as current research developments.
INFORMATION RETRIEVAL MODELS
This lecture will present the models that have been used to rank documents according to their estimated relevance to user given queries, where the most relevant documents are shown ahead to those less relevant. Many of these models form the basis for many of the ranking algorithms used in many of past and today’s search applications. The lecture will describe models of IR such as Boolean retrieval, vector space, probabilistic retrieval, language models, and logical models. Relevance feedback, a technique that either implicitly or explicitly modifies user queries in light of their interaction with retrieval results, will also be discussed, as this is particularly relevant to web search and personalization.
This document summarizes research on frequent itemset mining techniques in data mining. It discusses the Apriori algorithm and how the authors improved it by introducing a vertical sort. This improves performance by allowing lower memory usage at all support thresholds and ability to mine lower support thresholds. The document also reviews related work on probabilistic frequent itemset mining and algorithms like U-Apriori and UF-growth for handling uncertainty.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
This document summarizes a research paper that proposes a method for classifying news and research articles using text pattern mining. The method involves preprocessing text to remove stop words and perform stemming. Frequent and closed patterns are then discovered from the preprocessed text. These patterns are structured into a taxonomy and deployed to classify new documents. The method also involves evolving patterns by reshuffling term supports within patterns to reduce the effects of noise from negative documents. Over 80% of documents were successfully classified using this pattern-based approach.
Zilliz - Overview of Generative models in MLZilliz
Learn the following topics:
What is Generative AI?
Use cases of generative AI
Large Language Models, Neural Networks, and Parameters
GAN (Generative Adversarial Network)
Diffusion models
Multimodal models
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
This document presents a method for generating suggestions for specific erroneous parts of sentences in Indian languages like Malayalam using deep learning. The method uses recurrent neural networks with long short-term memory layers to train a model on input-output examples of sentences and their corrections. The model takes in preprocessed sentence data and generates a set of possible corrections for erroneous parts through multiple network layers. An analysis of the model shows that it can accurately generate suggestions for word length of three, but requires more data and study to handle the complex morphology and symbols of Malayalam. The performance of the method is limited by the hardware used and it could be improved with a more powerful system and additional training data.
Bridging the gap between AI and UI - DSI Vienna - full versionLiad Magen
This is a summary of the latest research on model interpretability, including Recurrent neural networks (RNN) for Natural Language Processing (NLP) in terms of what's in an RNN.
In addition, it contains suggestion to improve machine learning based user interface, to engage users and encourage them to contribute data to adapt the models to them.
The document discusses text mining and summarizes several key points:
1) Text mining involves deriving patterns and trends from text to discover useful knowledge, but it is challenging to accurately evaluate features due to issues like polysemy and synonymy.
2) Phrase-based approaches could perform better than term-based approaches by carrying more semantic meaning, but have faced challenges due to low phrase frequencies and redundant/noisy phrases.
3) The proposed approach uses pattern mining to discover specific patterns and evaluates term weights based on pattern distributions rather than full document distributions to address misinterpretation issues and improve accuracy.
B sc it syit sem 3 sem 4 syllabus as per mumbai universitytanujaparihar
This document contains the syllabus for the S.Y.B.Sc. Information Technology program at the University of Mumbai for the academic year 2017-2018. It outlines the courses, course codes, credits, and topics to be covered for semesters 3 and 4. Semester 3 includes courses in Python programming, data structures, computer networks, databases, applied mathematics, and their corresponding practical courses. Semester 4 includes courses in Core Java, embedded systems, computer statistics, software engineering, computer graphics, and their practical courses.
Construction of Keyword Extraction using Statistical Approaches and Document ...IJERA Editor
Organize continuing growth of dynamic unstructured documents is the major challenge to the field experts.
Handling of such unorganized documents causes more expensive. Clustering of such dynamic documents helps
to reduce the cost. Document clustering by analysing the keywords of the documents is one the best method to
organize the unstructured dynamic documents. Statistical analysis is the best adaptive method to extract the
keywords from the documents. In this paper an algorithm was proposed to cluster the documents. It has two
parts, first part extracts the keywords using statistical method and the second part construct the clusters by
keyword using agglomerative method. This proposed algorithm gives more than 90% of accuracy.
Construction of Keyword Extraction using Statistical Approaches and Document ...IJERA Editor
Organize continuing growth of dynamic unstructured documents is the major challenge to the field experts.
Handling of such unorganized documents causes more expensive. Clustering of such dynamic documents helps
to reduce the cost. Document clustering by analysing the keywords of the documents is one the best method to
organize the unstructured dynamic documents. Statistical analysis is the best adaptive method to extract the
keywords from the documents. In this paper an algorithm was proposed to cluster the documents. It has two
parts, first part extracts the keywords using statistical method and the second part construct the clusters by
keyword using agglomerative method. This proposed algorithm gives more than 90% of accuracy.
Automated News Categorization Using Machine Learning TechniquesDrjabez
This document summarizes a research paper that compared different machine learning algorithms for automated news categorization. The researchers used a news article dataset from Kaggle to test Naive Bayes, Support Vector Machine (SVM), and Neural Network classifiers. SVM performed best with an accuracy of 75.84%, execution time of 243 milliseconds, and mean absolute error of 0.28. The paper concludes SVM is the best algorithm for classifying news articles out of the three compared based on accuracy, speed and error rate.
This document discusses building an inverted index to efficiently support information retrieval on large document collections. It describes tokenizing documents, building a dictionary of normalized terms, and creating postings lists that map each term to the documents it appears in. Inverted indexes allow skipping linear scanning and support flexible queries by indexing term locations. The document also covers calculating precision and recall to measure system effectiveness.
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET Journal
This document proposes a methodology to automatically assign topics to unlabeled datasets using topic modeling techniques. It applies latent Dirichlet allocation (LDA) and non-negative matrix factorization (NMF) with term frequency-inverse document frequency (TF-IDF) weighting to product reviews to generate topics. Word similarities are used to cluster words for each topic. Sentiment analysis and word clouds are also used to gain insights. The methodology successfully converts unlabeled to labeled data and provides automatic topic labeling to facilitate further research and opportunity discovery.
- The document proposes developing a Java API for WSO2 Machine Learner (ML) that supports streaming data and machine learning algorithms like k-means clustering and generalized linear models.
- The API would work by periodically retraining ML models on mini-batches of data extracted from incoming data streams, such as event streams from a Complex Event Processor.
- This would allow real-time predictive analysis of streaming data using WSO2 ML, by incrementally updating models with new streaming data.
Discovering User's Topics of Interest in Recommender SystemsGabriel Moreira
This talk introduces the main techniques of Recommender Systems and Topic Modeling.
Then, we present a case of how we've combined those techniques to build Smart Canvas (www.smartcanvas.com), a service that allows people to bring, create and curate content relevant to their organization, and also helps to tear down knowledge silos.
We present some of Smart Canvas features powered by its recommender system, such as:
- Highlight relevant content, explaining to the users which of his topics of interest have generated each recommendation.
- Associate tags to users’ profiles based on topics discovered from content they have contributed. These tags become searchable, allowing users to find experts or people with specific interests.
- Recommends people with similar interests, explaining which topics brings them together.
We give a deep dive into the design of our large-scale recommendation algorithms, giving special attention to our content-based approach that uses topic modeling techniques (like LDA and NMF) to discover people’s topics of interest from unstructured text, and social-based algorithms using a graph database connecting content, people and teams around topics.
Our typical data pipeline that includes the ingestion millions of user events (using Google PubSub and BigQuery), the batch processing of the models (with PySpark, MLib, and Scikit-learn), the online recommendations (with Google App Engine, Titan Graph Database and Elasticsearch), and the data-driven evaluation of UX and algorithms through A/B testing experimentation. We also touch topics about non-functional requirements of a software-as-a-service like scalability, performance, availability, reliability and multi-tenancy and how we addressed it in a robust architecture deployed on Google Cloud Platform.
This document provides an overview of information retrieval models, including vector space models, TF-IDF, Doc2Vec, and latent semantic analysis. It begins with basic concepts in information retrieval like document indexing and relevance scoring. Then it discusses vector space models and how documents and queries are represented as vectors. TF-IDF weighting is explained as assigning higher weight to rare terms. Doc2Vec is introduced as an extension of word2vec to learn document embeddings. Latent semantic analysis uses singular value decomposition to project documents to a latent semantic space. Implementation details and examples are provided for several models.
Automated Software Requirements LabelingData Works MD
Video of the presentation is available here: https://youtu.be/L6EMnvALYtU
Talk: Machine Learning for Requirements Engineering
Speaker: Jon Patton
This project applies a number of machine learning, deep learning, and NLP techniques to solve challenging problems in requirements engineering.
This document presents a feature clustering algorithm to reduce the dimensionality of feature vectors for text classification. The algorithm groups words in documents into clusters based on similarity, with each cluster characterized by a membership function. Words not similar to existing clusters form new clusters. This avoids specifying features in advance and the need for trial and error. Experimental results showed the method can classify text faster and with better extracted features than other methods.
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
Abstract – It is a research venture on the new information-access standard called type-ahead search, in which systems discover responds to a keyword query on-the-fly as users type in the uncertainty. In this paper we learn how to support fuzzy type-ahead search in XML. Underneath fuzzy search is important when users have limited knowledge about the exact representation of the entities they are looking for, such as people records in an online directory. We have developed and deployed several such systems, some of which have been used by many people on a daily basis. The systems received overwhelmingly positive feedbacks from users due to their friendly interfaces with the fuzzy-search feature. We describe the design and implementation of the systems, and demonstrate several such systems. We show that our efficient techniques can indeed allow this search paradigm to scale on large amounts of data.
Index Terms - type-ahead, large data set, server side, online directory, search technique.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
A tale of scale & speed: How the US Navy is enabling software delivery from l...
Julia text mining_inmobi
1. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Hands-on Introduction to Text Mining with Julia
- A Mathematical approach
Abhijith Chandraprabhu
April 19, 2014
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
2. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Overview
Introduction
Preprocessing
VSM Model
Query and Performance Modeling
LSI Model
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
3. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
About todays session
We will be dealing with an age old technique(1999).
Old wine in a New bottle!
The emphasis is on understanding the math behind it.
Vectors, Matrices
Dimension, Space
Projection, Matrix factorization
SVD
Hands on session.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
4. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
“For almost a decade the computational linguistics community has
viewed large text collections as a resource to be tapped in order to
produce better text analysis algorithms. In this paper, I have
attempted to suggest a new emphasis: the use of large online text
collections to discover new facts and trends about the world itself.
I suggest that to make progress we do not need fully artificial
intelligent text analysis; rather, a mixture of
computationally-driven and user-guided analysis may open the door
to exciting new results.”
Untangling Text Data Mining, Marti A. Hearst
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
5. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Text Mining
Type of Information retrieval, where we try to extract relevant
information from huge collection of textual data(documents).
Documents: web pages, biomedical literature, movie reviews,
research articles etc.
Non-Semantic
Based Vector-Space model.
Applications : Web Search Engines, Biomedical Information
Retrieval etc.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
6. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Document1
It’s also important to point out that even though these arrays are generic, they’re not boxed: an Int8 array will take
up much less memory than an Int64 array, and both will be laid out as continuous blocks of memory; Julia can deal
seamlessly and generically with these different immediate types as well as pointer types like String.
Document2
To celebrate some of the amazing work that’s already been done to make Julia usable for day-to-day data analysis,
I’d like to give a brief overview of the state of statistical programming in Julia. There are now several packages
that, taken as a whole, suggest that Julia may really live up to its potential and become the next generation
language for data analysis.
Document3
Only later did I realize what makes Julia different from all the others. Julia breaks down the second wall — the wall
between your high-level code and native assembly. Not only can you write code with the performance of C in Julia,
you can take a peek behind the curtain of any function into its LLVM Intermediate Representation as well as its
generated assembly code — all within the REPL. Check it out.
Document4
Homoiconicity — the code can be operated on by other parts of the code. Again, R kind of has this too! Kind of,
because I’m unaware of a good explanation for how to use it productively, and R’s syntax and scoping rules make it
tricky to pull off. But I’m still excited to see it in Julia, because I’ve heard good things about macros and I’d like to
appreciate them.
Document5
Graphics. One of the big advantages of R over similar languages is the sophisticated graphics available in ggplot2
or lattice. Work is underway on graphics models for Julia but, again, it is early days still.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
8. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Terminologies explained:
Corpus : Structured set of text, obtained by preprocessing raw
textual data.
Lexicon : Distinctive collection of all the words/terms in the
corpus.
Document, Query : Bag of words/terms, represented as vector.
Keyword, term : Elements of Lexicon.
Term Frequency : Frequency of occurence of a term in a
Document.
TDM : A matrix, with term frequencies as entries across all
Documents.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
9. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Major Steps in Text Mining
1. Preprocess the documents to form a corpus.
2. Identify the lexicon from the corpus.
3. Form the Term-Document matrix (Numericized Text).
4. Apply VSM, LSI, K-Means to measure the proximity between
all the documents and the query.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
10. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Preprocessing using TextAnalysis.jl package
Raw input Text is usually stream of characters. Need to
convert this to stream of terms(basic processing units).
The first step would be to create Documents which is
collection of terms.
Four types of documents can be created in Julia.
A FileDocument which represents the files on disk
str=“Julia is a high-level, high-performance dynamic
programming language for technical computing”
sd = StringDocument(str)
td = TokenDocument(str)
nd = NGramDocument(str)
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
11. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Preprocessing
Most of the times we have huge volumes of unstructured data, with
content not carrying any useful information w.r.t Text Mining.
HTML tags
Numbers
Stop words
Prepositions, Articles, Pronouns,
In Julia, we can remove these uneccessary using the functions,
remove_articles!(), remove_indefinite_articles!()
remove_definite_articles!(), remove_pronouns!()
remove_prepositions!(), remove_stop_words!()
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
12. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Stemming
I have a different opinion
I differ with your opinion
I opine differently
In Information Retrieval, morphological variants of
words/terms carrying the same semantic information adds to
redundancy.
Stemming linguistically normalizes the lexicon of a corpus.
stem!(Corpus)
stem!(Document) #sd, td or nd
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
13. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Vector-Space Model
T1
T2
T3
D1
D2
D3
0
Documents are vectors in
the Term-Document Space
The elements of the vector
are the weights1, wij ,
corresponding to Document
i and term j
The weights are the
frequencies of the terms in
the documents.
Proximity of documents
calculated by the cosine of
the angle between them.
a
Refer Weighting Schemes
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
14. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Term-Document Space
Let Dj , be a collection of i documents.
T = t1, t2, t3, ..., ti , is the Lexicon set.
wji , is the frequency of occurence of the term j in document i.
Document, dj = [w1j , w2j , w3j , ..., wij ].
TDM = [d1 d2 d3 ... dj ]ij
dj ∈ Ri×j
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
15. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Weighting Schemes
Associating each occurence of a term with a weight that
represents its relevance with respect to the meaning of the
document it appears in. []
Longer documents do not always carry more informative
content (or more relevant content wrt a query).
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
16. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Binary Scheme : wij = 1, if ti occurs in dj , 0 else not.
Term Frequency (TF) Scheme : wij = fij , i.e. the number of
times ti occurs in document dj .
Term Frequency - Inverse Document Frequency (TF-IDF)
Scheme : Term Frequency tfij =
fij
max[f1j ,f2j ,f3j ...,fij ]
Inverse-Document frequency, idfi = log N
dfi
,
N: number of documents and dfi : Number of documents in
which the term ti occurs.
wij = tfij × idfi .
m=DocumentTermMatrix(crps)
tf_idf(m)
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
17. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Query Matching
Finding the relevant documents for a query, q.
Cosine Distance Measure is used.
cos(θ) =
qT dj
q 2 dj 2
, where θ is the angle between the query q
and document dj .
The documents for which cos(θ) > tol, are considered
relevant, where tol, is the predefined tolerance.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
18. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Performance Modeling
The tol, decides the number of documents returned.
With low tol, more documents are returned.
But chances of the documents being irrelevant increases.
Ideally we need higher number of documents returned and
majority of the returned documents to be relevant.
Precision, P = Dr
Dt
, where Dr is the number of relevant
documents retrieved, and Dt is the total number of
documents retrieved.
Recall, R = Dr
Nr
, where Nr is the total number of relevant
documents in the database.
VSMModel(QueryNum::Int,A::Array{Float64,2},NumQueries::Int
This function is used to obtain Dr , Dt & Nr for any specific query
using Vector Space model.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
19. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
0.2 0.4 0.6 0.8
Tolerance
0
5
10
15
20
25
30
35
40No.ofDocuments Query10
Dr
Dt
Nr
0.2 0.4 0.6 0.8
Tolerance
0
20
40
60
80
No.ofDocuments
Query13
Dr
Dt
Nr
0.2 0.4 0.6 0.8
Tolerance
0
50
100
150
200
No.ofDocuments
Query6
Dr
Dt
Nr
0.2 0.4 0.6 0.8
Tolerance
0
5
10
15
20
25
30
35
No.ofDocuments
Query23
Dr
Dt
Nr
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
20. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Latent Semantic Indexing
There exists underlying latent semantic structure in the data.
We can identify this structure through SVD.
Project the data onto two lower dimensional spaces.
These are the term space and the document space.
Dimension reduction is achieved through truncated SVD.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
21. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Singular Value Decomposition
Theorem (SVD)
For any matrix A ∈ Rm×n, with m > n, there exists two
orthogonal matrices U = (u1, . . . , um) ∈ Rm×m &
V = (v1, . . . , vn) ∈ Rn×n such that
A = U
Σ
0
V T
,
where Σ ∈ Rn×n is diagonal matrix, i.e., Σ = (σ1, ...., σn), with
σ1 ≥ σ2 ≥ .... ≥ σn ≥ 0. σ1, . . . , σn are called the singular values
of A. Columns of U & V are called the right and left singular
vectors of A respectively.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
22. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Low Rank Matrix Approximation using SVD
A =
n
i=1
σi ui vT
i ≈
k
i=1
σi ui vT
i =: Ak k < r
The above approximation is based on the Eckart-Young
theorem.
It helps in removal of noise, solving ill-conditioned problems,
and mainly in dimension reduction of data.
Using the below function examine the effect of rank reduction.
Why is SVD good enough to decompose the TDM?
svdRedRank(A,k)
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
23. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
A
m
n
≈ m
U
k ≤ n
k ≤ n
S
k
k ≤ n
V
n
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
24. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
VSMModel(A::Array{Float64,2},nq::Int)
SVDModel(A::Array{Float64,2},nq::Int,rank::Int)
The above functions returns the Recall and Precision using
the Vector space model and the LSI model.
The first qNum vectors are the queries and the rest are the
document vectors, of matrix A.
For the LSI(SVDModel), we need to pass in the rank also as a
parameter.
Use the functions shown below to view the Precision Vs Recall for
VSM and LSI.
plotNew_RecPrec()
plotAdd_RecPrec()
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
25. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
10 20 30 40 50 60 70 80 90
Recall
30
40
50
60
70
Precision
Precision Vs Recall
VSM
LSI
KM
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
26. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Thank you.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach