Malek Chebil, Rim Jallouli, Mohamed Anis Bach Tobji and Chiheb Eddine Ben Ncir. Topic modeling of marketing scientific papers: An experimental survey. (ICDEc 2021)
Language Models for Information RetrievalDustin Smith
The document provides background information on Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze, who are authors of the book "Introduction to Information Retrieval: Language models for information retrieval". It then outlines the presentation which discusses language models for information retrieval, including query likelihood models, estimating query generation probabilities, and experiments comparing language modeling approaches to other IR techniques.
The document presents an overview of probabilistic models for information retrieval. It discusses how probability theory can be applied to model the uncertain nature of retrieval, where queries only vaguely represent user needs and relevance is uncertain. The document outlines different probabilistic IR models including the classical probabilistic retrieval model, probability ranking principle, binary independence model, Bayesian networks, and language modeling approaches. It also describes datasets used to evaluate these models, including collections from TREC, Cranfield, and others. Basic probability theory concepts are reviewed, including joint probability, conditional probability, and rules relating probabilities.
The document discusses probabilistic retrieval models in information retrieval. It introduces three influential probabilistic models: (1) Maron and Kuhns' 1960 model which calculates the probability of relevance based on historical user data; (2) Salton's model which estimates the probability of term occurrence in relevant documents; (3) A model that ranks documents by the probability of relevance and considers retrieval as a decision between costs of retrieving non-relevant vs. not retrieving relevant documents. The document provides background on the development of probabilistic IR models and challenges of estimating probabilities for evaluation.
This summarizes an academic paper that proposes an automatic ontology creation method for classifying research papers. It uses text mining techniques like classification and clustering algorithms. It first builds a research ontology by extracting keywords and patterns from previous papers. It then uses a decision tree algorithm to classify new papers into disciplines defined in the ontology. The classified papers are then clustered based on similarities to group them. The method was tested on a dataset of 100 papers and achieved average precision of 85.7% for term-based and 89.3% for pattern-based keyword extraction.
The document discusses a study that trained a GPT-2 model to generate contextual definitions for words based on the provided context. The model was trained on a new dataset containing definition and context pairs from various sources. It was evaluated through surveys where human raters assessed definitions generated by the model for short and long contexts, as well as real human-generated definitions. The results found that while the model performed significantly better at generating definitions for short contexts compared to long ones, human-generated definitions were still significantly more accurate. Areas for improvement included reducing fluctuations depending on context and better interpreting some contexts.
Mining from Open Answers in Questionnaire Datafeiwin
The document summarizes a system called Survey Analyzer (SA) that analyzes open-ended answers from questionnaire data. SA uses rule analysis through classification and association rules as well as correspondence analysis to automatically summarize open answers and mine useful information. It employs statistical learning methods like stochastic complexity to acquire rules from categorized text and classify new text. SA views each analysis target and its associated open answers to learn rules and analyze relationships between targets and words.
Order out of Chaos: Construction of Knowledge Models from PDF TextbooksIsaac Alpizar-Chacon
Textbooks are educational documents created, structured and formatted by domain experts with the main purpose to explain the knowledge in the domain to a novice. Authors use their understanding of the domain when structuring and formatting the content of a textbook to facilitate this explanation. As a result, the formatting and structural elements of textbooks carry the elements of domain knowledge implicitly encoded by their authors. Our paper presents an extendable approach towards automated extraction of this knowledge from textbooks taking into account their formatting rules and internal structure. We focus on PDF as the most common textbook representation format; however, the overall method is applicable to other formats as well. The evaluation experiments examine the accuracy of the approach, as well as the pragmatic quality of the obtained knowledge models using one of their possible applications --- semantic linking of textbooks in the same domain. The results indicate high accuracy of model construction on symbolic, syntactic and structural levels across textbooks and domains, and demonstrate the added value of the extracted models on the semantic level.
Presented at Document Engineering 2020
This document provides an overview of content analysis as a research technique. It defines content analysis as objective, systematic analysis and categorization of communication content. The workshop covers the procedures of content analysis, including design, unitizing, sampling, coding, drawing inferences and validation. Examples of using content analysis to analyze text data from interviews are presented, showing coding categories, frequency calculations and correlations. Content analysis is described as a useful technique for making inferences from qualitative data in an objective and replicable manner.
Language Models for Information RetrievalDustin Smith
The document provides background information on Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze, who are authors of the book "Introduction to Information Retrieval: Language models for information retrieval". It then outlines the presentation which discusses language models for information retrieval, including query likelihood models, estimating query generation probabilities, and experiments comparing language modeling approaches to other IR techniques.
The document presents an overview of probabilistic models for information retrieval. It discusses how probability theory can be applied to model the uncertain nature of retrieval, where queries only vaguely represent user needs and relevance is uncertain. The document outlines different probabilistic IR models including the classical probabilistic retrieval model, probability ranking principle, binary independence model, Bayesian networks, and language modeling approaches. It also describes datasets used to evaluate these models, including collections from TREC, Cranfield, and others. Basic probability theory concepts are reviewed, including joint probability, conditional probability, and rules relating probabilities.
The document discusses probabilistic retrieval models in information retrieval. It introduces three influential probabilistic models: (1) Maron and Kuhns' 1960 model which calculates the probability of relevance based on historical user data; (2) Salton's model which estimates the probability of term occurrence in relevant documents; (3) A model that ranks documents by the probability of relevance and considers retrieval as a decision between costs of retrieving non-relevant vs. not retrieving relevant documents. The document provides background on the development of probabilistic IR models and challenges of estimating probabilities for evaluation.
This summarizes an academic paper that proposes an automatic ontology creation method for classifying research papers. It uses text mining techniques like classification and clustering algorithms. It first builds a research ontology by extracting keywords and patterns from previous papers. It then uses a decision tree algorithm to classify new papers into disciplines defined in the ontology. The classified papers are then clustered based on similarities to group them. The method was tested on a dataset of 100 papers and achieved average precision of 85.7% for term-based and 89.3% for pattern-based keyword extraction.
The document discusses a study that trained a GPT-2 model to generate contextual definitions for words based on the provided context. The model was trained on a new dataset containing definition and context pairs from various sources. It was evaluated through surveys where human raters assessed definitions generated by the model for short and long contexts, as well as real human-generated definitions. The results found that while the model performed significantly better at generating definitions for short contexts compared to long ones, human-generated definitions were still significantly more accurate. Areas for improvement included reducing fluctuations depending on context and better interpreting some contexts.
Mining from Open Answers in Questionnaire Datafeiwin
The document summarizes a system called Survey Analyzer (SA) that analyzes open-ended answers from questionnaire data. SA uses rule analysis through classification and association rules as well as correspondence analysis to automatically summarize open answers and mine useful information. It employs statistical learning methods like stochastic complexity to acquire rules from categorized text and classify new text. SA views each analysis target and its associated open answers to learn rules and analyze relationships between targets and words.
Order out of Chaos: Construction of Knowledge Models from PDF TextbooksIsaac Alpizar-Chacon
Textbooks are educational documents created, structured and formatted by domain experts with the main purpose to explain the knowledge in the domain to a novice. Authors use their understanding of the domain when structuring and formatting the content of a textbook to facilitate this explanation. As a result, the formatting and structural elements of textbooks carry the elements of domain knowledge implicitly encoded by their authors. Our paper presents an extendable approach towards automated extraction of this knowledge from textbooks taking into account their formatting rules and internal structure. We focus on PDF as the most common textbook representation format; however, the overall method is applicable to other formats as well. The evaluation experiments examine the accuracy of the approach, as well as the pragmatic quality of the obtained knowledge models using one of their possible applications --- semantic linking of textbooks in the same domain. The results indicate high accuracy of model construction on symbolic, syntactic and structural levels across textbooks and domains, and demonstrate the added value of the extracted models on the semantic level.
Presented at Document Engineering 2020
This document provides an overview of content analysis as a research technique. It defines content analysis as objective, systematic analysis and categorization of communication content. The workshop covers the procedures of content analysis, including design, unitizing, sampling, coding, drawing inferences and validation. Examples of using content analysis to analyze text data from interviews are presented, showing coding categories, frequency calculations and correlations. Content analysis is described as a useful technique for making inferences from qualitative data in an objective and replicable manner.
Mathematical Language Processing via Tree EmbeddingsSergey Sosnovsky
This document proposes a framework that uses tree embeddings to process mathematical language via encoding equations as trees. The framework includes a novel encoder-decoder architecture that learns representations of mathematical formulae. This approach achieves state-of-the-art performance on formula retrieval tasks by computing the similarity between embedding vectors of query and dataset equations. Future work will explore joint processing of math and text, deploying the system for textbook search, and using the embeddings for open-ended math problem solving.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
Search: Probabilistic Information RetrievalVipul Munot
Probabilistic Information Retrieval uses probability rankings to effectively retrieve documents. It assumes binary relevance and independence between documents. The Binary Independence Model represents documents and queries as term vectors and estimates probabilities of relevance using term frequencies. Documents are ranked by their odds of relevance based on query term matches. In practice, probability estimates use collection frequencies. Extensions allow dependencies between terms and non-binary representations.
The Boolean model is a classical information retrieval model based on set theory and Boolean logic. Queries are specified as Boolean expressions to retrieve documents that either contain or do not contain the query terms. All term frequencies are binary and documents are retrieved based on an exact match to the query terms. However, this model has limitations as it does not rank documents and queries are difficult for users to translate into Boolean expressions, often returning too few or too many results.
This document reviews probabilistic matrix factorization, topic modeling, collaborative topic modeling, and collaborative deep learning approaches. It describes the generative processes and graphical models of probabilistic matrix factorization, latent Dirichlet allocation, collaborative topic modeling, and collaborative deep learning. It also compares collaborative topic modeling and collaborative deep learning methods and discusses potential extensions, including using deep learning models instead of LDA to more deeply extract features from documents.
Integrating Textbooks with Smart Interactive Content for Learning ProgrammingIsaac Alpizar-Chacon
Online textbooks with interactive content emerged as a popular medium for learning programming and other computer science topics. While the textbook component supports acquisition of programming concepts by reading, various types of ``smart'' interactive learning content such as worked examples, code animations, Parson's puzzles, and coding problems allow students to immediately practice and master the newly learned concepts. This paper attempts to automate the time-consuming manual process of augmenting textbooks with ``smart'' interactive content. We introduce an ontology-based approach that can link fragment of text with ``smart'' content activities, demonstrate its application to two practical linking cases, and present the results of its pilot evaluation.
This document summarizes a research paper that proposes a new representation for relational learning that allows the use of propositional learning algorithms. The paper argues that traditional inductive logic programming (ILP) approaches have limitations like intractability and inefficiency. It presents a representation using a restricted first-order logic and graph structures that can be converted to propositions, enabling the use of propositional and probabilistic learning algorithms. An information extraction system using this approach achieved better performance than other ILP-based systems. The paper contributes a new paradigm for relational learning but did not fully analyze the contributions of its two-stage architecture.
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
Slides for the talk about the paper:
Ziqi Zhang, Johann Petrak and Diana Maynard, 2018: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. Semantics-2018, Vienna, Austria
An efficient-classification-model-for-unstructured-text-documentSaleihGero
The document presents a classification model for unstructured text documents that aims to support both generality and efficiency. The model follows the logical sequence of text classification steps and proposes a combination of techniques for each step. Specifically, it uses multinomial naive bayes classification with term frequency-inverse document frequency (TF-IDF) representation. The model is tested on the 20-Newsgroups dataset and results show improved performance over precision, recall, and f-score compared to other models.
Two Level Disambiguation Model for Query TranslationIJECEIAES
Selection of the most suitable translation among all translation candidates returned by bilingual dictionary has always been quiet challenging task for any cross language query translation. Researchers have frequently tried to use word co-occurrence statistics to determine the most probable translation for user query. Algorithms using such statistics have certain shortcomings, which are focused in this paper. We propose a novel method for ambiguity resolution, named „two level disambiguation model‟. At first level disambiguation, the model properly weighs the importance of translation alternatives of query terms obtained from the dictionary. The importance factor measures the probability of a translation candidate of being selected as the final translation of a query term. This removes the problem of taking binary decision for translation candidates. At second level disambiguation, the model targets the user query as a single concept and deduces the translation of all query terms simultaneously, taking into account the weights of translation alternatives also. This is contrary to previous researches which select translation for each word in source language query independently. The experimental result with English-Hindi cross language information retrieval shows that the proposed two level disambiguation model achieved 79.53% and 83.50% of monolingual translation and 21.11% and 17.36% improvement compared to greedy disambiguation strategies in terms of MAP for short and long queries respectively.
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
This document summarizes the agenda and key topics for a CIS 890 project final presentation on topics modelling with LDA. The presentation will cover LDA modelling, HMMLDA modelling, LDA with collocations modelling, and experimental results on the NIPS collection. It will discuss topic modelling approaches like LDA, discriminative vs generative methods, and limitations of bag-of-words assumptions.
Survey of Generative Clustering Models 2008Roman Stanchak
Survey of Generative Clustering Models "Probabilistic Topic Models" circa 2008. Class presentation by Roman Stanchak and Prithviraj Sen for University of Maryland College Park cmsc828g, Link Mining and Dynamic Graph Analysis. Spring 2008. Instructor: Prof. Lise Getoor
Topic modeling using big data analytics can analyze large datasets. It involves installing Hadoop on multiple nodes for distributed processing, preprocessing data into a desired format, and using modeling tools to parallelize computation and select algorithms. Topic modeling identifies patterns in corpora to develop new ways to search, browse, and summarize large text archives. Tools like Mallet use algorithms like LDA and PLSI to achieve topic modeling on Hadoop, applying it to analyze news articles, search engine rankings, genetic and image data, and more.
Topic modeling is a technique for discovering hidden semantic patterns in large document collections. It represents documents as probability distributions over latent topics, where each topic is characterized by a distribution over words. Two common probabilistic topic models are latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (pLSA). LDA assumes each document exhibits multiple topics in different proportions, with topics modeled as distributions over words. Topic modeling provides dimensionality reduction and can be applied to problems like text classification, collaborative filtering, and computer vision tasks like image classification.
This document provides an overview of various information retrieval models and techniques used in search engines, including:
- Boolean, vector space, probabilistic models like BM25, and language models are described as older retrieval models.
- Learning to rank uses machine learning techniques to optimize ranking functions using features and training data.
- Relevance feedback, query likelihood models, and pseudo-relevance feedback are discussed as techniques for improving retrieval effectiveness by incorporating user feedback.
Comparative study of classification algorithm for text based categorizationeSAT Journals
Abstract
Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine
learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques.
It provides conceptual views of the collected documents and has important applications in the real world. Text based
categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number
of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes'
algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of
the above mentioned classification algorithm.
Keywords: Data Mining, Text Mining, Text Categorization, Machine Learning, Pattern Analysis, Naive Bayes’, KNN,
Decision Tree.
Combining IR with Relevance Feedback for Concept LocationSonia Haiduc
This document discusses using information retrieval and relevance feedback techniques to help with the concept location task during software maintenance and evolution. It describes the concept location process, challenges with query formulation, and how relevance feedback can help developers iteratively refine queries to more accurately locate relevant code. The document outlines two studies conducted that show relevance feedback generally improves the results of information retrieval for concept location, though the benefits depend on properly calibrating the relevance feedback parameters for each software system.
What to read next? Challenges and Preliminary Results in Selecting Represen...MOVING Project
1. The document presents an approach for selecting representative documents from a set of search results to provide users with an overview of the content and subtopics. It compares different document representations, clustering algorithms, and selection methods on two datasets.
2. The evaluation measures of coverage and redundancy were found to be insufficient for accurately evaluating representativeness, as the scores increased with the number of selected documents and were sometimes independent of the actual selection method.
3. The research questions explored how document representation, clustering algorithm, and selection method influence coverage and redundancy, finding the choice of clustering had the largest impact. Coverage and redundancy were found to be inflated and not directly reflect representativeness.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
This document provides an overview of topic modeling. It defines topic modeling as discovering the thematic structure of a corpus by modeling relationships between words and documents through learned topics. The document introduces Latent Dirichlet Allocation (LDA) as a widely used topic modeling technique. It outlines LDA's generative process and inference methods like Gibbs sampling and variational inference. The document also discusses extensions to LDA, evaluation strategies, open questions, and applications like topic labeling and browsing.
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Sergey Sosnovsky
As textbooks evolve into digital platforms, they open a world of opportunities for Artificial Intelligence in Education (AIED) research. This paper delves into the novel use of textbooks as a source of high-quality labeled data for automatic keyword extraction, demonstrating an affordable and efficient alternative to traditional methods. By utilizing the wealth of structured information provided in textbooks, we propose a methodology for annotating corpora across diverse domains, circumventing the costly and time-consuming process of manual data annotation. Our research presents a deep learning model based on Bidirectional Encoder Representations from Transformers (BERT) fine-tuned on this newly labeled dataset. This model is applied to keyword extraction tasks, with the model’s performance surpassing established baselines. We further analyze the transformation of BERT’s embedding space before and after the fine-tuning phase, illuminating how the model adapts to specific domain goals. Our findings substantiate textbooks as a resource-rich, untapped well of high-quality labeled data, underpinning their significant role in the AIED research landscape.
Mathematical Language Processing via Tree EmbeddingsSergey Sosnovsky
This document proposes a framework that uses tree embeddings to process mathematical language via encoding equations as trees. The framework includes a novel encoder-decoder architecture that learns representations of mathematical formulae. This approach achieves state-of-the-art performance on formula retrieval tasks by computing the similarity between embedding vectors of query and dataset equations. Future work will explore joint processing of math and text, deploying the system for textbook search, and using the embeddings for open-ended math problem solving.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
Search: Probabilistic Information RetrievalVipul Munot
Probabilistic Information Retrieval uses probability rankings to effectively retrieve documents. It assumes binary relevance and independence between documents. The Binary Independence Model represents documents and queries as term vectors and estimates probabilities of relevance using term frequencies. Documents are ranked by their odds of relevance based on query term matches. In practice, probability estimates use collection frequencies. Extensions allow dependencies between terms and non-binary representations.
The Boolean model is a classical information retrieval model based on set theory and Boolean logic. Queries are specified as Boolean expressions to retrieve documents that either contain or do not contain the query terms. All term frequencies are binary and documents are retrieved based on an exact match to the query terms. However, this model has limitations as it does not rank documents and queries are difficult for users to translate into Boolean expressions, often returning too few or too many results.
This document reviews probabilistic matrix factorization, topic modeling, collaborative topic modeling, and collaborative deep learning approaches. It describes the generative processes and graphical models of probabilistic matrix factorization, latent Dirichlet allocation, collaborative topic modeling, and collaborative deep learning. It also compares collaborative topic modeling and collaborative deep learning methods and discusses potential extensions, including using deep learning models instead of LDA to more deeply extract features from documents.
Integrating Textbooks with Smart Interactive Content for Learning ProgrammingIsaac Alpizar-Chacon
Online textbooks with interactive content emerged as a popular medium for learning programming and other computer science topics. While the textbook component supports acquisition of programming concepts by reading, various types of ``smart'' interactive learning content such as worked examples, code animations, Parson's puzzles, and coding problems allow students to immediately practice and master the newly learned concepts. This paper attempts to automate the time-consuming manual process of augmenting textbooks with ``smart'' interactive content. We introduce an ontology-based approach that can link fragment of text with ``smart'' content activities, demonstrate its application to two practical linking cases, and present the results of its pilot evaluation.
This document summarizes a research paper that proposes a new representation for relational learning that allows the use of propositional learning algorithms. The paper argues that traditional inductive logic programming (ILP) approaches have limitations like intractability and inefficiency. It presents a representation using a restricted first-order logic and graph structures that can be converted to propositions, enabling the use of propositional and probabilistic learning algorithms. An information extraction system using this approach achieved better performance than other ILP-based systems. The paper contributes a new paradigm for relational learning but did not fully analyze the contributions of its two-stage architecture.
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
Slides for the talk about the paper:
Ziqi Zhang, Johann Petrak and Diana Maynard, 2018: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. Semantics-2018, Vienna, Austria
An efficient-classification-model-for-unstructured-text-documentSaleihGero
The document presents a classification model for unstructured text documents that aims to support both generality and efficiency. The model follows the logical sequence of text classification steps and proposes a combination of techniques for each step. Specifically, it uses multinomial naive bayes classification with term frequency-inverse document frequency (TF-IDF) representation. The model is tested on the 20-Newsgroups dataset and results show improved performance over precision, recall, and f-score compared to other models.
Two Level Disambiguation Model for Query TranslationIJECEIAES
Selection of the most suitable translation among all translation candidates returned by bilingual dictionary has always been quiet challenging task for any cross language query translation. Researchers have frequently tried to use word co-occurrence statistics to determine the most probable translation for user query. Algorithms using such statistics have certain shortcomings, which are focused in this paper. We propose a novel method for ambiguity resolution, named „two level disambiguation model‟. At first level disambiguation, the model properly weighs the importance of translation alternatives of query terms obtained from the dictionary. The importance factor measures the probability of a translation candidate of being selected as the final translation of a query term. This removes the problem of taking binary decision for translation candidates. At second level disambiguation, the model targets the user query as a single concept and deduces the translation of all query terms simultaneously, taking into account the weights of translation alternatives also. This is contrary to previous researches which select translation for each word in source language query independently. The experimental result with English-Hindi cross language information retrieval shows that the proposed two level disambiguation model achieved 79.53% and 83.50% of monolingual translation and 21.11% and 17.36% improvement compared to greedy disambiguation strategies in terms of MAP for short and long queries respectively.
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
This document summarizes the agenda and key topics for a CIS 890 project final presentation on topics modelling with LDA. The presentation will cover LDA modelling, HMMLDA modelling, LDA with collocations modelling, and experimental results on the NIPS collection. It will discuss topic modelling approaches like LDA, discriminative vs generative methods, and limitations of bag-of-words assumptions.
Survey of Generative Clustering Models 2008Roman Stanchak
Survey of Generative Clustering Models "Probabilistic Topic Models" circa 2008. Class presentation by Roman Stanchak and Prithviraj Sen for University of Maryland College Park cmsc828g, Link Mining and Dynamic Graph Analysis. Spring 2008. Instructor: Prof. Lise Getoor
Topic modeling using big data analytics can analyze large datasets. It involves installing Hadoop on multiple nodes for distributed processing, preprocessing data into a desired format, and using modeling tools to parallelize computation and select algorithms. Topic modeling identifies patterns in corpora to develop new ways to search, browse, and summarize large text archives. Tools like Mallet use algorithms like LDA and PLSI to achieve topic modeling on Hadoop, applying it to analyze news articles, search engine rankings, genetic and image data, and more.
Topic modeling is a technique for discovering hidden semantic patterns in large document collections. It represents documents as probability distributions over latent topics, where each topic is characterized by a distribution over words. Two common probabilistic topic models are latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (pLSA). LDA assumes each document exhibits multiple topics in different proportions, with topics modeled as distributions over words. Topic modeling provides dimensionality reduction and can be applied to problems like text classification, collaborative filtering, and computer vision tasks like image classification.
This document provides an overview of various information retrieval models and techniques used in search engines, including:
- Boolean, vector space, probabilistic models like BM25, and language models are described as older retrieval models.
- Learning to rank uses machine learning techniques to optimize ranking functions using features and training data.
- Relevance feedback, query likelihood models, and pseudo-relevance feedback are discussed as techniques for improving retrieval effectiveness by incorporating user feedback.
Comparative study of classification algorithm for text based categorizationeSAT Journals
Abstract
Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine
learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques.
It provides conceptual views of the collected documents and has important applications in the real world. Text based
categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number
of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes'
algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of
the above mentioned classification algorithm.
Keywords: Data Mining, Text Mining, Text Categorization, Machine Learning, Pattern Analysis, Naive Bayes’, KNN,
Decision Tree.
Combining IR with Relevance Feedback for Concept LocationSonia Haiduc
This document discusses using information retrieval and relevance feedback techniques to help with the concept location task during software maintenance and evolution. It describes the concept location process, challenges with query formulation, and how relevance feedback can help developers iteratively refine queries to more accurately locate relevant code. The document outlines two studies conducted that show relevance feedback generally improves the results of information retrieval for concept location, though the benefits depend on properly calibrating the relevance feedback parameters for each software system.
What to read next? Challenges and Preliminary Results in Selecting Represen...MOVING Project
1. The document presents an approach for selecting representative documents from a set of search results to provide users with an overview of the content and subtopics. It compares different document representations, clustering algorithms, and selection methods on two datasets.
2. The evaluation measures of coverage and redundancy were found to be insufficient for accurately evaluating representativeness, as the scores increased with the number of selected documents and were sometimes independent of the actual selection method.
3. The research questions explored how document representation, clustering algorithm, and selection method influence coverage and redundancy, finding the choice of clustering had the largest impact. Coverage and redundancy were found to be inflated and not directly reflect representativeness.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
This document provides an overview of topic modeling. It defines topic modeling as discovering the thematic structure of a corpus by modeling relationships between words and documents through learned topics. The document introduces Latent Dirichlet Allocation (LDA) as a widely used topic modeling technique. It outlines LDA's generative process and inference methods like Gibbs sampling and variational inference. The document also discusses extensions to LDA, evaluation strategies, open questions, and applications like topic labeling and browsing.
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Sergey Sosnovsky
As textbooks evolve into digital platforms, they open a world of opportunities for Artificial Intelligence in Education (AIED) research. This paper delves into the novel use of textbooks as a source of high-quality labeled data for automatic keyword extraction, demonstrating an affordable and efficient alternative to traditional methods. By utilizing the wealth of structured information provided in textbooks, we propose a methodology for annotating corpora across diverse domains, circumventing the costly and time-consuming process of manual data annotation. Our research presents a deep learning model based on Bidirectional Encoder Representations from Transformers (BERT) fine-tuned on this newly labeled dataset. This model is applied to keyword extraction tasks, with the model’s performance surpassing established baselines. We further analyze the transformation of BERT’s embedding space before and after the fine-tuning phase, illuminating how the model adapts to specific domain goals. Our findings substantiate textbooks as a resource-rich, untapped well of high-quality labeled data, underpinning their significant role in the AIED research landscape.
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this paper, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of re-search areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.
This document outlines a course on Knowledge Representation (KR) on the Web. The course aims to expose students to challenges of applying traditional KR techniques to the scale and heterogeneity of data on the Web. Students will learn about representing Web data through formal knowledge graphs and ontologies, integrating and reasoning over distributed datasets, and how characteristics such as volume, variety and veracity impact KR approaches. The course involves lectures, literature reviews, and milestone projects where students publish papers on building semantic systems, modeling Web data, ontology matching, and reasoning over large knowledge graphs.
This document discusses topic extraction for domain ontology. It describes domain ontology as a collection of vocabularies and conceptualization of a given domain. The purpose of topic extraction is to identify relevant concepts in documents, obtain domain-specific terms, classify documents, and identify key concepts and relationships for an ontology. The project stages include obtaining domain knowledge, preprocessing documents, and applying either K-Means clustering or Latent Dirichlet Allocation to extract topics. K-Means partitions data into clusters while LDA represents documents as mixtures over topics characterized by word distributions.
Introduction to Text Mining and Topic ModellingDavid Paule
A brief introduction to Text Mining and Topic Modelling given at the Urban Big Data Centre (University of Glasgow).
Want to know more? Visit my website davidpaule.es
This document provides an overview of design-based research (DBR) for studying educational innovations. It discusses DBR as a flexible methodology that uses iterative design, development, implementation, and analysis to improve educational practices and develop design principles and theories. Key aspects of DBR include collaboration between researchers and practitioners in real-world settings, qualitative and multimethod approaches, and exploring new domains to design effective solutions while allowing theories to emerge. The document also provides recommendations for conducting DBR, such as rigorous data collection and clear project structure.
1 Nova Southeastern University College of Computing.docxShiraPrater50
1
Nova Southeastern University
College of Computing and Engineering
Master of Management Information Systems
MMIS 643 Data Mining
Fall 2019
(August 19 – December 8, 2019)
Class Project
Due Date: November 17, 2019 (Firm)
Instructor: Dr. Junping Sun
In this project, you will be expected to do a comprehensive literature search and survey, select and
study a specific topic in one subject area of data mining and its applications in business
intelligence and analytics (BIA), and write a research paper on the selected topic by yourself. The
research paper you are required to write can be a detailed comprehensive study on some specific
topic or the original research work that will have been done by yourself.
Requirements and Instructions for the Research Paper:
1. The objective of the paper should be very clear about subject, scope, domain, and the goals to be
achieved.
2. The paper should address the important advanced and critical issues in a specific area of data
mining and its applications in business intelligence and analytics. Your research paper
should emphasize not only breadth of coverage, but also depth of coverage in the specific area.
3. The research paper should give the measurable conclusions and future research directions (this is
your contribution).
4. It might be beneficial to review or browse through about 15 to 20 relevant technical articles
before you make decision on the topic of the research project.
5. The research paper can be:
a. Literature review papers on data mining techniques and their applications for business
intelligence and analytics.
b. Study and examination of data mining techniques in depth with technical details.
c. Applied research that applies a data mining method to solve a real world application in terms
of the domain of BIA.
6. The research paper should reflect the quality at certain academic research level.
7. The paper should be about at least 3000-3500 words double space.
8. The paper should include adequate abstraction or introduction, and reference list.
9. Please write the paper in your words and statements, and please give the names of
references, citations, and resources of reference materials if you want to use the statements from
other reference articles.
2
10. From the systematic study point of view, you may want to read a list of technical papers from
relevant magazines, journals, conference proceedings and theses in the area of the topic you
choose.
11. For the format and style of your research paper, please make reference to CEC Dissertation
Guide (http://cec.nova.edu/doctoral/documents/nsu-cec-dissertation-guides.html), Publication
Manual of APA, or the format of ACM and IEEE journal publications.
Suggested and Possible Topics for Written Report (But Not Limited)
Supervised Learning Methods:
Classification Methods:
Regression Methods
Multiple Linear Regression
Logistic Regression ...
Topic Extraction using Machine LearningSanjib Basak
This document discusses topic extraction using machine learning techniques. It provides a history of topic models, including TF-IDF, LSI, pLSI and LDA. It describes how LDA uses a hierarchical Bayesian model to represent documents as mixtures of topics and topics as mixtures of words. The document demonstrates LDA and k-means topic modeling in R and Spark. It concludes that LDA provides mixtures of topics while k-means provides distinct topics, and unsupervised LDA may need domain experts to improve topic representation.
May 2024 - Top10 Cited Articles in Natural Language Computingkevig
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Cross-domain Document Retrieval: Matching between Conversational and Formal W...Jinho Choi
This paper challenges a cross-genre document retrieval task, where the queries are in formal writing and the target documents are in conversational writing. In this task, a query, is a sentence extracted from either a summary or a plot of an episode in a TV show, and the target document consists of transcripts from the corresponding episode. To establish a strong baseline, we employ the current state-of-the-art search engine to perform document retrieval on the dataset collected for this work. We then introduce a structure reranking approach to improve the initial ranking by utilizing syntactic and semantic structures generated by NLP tools. Our evaluation shows an improvement of more than 4% when the structure reranking is applied, which is very promising.
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Jennifer D'Souza
We describe a rule-based approach for the automatic acquisition of salient scientific entities from Computational Linguistics (CL) scholarly article titles. Two observations motivated the approach: (i) noting salient aspects of an article’s contribution in its title; and (ii) pattern regularities capturing the salient terms that could be expressed in a set of rules. Only those lexico-syntactic patterns were selected that were easily recognizable, occurred frequently, and positionally indicated a scientific entity type. The rules were developed on a collection of 50,237 CL titles covering all articles in the ACL Anthology. In total, 19,799 research problems, 18,111 solutions, 20,033 resources, 1,059 languages, 6,878 tools, and 21,687 methods were extracted at an average precision of 75%.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMeMadrid network
1) The document proposes an approach to assist course creators in generating or restructuring courses by exploiting text mining techniques, semantic information from DBpedia, and linking educational resources.
2) The approach was implemented as a prototype that retrieves online courses, identifies key elements from text, formulates queries to other courses, and returns related courses to help creators generate mashups.
3) Preliminary tests on 265 computer science courses showed promising results, though future work is needed to improve similarity measures and generate concept maps between related courses.
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...Giannis Tsakonas
This document proposes a methodology for discovering patterns in scientific literature using a case study of digital library evaluation. It involves:
1. Classifying documents to identify relevant papers using naive Bayes classification.
2. Semantically annotating papers with concepts from a Digital Library Evaluation Ontology using the GoNTogle annotation tool. Over 2,600 annotations were generated.
3. Clustering the annotated papers into coherent groups using k-means clustering.
4. Interpreting the clusters with the assistance of the ontology to discover patterns and trends in the literature. Benchmarking tests were performed to evaluate effectiveness of the methodology.
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGijnlc
In this paper, we propose a novel algorithm that rearrange the topic assignment results obtained from topic
modeling algorithms, including NMF and LDA. The effectiveness of the algorithm is measured by how much
the results conform to expert opinion, which is a data structure called TDAG that we defined to represent the
probability that a pair of highly correlated words appear together. In order to make sure that the internal
structure does not get changed too much from the rearrangement, coherence, which is a well known metric
for measuring the effectiveness of topic modeling, is used to control the balance of the internal structure.
We developed two ways to systematically obtain the expert opinion from data, depending on whether the
data has relevant expert writing or not. The final algorithm which takes into account both coherence and
expert opinion is presented. Finally we compare amount of adjustments needed to be done for each topic
modeling method, NMF and LDA.
This document provides an overview of probabilistic topic models. It discusses latent Dirichlet allocation (LDA), a commonly used topic model, and how it represents documents as mixtures over latent topics and words as generated by those topics. Parameter selection and inference algorithms for LDA are also summarized. Evaluation methods for topic models like held-out likelihood, topic coherence, and topic intrusion are outlined to assess how well models fit data and how interpretable topics are for humans.
Similar to Topic modeling of marketing scientific papers: An experimental survey (20)
Two faces of the same coin: Exploring the multilateral perspective of informa...ICDEcCnferenece
Adriana AnaMaria Davidescu, Professor, PhD, Department of Statistics and Econometrics. Two faces of the same coin: Exploring the multilateral perspective of informality in relation to Sustainable Development Goals. Fostering formal work with digital tools. (ICDEc 2022)
The document summarizes an upcoming special issue of the Journal of Telecommunications and the Digital Economy on the topic of digital technologies and innovation. It provides details on the topics covered in the special issue such as banking/finance, business data, and social media. It also outlines the submission process and acceptance rates. Additionally, it discusses future special issues that will focus on areas like AI technologies for smart cities and women's participation in the digital economy.
Possibilities and limitations of the Croatian police in communication via soc...ICDEcCnferenece
Ivana Radic, Robert Idlbek and Irena Cajner Mraović. Possibilities and limitations of the Croatian police in communication via social networks. (ICDEc 2022)
Changes in Global Virtual Team Conflict Over Time: The Role of Openness to Li...ICDEcCnferenece
Longzhu Dong, Robert Stephens and Ana Maria Soares. Changes in Global Virtual Team Conflict Over Time: The Role of Openness to Linguistic Diversity. (ICDEc 2022)
Cause-related marketing: towards an exploration of the factors favoring the p...ICDEcCnferenece
The document presents research on factors favoring purchase intention of Tunisian consumers towards cause-related marketing campaigns. Through qualitative interviews, the study found that congruence between the cause and brand/consumer had the strongest impact on purchase intention. Additionally, consumer identification with the cause and attribution of altruistic motivations to the company positively influenced purchase intention. The research contributes to understanding consumer behavior towards cause marketing and provides recommendations for companies on aligning causes with brands and consumers. Limitations included a small sample size and non-representative group, calling for future quantitative research.
Relationship between culture and user's behavior in the context of informatio...ICDEcCnferenece
Olfa Ismail. Relationship between culture and user's behavior in the context of information security systems: A qualitative study in SMEs. (ICDEc 2022)
A Maturity Model for Open Educational Resources in Higher Education Instituti...ICDEcCnferenece
Carla Reinken, Nicole Draxler-Weber and Uwe Hoppe. A Maturity Model for Open Educational Resources in Higher Education Institutions - Development and Evaluation. (ICDEc 2022)
AI-based Business Models in Healthcare: An Empirical Study of Clinical Decisi...ICDEcCnferenece
Marija Radic, Claudia Vienken, Laurin Nikschat, Thore Dietrich, Holger König, Lorenz Laderick and Dubravko Radic. AI-based Business Models in Healthcare: An Empirical Study of Clinical Decision Support Systems. (ICDEc 2022)
Towards a better digital transformation: learning from the experience of a di...ICDEcCnferenece
Houda Mahboub and Hicham Sadok. Towards a better digital transformation: learning from the experience of a digital transformation project. (ICDEc 2022)
Consumer Satisfaction using fitness technology innovationICDEcCnferenece
This document summarizes research into the determinants of customer satisfaction with fitness technology innovations. It reviews theories of diffusion of innovation, planned behavior, and technology acceptance. The research methodology combined independent variables like service quality, device friendliness and helpfulness, and quickness with the dependent variable of customer satisfaction. The results found that all hypotheses about positive relationships between the independent and dependent variables were accepted. Practical implications are that focusing on customer satisfaction through service quality, device usability factors, and other determinants can improve research and development, product/service quality, and customer satisfaction with fitness technologies.
Closing session: awards for best papers and reviewersICDEcCnferenece
The document summarizes the 6th International Conference on Digital Economy (ICDEC) 2021, including statistics about papers submitted and accepted. It provides details on the best papers selected, which covered topics on business innovation and emerging technologies. The top three papers and their authors are described. Statistics on paper reviews and reviewers from over 20 countries are provided. The document concludes by recognizing the best paper reviewers based on criteria like providing rich, positive comments and proposing new directions and references in their reviews. The two reviewers receiving awards are named and described.
Transition to Tertiary Education and eLearning in Lebanon against the backdro...ICDEcCnferenece
Jacqueline Saad Harfouche and Nizar Hariri. Transition to Tertiary Education and eLearning in Lebanon against the backdrop of economic collapse and Covid-19 pandemic. (ICDEc 2021)
Internet of Things healthcare system for reducing economic burdenICDEcCnferenece
The document outlines a proposed IoT monitoring healthcare system for patients with COPD. It discusses:
1) The problems with existing COPD treatment and monitoring, including a lack of comprehensive systems that can accurately assess risk and enable fast intervention.
2) The goals of extending lifetime, improving quality of life, and reducing economic burden for COPD patients.
3) The proposed architecture of the monitoring system, which would use an IoT approach with three dimensions - an ontological model, medical rules, and context awareness - to continuously monitor patients' medical, environmental, and behavioral contexts.
Ready to Unlock the Power of Blockchain!Toptal Tech
Imagine a world where data flows freely, yet remains secure. A world where trust is built into the fabric of every transaction. This is the promise of blockchain, a revolutionary technology poised to reshape our digital landscape.
Toptal Tech is at the forefront of this innovation, connecting you with the brightest minds in blockchain development. Together, we can unlock the potential of this transformative technology, building a future of transparency, security, and endless possibilities.
Gen Z and the marketplaces - let's translate their needsLaura Szabó
The product workshop focused on exploring the requirements of Generation Z in relation to marketplace dynamics. We delved into their specific needs, examined the specifics in their shopping preferences, and analyzed their preferred methods for accessing information and making purchases within a marketplace. Through the study of real-life cases , we tried to gain valuable insights into enhancing the marketplace experience for Generation Z.
The workshop was held on the DMA Conference in Vienna June 2024.
Discover the benefits of outsourcing SEO to Indiadavidjhones387
"Discover the benefits of outsourcing SEO to India! From cost-effective services and expert professionals to round-the-clock work advantages, learn how your business can achieve digital success with Indian SEO solutions.
HijackLoader Evolution: Interactive Process HollowingDonato Onofri
CrowdStrike researchers have identified a HijackLoader (aka IDAT Loader) sample that employs sophisticated evasion techniques to enhance the complexity of the threat. HijackLoader, an increasingly popular tool among adversaries for deploying additional payloads and tooling, continues to evolve as its developers experiment and enhance its capabilities.
In their analysis of a recent HijackLoader sample, CrowdStrike researchers discovered new techniques designed to increase the defense evasion capabilities of the loader. The malware developer used a standard process hollowing technique coupled with an additional trigger that was activated by the parent process writing to a pipe. This new approach, called "Interactive Process Hollowing", has the potential to make defense evasion stealthier.
Topic modeling of marketing scientific papers: An experimental survey
1. Presentation of the article
Presented by:
Malek Chebil
Authors:
Topic modeling of marketing scientific papers:
An experimental survey
Malek Chebil,
Rim Jallouli,
Mohamed Anis Bach Tobji,
Chiheb Eddine Ben Ncir,
2020-2021
Video link: https://drive.google.com/file/d/1ppGoL0qirOlZ4-ecdNg3JG_v85-ZohQ-/view?usp=sharing
2. Introduction
Analyse des besoins
Conception
Réalisation
Conclusion et perspectives
Plan
2
Introduction
Natural Language Processing (NLP)
Topic modeling
Application of topic modeling techniques on
marketing scientific papers' corpus
Objective and subjective evaluation
Conclusion and perspectives
5. 5
Topic modeling: definition
• A type of statistical model for discovering the abstract “topic” that occur in a
collection of documents (corpus).
• Topics contain the cluster of words which frequently occur together in the
corpus.
• Each document consists of a mixture of topics.
• Each topic consists of a collection of words.
6. 6
Topic modeling: techniques
Latent Semantic Analysis
(LSA)
Latent Dirichlet Allocation
(LDA)
Correlated Topic Model
(CTM)
Algebric method Generative probabilistic
model
Generative probabilistic
model
Analyzes relationships
between a set of documents
and the terms contained
within
Improves the way of mixture
models that capture the
exchangeability of both
words and documents
Improves the way of mixture
models that capture the
exchangeability of both
words and documents
Words that are similar in
meaning will appear in
similar pieces of text
(distributional hypothesis).
Each document is a
probability distribution of
topics and each topic is a
probability distribution of
words from the corpus
Each document is a
probability distribution of
topics and each topic is a
probability distribution of
words from the corpus
Uses singular value
decomposition (SVD) to scan
unstructured data to find
hidden relationships between
terms and concepts.
Uses the Dirichlet prior to
model the variability among
the topic proportions.
Uses the logistic normal
distribution to model the
pairwise topic correlations
7. 7
Application of topic modeling techniques on
marketing scientific papers' corpus (1/4)
NLP process
8. 8
Topic model generated by LSA for k=6
Application of topic modeling techniques on
marketing scientific papers' corpus (2/4)
9. 9
Topic model generated by LDA for k=6
Application of topic modeling techniques on
marketing scientific papers' corpus (3/4)
10. 10
Topic model generated by CTM for k=6
Application of topic modeling techniques on
marketing scientific papers' corpus (4/4)
11. 11
Objective evaluation (1/4)
Probabilistic coherence
• Measure how well the topics are extracted.
• Score a topic by measuring the degree of coherence between its words.
• Higher probabilistic coherence score means better model.
12. 12
Objective evaluation (2/4)
R-squared
• Known as the coefficient of determination, or the coefficient of multiple
determination for multiple regression.
• Evaluate how well the model fits the data.
• Interpretable as the proportion of variability in the data explained by the model.
• Higher r-squared indicates that the model fits the data perfectly.
13. 13
Objective evaluation (3/4)
Perplexity
• Measure how well a probability distribution or probability model predicts a set
of data.
• Compare probability models and not algebraic models.
• Lower perplexity suggests a better model.
14. 14
Objective evaluation (4/4)
• CaoJuan2009: calculates the cosine
distance of topics. The minimum value
indicates that the corresponding K is the
optimal number of topics.
• Griffiths2004: is computed based
on an estimate multinomial distribution
of K topics to words in the corpus. The
maximum value indicates that the
corresponding K is the optimal number.
Arun2010, CaoJuan2009 and Griffiths2004
• Arun2010: is computed based on the symmetric KL-Divergence between two
matrices (Topic-Words and Document-Topics). The lower value, the better.
18. 18
Subjective evaluation (4/4)
Synthetic comparative table of topic modeling techniques
ranking in the context of scientific papers' corpus
19. 19
• A comparative study of LSA, LDA and CTM on a corpus of marketing
scientific papers.
• Objective evaluation using different metrics.
• Subjective evaluation using marketing expert.
• LDA and CTM models are better than LSA.
• Using a large corpus of scientific papers in other fields or context.
• Appling topic modeling techniques on the full-text of the corpus.
• Comparing other topic modeling techniques.
• Applying cognitive analytics to certain tasks such as “improving the label of
topics” to minimize the cost and efforts of the experts.
Introduction (1/….)
Conclusion and perspectives