This document discusses basic research on text mining conducted at UCSD. It introduces text mining and poses three central questions: how to represent documents, how to model closely related documents, and how to model distantly related documents. It then describes the Dirichlet compound multinomial distribution model (DCM), which allows modeling of bursty words and capturing multiple topics within documents. The document proposes extending latent Dirichlet allocation (LDA) with DCM to create the DCMLDA model, and describes the training and inference procedures for this model.
Context-dependent Token-wise Variational Autoencoder for Topic ModelingTomonari Masada
This document proposes a new variational autoencoder (VAE) approach for topic modeling that addresses the issue of latent variable collapse. The proposed VAE models each word token separately using a context-dependent sampling approach. It minimizes a KL divergence term not considered in previous VAEs for topic modeling. An experiment on four large datasets found the proposed VAE improved over existing VAEs for about half the datasets in terms of perplexity or normalized pairwise mutual information.
A4 oracle's application engineered storage your application advantageDr. Wilfred Lin (Ph.D.)
The document discusses Oracle's Application Engineered Storage solutions. It highlights how Oracle databases and applications are storage-aware while storage systems are database and application-aware. Oracle's storage offers best-in-class performance, efficiency, and integration with Oracle software. Specific products discussed include the ZS3 storage appliance and how it outperforms competitors on benchmarks. The document also covers backup, archiving to tape, and how Oracle solutions can optimize storage tiering and lower costs.
The document discusses the transition from tape storage to cloud storage at the San Diego Supercomputer Center (SDSC). It provides an overview of SDSC's previous tape archive system and current OpenStack Swift cloud storage system. The Swift system provides scalable, durable storage with 99.5% availability and supports access through web interfaces, command line tools, and library collections. Future plans include integrating Active Directory authentication and improving large file upload support.
This document provides an overview of Latent Dirichlet Allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA models each item in a collection as a finite mixture of latent topics, where a topic is characterized by a distribution over words. It assumes documents are represented as random mixtures over latent topics, where each word's creation is attributed to a topic. The document summarizes LDA's graphical model and outlines challenges in fitting the model, specifically estimating the posterior distribution over latent topic structures via sampling or variational methods.
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
A Large number of digital text information is generated every day. Effectively searching,
managing and exploring the text data has become a main task. In this paper, we first represent
an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The
former one builds up a document topic model, aiming to a topic perspective solution on
searching, exploring and recommending articles. The latter one sets up a user topic model,
providing a full research and analysis over Twitter users’ interest. The experiment process
including data collecting, data pre-processing and model training is fully documented and
commented. Further more, the conclusion and application of this paper could be a useful
computation tool for social and business research.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
Context-dependent Token-wise Variational Autoencoder for Topic ModelingTomonari Masada
This document proposes a new variational autoencoder (VAE) approach for topic modeling that addresses the issue of latent variable collapse. The proposed VAE models each word token separately using a context-dependent sampling approach. It minimizes a KL divergence term not considered in previous VAEs for topic modeling. An experiment on four large datasets found the proposed VAE improved over existing VAEs for about half the datasets in terms of perplexity or normalized pairwise mutual information.
A4 oracle's application engineered storage your application advantageDr. Wilfred Lin (Ph.D.)
The document discusses Oracle's Application Engineered Storage solutions. It highlights how Oracle databases and applications are storage-aware while storage systems are database and application-aware. Oracle's storage offers best-in-class performance, efficiency, and integration with Oracle software. Specific products discussed include the ZS3 storage appliance and how it outperforms competitors on benchmarks. The document also covers backup, archiving to tape, and how Oracle solutions can optimize storage tiering and lower costs.
The document discusses the transition from tape storage to cloud storage at the San Diego Supercomputer Center (SDSC). It provides an overview of SDSC's previous tape archive system and current OpenStack Swift cloud storage system. The Swift system provides scalable, durable storage with 99.5% availability and supports access through web interfaces, command line tools, and library collections. Future plans include integrating Active Directory authentication and improving large file upload support.
This document provides an overview of Latent Dirichlet Allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA models each item in a collection as a finite mixture of latent topics, where a topic is characterized by a distribution over words. It assumes documents are represented as random mixtures over latent topics, where each word's creation is attributed to a topic. The document summarizes LDA's graphical model and outlines challenges in fitting the model, specifically estimating the posterior distribution over latent topic structures via sampling or variational methods.
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
A Large number of digital text information is generated every day. Effectively searching,
managing and exploring the text data has become a main task. In this paper, we first represent
an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The
former one builds up a document topic model, aiming to a topic perspective solution on
searching, exploring and recommending articles. The latter one sets up a user topic model,
providing a full research and analysis over Twitter users’ interest. The experiment process
including data collecting, data pre-processing and model training is fully documented and
commented. Further more, the conclusion and application of this paper could be a useful
computation tool for social and business research.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...Aaron Li
Video (2014): http://videolectures.net/kdd2014_li_sampling_complexity/
This paper presents an approximate sampler for topic models that theoretically and experimentally outperforms existing samplers thereby allowing topic models to scale to industry-scale datasets.
This document provides an overview of topic modeling. It defines topic modeling as discovering the thematic structure of a corpus by modeling relationships between words and documents through learned topics. The document introduces Latent Dirichlet Allocation (LDA) as a widely used topic modeling technique. It outlines LDA's generative process and inference methods like Gibbs sampling and variational inference. The document also discusses extensions to LDA, evaluation strategies, open questions, and applications like topic labeling and browsing.
This document summarizes the agenda and key topics for a CIS 890 project final presentation on topics modelling with LDA. The presentation will cover LDA modelling, HMMLDA modelling, LDA with collocations modelling, and experimental results on the NIPS collection. It will discuss topic modelling approaches like LDA, discriminative vs generative methods, and limitations of bag-of-words assumptions.
This document provides an introduction to topic modelling. It discusses how topic modelling can be used to summarize large collections of documents by clustering them into topics. It describes latent Dirichlet allocation (LDA) as a commonly used topic modelling technique that represents documents as mixtures of topics and topics as mixtures of words. The document outlines how LDA works using a generative process and Gibbs sampling. It also discusses other related methods like latent semantic analysis, word2vec, and Ida2vec. Evaluation techniques for topic models like word and topic intrusion are presented.
This document provides an overview of topic modeling and Latent Dirichlet Allocation (LDA). It begins by discussing Aristotle's definition of topics as headings under which arguments fall. It then explains LDA's view of topics as distributions of co-occurring words. The document outlines the parameters and process of LDA, including variational inference using the Expectation-Maximization algorithm to estimate topic distributions and document-topic distributions. It concludes by describing how to compute the likelihood of LDA models.
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
This document introduces latent semantic indexing (LSI), a technique for information retrieval that overcomes some limitations of the vector space model. LSI represents documents and queries in a semantic space of concepts derived from word co-occurrence patterns in the original text. It uses singular value decomposition to project documents and queries into a concept space of lower dimensionality than the original word space. This addresses problems with synonymy and polysemy. An example shows how LSI can retrieve a document based on conceptual similarity rather than direct word matching. Advantages of LSI include capturing synonymy and polysemy, while disadvantages include increased storage and computational requirements.
This document proposes online inference algorithms for topic models as an alternative to traditional batch algorithms. It introduces two related online algorithms: incremental Gibbs samplers and particle filters. These algorithms update estimates of topics incrementally as each new document is observed, making them suitable for applications where the document collection grows over time. The algorithms are evaluated in comparison to existing batch algorithms to analyze their runtime and performance.
Mathematical Modeling for Practical ProblemsLiwei Ren任力偉
Mathematical modeling is an important step for developing many advanced technologies in various domains such as network security, data mining and etc… This lecture introduces a process that the speaker summarizes from his past practice of mathematical modeling and algorithmic solutions in IT industry, as an applied mathematician, algorithm specialist or software engineer , and even as an entrepreneur. A practical problem from DLP system will be used as an example for creating math models and providing algorithmic solutions.
Topic modeling is a technique for discovering hidden semantic patterns in large document collections. It represents documents as probability distributions over latent topics, where each topic is characterized by a distribution over words. Two common probabilistic topic models are latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (pLSA). LDA assumes each document exhibits multiple topics in different proportions, with topics modeled as distributions over words. Topic modeling provides dimensionality reduction and can be applied to problems like text classification, collaborative filtering, and computer vision tasks like image classification.
The document discusses diversified social media retrieval for news stories. It outlines the motivation for retrieving relevant information from social media like Reddit comments for a given news story. The goal is to reduce redundancy in the retrieved results and create a concise and diversified search result. It discusses related work on diversification and summarization of social media data. The proposed solution involves text preprocessing, topic modeling using Dirichlet Multinomial Mixture Model (DMM) clustering, and a diversification method to generate the final result.
A scalable gibbs sampler for probabilistic entity linkingSunny Kr
This document summarizes a research paper that proposes a scalable Gibbs sampling approach for probabilistic entity linking. The approach formulates entity linking as probabilistic inference in a topic model where each topic corresponds to a Wikipedia article. It introduces an efficient Gibbs sampling scheme that exploits the sparsity in the Wikipedia-LDA model to allow inference over millions of topics. Experimental results show it achieves state-of-the-art performance on the Aida-CoNLL dataset.
1. The document proposes Granulated LDA (GLDA), a regularized version of LDA, to improve topic modeling stability.
2. It introduces measures like Kullback-Leibler divergence and Jaccard coefficient to evaluate topic similarity and modeling stability across runs.
3. An experiment applies LDA, SLDA, and GLDA to a large Russian text corpus, finding that GLDA produces more stable topics across multiple runs according to these measures.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
The document describes a tutorial on using neural networks for information retrieval. It discusses an agenda for the tutorial that includes fundamentals of IR, word embeddings, using word embeddings for IR, deep neural networks, and applications of neural networks to IR problems. It provides context on the increasing use of neural methods in IR applications and research.
Opening Keynote: The Many and the One: BCE themes in 21st century data curation
Allen Renear, Professor and Interim Dean, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Two scientists can be using "the same data" even though the computer files involved appear to be quite different. This is familiar enough, and for the most part, in small communities with shared practices and familiar datasets, raises few problems. But these informal understandings do not scale to 21st century data curation. To get full value from cyberinfrastructure we must support huge quantities of heterogeneous data developed by diverse communities and used by diverse communities -- often with widely varying methods, tools, and purposes. To accomplish this our informal practices and understandings much be replaced, or at least supplemented, by a shared framework of standard terminology for describing complex cascades of representational levels and relationships. Fundamental problems in data curation -- and in particular problems involving provenance, identifiers, and data citation — cannot be fully resolved without such a framework. Although the deepest problems here have ancient origins, useful practical measures are now within reach. Some recent work toward this end that is being carried out at the Center for Informatics Research in Science and Scholarship (CIRSS) at the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign will be described.
- LDAvis is an interactive visualization tool built using R and D3 to help users interpret topics estimated using Latent Dirichlet Allocation (LDA).
- It aims to answer questions about the meaning of each topic, the prevalence of each topic, and how topics relate to each other.
- The tool visualizes term relevance, topic prevalence, and inter-topic distances to help users understand the topics in a corpus.
Spatial Latent Dirichlet Allocation (SLDA) is an extension of LDA that incorporates spatial information to improve topic modeling of image data. SLDA treats each region of an image grid as a document and assigns visual words representing local image patches to the closest region. This allows it to capture co-occurrence relationships between visual words better than LDA. The paper demonstrates SLDA can outperform LDA on image classification tasks by incorporating spatial context between visual words.
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...Matthias Trapp
Presentation of research paper "A Benchmark for the Use of Topic Models for Text Visualization Tasks" at the 15th International Symposium on Visual Information Communication and Interaction in Chur, Switzerland.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
More Related Content
Similar to Basic Research on Text Mining at UCSD.
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...Aaron Li
Video (2014): http://videolectures.net/kdd2014_li_sampling_complexity/
This paper presents an approximate sampler for topic models that theoretically and experimentally outperforms existing samplers thereby allowing topic models to scale to industry-scale datasets.
This document provides an overview of topic modeling. It defines topic modeling as discovering the thematic structure of a corpus by modeling relationships between words and documents through learned topics. The document introduces Latent Dirichlet Allocation (LDA) as a widely used topic modeling technique. It outlines LDA's generative process and inference methods like Gibbs sampling and variational inference. The document also discusses extensions to LDA, evaluation strategies, open questions, and applications like topic labeling and browsing.
This document summarizes the agenda and key topics for a CIS 890 project final presentation on topics modelling with LDA. The presentation will cover LDA modelling, HMMLDA modelling, LDA with collocations modelling, and experimental results on the NIPS collection. It will discuss topic modelling approaches like LDA, discriminative vs generative methods, and limitations of bag-of-words assumptions.
This document provides an introduction to topic modelling. It discusses how topic modelling can be used to summarize large collections of documents by clustering them into topics. It describes latent Dirichlet allocation (LDA) as a commonly used topic modelling technique that represents documents as mixtures of topics and topics as mixtures of words. The document outlines how LDA works using a generative process and Gibbs sampling. It also discusses other related methods like latent semantic analysis, word2vec, and Ida2vec. Evaluation techniques for topic models like word and topic intrusion are presented.
This document provides an overview of topic modeling and Latent Dirichlet Allocation (LDA). It begins by discussing Aristotle's definition of topics as headings under which arguments fall. It then explains LDA's view of topics as distributions of co-occurring words. The document outlines the parameters and process of LDA, including variational inference using the Expectation-Maximization algorithm to estimate topic distributions and document-topic distributions. It concludes by describing how to compute the likelihood of LDA models.
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
This document introduces latent semantic indexing (LSI), a technique for information retrieval that overcomes some limitations of the vector space model. LSI represents documents and queries in a semantic space of concepts derived from word co-occurrence patterns in the original text. It uses singular value decomposition to project documents and queries into a concept space of lower dimensionality than the original word space. This addresses problems with synonymy and polysemy. An example shows how LSI can retrieve a document based on conceptual similarity rather than direct word matching. Advantages of LSI include capturing synonymy and polysemy, while disadvantages include increased storage and computational requirements.
This document proposes online inference algorithms for topic models as an alternative to traditional batch algorithms. It introduces two related online algorithms: incremental Gibbs samplers and particle filters. These algorithms update estimates of topics incrementally as each new document is observed, making them suitable for applications where the document collection grows over time. The algorithms are evaluated in comparison to existing batch algorithms to analyze their runtime and performance.
Mathematical Modeling for Practical ProblemsLiwei Ren任力偉
Mathematical modeling is an important step for developing many advanced technologies in various domains such as network security, data mining and etc… This lecture introduces a process that the speaker summarizes from his past practice of mathematical modeling and algorithmic solutions in IT industry, as an applied mathematician, algorithm specialist or software engineer , and even as an entrepreneur. A practical problem from DLP system will be used as an example for creating math models and providing algorithmic solutions.
Topic modeling is a technique for discovering hidden semantic patterns in large document collections. It represents documents as probability distributions over latent topics, where each topic is characterized by a distribution over words. Two common probabilistic topic models are latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (pLSA). LDA assumes each document exhibits multiple topics in different proportions, with topics modeled as distributions over words. Topic modeling provides dimensionality reduction and can be applied to problems like text classification, collaborative filtering, and computer vision tasks like image classification.
The document discusses diversified social media retrieval for news stories. It outlines the motivation for retrieving relevant information from social media like Reddit comments for a given news story. The goal is to reduce redundancy in the retrieved results and create a concise and diversified search result. It discusses related work on diversification and summarization of social media data. The proposed solution involves text preprocessing, topic modeling using Dirichlet Multinomial Mixture Model (DMM) clustering, and a diversification method to generate the final result.
A scalable gibbs sampler for probabilistic entity linkingSunny Kr
This document summarizes a research paper that proposes a scalable Gibbs sampling approach for probabilistic entity linking. The approach formulates entity linking as probabilistic inference in a topic model where each topic corresponds to a Wikipedia article. It introduces an efficient Gibbs sampling scheme that exploits the sparsity in the Wikipedia-LDA model to allow inference over millions of topics. Experimental results show it achieves state-of-the-art performance on the Aida-CoNLL dataset.
1. The document proposes Granulated LDA (GLDA), a regularized version of LDA, to improve topic modeling stability.
2. It introduces measures like Kullback-Leibler divergence and Jaccard coefficient to evaluate topic similarity and modeling stability across runs.
3. An experiment applies LDA, SLDA, and GLDA to a large Russian text corpus, finding that GLDA produces more stable topics across multiple runs according to these measures.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
The document describes a tutorial on using neural networks for information retrieval. It discusses an agenda for the tutorial that includes fundamentals of IR, word embeddings, using word embeddings for IR, deep neural networks, and applications of neural networks to IR problems. It provides context on the increasing use of neural methods in IR applications and research.
Opening Keynote: The Many and the One: BCE themes in 21st century data curation
Allen Renear, Professor and Interim Dean, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Two scientists can be using "the same data" even though the computer files involved appear to be quite different. This is familiar enough, and for the most part, in small communities with shared practices and familiar datasets, raises few problems. But these informal understandings do not scale to 21st century data curation. To get full value from cyberinfrastructure we must support huge quantities of heterogeneous data developed by diverse communities and used by diverse communities -- often with widely varying methods, tools, and purposes. To accomplish this our informal practices and understandings much be replaced, or at least supplemented, by a shared framework of standard terminology for describing complex cascades of representational levels and relationships. Fundamental problems in data curation -- and in particular problems involving provenance, identifiers, and data citation — cannot be fully resolved without such a framework. Although the deepest problems here have ancient origins, useful practical measures are now within reach. Some recent work toward this end that is being carried out at the Center for Informatics Research in Science and Scholarship (CIRSS) at the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign will be described.
- LDAvis is an interactive visualization tool built using R and D3 to help users interpret topics estimated using Latent Dirichlet Allocation (LDA).
- It aims to answer questions about the meaning of each topic, the prevalence of each topic, and how topics relate to each other.
- The tool visualizes term relevance, topic prevalence, and inter-topic distances to help users understand the topics in a corpus.
Spatial Latent Dirichlet Allocation (SLDA) is an extension of LDA that incorporates spatial information to improve topic modeling of image data. SLDA treats each region of an image grid as a document and assigns visual words representing local image patches to the closest region. This allows it to capture co-occurrence relationships between visual words better than LDA. The paper demonstrates SLDA can outperform LDA on image classification tasks by incorporating spatial context between visual words.
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...Matthias Trapp
Presentation of research paper "A Benchmark for the Use of Topic Models for Text Visualization Tasks" at the 15th International Symposium on Visual Information Communication and Interaction in Chur, Switzerland.
Similar to Basic Research on Text Mining at UCSD. (20)
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
WeTestAthens: Postman's AI & Automation Techniques
Basic Research on Text Mining at UCSD.
1. Basic Research on Text Mining at UCSD
Charles Elkan
University of California, San Diego
June 5, 2009
1
2. Text mining
What is text mining?
Working answer: Learning to classify documents,
and learning to organize documents.
Three central questions:
(1) how to represent a document?
(2) how to model a set of closely related documents?
(3) how to model a set of distantly related documents?
2
3. Why quot;basicquot; research?
Mindsets:
Probability versus linear algebra
Linguistics versus databases
Single topic per document versus multiple.
Which issues are important? From most to least interesting :-)
Sequencing of words
Burstiness of words
Titles versus bodies
Repeated documents
Included text
Feature selection
Applications:
Recognizing helpful reviews on Amazon
Finding related topics across books published decades apart.
4. With thanks to ...
Gabe Doyle, David Kauchak, Rasmus Madsen.
Amarnath Gupta, Chaitan Baru.
4
5. Three central questions:
(1) how to represent a document?
(2) how to model a set of closely related documents?
(3) how to model a set of distantly related documents?
Answers:
(1) quot;bag of wordsquot;
(2) Dirichlet compound multinomial (DCM) distribution
(3) DCM-based topic model (DCMLDA)
6. The quot;bag of wordsquot; representation
Let V be a xed vocabulary. The vocabulary size is m = |V |.
Each document is represented as a vector x of length m ,
j
where x is the number of appearances of word j in the document.
The length of the document is n = j xj .
For typical documents, n m and x j = 0 for most words j .
6
7. The multinomial distribution
The probability of document x according to model θ is
n ! x
( |θ) =
p x θj j .
j j!x
j
Each appearance of the same word j always has the same
probability θj .
Computing the probability of a document needs O (n ) time,
not O (m ) time.
7
8. The phenomenon of burstiness
In reality, additional appearances of the same word are less
surprising, i.e. they have higher probability. Example:
Toyota Motor Corp. is expected to announce a major
overhaul. Yoshi Inaba, a former senior Toyota executive, was
formally asked by Toyota this week to oversee the U.S.
business. Mr. Inaba is currently head of an international
airport close to Toyota's headquarters in Japan.
Toyota's U.S. operations now are suering from plunging
sales. Mr. Inaba was credited with laying the groundwork for
Toyota's fast growth in the U.S. before he left the company.
Recently, Toyota has had to idle U.S. assembly lines and
oer a limited number of voluntary buyouts. Toyota now
employs 36,000 in the U.S.
9. Empirical evidence of burstiness
How to interpret the gure: The chance that a given rare word
occurs 10 times in a document is 10
−6 . The chance that it occurs
−6.5 .
20 times is 10
9
10. Moreover ...
A multinomial is appropriate only for modeling common words,
which are not informative about the topic of a document.
Burstiness and signicance are correlated: more informative words
are also more bursty.
10
11. A trained DCM model gives correct probabilities for all counts of all
types of words.
11
12. The Polya urn
So what is the Dirichlet compound multinomial (DCM)?
Consider a bucket with balls of m = |V | dierent colors.
After a ball is selected randomly, it is replaced and one more ball of
the same color is added.
Each time a ball is drawn, the chance of drawing the same color
again is increased.
The initial number of balls with color j is αj .
12
13. The bag-of-bag-of-words process
Let ϕ be the parameter vector of a multinomial,
i.e. a xed probability for each word.
Let Dir(β) be a Dirichlet distribution over ϕ.
To generate a document:
(1) draw document-specic probabilities ϕ ∼ Dir(β)
(2) draw n words w ∼ Mult(ϕ).
Each document consists of words drawn from a multinomial that is
xed for that document, but dierent for other documents.
Remarkably, the Polya urn and the bag-of-bag-of-words process
yield the same probability distribution over documents.
13
14. How is burstiness captured?
A multinomial parameter vector ϕ has length |V | and is
constrained: j ϕj = 1.
A DCM parameter vector β has the same length |V | but is
unconstrained.
The one extra degree of freedom allows the DCM to discount
multiple observations of the same word, in an adjustable way.
The smaller the sum s = j βj , the more words are bursty.
14
15. Moving forward ...
Three central questions:
(1) how to represent documents?
(2) how to model closely related documents?
(3) how to model distantly related documents?
A DCM is a good model of documents that all share a single theme.
β represents the central theme; for each document ϕ represents its
variation on this theme.
By combining DCMs with latent Dirichlet allocation (LDA),
we answer (3).
15
16. Digression 1: Mixture of DCMs
Because of the 1:1 mapping between multinomials and documents,
in a DCM model each document comes entirely from one subtopic.
We want to allow multiple topics, multiple documents from the
same topic, and multiple topics within one document.
In 2006 we extended the DCM model to a mixture of DCM
distributions. This allows multiple topics, but not multiple topics
within one document.
16
17. Digression 2: A non-text application
Goal: Find companies whose stock prices tend to move together.
Example: { IBM+, MSFT+, AAPL- } means IBM and Microsoft
often rise, and Apple falls, on the same days.
Let each day be a document containing words like IBM+.
Each word is a stock symbol and a direction (+ or -). Each day has
one copy of the word for each 1% change in the stock price.
Let a co-moving group of stocks be a topic. Each day is a
combination of multiple topics.
17
18. Examples of discovered topics
Computer Related Real Estate
Stock Company Stock Company
NVDA Nvidia SPG Simon Properties
SNDK SanDisk AIV Apt. Investment
BRCM Broadcom KIM Kimco Realty
JBL Jabil Circuit AVB AvalonBay
KLAC KLA-Tencor DDR Developers
NSM Nat'l Semicond. EQR Equity Residential
The dataset contains 501 days of transactions between January
2007 and September 2008.
18
19. DCMLDA advantages
Unlike a mixture model, a topic model allows many topics to occur
in each document.
DCMLDA allows the same topic to occur with dierent words in
dierent documents.
Consider a sports topic. Suppose rugby and hockey are
equally common. But within each document, seeing rugby makes
seeing rugby again more likely than seeing hockey.
A standard topic model cannot represent this burstiness, unless the
words rugby and hockey are spread across two topics.
19
20. Hypothesis
quot;A DCMLDA model with a few topics can t a corpus as well as an
LDA model with many topics.quot;
Motivation: A single DCMLDA topic can explain related aspects of
documents more eectively than a single LDA topic.
The hypothesis is conrmed by the experimental results below.
21. Latent Dirichlet Allocation (LDA)
LDA is a generative model:
For each of K topics, draw a multinomial to describe it.
For each of D documents:
(1) Determine the probability of each of K topics in this document.
(2) For each of N words:
rst draw a topic, then draw a word based on that topic.
α θ
z
β ϕ w
N
K D
21
22. Graphical model
α θ
z
β ϕ w
N
K D
The only xed parameters of the model are α and β.
ϕ ∼ Dirichlet(β)
θ ∼ Dirichlet(α)
z ∼ Multinomial(θ)
w ∼ Multinomial(ϕ)
22
23. Using LDA for text mining
Training nds maximum-likelihood values for ϕ for each topic,
and for θ for each document.
For each topic, ϕ is a vector of word probabilities indicating the
content of that topic.
The distribution θ of each document is a reduced-dimensionality
representation. It is useful for:
learning to classify documents
measuring similarity between documents
more?
23
24. Extending LDA to DCMLDA
Goal: Allow multiple topics in a single document, while making
subtopics be document-specic.
In DCMLDA, for each topic k and each document d a fresh
multinomial word distribution ϕkd is drawn.
For each topic k , these multinomials are drawn from the same
Dirichlet βk , so all versions of the same topic are linked.
Per-document instances of each topic allow for burstiness.
24
26. DCMLDA generative process
for document d∈ {1, . . . , D } do
draw topic distribution θd ∼ Dir(α)
for topic k ∈ {1, . . . , K } do
draw topic-word distribution ϕkd ∼ Dir(βk )
end for
for word n ∈ {1, . . . , Nd } do
d n ∼ θd
draw topic z ,
draw word wd ,n ∼ ϕzd ,n d
end for
end for
26
27. Meaning of α and β
When applying LDA, it is not necessary to learn α and β.
Steyvers and Griths recommend xed uniform values
α = 50/K and β = .01, where K is the number of topics.
However, the information in the LDA ϕ values is in the DCMLDA β
values.
Without an eective method to learn the hyperparameters, the
DCMLDA model is not useful.
27
28. Training
Given a training set of documents, alternate:
(a) optimize parameters ϕ, θ, and z given hyperparameters,
(b) optimize hyperparameters α, β given document parameters.
For xed α and β, do collapsed Gibbs sampling to nd the
distribution of z .
Given a z sample, nd α and β by Monte Carlo expectation-
maximization.
When desired, compute ϕ and θ from samples of z .
29. Gibbs sampling
Gibbs sampling for DCMLDA is similar to the method for LDA.
Start by factoring the complete likelihood of the model:
( , z |α, β) = p (w |z , β)p (z |α).
p w
DCMLDA and LDA are identical over the α-to-z pathway, so
( |α)
p z in DCMLDA is the same as for LDA:
( d + α)
B n··
( |α) =
p z .
B (α)
d
B (·) is the Beta function, and n tkd is how many times word t has
topic k in document d .
29
30. To get p (w |z , β), average over all possible ϕ distributions:
( | , β) =
p w z p z ( |ϕ)p (ϕ|β)d ϕ
ϕ
Nd
= p (ϕ|β) ϕwd ,n zd ,n d d ϕ
ϕ d n =1
= p (ϕ|β) (ϕtkd )ntkd d ϕ.
ϕ d ,k ,t
Expand p (ϕ|β) as a Dirichlet distribution:
(ϕtkd )ntkd d ϕ
1
( | , β) =
p w z (ϕtkd )βtk −1
B (β·k ) t
ϕ d ,k d ,k ,t
( kd + β·k )
(ϕtkd )βtk −1+ntkd d ϕ =
B n·
= .
B (β·k )
d ,k ϕ t d ,k
30
31. Gibbs sampling cont.
Combining equations, the complete likelihood is
( d + α)
B n·· ( kd + β·k )
B n·
( , z |α, β) =
p w .
B (α) B (β·k )
d k
31
32. Finding optimal α and β
Optimal α and β values maximize p (w |α, β). Unfortunately, this
likelihood is intractable.
The complete likelihood p (w , z |α, β) is tractable. Based on it, we
use single-sample Monte Carlo EM.
Run Gibbs sampling for a burn-in period, with guesses for α and β.
Then draw a topic assignment z for each word of each document.
Use this vector in the M-step to estimate new values for α and β.
Run Gibbs sampling for more iterations, to let topic assignments
stabilize based on the new α and β values.
Then repeat.
32
33. Training algorithm
Start with initial α and β
repeat
Run Gibbs sampling to approximate steady state
Choose a topic assignment for each word
Choose α and β to maximize complete likelihood p (w , z |α, β)
until convergence of α and β
33
34. α and β to maximize complete likelihood
Log complete likelihood is
L(α, β; w , z ) = [log Γ(n·kd + αk ) − log Γ(αk )]
d ,k
+ [log Γ( αk ) − log Γ( n· kd + αk )]
d k k
+ [log Γ(ntkd + βtk ) − log Γ(βtk )]
d ,k , t
+ [log Γ( βtk ) − log Γ( ntkd + βtk )].
d ,k t t
The rst two lines depend only on α, and the second two on β.
Furthermore, βtk can be independently maximized for each k .
34
35. We get K +1 equations to maximize:
α = argmax (log Γ(n·kd + αk ) − log Γ(αk ))
d ,k
+ [log Γ( αk ) − log Γ( n· kd + αk )]
d k k
β·k = argmax (log Γ(ntkd + βtk ) − log Γ(βtk ))
d ,t
+ [log Γ( βtk ) − log Γ( n tkd + βtk )]
d t t
Each equation denes a vector, either {αk }k or {βtk }t .
With a carefully coded Matlab implementation of L-BFGS, one
iteration of EM takes about 100 seconds on sample data.
35
36. Non-uniform α and β
Implementations of DCMLDA must allow the α vector and β array
to be non-uniform.
In DCMLDA, β carries the information that ϕ carries in LDA.
α could be uniform in DCMLDA, but learning non-uniform values
allows certain topics to have higher overall probability than others.
36
37. Experimental design
Goal: Check whether handling burstiness in DCMLDA model yields
a better topic model than LDA.
Compare DCMLDA only with LDA for two reasons:
(1) Comparable conceptual complexity.
(2) DCMLDA is not in competition with more complex topic
models, since those can be modied to include DCM topics.
37
38. Comparison method
Given a test set of documents not used for training, estimate the
likelihood p (w |α, β) for LDA and DCMLDA models.
For DCMLDA, p (w |α, β) uses trained α and β.
For LDA, p (w |α, β) uses α=α
¯ and
¯
β = β, the scalar means of the
DCMLDA values.
Also compare to LDA with heuristic values β = .01 and α = 50/K ,
where K is the number of topics.
38
39. Datasets
Compare LDA and DCMLDA as models for both text and non-text
datasets.
The text dataset is a collection of papers from the 2002 and 2003
NIPS conferences, with 520955 words in 390 documents, and
|V | = 6871.
The SP 500 dataset contains 501 days of stock transactions
between January 2007 and September 2008. |V | = 1000.
39
40. Digression: Computing likelihood
The incomplete likelihood p (w |α, β) is intractable for topic models.
The complete likelihood p (w , z |α, β) is tractable, so previous work
has averaged it over z , but this approach is unreliable.
Another possibility is to measure classication accuracy.
But, our datasets do not have obvious classication schemes. Also,
topics may be more accurate than predened classes.
40
41. Empirical likelihood
To calculate empirical likelihood (EL), rst train each model.
Feed obtained parameter values α and β into a generative model.
Get a large set of pseudo documents. Use the pseudo documents to
train a tractable model: a mixture of multinomials.
Estimate the test set likelihood as its likelihood under the tractable
model.
41
42. Digression2 : Stability of EL
Investigate stability by running EL multiple times for the same
DCMLDA model
Train three independent 20-topic DCMLDA models on the SP500
dataset, and run EL ve times for each model.
Mean absolute dierence of EL values for the same model is 0.08%.
Mean absolute dierence between EL values for separately trained
DCMLDA models is 0.11%.
Conclusion: Likelihood values are stable over DCMLDA models
with a constant number of topics.
42
43. Cross-validation
Perform ve 5-fold cross-validation trials for each number of topics
and each dataset.
First train a DCMLDA model, then create two LDA models.
Fitted LDA uses the means of the DCMLDA hyperparameters.
Heuristic LDA uses xed parameter values.
Results: For both datasets, DCMLDA is better than tted LDA,
which is better than heuristic LDA.
44. Mean log-likelihood on the SP500 dataset. Heuristic model
likelihood is too low to show. Max. standard error is 11.2.
−6200
DCMLDA
−6250
LDA
−6300
−6350
Log−Likelihood
−6400
−6450
−6500
−6550
−6600
−6650
−6700
0 50 100 150 200 250
Number of Topics
44
45. SP500 discussion
On the SP500 dataset, the best t is DCMLDA with seven topics.
A DCMLDA model with few topics is comparable to an LDA model
with many topics.
Above seven topics, DCMLDA likelihood drops. Data sparsity may
prevent the estimation of β values that generalize well (overtting).
LDA seems to undert regardless of how many topics are used.
45
46. Mean log-likelihood on the NIPS dataset. Max. standard error 21.5.
−10000
−12000
−14000
−16000
Log−Likelihood
−18000
−20000
−22000
−24000 DCMLDA
trained LDA
−26000 heuristic LDA
−28000
0 20 40 60 80 100
Number of Topics
46
47. NIPS discussion
On the NIPS dataset, DCMLDA outperforms LDA model at every
number of topics.
LDA with heuristic hyperparameter values almost equals the tted
LDA model at 50 topics.
The tted model is better when the number of topics is small.
48. Alternative heuristic values for hyperparameters
Learning α and β is benecial, both for LDA and DCMLDA models.
Optimal values are signicantly dierent from previously suggested
heuristic values.
Best α values around 0.7 seem independent of the number of
topics, unlike the suggested value 50/K .
48
49. Newer topic models
Variants include the Correlated Topic Model (CTM) and the
Pachinko Allocation Model (PAM). These outperform LDA on
many tasks.
However, DCMLDA competes only with LDA. The LDA core in
other models can be replaced by DCMLDA to improve their
performance.
DCMLDA and complex topic models are complementary.
49
51. Conclusion
The ability of the DCMLDA model to account for burstiness leads
to a signicant improvement in likelihood over LDA.
The burstiness of words, and of some non-text data, is an
important phenomenon to capture in topic modeling.
51