• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Topic Cube: Topic Modeling for OLAP on Multidimensional Text ...
 

Topic Cube: Topic Modeling for OLAP on Multidimensional Text ...

on

  • 1,542 views

 

Statistics

Views

Total Views
1,542
Views on SlideShare
1,542
Embed Views
0

Actions

Likes
2
Downloads
13
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Topic Cube: Topic Modeling for OLAP on Multidimensional Text ... Topic Cube: Topic Modeling for OLAP on Multidimensional Text ... Document Transcript

    • Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases Duo Zhang Chengxiang Zhai Jiawei Han dzhang22@cs.uiuc.edu, czhai@cs.uiuc.edu, hanj@cs.uiuc.edu Department of Computer Science, University of Illinois at Urbana Champaign Abstract ness. Unfortunately, traditional data cubes, though capable As the amount of textual information grows explosively in vari- of dealing with structured data, would face challenges for ous kinds of business systems, it becomes more and more desir- analyzing unstructured text data. able to analyze both structured data records and unstructured text As a specific example, consider ASRS [2], which is data simultaneously. While online analytical processing (OLAP) techniques have been proven very useful for analyzing and mining the world’s largest repository of safety information provided structured data, they face challenges in handling text data. On the by aviation’s frontline personnel. The database has both other hand, probabilistic topic models are among the most effective structured data (e.g., time, airport, and light condition) and approaches to latent topic analysis and mining on text data. In this paper, we propose a new data model called topic cube to combine text data such as narratives about an anomalous event written OLAP with probabilistic topic modeling and enable OLAP on the by a pilot or flight attendant as illustrated in Table 1. A text dimension of text data in a multidimensional text database. Topic narrative usually contains hundreds of words. cube extends the traditional data cube to cope with a topic hierarchy and store probabilistic content measures of text documents learned through a probabilistic topic model. To materialize topic cubes ef- Table 1: An example of text database in ASRS ficiently, we propose a heuristic method to speed up the iterative EM algorithm for estimating topic models by leveraging the mod- ACN Time Airport ··· Light Narrative els learned on component data cells to choose a good starting point 101285 199901 MSP ··· Daylight Document 1 for iteration. Experiment results show that this heuristic method is 101286 199901 CKB ··· Night Document 2 much faster than the baseline method of computing each topic cube 101291 199902 LAX ··· Dawn Document 3 from scratch. We also discuss potential uses of topic cube and show sample experimental results. This repository contains valuable information about avi- 1 Introduction ation safety and is a main resource for analyzing causes of recurring anomalous aviation events. Since causal factors do Data warehouses are widely used in today’s business market not explicitly exist in the structured data part of the reposi- for organizing and analyzing large amounts of data. An im- tory, but are buried in many narrative text fields, it is crucial portant technology to exploit data warehouses is the Online to support an analyst to mine such data flexibly with both Analytical Processing (OLAP) technology [4, 10, 16], which OLAP and text content analysis in an integrative manner. enables flexible interactive analysis of multidimensional data Unfortunately, the current data cube and OLAP techniques in different granularities. It has been widely applied to many can only provide limited support for such integrative anal- different domains [15, 22, 31]. OLAP on data warehouses is ysis. In particular, although they can easily support drill- mainly supported through data cubes [11, 12]. down and roll-up on structured attributes dimensions such As unstructured text data grows quickly, it is more as “time” and “location”, they cannot support an analyst to and more important to go beyond the traditional OLAP drill-down and roll-up on the text dimension according to on structured data to also tap into the huge amounts of some meaningful topic hierarchy defined for the analysis text data available to us for data analysis and knowledge task (i.e. anomalous event analysis), such as the one illus- discovery. These text data often exist either in the character trated in Figure 1. fields of data records or in a separate place with links to In the tree shown in Figure 1, the root represents the the data records through joinable common attributes. Thus aggregation of all the topics (each representing an anoma- conceptually we have both structured data and unstructured lous event). The second level contains some general anomaly text data in a database. For convenience, we will refer to such types defined in [1], such as “Anomaly Altitude Deviation” a database as a multidimensional text database, to distinguish and “Anomaly Maintenance Problem.” A child topic node it from both the traditional relational databases and the text represents a specialized event type of the event type of its databases which consist primarily of text documents. parent node. For example, “Undershoot” and “Overshoot” As argued convincingly in [13], simultaneous analysis are two special anomaly events of “Anomaly Altitude Devi- of both structured data and unstructured text data is needed ation.” in order to fully take advantage of all the knowledge in all Being able to drill-down and roll-up along such a hi- the data, and will mutually enhance each other in terms of erarchy would be extremely useful for causal factor analy- knowledge discovery, thus bringing more values to busi- sis of anomalous events. Unfortunately, with today’s OLAP 1124 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
    • ALL For example, at the level of “altitude deviation”, it would be desirable to provide a summary of the content of all the nar- Anomaly Altitude Deviation …… Anomaly Maintenance Problem …… Anomaly Inflight Encounter ratives matching this topic, and when we drill-down to “over- shooting”, it would also be desirable to allow the analyst to easily obtain a summary of the narratives corresponding to Undershoot …… Overshoot Improper Documentation …… Improper Maintenance Birds …… Turbulence the specific subcategory of “overshooting deviation.” With such summaries, the analyst would be able to compare the Figure 1: Hierarchical Topic Tree for Anomaly Event content of narratives about the encountering problem across different locations in 1999. Such a summary can be regarded as a content measure of the text in a cell. This example illustrates that in order to integrate text techniques, the analyst cannot easily do this. Imagine that analysis seamlessly into OLAP, we need to support the an analyst is interested in analyzing altitude deviation prob- following functions: lems of flights in Los Angeles in 1999. With the tradi- Topic dimension hierarchy: We need to map text docu- tional data cube, the analyst can only pose a query such ments semantically to an arbitrary topic hierarchy specified as (Time=“1999”,Location=“LA”) and obtain a large set of by an analyst so that the analyst can drill-down and roll-up text narratives, which would have to be further analyzed with on the text dimension (i.e., adding text dimension to OLAP). separate text mining tools. Note that we do not always have training data (i.e., docu- Even if the data warehouse can support keyword queries ments with known topic labels) for learning. Thus we must on the text field, it still would not help the analyst that much. be able to perform this mapping without training data. Specifically, the analyst can now pose a more constrained Text content measure: We need to summarize the content query (Time=“1999”,Location=“LA”, Keyword=“altitude of text documents in a data cell (i.e., computing content deviation”), which would give the analyst a smaller set measure on text). Since different applications may prefer of more relevant text narratives (i.e., those matching the different forms of summaries, we need to be able to represent keywords “altitude deviation”) to work with. However, the the content in a general way that can accommodate many analyst would still need to use a separate text mining tool different ways to further customize a summary according to to further analyze the text information to understand causes the application needs. of this anomaly. Moreover, exact keyword matching would Efficient materialization: We need to materialize cubes also likely miss many relevant text narratives that are about with text content measures efficiently. deviation problems but do not contain or match the keywords Although there has been some previous work on text “altitude” and “deviation” exactly; for example, a more database analysis [9, 20] and integrating text analysis with specific word such as “overshooting” may have been used OLAP [19, 27], to the best of our knowledge, no previous in a narrative about an altitude deviation problem. work has proposed a specific solution to extend OLAP to A more powerful OLAP system should ideally integrate support all these functions mentioned above. The closest text mining more tightly with the traditional cube and OLAP, previous work is the IBM work [13], where the authors pro- and allow the analyst to drill-down and roll-up on the text di- posed some high-level ideas for leveraging existing text min- mension along a specified topic hierarchy in exactly the same ing algorithms to integrate text mining with OLAP. However, way as he/she could on the location dimension. For exam- in their work, the cubes are still traditional data cubes, thus ple, it would be very helpful if the analyst can use a similar the integration is loose and text mining is in nature external query (Time=“1999”,Location=“LA”, Topic=“altitude de- to OLAP. Moreover, the issue of materializing cubes to effi- viation”) to obtain all the relevant narratives to this topic ciently handle text dimension has not be addressed. We will (including those that do not necessarily match exactly the review the related work in more detail in Section 2. words “altitude” and “deviation”), and then drill down into In this paper, we propose a new cube data model called the lower-level categories such as “overshooting” and “un- topic cube to support the two key components of OLAP on dershooting” according to the hierarchy (and roll-up again) text dimension (i.e., topic dimension hierarchy and text con- to change the view of the content of the text narrative data. tent measure) with a unified probabilistic framework. Our Note that although the query has a similar form to that with basic idea is to leverage probabilistic topic modeling [18, 8], the keyword query mentioned above, its intended semantics which is a principled method for text mining, and combine is different: “altitude deviation” is a topic taken from the hi- it with OLAP. Indeed, PLSA and similar topic models have erarchy specified by the analyst, which is meant to match all recently been very successfully applied to a large range of the narratives covering this topic including those that may text mining problems including hierarchical topic modeling not match the keywords “altitude” and “deviation.” [17, 7], author-topic analysis [29], spatiotemporal text min- Furthermore, the analyst would also need to digest the ing [24], sentiment analysis [23], and multi-stream bursty content of all the narratives in the cube cell corresponding pattern finding [32]. They are among the most effective text to each topic category and compare the content across dif- mining techniques. We propose Topic Cube to combine them ferent cells that correspond to interesting context variations. with OLAP to enable effective mining of both structured data 1125 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
    • and unstructured text within a unified framework. would start with the smallest cells to be materialized, and Specifically, we will extend the traditional cube to in- fit the PLSA to them first. We then work on larger cells by corporate the probabilistic latent semantic analysis (PLSA) leveraging the estimated parameters for the small cells as a model [18] so that a data cube would carry parameters of a more efficient starting point. Experiment results show that probabilistic model that can indicate the text content of the the proposed strategy can indeed speed up the estimation cell. Our basic assumption is that we can use a probability algorithm significantly. distribution over words to model a topic in text. For example, The main contributions of this paper are: (1) We intro- a distribution may assign high probabilities to words such duce the new concept of topic cube to combine OLAP and as “encounter”, “turbulence”, “height”, “air”, and it would probabilistic topic models to generalize OLAP techniques intuitively capture the theme “encountering turbulence” in to handle the text dimension. (2) We propose a heuristic the aviation report domain. We assume all the documents method that can improve the efficiency of topic cube ma- in a cell to be word samples drawn from a mixture of many terialization. (3) We present experiment results to show the such topic distributions, and can thus estimate these hidden accuracy and efficiency of topic cube construction. topic models by fitting the mixture model to the documents. These topic distributions can thus serve as content measures 2 Related Work of text documents. In order to respect the topic hierarchy To the best of our knowledge, no previous work has unified specified by the analyst and enable drill-down and roll-up topic modeling with OLAP. However, some previous studies on the text dimension, we further structure such distribution- have attempted to analyze text data in a relational database based content measures based on the topic hierarchy speci- and support OLAP for text analysis. These studies can be fied by an analyst by using the concept hierarchy to impose grouped into three categories, depending on how they treat a prior (preference) on the word distributions characterizing the text data. each topic, so that each word distribution would correspond Text as fact: In this kind of approaches, the text data is to precisely one topic in the hierarchy. This way, we will regarded as a fact of the data records. When a user queries learn word distributions characterizing each topic in the hi- the database, the fact of the text data, e.g. term frequency, erarchy. Once we have a distributional representation of each will be returned. BlogScope [5], which is an analysis and topic, we can easily map any set of documents to the topic visualization tool for blogosphere, belongs to this category. hierarchy. One feature of BlogScope is to depict the trend of a keyword. Note that topic cube supports the first two functions in This is done by counting relevant blog posts in each time slot a quite general way. First, when mapping documents into a according to the input keyword and then drawing a curve topic hierarchy, the model could work with just some key- of counts along the time axis. However, such an approach word description of each topic but no training data. Our ex- cannot support OLAP operations such as drill-down and roll- periment results show that this is feasible. If we do have up on the text dimension, which the proposed topic cube training data available, the model can also easily use it to en- would support. rich our prior; indeed, if we have many training data and im- Text as character field: A major representative work in this pose an infinitely strong prior, we essentially perform super- group is [33], where the text data is treated as a character vised text categorization with a generative model. Second, a field. Given a keyword query, the records which have the multinomial word distribution serves well as a content mea- most relevant text in the field will be returned as results. For sure. Such a model (often called unigram language model) example, the following query (Location=“Columbus”, key- has been shown to outperform the traditional vector space word=“LCD”) will fetch those records with location equal models in information retrieval [28, 34], and can be further to “Columbus” and text field containing “LCD”. This essen- exploited to generate more informative summaries if needed. tially extends the query capability of a relational database For example, in [26], such a unigram language model has to support search over a text field. However, this approach been successfully used to generate a sentence-based impact cannot support OLAP on the text dimension either. summary of scientific literature. In general, we may further Text as categorical data: The two most similar works to use these word distributions to generate informative phrases ours are BIW [13] and Polyanalyst [3]. Both of them use [25] or comparative summaries for comparing content across classification methods to classify documents into categories different contexts [23]. Thus topic cube has potentially many and attach documents with class labels. Such category labels interesting applications. would allow a user to drill-down or roll-up along the category Computing and materializing such a topic cube in a dimension, thus achieving OLAP on text. However, in [13], brute force way is time-consuming. So to better support the only some high-level function descriptions are given with no third function, we propose a heuristic algorithm to leverage algorithms given to efficiently support such OLAP opera- estimated models for “component cells” to speed up the tions on text dimension. Moreover in both works, the no- estimation of the model for a combined cell. Estimation tion of cube is still the traditional data cube. Our topic cube of parameters is done with an iterative EM algorithm. Its differs from these two systems in that we integrate text min- efficiency highly depends on where to start in the parameter ing (specifically topic modeling) more tightly with OLAP space. Our idea for speeding it up is as follows: We 1126 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
    • by extending the traditional data cube to cover topic dimen- hierarchical topic tree. A hierarchical topic tree defines a sion and support text content measures, which leads to a new set of hierarchically organized topics that users are mostly cube model (i.e., topic cube). The topic model provides a interested in, which are presumably also what we want to principled way to map arbitrary text documents into topics mine from the text. A sample hierarchical topic tree is shown specified in any topic hierarchy determined by an applica- in Fig. 1. In a hierarchical topic tree, each node represents tion without needing any training data. Previous work would a topic, and its child nodes represent the sub-topics under mostly either need documents with known labels (for learn- this super topic. Formally, the topics are placed in several ing) which do not always exist, or cluster text documents levels L1 , L2 , . . . , Lm . Each level contains ki topics, i.e. in an unsupervised way, which does not necessarily produce Li = (T1 , . . . , Tki ). meaningful clusters to an analyst, reducing the usefulness Given a multidimensional text database and a hierarchi- for performing OLAP on multidimensional text databases. cal topic tree, the main idea of a topic cube is to use the We also propose heuristic methods for materializing the new hierarchical topic tree as the hierarchy for the text dimension topic cubes efficiently. so that a user can drill-down and roll-up along this hierarchy Topic models have been extensively studied in recent to explore the content of the text documents in the database. years [6, 7, 8, 17, 18, 23, 24, 25, 29, 32], all showing that In order to achieve this, we would need to (1) map all the they are very useful for analyzing latent topics in text and text documents to topics specified in the tree and (2) com- discovering topical patterns. However, all the work in this pute a measure of the content of the text documents falling line only deals with pure text data. Our work can be regarded into each cell. as a novel application of such models to support OLAP on As will be explained in detail later, we can solve both multidimensional text databases. problems simultaneously with a probabilistic topic model called probabilistic latent semantics analysis (PLSA) [18]. 3 Topic Cube as an Extension of Data Cube Specifically, given any set of text documents, we can fit the The basic idea of topic cube is to extend the standard data PLSA model to these documents to obtain a set of latent cube by adding a topic hierarchy and probabilistic content topics in text, each represented by a word distribution (also measures of text so that we can perform OLAP on the called a unigram language model). These word distributions text dimension in the same way as we perform OLAP on can serve as the basis of the “content measure” of text. structured data. In order to understand this idea, it is Since a basic assumption we make is that the analyst necessary to understand some basic concepts about data cube would be most interested in viewing the text data from and OLAP. So before we introduce topic cube in detail, we the perspective of the specified hierarchical topic tree, we give a brief introduction to these concepts. would also like these word distributions corresponding well to the topics defined in the tree. Note that an analyst will 3.1 Standard Data Cube and OLAP A data cube is a be potentially interested in multiple levels of granularity of multidimensional data model. It has three components as topics, thus we also would like our content measure to have input: a base table, dimensional hierarchies, and measures. “multiple resolutions”, corresponding to the multiple levels A base table is a relational table in a database. A dimensional of topics in the tree. Formally, for each level Li , if the tree hierarchy gives a tree structure of values in a column field has defined ki topics, we would like the PLSA to compute of the base table so that we can define aggregation in a precisely ki word distributions, each characterizing one of meaningful way. A measure is a fact of the data. these ki topics. We will denote these word distributions as Roll-up and drill-down are two typical operations in θj , for j = 1, ..., ki , and p(w|θj ) is the probability of word OLAP. Roll-up would “climb up” on a dimensional hierarchy w according to distribution θj . Intuitively, the distribution to merge cells, while drill-down does the opposite and split θj reflects the content of the text documents when “viewed” cells. Other OLAP operations include slice, dice, pivot, etc. from the perspective of the j-th topic at level Li . Two kinds of OLAP queries are supported in a data We achieve this goal of aligning a topic to a word cube: point query and subcube query. A point query seeks distribution in PLSA by using keyword descriptions of the a cell by specifying the values of some dimensions, while a topics in the tree to define a prior on the word distribution range query would return a set of cells satisfying the query. parameters in PLSA so that all these parameters will be biased toward capturing the topics in the tree. We estimate 3.2 Overview of Topic Cube A topic cube is constructed PLSA for each level of topics separately so as to obtain a based on a multidimensional text database (MTD), which multi-resolution view of the content. we define as a multi-dimensional database with text fields. This established correspondence between a topic and a An example of such a database is shown in Table 1. We word distribution in PLSA has another immediate benefit, may distinguish a text dimension (denoted by T D) from a which is to help us map the text documents into topics in the standard (i.e., non-text) dimension (denoted by SD) in a tree because the word distribution for each topic can serve multidimensional text database. as a model for the documents that cover the topic. Actually, Another component used to construct a topic cube is a after estimating parameters in PLSA we also obtain another set of parameters that indicate to what extent each document 1127 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
    • 69B5A 69B5A covers each topic. It is denoted as p(θj |d), which means the probability that document d covers topic θj . Thus we can §3¢§2$¨&$0 easily predict which topic is the dominant topic in the set of '%&1¨ &§0¢$)3¢G documents by aggregating the coverage of a topic over all 0))('&§%¢$ the documents in the set. That is, with p(θj |d), we can also #"!  ¢)10¡1F§E compute the topic distribution in a cell of documents C as ¥ ¨§¦ ¤£ 1 ¥ ¢¡   ¥ D p(θj |C) = |C| d∈C p(θj |d) (we assume p(d) are equal for ¤£ ¤¤ ¨§¦ ¤¤£ ¢¡   D ©      ¤¤£ ¤¤¤ all d ∈ C). While θj is the primary content measure which  ©¦ C we will store in each cell, we will also store p(θj |d) as an @59 87654 UTSHRQQPIH a`YXXWV @59 8 654 7 auxiliary measure to support other ways of aggregating and summarizing text content. Thus essentially, our idea is to define a topic cube as an Figure 3: An example of a Topic Cube extension of a standard data cube by adding (1) a hierarchical topic tree as a topic dimension for the text field and (2) a set of probabilistic distributions as the content measure of text ASRS text database, and the topic dimension is added from documents in the hierarchical topic dimension. We now give the hierarchical tree shown in Fig. 1. For example, the left a systematic definition of the topic cube. cuboid in Fig. 3 shows us word distributions of some finer topics like “overshoot” at “LAX” in “Jan. 99”, while the right cuboid shows us word distributions of some coarse topics 3.3 Definition of Topic Cube like “Deviation” at “LA” in “1999”. In Fig. 4, it shows two D EFINITION 3.1. A topic cube is constructed based on a example cells of a topic cube (with only word distribution text database D and a hierarchical topic tree H. It not only measure) constructed from ASRS. The meaning of the first has dimensions directly defined in the standard dimensions record is that the top words of aircraft equipment problem SD in the database D, but it also has a topic dimension occurred in flights during January 1999 are (engine 0.104, which is corresponding to the hierarchical topic tree. Drill- pressure 0.029, oil 0.023, checklist 0.022, hydraulic 0.020, down and roll-up along this topic dimension will allow users ...). So when an expert gets the result from the topic cube, she to view the data from different granularities of topics. The will soon know what are the main problems of equipments primary measure stored in a cell of a topic cube consists of during January 1999, which shows the power of a topic cube. a word distribution characterizing the content of text docu- ments constrained by values on both the topic dimension and Time Anomaly Event Word Distribution the standard dimensions (contexts). engine 0.104, pressure 0.029, oil 0.023, 1999.01 equipment checklist 0.022, hydraulic 0.020, ... The star schema of a topic cube for the ASRS example is ground tug 0.059, park 0.031, pushback 0.031, ramp given in Fig. 2. The dimension table for the topic dimension 1999.01 encounters 0.029, brake 0.027, taxi 0.026, tow 0.023, ... is built based on the hierarchical topic tree. Two kinds of measures are stored in a topic cube cell, namely word distribution of a topic p(wi |topic) and topic coverage by Figure 4: Example Cells in a Topic Cube documents p(topic|dj ). Time Query A topic cube supports the following query: Time_key Environment (a1 , a2 , . . . , am , t). Here, ai represents the value of the i-th Day Fact table Environment_key dimension and t represents the value of the topic dimension. Month Time_key Light Both ai and t could be a specific value, a character “?”, or Year a character “*”. For example, in Fig. 3, a query (“LAX”, Location_key “Jan. 99”, t=“turbulence”) will return the word distribution Location Environment_key of topic “turbulence” at “LAX” in “Jan. 99”, while a query Location_key Topic_key Topic (“LAX”, “Jan. 99”, t=“?”) will return the word distribution City {wi: p(wi|topic)} Topic_key State Lower level topic of all the topics at “LAX” in “Jan. 99”. If t is specified as a Country {dj: p(topic|dj)} Higher level topic “*”, e.g. (“LAX”, “Jan. 99”, t=“*”), a topic cube will only return all the documents belong to (Location=“LAX”) and Measures (Time=“Jan. 99”). Figure 2: Star Schema of a Topic cube Operations A topic cube not only allows users to carry out traditional OLAP operations on the standard dimensions, Fig. 3 shows an example of a topic cube which is but also allows users to do the same kinds of operations on built based on ASRS data. The “Time” and “Location” the topic dimension. The roll-up and drill-down operations dimensions are defined in the standard dimensions in the along the topic dimension will allow users to view the data 1128 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
    • in different levels (granularities) of topics in the hierarchical eters would usually include a set of word distributions cor- topic tree. Roll-up corresponds to change the view from a responding to latent topics, thus allowing us to discover and lower level to a higher level in the tree, and drill-down is thecharacterize hidden topics in text. opposite. For example, in Fig. 3, an operation: Most topic models proposed so far are based on two rep- resentative basic models: probabilistic latent semantic anal- Roll-up on Anomaly Event (from Level 2 to Level 1) ysis (PLSA) [18] and latent Dirichlet allocation (LDA) [8]. will change the view of topic cube from finer topics like “tur- While in principle both PLSA and LDA can be incorporated bulence” and “overshoot” to coarser topics like “Encounter” into OLAP with our ideas, we have chosen PLSA because its and “Deviation”. The operation: estimation can be done much more efficiently than for LDA. Below we give a brief introduction to PLSA. Drill-down on Anomaly Event (from Level 1 to Level 2) Basic PLSA The basic PLSA [18] is a finite mixture model just does the opposite change. with k multinomial component models. Each word in a document is assumed to be a sample from this mixture 4 Construction of Topic Cube model. Formally, suppose we use θi to denote a multinomial To construct a topic cube, we first construct a general data distribution capturing the i-th topic, and p(w|θi ) is the cube (we call it GDC from now on) based on the standard probability of word w given by θi . Let Θ = {θ1 , θ2 , . . . , θk } dimensions in the multidimensional text database D. In each be the set of all k topics. The log likelihood of a collection cell of this cube, it stores a set of documents aggregated of text C is: from the text dimension. Then, from the set of documents k in each cell, we mine word distributions of topics defined in the hierarchical topic tree level by level. Next, we split each (4.1) L(C|Λ) ∝ c(w, d) log p(θj |d)p(w|θj ) m d∈C w∈V j=1 cell into K = i=1 ki cells. Here, ki is the number of topics in level i. Each of these new cells corresponds to one topic and stores its word distribution (primary measure) and the where V is the vocabulary set of all words, c(w, d) is the topic coverage probabilities (auxiliary measure). At last, a count of word w in document d, and Λ is the parameter set topic dimension is added to the data cube which allows users composed of {p(θj |d), p(w|θj )}d,w,j . to view the data by selecting topics. Given a collection, we may estimate PLSA using the For example, to obtain a topic cube shown in Fig. 3, we maximum likelihood (ML) estimator by choosing the pa- first construct a data cube which has only two dimensions rameters to maximize the likelihood function above. The “Time” and “Location”. Each cell in this data cube stores a ML estimator can be computed using the Expectation- set of documents. For example, in cell (“LAX”, “Jan. 99”), Maximization (EM) algorithm [14]. The EM algorithm is a it stores the documents belonging to all the records in the hill-climbing algorithm, and guaranteed to find a local maxi- database which “Location” field is “LAX” and “Time” field mum of the likelihood function. It finds this solution through is “Jan. 99”. Then, for the second level of the hierarchical iteratively improving an initial guess of parameter values us- topic tree in Fig. 1, we mine topics, such as “turbulence”, ing the following updating formulas (alternating between the “bird”, “overshoot”, and “undershoot”, from the document E-step and M-step): set. For the first level of the hierarchical topic tree, we mine E-step: topics such as “Encounter” and “Deviation” from the doc- p(n) (θj |d)p(n) (w|θj ) ument set. Next, we split the original cell, say (“LAX”, (4.2) p(zd,w = j) = k (n) (θ |d)p(n) (w|θ ) “Jan. 99”), into K cells, e.g. (“LAX”, “Jan. 99”, “turbu- j =1 p j j lence”), (“LAX”, “Jan. 99”, “Deviation”) and etc. Here, K is the total number of topics defined in the hierarchical topic M-step: tree. At last, we add a topic dimension to the original data cube, and a topic cube is constructed. w c(w, d)p(zd,w = j) (4.3) p(n+1) (θj |d) = Since a major component in our algorithm for construct- j w c(w, d)p(zd,w = j ) ing topic cube is the estimation of the PLSA model, we first give a brief introduction to this model before discussing the exact algorithm for constructing topic cube in detail. d c(w, d)p(zd,w = j) (4.4) p(n+1) (w|θj ) = w d c(w , d)p(zd,w = j) 4.1 Probabilistic Latent Semantic Analysis (PLSA) Probabilistic topic models are generative models of text In the E-step, we compute the probability of a hidden with latent variables representing topics (more precisely variable zd,w , indicating which topic has been used to gen- subtopics) buried in text. When using a topic model for text erate word w in d, which is calculated based on the current mining, we generally would fit a model to the text data to generation of parameter values. In the M-step, we would use be analyzed and estimate all the parameters. These param- the information obtained from the E-step to re-estimate (i.e., 1129 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
    • update) our parameters. It can be shown that the M-step al- strategy of materialization is an exhaustive method which ways increases the likelihood function value, meaning that computes the topics cell by cell. However, this method is the next generation of parameter values would be better than not efficient for the following reasons: the current one [14]. This updating process continues until the likelihood 1. For each cell in GDC, the PLSA model uses EM algo- function converges to a local maximum point which gives rithm to calculate the parameters of topic models. This us the ML estimate of the model parameters. Since the EM is an iterative method, and it always needs hundreds of algorithm can only find a local maximum, its result is clearly iterations before converge. affected by the choice of the initial values of parameters that 2. For each cell in GDC, the PLSA model has the local it starts with. If the starting point of parameter values is al- maximum problem. To avoid this problem and find the ready close to the maximum point, the algorithm would con- global maximization, it always starts from a number of verge quickly. As we will discuss in detail later, we will different random points and selects the best one. leverage this property to speed up the process of materializ- ing topic cube. Naturally, when a model has multiple local 3. The number of cells in GDC could be huge. maxima (as in the case of PLSA), we generally need to run the algorithm multiple times, each time with a different start- On the other hand, based on the difficulty of aggregation, ing point, and finally use the one with the highest likelihood. measures in a data cube can be classified into three cate- gories: distributive, algebraic, and holistic [12]. As the mea- PLSA Aligned to a Specified Topic Hierarchy Directly sure in a topic cube is the word distributions of topics got applying PLSA model on a data set, we can extract k word from PLSA, we can easily see that it is a holistic measure. distributions {p(w|θi )}i=1,...,k , characterizing k topics. In- Therefore, there is no simple solution for us to aggregate tuitively these distributions capture word co-occurrences in measures from sub cells to super cells in a topic. the data, but the co-occurrence patterns captured do not nec- To overcome this problem, we propose to use a heuristic essarily correspond to the topics in our hierarchical topic method to materialize a topic cube more efficiently, which is tree. A key idea in our application of PLSA to construct described below. topic cube is to align the discovered topics with the topics specified in the tree through defining a prior with the topic A Heuristic Materialization Algorithm The basic idea of tree and using Bayesian estimation instead of the maximum our heuristic algorithm is: when the heuristic method mines likelihood estimator which solely listens to the data. topics from documents of one cell in GDC, it computes the Specifically, we could define a conjugate Dirichlet prior topics by first aggregating the word distributions of topics in and use the MAP (Maximum A Posteriori) estimator to esti- its sub cells as a starting point and then using EM algorithm mate the parameters [30]. We would first define a prior word from this starting point to get the local maximum result. In distribution p (w|θj ) based on the keyword description of the this way, the EM algorithm converges very quickly, and we corresponding topic in the tree; for example, we may define do not need to restart the EM algorithm in several different it based on the relative frequency of each keyword in the de- random points. The outline of the heuristic method is shown scription. We assume that it is quite easy for an analyst to in Table 2. give at least a few keywords to describe each topic. We then Basically, in the first step of our algorithm, we construct define a Dirichlet prior based on this distribution to essen- a general data cube GDC based on the standard dimen- tially “force” θj to assign a reasonably high probability to sions. Then in step 2, for each cell in the base cuboid, we words that have high probability according to our prior, i.e., use exhaustive method (starting EM algorithm from several the keywords used by the analyst to specify this topic would random points and selecting the best local maximization) to all tend to high probabilities according to θj , which further mine topics from its associated document set level by level. bias the distribution to attract other words co-occurring with The reason of using exhaustive method for base cells is to them, achieving the purpose of extracting the content about provide a solid foundation for future aggregation when we this topic from text. materialize higher level cuboids. In step 3 and step 4, we use our heuristic aggregation method to mine topics from cells in 4.2 Materialization As described in Section 4, to fully higher level cuboid. For example, when mining topics from materialize a topic cube, we need to mine topics for each cell (a, b) in GDC, we aggregate all the same level topics cell in the original data cube. As discussed earlier, we use from its sub cells (a, b, cj )∀cj in the base cuboid. the PLSA model as our topic modeling method. Suppose Specifically, suppose ca is a cell in GDC and is aggre- there are d standard dimensions in the text database D, each gated from a set of sub cells {c1 , . . . , cm }, so that Sca = m dimension has Li levels (i = 1, . . . , d), and each level has i=1 Sci , where Sc is the document set associated with the (l) cell Sc . For each level Li in the hierarchical topic tree, we ni values (i = 1, . . . , d; l = 1, . . . , Li ). Then, we have (L ) (L ) L1 (l) Ld (l) have got word distribution {pci (w|θ i ), . . . , pci (w|θ i )} totally ( l=1 n1 ) × · · · × ( l=1 nd ) cells need to mine for each sub cell c , and we are1 going to mine ki topics i if we want to fully materialize a topic cube. One baseline (L ) (L ) (L ) {θ1 i , θ2 i , . . . , θki i } from Sca . The basic idea of the 1130 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
    • There are three possible strategies to solve the storage Table 2: Outline of a heuristic method for materialization problem. One is to reduce the storage by only storing the top Suppose a topic cube has three standard dimensions A, B, C and a topic dimension T . The hierarchical topic tree H has n k words for each topic. This method is reasonable, because levels and each level Li has ki topics. in a word distribution of one topic, the top words always have Step 1. the most information of a topic and can be used to represent Build a general data cube GDC based on the standard dimen- the meaning of the topic. Although the cost of this method sions, and each cell stores the corresponding document set. is the loss of information for topics, this strategy really saves Step 2. the disk space a lot. For example, generally we always have · For each cell (a, b, c) in the base cuboid and a document set ten thousands of words in our vocabulary. If we only store Sabc associated with it the top 10 words of each topic in the topic cube instead of · For each level Li in H, where i is from 1 to n − 1 storing all the words, then it will save thousands of times in · Estimate PLSA to mine ki topics from the document set disk space. The efficiency of this method is studied in our Sabc using the exhaustive method Step 3. experiment part. · For each cell (a, b) in cuboid AB and a document set Sab Another possible strategy is to use a general term to associated with it replace a set of words or phrases so that the size of the · For each level Li in H, where i is from 1 to n − 1 vocabulary will be compressed. For example, when we · Estimate PLSA to mine ki topics from Sab by aggregat- talks about an engine problem, words or phrases like “high ing the same level of topics in all sub cells (a, b, cj ) of (a, b) in temperature, noise, left engine, right engine” always appear. base cuboid So it motivates us to use a general term “engine-problem” · Do similar aggregation in cuboid BC and CA to replace all those correlated words or phrases. Such a Step 4. replacement is meaningful especially when an expert only · Calculate topics for cells in cuboid A,B,C by aggregating cares about general causes of an aviation anomaly instead from cuboid AB,BC,CA of the details. But the disadvantage of this method is that it loses much detailed information, so there is a trade off between the storage and the precision of the information we heuristic method is when we apply EM algorithm for mining need. topics from Sca , we start from a good starting point, which is The third possible strategy is that instead of storing aggregated from word distributions of topics in its sub cells. topics in all the levels in the hierarchical topic tree, we can The benefit of a good starting point is that it can save the time only store the topics in the odd levels or the topics in the for EM algorithm to converge and also save it from starting even levels. The intuition is: suppose one topic’s parent with several different random points. topic node has a word w at the top of its word distribution The aggregation formulas are as follows: and at the same time its child topic node also has the same word w at the top of its word distribution, then it is highly (Li ) ci d∈Sci c(w, d)p(zd,w = j) probably that this topic also has word w at the top of its word p(0) (w|θj ca )= distribution. In other words, for a specific topic, we can use w ci d∈Sci c(w , d)p(zd,w = j) the word distribution in both its parent topic and child topic to quickly induce its own word distribution. This is also an (Li ) (Li ) interesting direction to explore. (4.5) p(0) (θj ca |d) = pci (θj |d), if d ∈ Sci Intuitively, we simply pool together the expected counts 5 Experiments of a word from each sub cell to get an overall count of the In this section, we present our evaluation of the topic cube word for a topic distribution. An initial distribution estimated model. First, we compare the computation efficiency of our in this way can be expected to be closer to the optimal heuristic method with a baseline method which materializes distribution than a randomly picked initial distribution. a topic cube exhaustively cell by cell. Next, we are going to show three usages of topic cube to demonstrate its power. 4.3 Saving Storage Cost One critical issue about topic cube is its storage. As discussed in Section 4.2, for a 5.1 Data Set The data set we used in our experiment is topic cube with d standard dimensions, we have totally downloaded from the ASRS database [2]. Three fields of L1 (l) Ld (l) ( l=1 n1 )×· · ·×( l=1 nd ) cells in GDC. If there are N the database are used as our standard dimensions, namely topic nodes in the hierarchical topic tree and the vocabulary Time {1998, 1999}, Location {CA, TX, FL}, and Environ- L1 (l) ment {Night, Daylight}. We use A, B, C to represent them size is V , then we need at least store ( l=1 n1 ) × · · · × Ld (l) respectively. Therefore, in the first step the constructed gen- ( l=1 nd ) × N × V values for the word distribution eral data cube GDC has 12 cells in the base cuboid ABC, 16 measure. This is a huge storage cost because both the cells in cuboids {AB, BC, CA}, and 7 cells in cuboids {A, B, number of cells and the size of the vocabulary are large in C}. The summarization of the number of documents in each most cases. 1131 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
    • base cell is shown in Table 3. run of the random method to compare the efficiency with the Agg method. The average performance is calculated by running the EM algorithm from M random points and then Table 3: The Number of Documents in Each Base Cell averaging the performance of these runs. The best run is the CA TX FL 1998 Daylight 456 306 266 one which converges to the best local optimum point (highest 1998 Night 107 64 62 log likelihood) among these M runs. 1999 Daylight 493 367 321 To measure the efficiency of these strategies, we look 1999 Night 136 87 68 at how long it takes for these strategies to get to the same closeness to the global optimum point. Here, we assume that the convergence point of the best run of the M runs is the Three levels of hierarchical topic tree is used in our global optimum point. The experimental results are shown experiment, 6 topics in the first level and 16 topics in the in Fig. 6. The three graphs show the efficiency comparison second level, which is shown in Fig. 5. In real applications, of the three strategies. Each graph represents the result in the prior knowledge of each topic can be given by domain one level of cuboid, and we use one representative cell to experts. For our experiments, we first collect a large number show the comparison. The experiment on other cells have of aviation safety report data (also from ASRS database), similar performance and can lead to the same conclusion. and then manually check documents related to each anomaly In the graph, Best Rdm represents the best run among event, and select top k (k < 10) most representative words those M random runs in the third strategy, and Avg Rdm of each topic as its prior. represents the average performance of the M runs. The abscissa axis represents how close one point is to the global Level 0 Equipment ALL Ground In-Flight Maintain optimum point. For example, the value “0.24” on the axis Level 1 Problem Altitude Deviation Conflict Incursion Encounter Problem means one point’s log likelihood is 0.24% smaller than the Crossing Restriction Not Meet, Airborne, Landing Without Turbulence, Improper log likelihood of the global optimum point. The vertical axis Critical, Excursion From Documentation, Level 2 Less Severe Assigned Altitude, Ground, NMAC* Clearance, Runway VFR* in IMC*, Weather Improper is the time measured in seconds. So a point in the plane Overshoot, Undershoot Maintenance means how much time a method needs to get to a certain : NMAC-Near Midair Collision, VFR-Visual Flight Rules, IMC-Instrument Meteorological Conditions closeness to the global optimum point. We can conclude that in all three cells, the proposed heuristic methods perform Figure 5: Hierarchical Topic Tree used in the Experiments more efficiently than the baseline method, and this advantage of the heuristic aggregation is not affected by the scale of the document set. An interesting discovery is that the App method performs stabler than Agg. For example, in Fig. 6 (b) 5.2 Efficiency Comparison We totally compare three and (c), although the Agg method starts from a better point strategies of constructing a topic cube. (1) Heuristic aggre- than App, after reaching a certain point, the Agg method gation method we proposed in Section 4.2, and we use Agg seems to be “trapped” and needs longer time than App to get to represent it. (2) An approximation method which only further close to the optimum point. Therefore, in Fig. 6 (d), stores top k words in the word distribution of each topic, and we also test the scalability of the three strategies. The graph we use App to represent it. The purpose of this method is depicts how much time one method needs to get to as close to test the storage-saving strategy proposed in Section 4.3. as 99.99% of the global maximum value. We can see that When calculating topics from a document set in one cell, we the performance of App is stabler than the Agg method. This use the same formula as in Agg to combine the word dis- indicates that, when the size of the document set becomes tributions of topics, with only top k words, in its sub cells larger, App method is preferred. and get a good starting point. Then, we initialize the EM Table 4 shows the log likelihood of the starting points of algorithm with this starting point and continue it until con- the three strategies. Here, the log likelihood of the objective vergence. In our experiment, we set the constant k equal function is calculated by Eq. (4.1). This value indicates to 10. (3) The third strategy is the baseline of our method, how likely the documents are generated by topic models, which initializes the EM algorithm with random points, and so it is the larger the better. In all the cells, both Agg and we use Rdm to represent it. As stated before, the exhaustive App strategies have higher value than the average value of method to materialize a topic cube runs EM algorithm by the Rdm strategy. This also assists our conclusion that the starting from several different random points and then select proposed heuristic methods are much easier to get to the the best local maximum point. Obviously, if the exhaustive optimum point than starting from a random point, thus need method runs EM algorithm M times, its time cost will be M less time to converge. times of the Agg method. The reason is every run of EM al- gorithm in Rdm has the same computation complexity as the 5.3 Topic Comparison in Different Context One major Agg method. Therefore, it’s no doubt that the heuristic ag- application of a topic cube is to allow users explore and an- gregation method is faster than the exhaustive method. So, alyze topics in different contexts. Here, we regard all the in our experiment, we use the average performance and best standard dimensions as contexts for topics. Fig. 7 shows four 1132 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
    • 90 180 450 250 Agg Agg Agg Agg 80 App 160 App 400 App App Best Rdm Best Rdm Best Rdm Best Rdm 70 140 350 200 Avg Rdm Avg Rdm Avg Rdm 60 120 300 Seconds Seconds Seconds Seconds 150 50 100 250 40 80 200 100 30 60 150 20 40 100 50 10 20 50 0 0 0.24 0.23 0.22 0.21 0.2 0.19 0.18 0.17 0.16 0.15 0.14 0.5 0.46 0.42 0.38 0.35 0.31 0.27 0.23 0.2 0.16 0.3 0.28 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 500 1000 1500 2000 2500 3000 Closeness to Global Optimum Point(%) Closeness to Global Optimum Point(%) Closeness to Global Optimum Point(%) Size of Document Set (a) Cell=(1999, CA, *) (b) Cell=(1999, *, *) (c) Cell=(*, *, *) (d) Comparison of Scalability with 629 documents with 1472 documents with 2733 documents Figure 6: Efficiency Comparison of Different Strategies or proportion of one topic t by the average of p(t|di ) over all Table 4: Starting Point of each Strategy the document di in the corresponding cell in GDC. From Strategy (1999, CA, *) (1999, *, *) (*, *, *) another point of view, the coverage of one topic also reflects Agg -501098 -1079750 -2081270 the severity of this anomaly event. App -517922 -1102810 -2117920 Avg Rdm -528778 -1125987 -2165459 Fig. 8 shows the topic coverage analysis on our experi- ment data set. Fig. 8(a) is the topic coverage over different places and Fig. 8(b) is the topic coverage over different en- cells in the topic cube constructed on our experiment data. vironment. With this kind of analysis, we can easily find The column of “Environment” can be viewed as the con- out answers to the questions like: what is the most sever text of the topic dimension “Anomaly Event”. Comparing anomaly among all the flights in California state? What kind the same topic in different contexts will discover some inter- of anomaly is more likely to happen during night rather than esting knowledge. For example, from the figure we can see daylight? For example, Fig. 8 helps us reveal some very that the “landing without clearance” anomaly event has more interesting facts. Flights in Texas have more “turbulence” emphasis on the words “light”, “ils”(instrument landing sys- problems than in California and Florida, while Florida has tem), and “beacon” in the context of “night” than in the con- the most sever “Encounter: Airborne” problem among these text of “daylight”. This tells experts of safety issues that three places. And there is no evident difference of the cov- these factors are most important for landing and are men- erage of anomalies like “Improper documentation” between tioned a lot by pilots. On the other hand, the anomaly event night and daylight. This indicates that these kinds of anoma- “altitude deviation: overshoot” is not affected too much by lies are not correlated with environment factors very much. the environment light, because the word distribution in these On the other hand, anomaly “Landing without clearance” ob- two contexts are quite similar. viously has a strong correlation with the environment. Environment Anomaly Event Word Distribution 5.5 Accuracy of Categorization In this experiment, we landing without tower 0.075, pattern 0.061, final 0.060, test how accurate the topic modeling method is for document daylight clearance runway 0.052, land 0.051, downwind 0.039 categorization. Since we only have our prior for each topic night landing without clearance tower 0.035, runway 0.027, light 0.026, lit 0.014, ils 0.014, beacon 0.013 without training examples in our data set, we do not compare altitude deviation: altitude 0.116, level 0.029, 10000 0.028, f our method with supervised classification. Instead, we use daylight overshoot 0.028, o 0.024, altimeter 0.023 the following method as our baseline. First, we use the prior night altitude deviation: altitude 0.073, set 0.029, altimeter 0.022, of each topic to create a language model ζj for each topic j. overshoot level 0.022, 11000 0.018, climb 0.015 Then, we create a document language model ζd for each doc- ument after Dirichlet smoothing: p(w|ζd ) = c(w,d)+µp(w|C) , |d|+µ Figure 7: Application of Topic Cube in ASRS where c(w, d) is the count of word w in document d and p(w|C) = c(w, C)/|V | is the collection background model. Finally, we can use the negative KL-divergence [21] function 5.4 Topic Coverage in Different Context Topic coverage to measure the similarity between a document d and a topic analysis is another function of a topic cube. As described above, one family of parameters in PLSA, {p(θ|d)}, is stored p(w|ζj ) j: S = −D(ζj ||ζd ) = w p(w|ζj ) log p(w|ζd) . If one docu- as an auxiliary measure in a topic cube. The meaning of ment d has a similarity score S higher than a threshold δ with these parameters is the topic coverage over each document. a topic j, then it is classified into that topic. One the other With this family of parameters, we can analyze the topic cov- hand, when we use the word distribution measure in a topic erage in different context. For example, given a context (Lo- cube for categorization, we use the word distribution θj of cation=“LA”, Time=“1999”), we can calculate the coverage topic j as its language model, and then compute the negative 1133 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
    • !!!% ©I ©I0B !!!% ©D ©D0B ! !0( P ©I ©I0B ! !0( E ©D ©D0B HDGFEDC ©4C ¨%B ! #)A ¨%B ! #)A !(6©(@ !(6©(@ 98!(# 98!(# !© ( 48 "!2! !© ( 48 "!2! ¨1%$ ) ¨1%$ 2!( ©7 4"9E 2!( © 7 Q@ !© 6©1 !© 6©1 4"$  4©2!5 1¨  4©2!5  4©3  4©3 2(1 2!"1 0 ©) ! ©('& 2(1 2!"1 0 ©) ! ©('& %  $ ! ©# "! ©¨ %  $ ! ©# "! ©¨ ©  ©  ©¨ ©¨ ¢  ¡  ¢  ¡£ ¢  ¡¤ ¢  ¡¥ ¢  ¡¦ ¢  ¡ § ¢  ¡£§ ¢  ¡¤§ ¢  ¡  ¢  ¡£ ¢  ¡¤ ¢  ¡¥ ¢  ¡¦ ¢  ¡ § ¢  ¡£§ (a) Place Comparison (b) Environment Comparison Figure 8: Topic Coverage Comparison among Different Contexts KL-divergence between θj and ζd to compute the similarity store word distributions as the primary content measure (and score of each topic j and document d. topic coverage probabilities as auxiliary measure) of text in- Our experiment is conducted on the whole data set, formation. All these probabilities can be computed simul- and use the first level of topics in Fig. 5 as the target taneously through estimating a probabilistic latent semantic categories, i.e. we classify the documents into 6 categories. analysis model based on the text data in a cell. To efficiently The gold answer we use is the “Anomaly Event” labels in materialize topic cube, we propose a heuristic algorithm to ASRS data, which is tagged by pilots. Then we get the leverage previously estimated models in component cells to following recall-precision curves by changing the value of choose a good starting point for estimating the model for a the threshold δ. We can see that the curve of PLSA is above merged large cell. Experiment results show that this heuristic the baseline method. This means that PLSA would get better algorithm is effective and topic cube can be used for many categorization result if we only have prior knowledge about different applications. topics. Our work just represents the first step in combining OLAP with text mining based on topic models, and there are many interesting directions for further study. First, it would 0.9 Baseline be interesting to further explore other possible strategies PLSA 0.8 to materialize a topic cube efficiently. Second, since the 0.7 topic cube gives us an intermediate result of text analysis in cells, it is important to explore how to summarize documents Precision 0.6 according to the context of each cell based on the result 0.5 of topic cube. Finally, our work only represents one way 0.4 to combine OLAP and topic models. It should be very 0.3 interesting to explore other ways to integrate these two kinds 0.2 of effective, yet quite different mining techniques. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Acknowledgment Figure 9: Comparison of Categorization Accuracy We sincerely thank the anonymous reviewers for their com- prehensive and constructive comments. The work was sup- ported in part by NASA grant NNX08AC35A, the U.S. Na- tional Science Foundation grants IIS-0842769, IIS-0713571 6 Conclusions and IIS-0713581, and the Air Force Office of Scientific Re- OLAP is a powerful technique for mining structured data, search MURI award FA9550-08-1-0265. while probabilistic topic models are among the most effec- tive techniques for mining topics in text and analyzing their patterns. In this paper, we proposed a new data model (i.e., References TopicCube) to combine OLAP and topic models so that we can extend OLAP to the text dimension which allows an an- [1] Anomaly event schema. http://www.asias.faa.gov alyst to flexibly explore the content in text documents to- /pls/portal/stage.meta show column?v table id=165477. gether with other standard dimensions in a multidimensional [2] Aviation safety reporting system. http://asrs.arc.nasa.gov/. text database. Technically, we extended the standard data [3] Megaputer’s polyanalyst. http://www.megaputer.com/. cube in two ways: (1) adopt a hierarchical topic tree to de- [4] S. AGARWAL , R. AGRAWAL , P. D ESHPANDE , A. G UPTA , fine a topic dimension for exploring text information, and (2) J. F. NAUGHTON , R. R AMAKRISHNAN , AND S. S ARAWAGI, 1134 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
    • On the computation of multidimensional aggregates, in [21] S. K ULLBACK AND R. A. L EIBLER, On information and VLDB, 1996, pp. 506–521. sufficiency, The Annals of Mathematical Statistics, 22 (1951), [5] N. BANSAL AND N. KOUDAS, Blogscope: a system for pp. 79–86. online analysis of high volume text streams, in VLDB ’07: [22] E. L O , B. K AO , W.-S. H O , S. D. L EE , C. K. C HUI , Proceedings of the 33rd international conference on Very AND D. W. C HEUNG , Olap on sequence data, in SIGMOD large data bases, VLDB Endowment, 2007, pp. 1410–1413. ’08: Proceedings of the 2008 ACM SIGMOD international [6] D. B LEI AND J. L AFFERTY, Correlated topic models, in conference on Management of data, New York, NY, USA, NIPS ’05: Advances in Neural Information Processing Sys- 2008, ACM, pp. 649–660. tems 18, 2005. [23] Q. M EI , X. L ING , M. W ONDRA , H. S U , AND C. Z HAI, [7] D. M. B LEI , T. L. G RIFFITHS , M. I. J ORDAN , AND J. B. Topic sentiment mixture: Modeling facets and opinions in T ENENBAUM, Hierarchical topic models and the nested chi- weblogs, in Proceedings of WWW ’07, 2007. nese restaurant process, in NIPS, 2003. [24] Q. M EI , C. L IU , H. S U , AND C. Z HAI, A probabilistic ap- [8] D. M. B LEI , A. Y. N G , AND M. I. J ORDAN, Latent dirichlet proach to spatiotemporal theme pattern mining on weblogs, allocation, Journal of Machine Learning Research, 3 (2003), in WWW, L. Carr, D. D. Roure, A. Iyengar, C. A. Goble, and pp. 993–1022. M. Dahlin, eds., ACM, 2006, pp. 533–542. [9] V. T. C HAKARAVARTHY, H. G UPTA , P. ROY, AND M. M O - [25] Q. M EI , X. S HEN , AND C. Z HAI, Automatic labeling of HANIA , Efficiently linking text documents with relevant struc- multinomial topic models, in Proceedings of the 14th ACM tured information, in VLDB ’06: Proceedings of the 32nd SIGKDD International Conference on Knowledge Discovery international conference on Very large data bases, VLDB En- and Data Mining, p. 490. dowment, 2006, pp. 667–678. [26] Q. M EI AND C. Z HAI, Generating impact-based summaries [10] S. C HAUDHURI AND U. DAYAL, An overview of data ware- for scientific literature, in Proceedings of the 46th Annual housing and olap technology, SIGMOD Rec., 26 (1997), Meeting of the Association for Computational Linguistics, pp. 65–74. p. 816. [11] B.-C. C HEN , L. C HEN , Y. L IN , AND R. R AMAKRISHNAN, ´ [27] J. M. P E REZ , R. B ERLANGA , M. J. A RAMBURU , AND Prediction cubes, in VLDB ’05: Proceedings of the 31st T. B. P EDERSEN, A relevance-extended multi-dimensional international conference on Very large data bases, VLDB model for a data warehouse contextualized with documents, Endowment, 2005, pp. 982–993. in DOLAP ’05: Proceedings of the 8th ACM international [12] Y. C HEN , G. D ONG , J. H AN , B. W. WAH , AND J. WANG, workshop on Data warehousing and OLAP, New York, NY, Multi-dimensional regression analysis of time-series data USA, 2005, ACM, pp. 19–28. streams, in Proc. 2002 Int. Conf. Very Large Data Bases [28] J. P ONTE AND W. B. C ROFT, A language modeling ap- (VLDB’02), Hong Kong, China, Aug. 2002, pp. 323–334. proach to information retrieval, in Proceedings of the ACM [13] W. F. C ODY, J. T. K REULEN , V. K RISHNA , AND W. S. SIGIR’98, 1998, pp. 275–281. S PANGLER, The integration of business intelligence and [29] M. S TEYVERS , P. S MYTH , M. ROSEN -Z VI , AND T. G RIF - knowledge management, IBM Syst. J., 41 (2002), pp. 697– FITHS , Probabilistic author-topic models for information dis- 713. covery, in Proceedings of KDD’04, 2004, pp. 306–315. [14] A. P. D EMPSTER , N. M. L AIRD , AND D. B. RUBIN ., Max- [30] T. TAO AND C. Z HAI, Regularized estimation of mixture imum likelihood from incomplete data via the EM algorithm, models for robust pseudo-relevance feedback, in SIGIR ’06: Journal of Royal Statist. Soc. B, 39 (1977), pp. 1–38. Proceedings of the 29th annual international ACM SIGIR [15] F. M. FEI J IANG , J. P EI , AND A. W. CHEE F U, Ix-cubes: conference on Research and development in information re- iceberg cubes for data warehousing and olap on xml data, trieval, New York, NY, USA, 2006, ACM, pp. 162–169. in CIKM ’07: Proceedings of the sixteenth ACM conference [31] Y. T IAN , R. A. H ANKINS , AND J. M. PATEL, Efficient ag- on Conference on information and knowledge management, gregation for graph summarization, in SIGMOD ’08: Pro- New York, NY, USA, 2007, ACM, pp. 905–908. ceedings of the 2008 ACM SIGMOD international confer- [16] J. G RAY, A. B OSWORTH , A. L AYMAN , AND H. P IRAHESH, ence on Management of data, New York, NY, USA, 2008, Data cube: A relational aggregation operator generalizing ACM, pp. 567–580. group-by, cross-tab, and sub-totals, ICDE, 00 (1996), p. 152. [32] X. WANG , C. Z HAI , X. H U , AND R. S PROAT, Mining cor- [17] T. H OFMANN, The cluster-abstraction model: Unsupervised related bursty topic patterns from coordinated text streams, in learning of topic hierarchies from text data, in IJCAI, T. Dean, KDD, P. Berkhin, R. Caruana, and X. Wu, eds., ACM, 2007, ed., Morgan Kaufmann, 1999, pp. 682–687. pp. 784–793. [18] T. H OFMANN, Probabilistic latent semantic indexing, in [33] P. W U , Y. S ISMANIS , AND B. R EINWALD, Towards SIGIR ’99: Proceedings of the 22nd annual international keyword-driven analytical processing, in SIGMOD ’07: Pro- ACM SIGIR conference on Research and development in ceedings of the 2007 ACM SIGMOD international confer- information retrieval, ACM Press, 1999, pp. 50–57. ence on Management of data, New York, NY, USA, 2007, [19] A. I NOKUCHI AND K. TAKEDA, A method for online analyt- ACM, pp. 617–628. ical processing of text data, in CIKM ’07: Proceedings of the [34] C. Z HAI AND J. L AFFERTY, Model-based feedback in the sixteenth ACM conference on Conference on information and language modeling approach to information retrieval, in knowledge management, New York, NY, USA, 2007, ACM, Tenth International Conference on Information and Knowl- pp. 455–464. edge Management (CIKM 2001), 2001, pp. 403–410. [20] P. G. I PEIROTIS , A. N TOULAS , J. C HO , AND L. G RAVANO, Modeling and managing changes in text databases, ACM Trans. Database Syst., 32 (2007), p. 14. 1135 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.