SlideShare a Scribd company logo
1 of 8
Download to read offline
Online Submission ID: 135
TStreamMonitor: Visually Monitoring Evolving Text Streams
Category: Technique
ABSTRACT
We develop a dynamic visualization system, TStreamMonitor,
aimed at monitoring realtime text streams. The streams consist of
a large amount of continuously arriving documents, such as news,
blogs and emails. Visually monitoring such streams helps analysts
and users quickly observe and understand the topics, events and
trends of the evolving datasets. Our system encodes the incom-
ing text documents into user-friendly dynamic visualizations which
support realtime in situ exploration of the streams. Such encod-
ing is implemented based on a multilevel clustering scheme over an
evolving graph of streaming documents. The clusters discover the-
matic topics of the streaming documents. They are visualized based
on text summarization so that users can easily identify salient and
changing focuses of the text streams.
Keywords: Text Stream; Realtime Monitoring; Visual Knowledge
Discovery
1 INTRODUCTION
News, blogs, emails, and social media records become more and
more ubiquitous. Such textual documents continuously “stream”
into people’s life, creating constantly evolving text streams. In
may cases, there is a real need to monitor the streams while they
thrive, so that users and analysts can quickly observe and under-
stand the characterization, underlying structure, trends, and events
in the evolving datasets. In such a monitoring scenario, the visual-
ization design should follow several principles:
• Support in situ processing The visualization needs to han-
dle the evolving text items as they pass through the system.
Realtime computation and visualization are demanded.
• Capture salient knowledge The visual elements should cap-
ture sufficient knowledge of the streaming data. Critical fea-
tures and trends should be easily identified and studied.
• Use understandable visualization The cognitive abilities of
analyzers should be taken into account. The dynamic visual-
ization need to avoid complex metaphors and promote prompt
understanding.
• Facilitate user participation Users should be able to control
the visualization process by tuning the system for their pre-
ferred outcome, so as to maintain flexible analytical utility.
Most of current text visualization techniques, like many analyt-
ical processes, rely on post-processing of stored data. They are
successful in visualizing features and trends in text collections, but
are not directly capable of monitoring realtime text streams when
there is no prior knowledge of incoming items. Few efforts are con-
ducted over visualizing text streams, which however do not satisfy
all the principles. User may struggle to identify salient features in
the rapidly changing and complex visualizations.
In this paper, we develop a visual monitoring system, namely
TStreamMonitor, for users to study evolving text streams. It en-
codes the actively incoming text documents into user-friendly dy-
namic visualizations that characterize evolving salient features.
TStreamMonitor is developed with respect to the design principles
above. Using an evolving stream of daily news as an example, our
system is designed:
• To support in situ processing, the computation and visual-
ization has no reliance on priori knowledge of the incoming
news. Instead, it is assumed that unknown news items contin-
uously flow into the system for users to characterize.
• To capture salient knowledge, the system visualizes the
dynamically-formed clusters of a large amount of news items
to elucidate the hidden thematic information. The clusters
represent specific news topics of the rapidly changing stream,
such as Ukraine crisis, Malaysia Airlines MH370, etc.
• To promote easy understanding, instead of displaying lots of
individual news items contained in each cluster, the dynamic
visualization provides a lightweight interface with minimal
clutters and distractions. Users are capable to identify essen-
tial insights easily and promptly.
• To facilitate user interaction, the size of the clusters can be
controlled by users in realtime together with the dynamic vi-
sualization. A big cluster can also be visualized as a set
of sub-clusters representing sub-topics. Consequently, the
formed clusters and sub-clusters show different levels of news
topics. For example, Ukraine crisis has sub-topics of Ukraine
protest and Crimea votes.
To support such designs of realtime stream processing, TStream-
Monitor starts from a dynamically-changing force directed layout
of individual documents. The particles representing documents
then create an evolving graph by 2D geometric triangulation. Then
topical clusters are generated based on user controlled graph cut.
This approach needs no special text clustering methods, as they
mostly perform batch processing over the whole dataset, which is
incapable for visualizing continuous arriving documents. More-
over, the clusters are naturally formed in a multilevel scheme where
user’s control of their sizes and levels are enabled.
The topical clusters are analyzed by text summarization utilizing
keyword frequency, importance weights and graph-based ranking
algorithms (e.g. PageRank [22]). The salient information of impor-
tant clusters are visualized for easy identification and understand-
ing, instead of massive individual documents. Moreover, the visu-
alization clearly shows the splitting and merging of topical clusters,
enabled by identifying temporal topic evolution through an opti-
mized matching algorithm. Their evolutionary history is further
showed over a topic stream view, which also plays a role for user
interaction, e.g. playback.
The TStreamMonitor system achieves a new web-based platform
which supports effective realtime monitoring and study of evolving
text streams. It can be used in visual exploration of news stories,
blogs, emails, business transactions, and many more text datasets,
where they will raise the level of understanding of realtime ana-
lyzers. We implemented the prototype system and conducted case
studies using a dataset of New York Times news. We also per-
formed a user study to evaluate the system.
2 RELATED WORK
Visualizing large text corpora is an emerging research topic. Tag
cloud techniques (e.g. [18]) use visual depictions of tags (or
words) giving greater prominence to words that appear more fre-
quently. Similarity-based projections help users get insights from
large text collections in 2D views. Typically, multidimensional
1
Online Submission ID: 135
Figure 1: TStreamMonitor system framework. A text stream continuously injects arriving documents into the system, which provides an interac-
tive, realtime monitoring interface for users to quickly characterize and understand the streaming data.
scaling (MDS) or force-directed methods are used to form 2D lay-
out from the text documents. For instance, “galaxies” or “moun-
tains” are formed in the displays [26]. A hierarchy of the docu-
ments are built and projected as circles in [23]. InfoSky [3] exploits
hierarchically structured documents at each level with Voronoi dia-
grams. Examplar-based visualization [6] visualizes extremely large
text corpus by probabilistic MDS projection with approximation
and decomposition. Alsakran et al. [2] use a dynamic similarity-
based projection system to depict text streams. Gansner et al. [11]
draw the dynamic graph of streaming documents by MDS.
Text collections often include time-stamped documents. The-
meRiver [13] and LensRiver [12] depict the frequency changes of
keywords as river currents. T-scroll [14] employs a novelty-based
clustering algorithm on time-series documents. StoryTracker [16]
presents a incremental visual analytics system for exploration of
news topics in dynamic information streams. Meme-tracking [19]
extracts keywords from news corpus to generate themes, and then
visualizes them to indicate the flow of stories. Utilizing text mining
tools such as topic modeling, some efforts have been made for vi-
sual analysis of text collections, including EventRiver [21], Cloud-
lines [15], Leadline [9], visual text summary [20], TIARA [25],
TextFlow [7], Textpool [1], and TextWheel [8]. These methods
make great success in visualizing the temporal trends in archives
of text. However, most of them are not directly designed for moni-
toring live text streams.
The most related existing work to TStreamMonitor is from Al-
sakran et al. [2] and Gansner et al. [11], which are targeted at
realtime monitoring and dynamic visualization of text streams.
In [11], the modularity clustering is used after MDS projects
tweet documents onto 2D layout. They use a map metaphor to
depict the clusters, where highly related messages form countries
enclosed by a boundary. Each country has a color different from
its neighbors. Such approach is to create static layout at a time.
To handle dynamic text streams, a two-step approach is used. A
newly arriving document is placed at the average of its neighbors
in the previous graph and local optimization of MDS layout is ap-
plied. However, a special Procrustes transformation has to be ap-
plied to make the new graph best aligned to the old one. Moreover,
to avoid overlapping components after local MDS, a packing algo-
rithm has to be introduced. In [2], such dynamic stability is handled
through continuous force-directed simulation. The dynamic evolu-
tion of documents inside the 2D projected layout is seamless and
smooth. However, this approach visualizes all the particles inside
clusters, whose layout is too complex. And the dynamic motion of
lots of particles is disturbing for users to focus on salient informa-
tion. Similarly, the dynamic maps approach [11] including many
cities and countries on the screen may also impose heavy burden
for observers. In our approach, we use force-directed layout for
dynamic evolving graph, and propose a lightweight visualization
based on the dynamic clusters directly computed from the 2D ge-
ometric distribution of documents. More importantly, these related
approaches do not grant easy access and control for users to manip-
ulate clusters and their topics in visualization. Our approach instead
computes and depicts multiple levels of clusters and their dynamic
merging and splitting in a fast and effective way. TStreamMonitor
thus helps users to interactively discover interesting topics and their
subtopics with an easy-to-read interface.
3 DYNAMIC TOPICAL CLUSTERS
Fig. 1 illustrates the processing framework of TStreamMonitor. A
text stream continuously injects arriving documents into the sys-
tem. Each document has its title, content, and time stamp. It is
typically represented by a sequence of keywords, based on which
we compute the similarities between pair-wise documents. Such
approach is the most popular process used in text clustering and
visualization works. We use the cosine similarity in this computa-
tion. For in situ computation, we update a current set of keywords
once new documents arrive. Not all the keywords are included in
the set, while the TF-IDF (Term Frequency-Inverse Document Fre-
quency) weights are used to find the most important N keywords
for similarity computation.
A 2D layout of active documents is generated, in which similar
documents are placed closely. To achieve this we adopted a force-
directed method. Over this 2D layout, we use triangulation over
the document particles which creates an evolving graph. Then, the
edges having a length smaller than a given threshold are removed
to find document clusters. This approach is based on the actively
evolving graph, similar to the work [2]. It can automatically han-
dle continuously incoming documents and create clusters over these
documents. It has a mechanism for dropping old and stale docu-
ments. Our contribution behind the new visual interface is further
extending this method to find multilevel clusters and handle evolu-
tionary topics.
3.1 Multilevel Clusters
Our system can find multilevel clusters, as shown in Figure 2. Fig-
ure 2a is the evolving graph created after applying Delaunay tri-
angulation to the 2D layout of document particles. By applying
graph cut with a threshold φ1, we can discover several clusters over
the graph in Figure 2b. We further can apply a threshold φ2 inside
these clusters, where some sub-clusters are found, which has shown
in different particle colors in Figure 2c. If needed, more clusters can
be identified inside these sub-clusters (Figure 2d).
Important clusters and their sub-clusters represent the thematic
knowledge of the text stream, as we can look at each cluster as
a topic. Then the sub-clusters refer to related sub-topics. Such
hierarchical structure provides powerful visual analytics ability to
characterize critical topics, and meanwhile, analyze subtopics in-
side them.
In [2], though some clusters are identified, the visualization still
shows individual particles over the force-directed layout. Such vi-
sualization is too complex for users to understand the changing
graph. Moreover, it cannot easily visualize the information of sub-
clusters and their relationships. Instead, our visualization design is
2
Online Submission ID: 135
(a) Evolving graph (b) Discovered clusters (c) Sub-clusters (d) Sub-sub-clusters
Figure 2: Multilevel clusters are easily created from the evolving graph. Their generation is controlled by threshold parameters and can be
achieved immediately.
based on the hierarchical clusters to show more salient knowledge
of the text stream.
3.2 Evolutionary Topical Clusters
The key problem in mining and visualizing text streams is to iden-
tify the evolutionary topics. In the monitoring scenario it is even
more challenging as many analytical tools, such as the widely used
LDA (Latent Dirichlet Allocation) model [4], cannot be directly
applied to the actively arriving documents. In TStreamMonitor, we
find the topical evolution based on the relationships between docu-
ment particles. This approach naturally models the topical cluster
variation without seeking help from the text mining techniques. It
can be completed immediately in realtime monitoring which is vital
for the in situ processing.
To trace the evolving clusters, we match the important pairs of
clusters in consecutive times which share most of their document
particles. In detail, given two clusters, Ci and Cj, in consecutive
time t and t’, each of them accommodates a set of particles. We can
compute a match factor δij between the two clusters as:
δij(Ci,Cj) =
N(Ci ∩Cj)
N(Ci)
, (1)
where N(C) is the number of particles inside a cluster C. Then,
considering a set of clusters {Ci},i = 1···N found at time t, and a
set of clusters {Cj}, j = 1···M formed at time t’, we need to find
the best matching pairs Ci → Cj for each Ci. This cluster matching
problem can be solved by a combinatorial optimization algorithm,
the Hungary method [17].
This method provides a better solution than a naive matching
implementation in [2]. In particular, we have a nonnegative M ×N
matrix, where the element in the i-th row and j-th column represents
δij(Ci,Cj). It can be looked as the cost of assigning Ci → Cj. We
apply the Hungary algorithm to find an optimal solution where the
total cost Σiδij(Ci,Cj) of all i = 1···N is minimized. In our imple-
mentation, we further extend the algorithm by setting a lower bound
of δij(Ci,Cj) = 0.15 representing the smallest value between any
matched pair. Because if the match factor is too small, i.e., the two
corresponding clusters do not share enough documents, we cannot
match Cj as the evolved cluster from Ci. Based on this result, we
can identify the merging and splitting of the topical clusters from t
to t’. Finally, any cluster at time t’ that does not have a match from
the previous time is considered as a new emerging cluster.
4 TOPICAL CLUSTER SUMMARIZATION
Visualizing the topical clusters relies on a fast, automatic way to
find and present the most important points of the original docu-
Hottest Keywords:
#United States
International Relations
#Ukraine
#Russia
#Putin, Vladimir V
Figure 3: Summarization of a cluster. Hottest keywords have shown.
Moreover, three different methods are used to find important docu-
ments whose titles are showed in different colors: (Green) the newest
coming document; (Blue) the document having the largest average
similarity to other documents. (Red) the document having the largest
page rank [22].
ments inside a cluster or sub-cluster. Due to the limited real estate
in visualization, only a very small set of words, phrases, or sen-
tences can be used in the visual representation. For evolving text
streams, we utilize three types of information to quickly convey the
important content of a topical cluster including:
1. Active time period: Topics in a text stream evolve along time.
During monitoring, an active cluster which has newly incom-
ing documents may need more attention. The active time pe-
riod of clusters can provide information of a cluster’s time
variation.
2. Representative keywords: As a document is represented by a
set of important keywords, a cluster is also represented by
some representative keywords. The selection of such key-
words can be completed by using the most frequent keywords,
or by using the highly weighted keywords (e.g. by IF-TDF).
3. Significant titles: Significant documents inside a cluster usu-
ally represent the important information. Visualizing the titles
of them helps users understand the topic. Finding the most
significant document can be implemented in different ways:
• The most recently arriving document;
3
Online Submission ID: 135
Figure 4: TStreamMonitor interface. (A) Dynamic Topic Monitor; (B) Topic Stream View; (C) Detail View; (D) Control Panel.
• The document that has the largest average similarity to
other documents;
• The document has the highest page rank [22] to other
documents;
Figure 3 is an example of three different particles found in a
cluster with these three options. It indicates the particles and
the titles of their corresponding documents in different col-
ors. The representative hottest keywords are also showed. We
implement the algorithms of all of these options. In realtime
monitoring, users can switch between them freely without dis-
turbing the dynamic process. In the future, we will study more
summarization techniques (e.g. [27]) to better represent a top-
ical cluster.
In TStreamMonitor, we show these information over easy visual-
izations so that users can promptly identify the topical information.
5 TSTREAMMONITOR VISUALIZATION DESIGN
Figure 4 is an overview of the interface in TStreamMonitor. Figure
4(A) is the dynamic topic monitor. It is used to monitor the dynamic
changes of topical clusters in an active text stream. Figure 4(B)
is the topic stream view, which shows the temporal evolution of
the topics. It is also used as a panel for users to explore historical
changes when they want to. Figure 4(C) is the detail view used
to show the details of documents. In Figure 4(D) users can adjust
several parameters and thresholds, and control playback, pause and
forward.
5.1 Dynamic Topic Monitor
The key function of TStreamMonitor is to provide an easy-to-read
view of the important topics discovered by topical clusters. In visu-
alizing dynamic scenes, too many visual elements on the interface
usually confound observers’ perception and understanding, since
the changes of complex visual metaphors can easily become dis-
turbing. Therefore, our major rationale of design is to provide an
interface that is:
• lightweight: with a limited number of visual elements that do
not overwhelm and defocus observers during monitoring;
• informative: including necessary information for users to
quickly identify the topical contents;
• smooth: with minimized abrupt changes during topic evolu-
tion;
• controllable: giving users the ease of parameter adjustment
that leads to natural and sleek view changes.
5.1.1 Visualizing Topic Clusters
For a lightweight topic monitor, we do not overwhelm users by vi-
sualizing all the individual document particles as in [2]. Instead,
we only show the discovered important topics. In Figure 4(A), five
important topics are displayed, each as a circular bubble metaphor.
A bubble represents one corresponding cluster containing a set of
documents. The size of the bubbles is determined by the total num-
ber of the documents inside a cluster. The bubble’s center location
is computed from the document particles’ locations originally de-
fined by the force-directed layout of the evolving graph (Figure 2).
4
Online Submission ID: 135
Figure 5: One topical cluster showed as bubble. The associated in-
formation includes: active time period shown in the curved color bar,
four representative keywords in the linked labels, and one significant
title showed at the top. The rectangle label further shows the title
of the latest arriving document in the whole TStreamMonitor system,
which happens to be in this cluster.
Figure 6: Subtopic view of a topical cluster. Representative keywords
and significant titles have shown for each of the three sub-clusters.
Therefore in the simplified view, the distances between the cluster
bubbles represent the similarities of the topics. Related topics are
placed closely.
A straightforward solution is to place the bubble’s center at the
average location of the belonging particles’ locations. However,
during dynamic evolution, such an average location will have small
shivering motion at each time step, since any location change of a
single particle will move this average. This effect makes temporal
instability that greatly distracts observers in animation. We instead
apply a simple but effective algorithm. When a cluster bubble is
to be drawn at the first time, we find the document particle that is
closest to the average location. Then this particle’s location is used
as the bubble center. In this way, the shivering effect is removed,
and the motion of bubble becomes stable and consistent.
Figure 5 is a zoom-in view of one topical cluster. The three types
of summarized information discussed in Section 4 are visualized to
provide a succinct and informative view. On the top, a significant
title inside this cluster has shown. The linked labels present the
representative keywords of this topic. A rectangle text box with
black letters shows the title of the latest arriving document in the
whole system. Moreover, a curved bar over the bubble encodes the
most active time of this topic in colors, which shows the latest time
a new document is added to this cluster. This latest time is showed
in digits above the bar either. In particular, fresh topics are showed
in light green, and those topics with no recently joined documents
are showed in red. The length of the colored curve inside this bar
represents the length of the active time period (i.e., time interval
between the first and the latest document arrival).
Inside a transparent bubble, we show a group of small particles
representing the accommodated documents. Here the particles are
simply placed together to avoid complexity since they are mainly
used to show the merging and splitting of clusters (see below for
details).
5.1.2 Visualizing Subtopics
Based on our multilevel clusters, we can further visualize the
subtopics inside one topical cluster. Figure 6 depicts an example
of three subtopics showed in small bubbles inside the original big
one. To make clear views, the significant title and keywords are
shrunk inside the bubbles and the time bars are removed. A green
box shows the newest document title inside this cluster. Users can
switch between the sub-cluster view and the original one, as well as
control the levels.
5.1.3 Visualizing Topic Merging and Splitting
The merging and splitting of topical clusters are critical cues help-
ing understand the text stream evolution. Visualizing such events is
a unique feature of our system. We implement the smooth animated
effect for the purpose. Figure 7 depicts a splitting process. Particles
inside different cluster bubbles fly away and generate new bubbles
to represent new topical clusters. The merging effect is visualized
in the same way. Such flying effect creates a smooth change of top-
ics. It attracts users’ attention and promote a clear understanding of
the effect, which can be depicted well in the supplemental video.
5.2 Topic Stream Visualization
We design a topic stream view, based on the theme river, to show the
evolution of topic clusters, as shown in Figure 4(B). It provides the
context for the monitoring process. The topical clusters are ordered
from top to bottom by the size at each time axis, and then connects
similar topics by ribbons. The width of ribbons relates to the size of
clusters. Major keywords are showed over the ribbons. The similar
topics evolve along time are detected by the algorithm in Section
3.2. The ribbons stream forward automatically along time during
dynamic monitoring. Users can directly observe a topic’s change,
including emerging, splitting, merging, and disappearing, over the
recent time. Moreover, this view can be dragged backward and
forward. Users can monitor any preferred time period repeatedly
by clicking on any time axis or ribbon to explore the topical clusters
at the selected time. The stream view then plays a role of temporal
control panel.
5.3 Detail Visualization
When users need to investigate the details of a topic, Figure 4(C)
displays the information of individual documents with time and title
in one or multiple clusters. Users can further read more contents if
needed. This view can be turned on or off, since sometimes turning
it off can help users better focus on the monitor view. When it
is turned on, it can also be used to dynamically show the newly
arriving documents, or to dynamically show the documents in a
particular cluster ordered by time.
5.4 Interactive Control
Figure 4(D) provides controls of monitoring process for the users.
First, the dynamic process can be played forward, backward, and
paused. Second, users can choose the options for the different types
of displayed summary information, such as different representative
titles, which is discussed in Section 4. Third, users can change
the clustering parameters for further exploring, including the given
threshold that determines which particles belong to one cluster in
the graph cut (Section 3), and how many topical clusters show in
the dynamic monitor. Figure 8 depicts two different views when
the parameters are changed. They have different numbers of topical
clusters (5 and 10). Users can control such different granularities
5
Online Submission ID: 135
(a) 3 clusters at the beginning (b) Splitting with flying particles (c) Split to 4 clusters
Figure 7: Cluster splitting effect. Particles inside the original clusters fly away to generate new clusters. Merging effect is implemented in the
same style.
(a) 5 clusters (b) 10 clusters
Figure 8: Changing clustering parameters. Different numbers of clus-
ters are generated with different topical details.
for their interest. More importantly, such control operations can
be done in realtime during the monitoring process. This shows one
important merit of our dynamic clustering method. Our method can
easily achieve such different clustering effects with an immediate
switch by users, even in a dynamic process.
6 SYSTEM IMPLEMENTATION
To maximize the usability of TStreamMonitor, we develop the sys-
tem in a realtime client-and-server scheme. In particular, after a
text stream injects new documents into the system server, the doc-
uments are processed for cleaning and keywords extraction. Then
the server-side program updates the active list of the important key-
words, performs in situ similarity computation, and uses the force-
directed algorithm to create the 2D layout of particles. Dynamic
and multilevel clustering are then performed (Section 3), followed
by the summarization (Section 4). Finally, a hierarchical data struc-
ture of clusters are created, each cluster stores the summarization
information as well as the indices of their belonging document par-
ticles. All these operations on the server are implemented by opti-
mized C++ programs for fast performance.
The server starts and maintains a realtime data exchange service.
One or more web clients can be initiated by linking them to the
server with internet connection. The data exchange service between
the server and clients is implemented through the JSON (JavaScript
Object Notation) format [10], a text format that is completely lan-
guage independent. The communication between server and client
is completed by Web Sockets API [24] that enables web pages to
use the Web Socket protocol for full-duplex communication with
the remote server. The full-duplex channel is important since in re-
altime we transfer control parameters adjusted by users to the server
side to generate different results according to their input.
A client’s sole task is to provide web-based visualization. It re-
ceives the information of the multi-clusters continuously and then
visualize them on web pages. The visualization is completely web-
based, implemented by Javascript and HTML5 with D3js [5].
Documents of a text stream may arrive at any speed (with a vari-
ety of time intervals). Our system can follow the incoming speed to
inject new documents. We also implement a buffering mechanism
so that the interval between consecutive streaming documents can
be controlled. This provides a flexible scheme since we found if
the stream goes too fast, users do not have enough time for percep-
tion and forming insights. Moreover, documents can stream into
the system one by one, or following their publishing time, in which
case a set of documents may be inserted at the same time.
7 EVALUATION
For evaluating our application, Sec. 7.1 demonstrates a case study
of real world data stream. Sec. 7.2 reports a preliminary user study
by 8 participants.
7.1 Case Study: New York Times News Stream
In this case study, we explored a text stream collected from New
York Times (www.nytimes.com). The text stream includes all(892)
news of the New York Times on March 6th, 2014. Each individual
news document includes the keywords, published time, title, and
the first paragraph of the news. We mimic continuously arriving
documents flowing into TStreamMonitor by their publishing time,
and perform monitoring tasks over the stream. We put all the news
of every 20 minutes into the system with an interval of 5 seconds.
On average, a news is ejected in the system per second. The News
which are older than 12 hours will be dropped. All the computation
are completed in realtime. In the supplemental video, this whole
dynamic monitoring process is clearly illustrated in the animation.
Figure 9a is a snapshot of TStreamMonitor. The dynamic mon-
itor shows 5 topical clusters at 12:31:30 pm March 6th, 2014. The
stream view shows the historical evolution of topics until this time.
From the labeled keywords, we can characterize the clusters related
to “Russia” (green), “British” (purple), “Factory” (blue), “France”
(orange), and “Africa” (red) respectively. The green cluster has the
largest size including lots of documents. We can investigate the de-
tail of this topic by turning on the detail list view on the right. More-
over, we switched this cluster to a multilevel view where we got
three sub-topics, as shown in Figure 9b. Major keywords and titles
indicate the subtopics: “Russia, Crimea, Ukrainian” (top), “Rus-
sia”(right bottom), and “Crimea” (left bottom).
Figure 9c shows the new layout in the next timestamp. The
blue cluster merged into the purple cluster after new items were
injected. And a new cluster about “China” was emerged. The rect-
angle labels of each cluster further shows the titles of new com-
ing news. A lots of news filled into the “Russia” topical cluster.
This topic showed a representative title “EU Slaps Initial Sanctions
6
Online Submission ID: 135
(a) (b) (c)
(d) (e) (f)
Figure 9: Case study: Monitoring a text stream of New York Time news on March 6th, 2014. (a)-(f) are snapshots showed in TStreamMonitor.
on Russia ...” and came with frequent keywords including “Rus-
sia”, “Ukraine”, “Crimea” and “European”. We showed the highest
page rank title of the cluster. We further enlarged it and found three
subtopics (Figure 9d). Then we can found an important event about
Russia, Ukraine and European. And this topical cluster became
the largest topical cluster. We are aware of the Crimea crisis and
Ukraine violence. By studying more details, we found the Crimean
referendum was reset to March 30th from its initially planned date
at May 25th, 2014.
Further monitoring the stream, at 18:51:30, The green cluster
became the largest cluster(Figure 9e). It indicated that “California”
and “France” became the hottest topic in the news stream at that
time. By studying more details, a few news about the Paris Fashion
Week (February 25th to March 5th, 2014) and California state filled
in at that time. Figure 9f advanced to the end of March 6th. The
purple cluster which is related to “Police”, “Paris Fashion Week”
and “executive” became the largest cluster. Then we are aware that
the topic about entertainment and social news became the hot topic
in the night.
7.2 User Study
We conducted a preliminary user study to evaluate our TStream-
Monitor system. The text stream for user study is collected from
New York Times by using NYT API. The text stream includes 956
News about ”Obama” which starts from Mar. 1st, 2014 and ends at
Mar. 20, 2014. We put all the news of every 5 hours with an interval
of 5 seconds into our system. Therefore continuously arriving news
flowed into TStreamMonitor by their publishing time. On average,
2 news are ejected in our system per second. News which are older
than 10 days will be dropped.
we designed the user study for evaluating the merits of our mon-
itoring system in the following three aspects:
• TStreamMonitor helps users identify critical information and
trends of a text stream;
• TStreamMonitor provides understandable and easy interface
for dynamic visualization.
• TStreamMonitor facilitates user participation in the monitor-
ing process.
7.2.1 Tasks
We designed three tasks in our user study. The monitoring process
has been described in the case study discussed in Section 7.1. These
tasks are to:
• T1: Identify the important events during the monitor-
ing of the text stream. In this task, participants were asked
whether or not several salient events can be identified.
• T2: Provide feedback of the understandable visualization.
In this task, participants were asked: (1) whether the topical
clusters can be identified clearly. (2) whether the splitting and
merging of clusters is helpful in understanding the events; (3)
whether the topical clusters can be easily traced along time.
• T3: Provide feedback of the user participation in the mon-
itoring process. In this task, participants were asked to eval-
uate whether or not the monitoring process can be controlled
easily.
7.2.2 Participants
The user study was conducted with 8 participants (3 females and 5
males), who were all in computer science major. Participants were
between 25 and 31 years old. Most of them had a basic understand-
ing of data mining and information visualization. They have not
used any text stream monitoring system before. An instructor spent
15 to 20 minutes to brief the user study for the participants. This in-
volved explaining the tasks, introducing the interface, and describ-
ing the concept of keywords and title of documents. Participants
can freely complete the tasks in their preferred order. Participants
have 25 minutes to complete all the three tasks by using our system
to explore the news stream.
7
Online Submission ID: 135
7.2.3 Results
Feedback collected from the participants generates positive evalua-
tion for our system.
In Task T1, on average, participants could identify 70.8% of the
important events from the text stream. In detail, there were 7 out
of 8 participants claimed their identification of the event on March
2nd. All the 8 participants claimed their identification of the event
on March 16th. But only 2 out of eight participants claimed their
identification of the event on March 7th. This was possibly caused
by that the size of the topic cluster did not increase enough to at-
tract their attention, although it has become the largest cluster. This
is because we did not set the cluster size linearly proportional to
its size, in order to avoid small clusters are too small on screen to
identify while big ones are too large. It indeed implies one of our
major future work, which is to improve the visual interface in many
subtle details by closely working with user feedbacks.
In Task T2, 7 out of 8 participants proposed that the topical clus-
ters can be identified and traced easily during dynamic monitoring.
All the 8 participants agreed that the splitting and merging of clus-
ter makes the text stream more understandable. Finally, in Task T3,
6 out of 8 participants agreed that they can control the monitoring
process easily.
The results primarily show that our TStreamMonitor system pro-
vides a useful and efficient monitoring platform for users. It needs
to be improved by more delicate design in visual representations,
such as color, size, label, etc.
8 CONCLUSION
We have presented a visual monitoring system, called TStream-
Monitor, which helps users to understand and study the topic infor-
mation of text streams. The system facilitates real-time computa-
tion over streaming data, uses easy-to-understand visual metaphors,
and promotes user participation in the analysis process. Users can
easily characterize salient topics in a large set of arriving docu-
ments. Through observing the topics’ merging and splitting in the
dynamic monitor, users can further understand their evolutionary
trends. A real-world case study illustrated the application for mon-
itoring realtime text streams. And the feedback of user study shows
a effort reduction for learning text stream. Motivated by these re-
sults, we believe that our work will have a positive impact on the
monitoring evolving data stream.
In the future, we will study more topic and cluster summarization
techniques. Since more than 8 thematic clusters are difficult to track
at the same time. It is limited for a document stream has much
higher number of topics clusters. We will use other data mining
method to generate more stable and reliable cluster results. We will
also perform more extensive user studies over specific datasets.
REFERENCES
[1] C. Albrecht-Buehler, B. Watson, and D. A. Shamma. Visualizing
live text streams using motion and temporal pooling. IEEE Computer
Graphics and Applications, 25(3):52–59, 2005.
[2] J. Alsakran, Y. Chen, D. Luo, Y. Zhao, J. Yang, W. Dou, and S. Liu.
Real-time visualization of streaming text with a force-based dynamic
system. IEEE Comput. Graph. Appl., 32(1):34–45, Jan. 2012.
[3] K. Andrews, W. Kienreich, V. Sabol, J. Becker, G. Droschl, F. Kappe,
M. Granitzer, P. Auer, and K. Tochtermann. The infosky visual ex-
plorer: Exploiting hierarchical structure and document similarities.
Information Visualization, 1(3):166–181, Dec. 2002.
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J.
Mach. Learn. Res., 3:993–1022, March 2003.
[5] M. Bostock, V. Ogievetsky, and J. Heer. D3 data-driven docu-
ments. IEEE Transactions on Visualization and Computer Graphics,
17(12):2301–2309, 2011.
[6] Y. Chen, L. Wang, M. Dong, and J. Hua. Exemplar-based visualization
of large document corpus (infovis2009-1115). IEEE Transactions on
Visualization and Computer Graphics, 15(6):1161–1168, 2009.
[7] W. Cui, S. Liu, L. Tan, C. Shi, Y. Song, Z. Gao, H. Qu, and
X. Tong. Textflow: Towards better understanding of evolving topics
in text. IEEE Transactions on Visualization and Computer Graphics,
17(12):2412–2421, Dec. 2011.
[8] W. Cui, H. Qu, H. Zhou, W. Zhang, and S. Skiena. Watch the story un-
fold with textwheel: Visualization of large-scale news streams. ACM
Transactions on Intelligent Systems and Technology (TIST), 3(2):20,
2012.
[9] W. Dou, X. Wang, D. Skau, W. Ribarsky, and M. X. Zhou. Leadline:
Interactive visual analysis of text data through event identification and
exploration. In Proceedings of IEEE Conference on Visual Analytics
Science and Technology, pages 93–102, 2012.
[10] Ecma-International. Ecma-404 the JSON data interchange standard.
http://www.json.org, 2013.
[11] E. R. Gansner, Y. Hu, and S. C. North. Interactive visualization of
streaming text data with dynamic maps. Journal of Graph Algorithms
and Applications, 17(4):515–540, 2013.
[12] M. Ghoniem, D. Luo, J. Yang, and W. Ribarsky. Newslab: Exploratory
broadcast news video analysis. In Proceedings of the 2007 IEEE Sym-
posium on Visual Analytics Science and Technology, pages 123–130,
2007.
[13] S. Havre, P. Whitney, and L. Nowell. Themeriver: Visualizing the-
matic changes in large document collections. IEEE Transactions on
Visualization and Computer Graphics, 8:9–20, 2002.
[14] Y. Ishikawa and M. Hasegawa. T-scroll: Visualizing trends in a time-
series of documents for interactive user exploration. Lecture Notes in
Computer Science, 4675:235–246, Nov. 2007.
[15] M. Krstajic, E. Bertini, and D. Keim. Cloudlines: compact display
of event episodes in multiple time-series. Visualization and Computer
Graphics, IEEE Transactions on, 17(12):2432–2439, 2011.
[16] M. Krstajic, M. Najm-Araghi, F. Mansmann, and D. A. Keim. Incre-
mental visual text analytics of news story development. In IS&T/SPIE
Electronic Imaging, pages 829407–829407. International Society for
Optics and Photonics, 2012.
[17] H. W. Kuhn. The Hungarian method for the assignment problem.
Naval Research Logistics Quarterly, 2:83–97, 1955.
[18] B. Lee, N. H. Riche, A. K. Karlson, and S. Carpendale. SparkClouds:
Visualizing trends in tag clouds. IEEE Trans. Visualization and Com-
puter Graphics, 16(6):1182–1189, 2010.
[19] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the
dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD
international conference on Knowledge discovery and data mining,
pages 497–506, 2009.
[20] S. Liu, M. X. Zhou, S. Pan, W. Qian, W. Cai, and X. Lian. Interactive,
topic-based visual text summarization and analysis. In Proceeding
of the 18th ACM conference on Information and knowledge manage-
ment, pages 543–552, 2009.
[21] D. Luo, J. Yang, M. Krstajic, W. Ribarsky, and D. Keim. Eventriver:
Visually exploring text collections with temporal references. Visual-
ization and Computer Graphics, IEEE Transactions on, 18(1):93–105,
2012.
[22] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation
ranking: Bringing order to the web. Technical Report, 1998.
[23] F. Paulovich and R. Minghim. Hipp: A novel hierarchical point place-
ment strategy and its application to the exploration of document col-
lections. IEEE Transaction on Visualization and Computer Graphics,
16(8):1229–1236, Nov. 2008.
[24] V. Wang, F. Salim, and P. Moskovits. The Definitive Guide to HTML5
WebSocket, Build Real-Time Applications with HTML5. Apress, 2013.
[25] F. Wei, S. Liu, Y. Song, S. Pan, M. Zhou, W. Qian, L. Shi, L. Tan, and
Q. Zhang. Tiara: a visual exploratory text analytic system. In Proc.
KDD, pages 153–162, 2010.
[26] J. A. Wise, J. J. Thomas, K. Pennock, D. Lantrip, M. Pottier, A. Schur,
and V. Crow. Visualizing the non-visual: spatial analysis and interac-
tion with information for text documents. Readings in information
visualization: using vision to think, pages 442–450, 1999.
[27] X. Zhu, A. Goldberg, J. V. Gael, and D. Andrzejewski. Improving di-
versity in ranking using absorbing random walks. HLT-NAACL, pages
97–104, 2007.
8

More Related Content

Similar to Visually Monitor Evolving Text Streams Using TStreamMonitor

communication in distributed systems
communication in distributed systemscommunication in distributed systems
communication in distributed systemsmohammed alrekabe
 
A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataredpel dot com
 
INTELLIGENT URBAN TRAFFIC CONTROL SYSTEM :ITS ARCHITECTURE
INTELLIGENT URBAN TRAFFIC CONTROL SYSTEM :ITS ARCHITECTUREINTELLIGENT URBAN TRAFFIC CONTROL SYSTEM :ITS ARCHITECTURE
INTELLIGENT URBAN TRAFFIC CONTROL SYSTEM :ITS ARCHITECTUREWael Alawsey
 
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...Shakas Technologies
 
Service usage classification with encrypted
Service usage classification with encryptedService usage classification with encrypted
Service usage classification with encryptedKamal Spring
 
Stream Processing Environmental Applications in Jordan Valley
Stream Processing Environmental Applications in Jordan ValleyStream Processing Environmental Applications in Jordan Valley
Stream Processing Environmental Applications in Jordan ValleyCSCJournals
 
Activity Context Modeling in Context-Aware
Activity Context Modeling in Context-AwareActivity Context Modeling in Context-Aware
Activity Context Modeling in Context-AwareEditor IJCATR
 
Hybrid-e-greedy for mobile context-aware recommender system
Hybrid-e-greedy for mobile context-aware recommender systemHybrid-e-greedy for mobile context-aware recommender system
Hybrid-e-greedy for mobile context-aware recommender systemBouneffouf Djallel
 
Mining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering AlgorithmMining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering AlgorithmManishankar Medi
 
Mobile information collectors trajectory data warehouse design
Mobile information collectors trajectory data warehouse designMobile information collectors trajectory data warehouse design
Mobile information collectors trajectory data warehouse designIJMIT JOURNAL
 
An adaptive clustering and classification algorithm for Twitter data streamin...
An adaptive clustering and classification algorithm for Twitter data streamin...An adaptive clustering and classification algorithm for Twitter data streamin...
An adaptive clustering and classification algorithm for Twitter data streamin...TELKOMNIKA JOURNAL
 
Ieee transactions on 2018 network and service management
Ieee transactions on 2018 network and service managementIeee transactions on 2018 network and service management
Ieee transactions on 2018 network and service managementtsysglobalsolutions
 
Network Monitoring and Traffic Reduction using Multi-Agent Technology
Network Monitoring and Traffic Reduction using Multi-Agent TechnologyNetwork Monitoring and Traffic Reduction using Multi-Agent Technology
Network Monitoring and Traffic Reduction using Multi-Agent TechnologyEswar Publications
 
CONTENT BASED DATA TRANSFER MECHANISM FOR EFFICIENT BULK DATA TRANSFER IN GRI...
CONTENT BASED DATA TRANSFER MECHANISM FOR EFFICIENT BULK DATA TRANSFER IN GRI...CONTENT BASED DATA TRANSFER MECHANISM FOR EFFICIENT BULK DATA TRANSFER IN GRI...
CONTENT BASED DATA TRANSFER MECHANISM FOR EFFICIENT BULK DATA TRANSFER IN GRI...ijgca
 
Reactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeReactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeSumant Tambe
 
PERFORMANCE STUDY OF TIME SERIES DATABASES
PERFORMANCE STUDY OF TIME SERIES DATABASESPERFORMANCE STUDY OF TIME SERIES DATABASES
PERFORMANCE STUDY OF TIME SERIES DATABASESijdms
 
Designing and configuring context-aware semantic web applications
Designing and configuring context-aware semantic web applicationsDesigning and configuring context-aware semantic web applications
Designing and configuring context-aware semantic web applicationsTELKOMNIKA JOURNAL
 

Similar to Visually Monitor Evolving Text Streams Using TStreamMonitor (20)

communication in distributed systems
communication in distributed systemscommunication in distributed systems
communication in distributed systems
 
A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring data
 
INTELLIGENT URBAN TRAFFIC CONTROL SYSTEM :ITS ARCHITECTURE
INTELLIGENT URBAN TRAFFIC CONTROL SYSTEM :ITS ARCHITECTUREINTELLIGENT URBAN TRAFFIC CONTROL SYSTEM :ITS ARCHITECTURE
INTELLIGENT URBAN TRAFFIC CONTROL SYSTEM :ITS ARCHITECTURE
 
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
 
rscript_paper-1
rscript_paper-1rscript_paper-1
rscript_paper-1
 
Service usage classification with encrypted
Service usage classification with encryptedService usage classification with encrypted
Service usage classification with encrypted
 
Stream Processing Environmental Applications in Jordan Valley
Stream Processing Environmental Applications in Jordan ValleyStream Processing Environmental Applications in Jordan Valley
Stream Processing Environmental Applications in Jordan Valley
 
Activity Context Modeling in Context-Aware
Activity Context Modeling in Context-AwareActivity Context Modeling in Context-Aware
Activity Context Modeling in Context-Aware
 
Hybrid-e-greedy for mobile context-aware recommender system
Hybrid-e-greedy for mobile context-aware recommender systemHybrid-e-greedy for mobile context-aware recommender system
Hybrid-e-greedy for mobile context-aware recommender system
 
Mining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering AlgorithmMining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering Algorithm
 
final
finalfinal
final
 
Mobile information collectors trajectory data warehouse design
Mobile information collectors trajectory data warehouse designMobile information collectors trajectory data warehouse design
Mobile information collectors trajectory data warehouse design
 
An adaptive clustering and classification algorithm for Twitter data streamin...
An adaptive clustering and classification algorithm for Twitter data streamin...An adaptive clustering and classification algorithm for Twitter data streamin...
An adaptive clustering and classification algorithm for Twitter data streamin...
 
Ieee transactions on 2018 network and service management
Ieee transactions on 2018 network and service managementIeee transactions on 2018 network and service management
Ieee transactions on 2018 network and service management
 
Network Monitoring and Traffic Reduction using Multi-Agent Technology
Network Monitoring and Traffic Reduction using Multi-Agent TechnologyNetwork Monitoring and Traffic Reduction using Multi-Agent Technology
Network Monitoring and Traffic Reduction using Multi-Agent Technology
 
Matrix Mapper
Matrix MapperMatrix Mapper
Matrix Mapper
 
CONTENT BASED DATA TRANSFER MECHANISM FOR EFFICIENT BULK DATA TRANSFER IN GRI...
CONTENT BASED DATA TRANSFER MECHANISM FOR EFFICIENT BULK DATA TRANSFER IN GRI...CONTENT BASED DATA TRANSFER MECHANISM FOR EFFICIENT BULK DATA TRANSFER IN GRI...
CONTENT BASED DATA TRANSFER MECHANISM FOR EFFICIENT BULK DATA TRANSFER IN GRI...
 
Reactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeReactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/Subscribe
 
PERFORMANCE STUDY OF TIME SERIES DATABASES
PERFORMANCE STUDY OF TIME SERIES DATABASESPERFORMANCE STUDY OF TIME SERIES DATABASES
PERFORMANCE STUDY OF TIME SERIES DATABASES
 
Designing and configuring context-aware semantic web applications
Designing and configuring context-aware semantic web applicationsDesigning and configuring context-aware semantic web applications
Designing and configuring context-aware semantic web applications
 

Visually Monitor Evolving Text Streams Using TStreamMonitor

  • 1. Online Submission ID: 135 TStreamMonitor: Visually Monitoring Evolving Text Streams Category: Technique ABSTRACT We develop a dynamic visualization system, TStreamMonitor, aimed at monitoring realtime text streams. The streams consist of a large amount of continuously arriving documents, such as news, blogs and emails. Visually monitoring such streams helps analysts and users quickly observe and understand the topics, events and trends of the evolving datasets. Our system encodes the incom- ing text documents into user-friendly dynamic visualizations which support realtime in situ exploration of the streams. Such encod- ing is implemented based on a multilevel clustering scheme over an evolving graph of streaming documents. The clusters discover the- matic topics of the streaming documents. They are visualized based on text summarization so that users can easily identify salient and changing focuses of the text streams. Keywords: Text Stream; Realtime Monitoring; Visual Knowledge Discovery 1 INTRODUCTION News, blogs, emails, and social media records become more and more ubiquitous. Such textual documents continuously “stream” into people’s life, creating constantly evolving text streams. In may cases, there is a real need to monitor the streams while they thrive, so that users and analysts can quickly observe and under- stand the characterization, underlying structure, trends, and events in the evolving datasets. In such a monitoring scenario, the visual- ization design should follow several principles: • Support in situ processing The visualization needs to han- dle the evolving text items as they pass through the system. Realtime computation and visualization are demanded. • Capture salient knowledge The visual elements should cap- ture sufficient knowledge of the streaming data. Critical fea- tures and trends should be easily identified and studied. • Use understandable visualization The cognitive abilities of analyzers should be taken into account. The dynamic visual- ization need to avoid complex metaphors and promote prompt understanding. • Facilitate user participation Users should be able to control the visualization process by tuning the system for their pre- ferred outcome, so as to maintain flexible analytical utility. Most of current text visualization techniques, like many analyt- ical processes, rely on post-processing of stored data. They are successful in visualizing features and trends in text collections, but are not directly capable of monitoring realtime text streams when there is no prior knowledge of incoming items. Few efforts are con- ducted over visualizing text streams, which however do not satisfy all the principles. User may struggle to identify salient features in the rapidly changing and complex visualizations. In this paper, we develop a visual monitoring system, namely TStreamMonitor, for users to study evolving text streams. It en- codes the actively incoming text documents into user-friendly dy- namic visualizations that characterize evolving salient features. TStreamMonitor is developed with respect to the design principles above. Using an evolving stream of daily news as an example, our system is designed: • To support in situ processing, the computation and visual- ization has no reliance on priori knowledge of the incoming news. Instead, it is assumed that unknown news items contin- uously flow into the system for users to characterize. • To capture salient knowledge, the system visualizes the dynamically-formed clusters of a large amount of news items to elucidate the hidden thematic information. The clusters represent specific news topics of the rapidly changing stream, such as Ukraine crisis, Malaysia Airlines MH370, etc. • To promote easy understanding, instead of displaying lots of individual news items contained in each cluster, the dynamic visualization provides a lightweight interface with minimal clutters and distractions. Users are capable to identify essen- tial insights easily and promptly. • To facilitate user interaction, the size of the clusters can be controlled by users in realtime together with the dynamic vi- sualization. A big cluster can also be visualized as a set of sub-clusters representing sub-topics. Consequently, the formed clusters and sub-clusters show different levels of news topics. For example, Ukraine crisis has sub-topics of Ukraine protest and Crimea votes. To support such designs of realtime stream processing, TStream- Monitor starts from a dynamically-changing force directed layout of individual documents. The particles representing documents then create an evolving graph by 2D geometric triangulation. Then topical clusters are generated based on user controlled graph cut. This approach needs no special text clustering methods, as they mostly perform batch processing over the whole dataset, which is incapable for visualizing continuous arriving documents. More- over, the clusters are naturally formed in a multilevel scheme where user’s control of their sizes and levels are enabled. The topical clusters are analyzed by text summarization utilizing keyword frequency, importance weights and graph-based ranking algorithms (e.g. PageRank [22]). The salient information of impor- tant clusters are visualized for easy identification and understand- ing, instead of massive individual documents. Moreover, the visu- alization clearly shows the splitting and merging of topical clusters, enabled by identifying temporal topic evolution through an opti- mized matching algorithm. Their evolutionary history is further showed over a topic stream view, which also plays a role for user interaction, e.g. playback. The TStreamMonitor system achieves a new web-based platform which supports effective realtime monitoring and study of evolving text streams. It can be used in visual exploration of news stories, blogs, emails, business transactions, and many more text datasets, where they will raise the level of understanding of realtime ana- lyzers. We implemented the prototype system and conducted case studies using a dataset of New York Times news. We also per- formed a user study to evaluate the system. 2 RELATED WORK Visualizing large text corpora is an emerging research topic. Tag cloud techniques (e.g. [18]) use visual depictions of tags (or words) giving greater prominence to words that appear more fre- quently. Similarity-based projections help users get insights from large text collections in 2D views. Typically, multidimensional 1
  • 2. Online Submission ID: 135 Figure 1: TStreamMonitor system framework. A text stream continuously injects arriving documents into the system, which provides an interac- tive, realtime monitoring interface for users to quickly characterize and understand the streaming data. scaling (MDS) or force-directed methods are used to form 2D lay- out from the text documents. For instance, “galaxies” or “moun- tains” are formed in the displays [26]. A hierarchy of the docu- ments are built and projected as circles in [23]. InfoSky [3] exploits hierarchically structured documents at each level with Voronoi dia- grams. Examplar-based visualization [6] visualizes extremely large text corpus by probabilistic MDS projection with approximation and decomposition. Alsakran et al. [2] use a dynamic similarity- based projection system to depict text streams. Gansner et al. [11] draw the dynamic graph of streaming documents by MDS. Text collections often include time-stamped documents. The- meRiver [13] and LensRiver [12] depict the frequency changes of keywords as river currents. T-scroll [14] employs a novelty-based clustering algorithm on time-series documents. StoryTracker [16] presents a incremental visual analytics system for exploration of news topics in dynamic information streams. Meme-tracking [19] extracts keywords from news corpus to generate themes, and then visualizes them to indicate the flow of stories. Utilizing text mining tools such as topic modeling, some efforts have been made for vi- sual analysis of text collections, including EventRiver [21], Cloud- lines [15], Leadline [9], visual text summary [20], TIARA [25], TextFlow [7], Textpool [1], and TextWheel [8]. These methods make great success in visualizing the temporal trends in archives of text. However, most of them are not directly designed for moni- toring live text streams. The most related existing work to TStreamMonitor is from Al- sakran et al. [2] and Gansner et al. [11], which are targeted at realtime monitoring and dynamic visualization of text streams. In [11], the modularity clustering is used after MDS projects tweet documents onto 2D layout. They use a map metaphor to depict the clusters, where highly related messages form countries enclosed by a boundary. Each country has a color different from its neighbors. Such approach is to create static layout at a time. To handle dynamic text streams, a two-step approach is used. A newly arriving document is placed at the average of its neighbors in the previous graph and local optimization of MDS layout is ap- plied. However, a special Procrustes transformation has to be ap- plied to make the new graph best aligned to the old one. Moreover, to avoid overlapping components after local MDS, a packing algo- rithm has to be introduced. In [2], such dynamic stability is handled through continuous force-directed simulation. The dynamic evolu- tion of documents inside the 2D projected layout is seamless and smooth. However, this approach visualizes all the particles inside clusters, whose layout is too complex. And the dynamic motion of lots of particles is disturbing for users to focus on salient informa- tion. Similarly, the dynamic maps approach [11] including many cities and countries on the screen may also impose heavy burden for observers. In our approach, we use force-directed layout for dynamic evolving graph, and propose a lightweight visualization based on the dynamic clusters directly computed from the 2D ge- ometric distribution of documents. More importantly, these related approaches do not grant easy access and control for users to manip- ulate clusters and their topics in visualization. Our approach instead computes and depicts multiple levels of clusters and their dynamic merging and splitting in a fast and effective way. TStreamMonitor thus helps users to interactively discover interesting topics and their subtopics with an easy-to-read interface. 3 DYNAMIC TOPICAL CLUSTERS Fig. 1 illustrates the processing framework of TStreamMonitor. A text stream continuously injects arriving documents into the sys- tem. Each document has its title, content, and time stamp. It is typically represented by a sequence of keywords, based on which we compute the similarities between pair-wise documents. Such approach is the most popular process used in text clustering and visualization works. We use the cosine similarity in this computa- tion. For in situ computation, we update a current set of keywords once new documents arrive. Not all the keywords are included in the set, while the TF-IDF (Term Frequency-Inverse Document Fre- quency) weights are used to find the most important N keywords for similarity computation. A 2D layout of active documents is generated, in which similar documents are placed closely. To achieve this we adopted a force- directed method. Over this 2D layout, we use triangulation over the document particles which creates an evolving graph. Then, the edges having a length smaller than a given threshold are removed to find document clusters. This approach is based on the actively evolving graph, similar to the work [2]. It can automatically han- dle continuously incoming documents and create clusters over these documents. It has a mechanism for dropping old and stale docu- ments. Our contribution behind the new visual interface is further extending this method to find multilevel clusters and handle evolu- tionary topics. 3.1 Multilevel Clusters Our system can find multilevel clusters, as shown in Figure 2. Fig- ure 2a is the evolving graph created after applying Delaunay tri- angulation to the 2D layout of document particles. By applying graph cut with a threshold φ1, we can discover several clusters over the graph in Figure 2b. We further can apply a threshold φ2 inside these clusters, where some sub-clusters are found, which has shown in different particle colors in Figure 2c. If needed, more clusters can be identified inside these sub-clusters (Figure 2d). Important clusters and their sub-clusters represent the thematic knowledge of the text stream, as we can look at each cluster as a topic. Then the sub-clusters refer to related sub-topics. Such hierarchical structure provides powerful visual analytics ability to characterize critical topics, and meanwhile, analyze subtopics in- side them. In [2], though some clusters are identified, the visualization still shows individual particles over the force-directed layout. Such vi- sualization is too complex for users to understand the changing graph. Moreover, it cannot easily visualize the information of sub- clusters and their relationships. Instead, our visualization design is 2
  • 3. Online Submission ID: 135 (a) Evolving graph (b) Discovered clusters (c) Sub-clusters (d) Sub-sub-clusters Figure 2: Multilevel clusters are easily created from the evolving graph. Their generation is controlled by threshold parameters and can be achieved immediately. based on the hierarchical clusters to show more salient knowledge of the text stream. 3.2 Evolutionary Topical Clusters The key problem in mining and visualizing text streams is to iden- tify the evolutionary topics. In the monitoring scenario it is even more challenging as many analytical tools, such as the widely used LDA (Latent Dirichlet Allocation) model [4], cannot be directly applied to the actively arriving documents. In TStreamMonitor, we find the topical evolution based on the relationships between docu- ment particles. This approach naturally models the topical cluster variation without seeking help from the text mining techniques. It can be completed immediately in realtime monitoring which is vital for the in situ processing. To trace the evolving clusters, we match the important pairs of clusters in consecutive times which share most of their document particles. In detail, given two clusters, Ci and Cj, in consecutive time t and t’, each of them accommodates a set of particles. We can compute a match factor δij between the two clusters as: δij(Ci,Cj) = N(Ci ∩Cj) N(Ci) , (1) where N(C) is the number of particles inside a cluster C. Then, considering a set of clusters {Ci},i = 1···N found at time t, and a set of clusters {Cj}, j = 1···M formed at time t’, we need to find the best matching pairs Ci → Cj for each Ci. This cluster matching problem can be solved by a combinatorial optimization algorithm, the Hungary method [17]. This method provides a better solution than a naive matching implementation in [2]. In particular, we have a nonnegative M ×N matrix, where the element in the i-th row and j-th column represents δij(Ci,Cj). It can be looked as the cost of assigning Ci → Cj. We apply the Hungary algorithm to find an optimal solution where the total cost Σiδij(Ci,Cj) of all i = 1···N is minimized. In our imple- mentation, we further extend the algorithm by setting a lower bound of δij(Ci,Cj) = 0.15 representing the smallest value between any matched pair. Because if the match factor is too small, i.e., the two corresponding clusters do not share enough documents, we cannot match Cj as the evolved cluster from Ci. Based on this result, we can identify the merging and splitting of the topical clusters from t to t’. Finally, any cluster at time t’ that does not have a match from the previous time is considered as a new emerging cluster. 4 TOPICAL CLUSTER SUMMARIZATION Visualizing the topical clusters relies on a fast, automatic way to find and present the most important points of the original docu- Hottest Keywords: #United States International Relations #Ukraine #Russia #Putin, Vladimir V Figure 3: Summarization of a cluster. Hottest keywords have shown. Moreover, three different methods are used to find important docu- ments whose titles are showed in different colors: (Green) the newest coming document; (Blue) the document having the largest average similarity to other documents. (Red) the document having the largest page rank [22]. ments inside a cluster or sub-cluster. Due to the limited real estate in visualization, only a very small set of words, phrases, or sen- tences can be used in the visual representation. For evolving text streams, we utilize three types of information to quickly convey the important content of a topical cluster including: 1. Active time period: Topics in a text stream evolve along time. During monitoring, an active cluster which has newly incom- ing documents may need more attention. The active time pe- riod of clusters can provide information of a cluster’s time variation. 2. Representative keywords: As a document is represented by a set of important keywords, a cluster is also represented by some representative keywords. The selection of such key- words can be completed by using the most frequent keywords, or by using the highly weighted keywords (e.g. by IF-TDF). 3. Significant titles: Significant documents inside a cluster usu- ally represent the important information. Visualizing the titles of them helps users understand the topic. Finding the most significant document can be implemented in different ways: • The most recently arriving document; 3
  • 4. Online Submission ID: 135 Figure 4: TStreamMonitor interface. (A) Dynamic Topic Monitor; (B) Topic Stream View; (C) Detail View; (D) Control Panel. • The document that has the largest average similarity to other documents; • The document has the highest page rank [22] to other documents; Figure 3 is an example of three different particles found in a cluster with these three options. It indicates the particles and the titles of their corresponding documents in different col- ors. The representative hottest keywords are also showed. We implement the algorithms of all of these options. In realtime monitoring, users can switch between them freely without dis- turbing the dynamic process. In the future, we will study more summarization techniques (e.g. [27]) to better represent a top- ical cluster. In TStreamMonitor, we show these information over easy visual- izations so that users can promptly identify the topical information. 5 TSTREAMMONITOR VISUALIZATION DESIGN Figure 4 is an overview of the interface in TStreamMonitor. Figure 4(A) is the dynamic topic monitor. It is used to monitor the dynamic changes of topical clusters in an active text stream. Figure 4(B) is the topic stream view, which shows the temporal evolution of the topics. It is also used as a panel for users to explore historical changes when they want to. Figure 4(C) is the detail view used to show the details of documents. In Figure 4(D) users can adjust several parameters and thresholds, and control playback, pause and forward. 5.1 Dynamic Topic Monitor The key function of TStreamMonitor is to provide an easy-to-read view of the important topics discovered by topical clusters. In visu- alizing dynamic scenes, too many visual elements on the interface usually confound observers’ perception and understanding, since the changes of complex visual metaphors can easily become dis- turbing. Therefore, our major rationale of design is to provide an interface that is: • lightweight: with a limited number of visual elements that do not overwhelm and defocus observers during monitoring; • informative: including necessary information for users to quickly identify the topical contents; • smooth: with minimized abrupt changes during topic evolu- tion; • controllable: giving users the ease of parameter adjustment that leads to natural and sleek view changes. 5.1.1 Visualizing Topic Clusters For a lightweight topic monitor, we do not overwhelm users by vi- sualizing all the individual document particles as in [2]. Instead, we only show the discovered important topics. In Figure 4(A), five important topics are displayed, each as a circular bubble metaphor. A bubble represents one corresponding cluster containing a set of documents. The size of the bubbles is determined by the total num- ber of the documents inside a cluster. The bubble’s center location is computed from the document particles’ locations originally de- fined by the force-directed layout of the evolving graph (Figure 2). 4
  • 5. Online Submission ID: 135 Figure 5: One topical cluster showed as bubble. The associated in- formation includes: active time period shown in the curved color bar, four representative keywords in the linked labels, and one significant title showed at the top. The rectangle label further shows the title of the latest arriving document in the whole TStreamMonitor system, which happens to be in this cluster. Figure 6: Subtopic view of a topical cluster. Representative keywords and significant titles have shown for each of the three sub-clusters. Therefore in the simplified view, the distances between the cluster bubbles represent the similarities of the topics. Related topics are placed closely. A straightforward solution is to place the bubble’s center at the average location of the belonging particles’ locations. However, during dynamic evolution, such an average location will have small shivering motion at each time step, since any location change of a single particle will move this average. This effect makes temporal instability that greatly distracts observers in animation. We instead apply a simple but effective algorithm. When a cluster bubble is to be drawn at the first time, we find the document particle that is closest to the average location. Then this particle’s location is used as the bubble center. In this way, the shivering effect is removed, and the motion of bubble becomes stable and consistent. Figure 5 is a zoom-in view of one topical cluster. The three types of summarized information discussed in Section 4 are visualized to provide a succinct and informative view. On the top, a significant title inside this cluster has shown. The linked labels present the representative keywords of this topic. A rectangle text box with black letters shows the title of the latest arriving document in the whole system. Moreover, a curved bar over the bubble encodes the most active time of this topic in colors, which shows the latest time a new document is added to this cluster. This latest time is showed in digits above the bar either. In particular, fresh topics are showed in light green, and those topics with no recently joined documents are showed in red. The length of the colored curve inside this bar represents the length of the active time period (i.e., time interval between the first and the latest document arrival). Inside a transparent bubble, we show a group of small particles representing the accommodated documents. Here the particles are simply placed together to avoid complexity since they are mainly used to show the merging and splitting of clusters (see below for details). 5.1.2 Visualizing Subtopics Based on our multilevel clusters, we can further visualize the subtopics inside one topical cluster. Figure 6 depicts an example of three subtopics showed in small bubbles inside the original big one. To make clear views, the significant title and keywords are shrunk inside the bubbles and the time bars are removed. A green box shows the newest document title inside this cluster. Users can switch between the sub-cluster view and the original one, as well as control the levels. 5.1.3 Visualizing Topic Merging and Splitting The merging and splitting of topical clusters are critical cues help- ing understand the text stream evolution. Visualizing such events is a unique feature of our system. We implement the smooth animated effect for the purpose. Figure 7 depicts a splitting process. Particles inside different cluster bubbles fly away and generate new bubbles to represent new topical clusters. The merging effect is visualized in the same way. Such flying effect creates a smooth change of top- ics. It attracts users’ attention and promote a clear understanding of the effect, which can be depicted well in the supplemental video. 5.2 Topic Stream Visualization We design a topic stream view, based on the theme river, to show the evolution of topic clusters, as shown in Figure 4(B). It provides the context for the monitoring process. The topical clusters are ordered from top to bottom by the size at each time axis, and then connects similar topics by ribbons. The width of ribbons relates to the size of clusters. Major keywords are showed over the ribbons. The similar topics evolve along time are detected by the algorithm in Section 3.2. The ribbons stream forward automatically along time during dynamic monitoring. Users can directly observe a topic’s change, including emerging, splitting, merging, and disappearing, over the recent time. Moreover, this view can be dragged backward and forward. Users can monitor any preferred time period repeatedly by clicking on any time axis or ribbon to explore the topical clusters at the selected time. The stream view then plays a role of temporal control panel. 5.3 Detail Visualization When users need to investigate the details of a topic, Figure 4(C) displays the information of individual documents with time and title in one or multiple clusters. Users can further read more contents if needed. This view can be turned on or off, since sometimes turning it off can help users better focus on the monitor view. When it is turned on, it can also be used to dynamically show the newly arriving documents, or to dynamically show the documents in a particular cluster ordered by time. 5.4 Interactive Control Figure 4(D) provides controls of monitoring process for the users. First, the dynamic process can be played forward, backward, and paused. Second, users can choose the options for the different types of displayed summary information, such as different representative titles, which is discussed in Section 4. Third, users can change the clustering parameters for further exploring, including the given threshold that determines which particles belong to one cluster in the graph cut (Section 3), and how many topical clusters show in the dynamic monitor. Figure 8 depicts two different views when the parameters are changed. They have different numbers of topical clusters (5 and 10). Users can control such different granularities 5
  • 6. Online Submission ID: 135 (a) 3 clusters at the beginning (b) Splitting with flying particles (c) Split to 4 clusters Figure 7: Cluster splitting effect. Particles inside the original clusters fly away to generate new clusters. Merging effect is implemented in the same style. (a) 5 clusters (b) 10 clusters Figure 8: Changing clustering parameters. Different numbers of clus- ters are generated with different topical details. for their interest. More importantly, such control operations can be done in realtime during the monitoring process. This shows one important merit of our dynamic clustering method. Our method can easily achieve such different clustering effects with an immediate switch by users, even in a dynamic process. 6 SYSTEM IMPLEMENTATION To maximize the usability of TStreamMonitor, we develop the sys- tem in a realtime client-and-server scheme. In particular, after a text stream injects new documents into the system server, the doc- uments are processed for cleaning and keywords extraction. Then the server-side program updates the active list of the important key- words, performs in situ similarity computation, and uses the force- directed algorithm to create the 2D layout of particles. Dynamic and multilevel clustering are then performed (Section 3), followed by the summarization (Section 4). Finally, a hierarchical data struc- ture of clusters are created, each cluster stores the summarization information as well as the indices of their belonging document par- ticles. All these operations on the server are implemented by opti- mized C++ programs for fast performance. The server starts and maintains a realtime data exchange service. One or more web clients can be initiated by linking them to the server with internet connection. The data exchange service between the server and clients is implemented through the JSON (JavaScript Object Notation) format [10], a text format that is completely lan- guage independent. The communication between server and client is completed by Web Sockets API [24] that enables web pages to use the Web Socket protocol for full-duplex communication with the remote server. The full-duplex channel is important since in re- altime we transfer control parameters adjusted by users to the server side to generate different results according to their input. A client’s sole task is to provide web-based visualization. It re- ceives the information of the multi-clusters continuously and then visualize them on web pages. The visualization is completely web- based, implemented by Javascript and HTML5 with D3js [5]. Documents of a text stream may arrive at any speed (with a vari- ety of time intervals). Our system can follow the incoming speed to inject new documents. We also implement a buffering mechanism so that the interval between consecutive streaming documents can be controlled. This provides a flexible scheme since we found if the stream goes too fast, users do not have enough time for percep- tion and forming insights. Moreover, documents can stream into the system one by one, or following their publishing time, in which case a set of documents may be inserted at the same time. 7 EVALUATION For evaluating our application, Sec. 7.1 demonstrates a case study of real world data stream. Sec. 7.2 reports a preliminary user study by 8 participants. 7.1 Case Study: New York Times News Stream In this case study, we explored a text stream collected from New York Times (www.nytimes.com). The text stream includes all(892) news of the New York Times on March 6th, 2014. Each individual news document includes the keywords, published time, title, and the first paragraph of the news. We mimic continuously arriving documents flowing into TStreamMonitor by their publishing time, and perform monitoring tasks over the stream. We put all the news of every 20 minutes into the system with an interval of 5 seconds. On average, a news is ejected in the system per second. The News which are older than 12 hours will be dropped. All the computation are completed in realtime. In the supplemental video, this whole dynamic monitoring process is clearly illustrated in the animation. Figure 9a is a snapshot of TStreamMonitor. The dynamic mon- itor shows 5 topical clusters at 12:31:30 pm March 6th, 2014. The stream view shows the historical evolution of topics until this time. From the labeled keywords, we can characterize the clusters related to “Russia” (green), “British” (purple), “Factory” (blue), “France” (orange), and “Africa” (red) respectively. The green cluster has the largest size including lots of documents. We can investigate the de- tail of this topic by turning on the detail list view on the right. More- over, we switched this cluster to a multilevel view where we got three sub-topics, as shown in Figure 9b. Major keywords and titles indicate the subtopics: “Russia, Crimea, Ukrainian” (top), “Rus- sia”(right bottom), and “Crimea” (left bottom). Figure 9c shows the new layout in the next timestamp. The blue cluster merged into the purple cluster after new items were injected. And a new cluster about “China” was emerged. The rect- angle labels of each cluster further shows the titles of new com- ing news. A lots of news filled into the “Russia” topical cluster. This topic showed a representative title “EU Slaps Initial Sanctions 6
  • 7. Online Submission ID: 135 (a) (b) (c) (d) (e) (f) Figure 9: Case study: Monitoring a text stream of New York Time news on March 6th, 2014. (a)-(f) are snapshots showed in TStreamMonitor. on Russia ...” and came with frequent keywords including “Rus- sia”, “Ukraine”, “Crimea” and “European”. We showed the highest page rank title of the cluster. We further enlarged it and found three subtopics (Figure 9d). Then we can found an important event about Russia, Ukraine and European. And this topical cluster became the largest topical cluster. We are aware of the Crimea crisis and Ukraine violence. By studying more details, we found the Crimean referendum was reset to March 30th from its initially planned date at May 25th, 2014. Further monitoring the stream, at 18:51:30, The green cluster became the largest cluster(Figure 9e). It indicated that “California” and “France” became the hottest topic in the news stream at that time. By studying more details, a few news about the Paris Fashion Week (February 25th to March 5th, 2014) and California state filled in at that time. Figure 9f advanced to the end of March 6th. The purple cluster which is related to “Police”, “Paris Fashion Week” and “executive” became the largest cluster. Then we are aware that the topic about entertainment and social news became the hot topic in the night. 7.2 User Study We conducted a preliminary user study to evaluate our TStream- Monitor system. The text stream for user study is collected from New York Times by using NYT API. The text stream includes 956 News about ”Obama” which starts from Mar. 1st, 2014 and ends at Mar. 20, 2014. We put all the news of every 5 hours with an interval of 5 seconds into our system. Therefore continuously arriving news flowed into TStreamMonitor by their publishing time. On average, 2 news are ejected in our system per second. News which are older than 10 days will be dropped. we designed the user study for evaluating the merits of our mon- itoring system in the following three aspects: • TStreamMonitor helps users identify critical information and trends of a text stream; • TStreamMonitor provides understandable and easy interface for dynamic visualization. • TStreamMonitor facilitates user participation in the monitor- ing process. 7.2.1 Tasks We designed three tasks in our user study. The monitoring process has been described in the case study discussed in Section 7.1. These tasks are to: • T1: Identify the important events during the monitor- ing of the text stream. In this task, participants were asked whether or not several salient events can be identified. • T2: Provide feedback of the understandable visualization. In this task, participants were asked: (1) whether the topical clusters can be identified clearly. (2) whether the splitting and merging of clusters is helpful in understanding the events; (3) whether the topical clusters can be easily traced along time. • T3: Provide feedback of the user participation in the mon- itoring process. In this task, participants were asked to eval- uate whether or not the monitoring process can be controlled easily. 7.2.2 Participants The user study was conducted with 8 participants (3 females and 5 males), who were all in computer science major. Participants were between 25 and 31 years old. Most of them had a basic understand- ing of data mining and information visualization. They have not used any text stream monitoring system before. An instructor spent 15 to 20 minutes to brief the user study for the participants. This in- volved explaining the tasks, introducing the interface, and describ- ing the concept of keywords and title of documents. Participants can freely complete the tasks in their preferred order. Participants have 25 minutes to complete all the three tasks by using our system to explore the news stream. 7
  • 8. Online Submission ID: 135 7.2.3 Results Feedback collected from the participants generates positive evalua- tion for our system. In Task T1, on average, participants could identify 70.8% of the important events from the text stream. In detail, there were 7 out of 8 participants claimed their identification of the event on March 2nd. All the 8 participants claimed their identification of the event on March 16th. But only 2 out of eight participants claimed their identification of the event on March 7th. This was possibly caused by that the size of the topic cluster did not increase enough to at- tract their attention, although it has become the largest cluster. This is because we did not set the cluster size linearly proportional to its size, in order to avoid small clusters are too small on screen to identify while big ones are too large. It indeed implies one of our major future work, which is to improve the visual interface in many subtle details by closely working with user feedbacks. In Task T2, 7 out of 8 participants proposed that the topical clus- ters can be identified and traced easily during dynamic monitoring. All the 8 participants agreed that the splitting and merging of clus- ter makes the text stream more understandable. Finally, in Task T3, 6 out of 8 participants agreed that they can control the monitoring process easily. The results primarily show that our TStreamMonitor system pro- vides a useful and efficient monitoring platform for users. It needs to be improved by more delicate design in visual representations, such as color, size, label, etc. 8 CONCLUSION We have presented a visual monitoring system, called TStream- Monitor, which helps users to understand and study the topic infor- mation of text streams. The system facilitates real-time computa- tion over streaming data, uses easy-to-understand visual metaphors, and promotes user participation in the analysis process. Users can easily characterize salient topics in a large set of arriving docu- ments. Through observing the topics’ merging and splitting in the dynamic monitor, users can further understand their evolutionary trends. A real-world case study illustrated the application for mon- itoring realtime text streams. And the feedback of user study shows a effort reduction for learning text stream. Motivated by these re- sults, we believe that our work will have a positive impact on the monitoring evolving data stream. In the future, we will study more topic and cluster summarization techniques. Since more than 8 thematic clusters are difficult to track at the same time. It is limited for a document stream has much higher number of topics clusters. We will use other data mining method to generate more stable and reliable cluster results. We will also perform more extensive user studies over specific datasets. REFERENCES [1] C. Albrecht-Buehler, B. Watson, and D. A. Shamma. Visualizing live text streams using motion and temporal pooling. IEEE Computer Graphics and Applications, 25(3):52–59, 2005. [2] J. Alsakran, Y. Chen, D. Luo, Y. Zhao, J. Yang, W. Dou, and S. Liu. Real-time visualization of streaming text with a force-based dynamic system. IEEE Comput. Graph. Appl., 32(1):34–45, Jan. 2012. [3] K. Andrews, W. Kienreich, V. Sabol, J. Becker, G. Droschl, F. Kappe, M. Granitzer, P. Auer, and K. Tochtermann. The infosky visual ex- plorer: Exploiting hierarchical structure and document similarities. Information Visualization, 1(3):166–181, Dec. 2002. [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March 2003. [5] M. Bostock, V. Ogievetsky, and J. Heer. D3 data-driven docu- ments. IEEE Transactions on Visualization and Computer Graphics, 17(12):2301–2309, 2011. [6] Y. Chen, L. Wang, M. Dong, and J. Hua. Exemplar-based visualization of large document corpus (infovis2009-1115). IEEE Transactions on Visualization and Computer Graphics, 15(6):1161–1168, 2009. [7] W. Cui, S. Liu, L. Tan, C. Shi, Y. Song, Z. Gao, H. Qu, and X. Tong. Textflow: Towards better understanding of evolving topics in text. IEEE Transactions on Visualization and Computer Graphics, 17(12):2412–2421, Dec. 2011. [8] W. Cui, H. Qu, H. Zhou, W. Zhang, and S. Skiena. Watch the story un- fold with textwheel: Visualization of large-scale news streams. ACM Transactions on Intelligent Systems and Technology (TIST), 3(2):20, 2012. [9] W. Dou, X. Wang, D. Skau, W. Ribarsky, and M. X. Zhou. Leadline: Interactive visual analysis of text data through event identification and exploration. In Proceedings of IEEE Conference on Visual Analytics Science and Technology, pages 93–102, 2012. [10] Ecma-International. Ecma-404 the JSON data interchange standard. http://www.json.org, 2013. [11] E. R. Gansner, Y. Hu, and S. C. North. Interactive visualization of streaming text data with dynamic maps. Journal of Graph Algorithms and Applications, 17(4):515–540, 2013. [12] M. Ghoniem, D. Luo, J. Yang, and W. Ribarsky. Newslab: Exploratory broadcast news video analysis. In Proceedings of the 2007 IEEE Sym- posium on Visual Analytics Science and Technology, pages 123–130, 2007. [13] S. Havre, P. Whitney, and L. Nowell. Themeriver: Visualizing the- matic changes in large document collections. IEEE Transactions on Visualization and Computer Graphics, 8:9–20, 2002. [14] Y. Ishikawa and M. Hasegawa. T-scroll: Visualizing trends in a time- series of documents for interactive user exploration. Lecture Notes in Computer Science, 4675:235–246, Nov. 2007. [15] M. Krstajic, E. Bertini, and D. Keim. Cloudlines: compact display of event episodes in multiple time-series. Visualization and Computer Graphics, IEEE Transactions on, 17(12):2432–2439, 2011. [16] M. Krstajic, M. Najm-Araghi, F. Mansmann, and D. A. Keim. Incre- mental visual text analytics of news story development. In IS&T/SPIE Electronic Imaging, pages 829407–829407. International Society for Optics and Photonics, 2012. [17] H. W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97, 1955. [18] B. Lee, N. H. Riche, A. K. Karlson, and S. Carpendale. SparkClouds: Visualizing trends in tag clouds. IEEE Trans. Visualization and Com- puter Graphics, 16(6):1182–1189, 2010. [19] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 497–506, 2009. [20] S. Liu, M. X. Zhou, S. Pan, W. Qian, W. Cai, and X. Lian. Interactive, topic-based visual text summarization and analysis. In Proceeding of the 18th ACM conference on Information and knowledge manage- ment, pages 543–552, 2009. [21] D. Luo, J. Yang, M. Krstajic, W. Ribarsky, and D. Keim. Eventriver: Visually exploring text collections with temporal references. Visual- ization and Computer Graphics, IEEE Transactions on, 18(1):93–105, 2012. [22] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report, 1998. [23] F. Paulovich and R. Minghim. Hipp: A novel hierarchical point place- ment strategy and its application to the exploration of document col- lections. IEEE Transaction on Visualization and Computer Graphics, 16(8):1229–1236, Nov. 2008. [24] V. Wang, F. Salim, and P. Moskovits. The Definitive Guide to HTML5 WebSocket, Build Real-Time Applications with HTML5. Apress, 2013. [25] F. Wei, S. Liu, Y. Song, S. Pan, M. Zhou, W. Qian, L. Shi, L. Tan, and Q. Zhang. Tiara: a visual exploratory text analytic system. In Proc. KDD, pages 153–162, 2010. [26] J. A. Wise, J. J. Thomas, K. Pennock, D. Lantrip, M. Pottier, A. Schur, and V. Crow. Visualizing the non-visual: spatial analysis and interac- tion with information for text documents. Readings in information visualization: using vision to think, pages 442–450, 1999. [27] X. Zhu, A. Goldberg, J. V. Gael, and D. Andrzejewski. Improving di- versity in ranking using absorbing random walks. HLT-NAACL, pages 97–104, 2007. 8