The document describes a system called TStreamMonitor that visually monitors evolving text streams in real-time. TStreamMonitor encodes incoming text documents into dynamic visualizations using multilevel clustering of an evolving graph. It discovers thematic topics through clusters that are visualized based on text summarization to identify salient trends. The system supports real-time processing and analysis, easy understanding through minimal visualizations, and user interaction through control of cluster sizes and levels.
Designing and configuring context-aware semantic web applications
Visually Monitor Evolving Text Streams Using TStreamMonitor
1. Online Submission ID: 135
TStreamMonitor: Visually Monitoring Evolving Text Streams
Category: Technique
ABSTRACT
We develop a dynamic visualization system, TStreamMonitor,
aimed at monitoring realtime text streams. The streams consist of
a large amount of continuously arriving documents, such as news,
blogs and emails. Visually monitoring such streams helps analysts
and users quickly observe and understand the topics, events and
trends of the evolving datasets. Our system encodes the incom-
ing text documents into user-friendly dynamic visualizations which
support realtime in situ exploration of the streams. Such encod-
ing is implemented based on a multilevel clustering scheme over an
evolving graph of streaming documents. The clusters discover the-
matic topics of the streaming documents. They are visualized based
on text summarization so that users can easily identify salient and
changing focuses of the text streams.
Keywords: Text Stream; Realtime Monitoring; Visual Knowledge
Discovery
1 INTRODUCTION
News, blogs, emails, and social media records become more and
more ubiquitous. Such textual documents continuously “stream”
into people’s life, creating constantly evolving text streams. In
may cases, there is a real need to monitor the streams while they
thrive, so that users and analysts can quickly observe and under-
stand the characterization, underlying structure, trends, and events
in the evolving datasets. In such a monitoring scenario, the visual-
ization design should follow several principles:
• Support in situ processing The visualization needs to han-
dle the evolving text items as they pass through the system.
Realtime computation and visualization are demanded.
• Capture salient knowledge The visual elements should cap-
ture sufficient knowledge of the streaming data. Critical fea-
tures and trends should be easily identified and studied.
• Use understandable visualization The cognitive abilities of
analyzers should be taken into account. The dynamic visual-
ization need to avoid complex metaphors and promote prompt
understanding.
• Facilitate user participation Users should be able to control
the visualization process by tuning the system for their pre-
ferred outcome, so as to maintain flexible analytical utility.
Most of current text visualization techniques, like many analyt-
ical processes, rely on post-processing of stored data. They are
successful in visualizing features and trends in text collections, but
are not directly capable of monitoring realtime text streams when
there is no prior knowledge of incoming items. Few efforts are con-
ducted over visualizing text streams, which however do not satisfy
all the principles. User may struggle to identify salient features in
the rapidly changing and complex visualizations.
In this paper, we develop a visual monitoring system, namely
TStreamMonitor, for users to study evolving text streams. It en-
codes the actively incoming text documents into user-friendly dy-
namic visualizations that characterize evolving salient features.
TStreamMonitor is developed with respect to the design principles
above. Using an evolving stream of daily news as an example, our
system is designed:
• To support in situ processing, the computation and visual-
ization has no reliance on priori knowledge of the incoming
news. Instead, it is assumed that unknown news items contin-
uously flow into the system for users to characterize.
• To capture salient knowledge, the system visualizes the
dynamically-formed clusters of a large amount of news items
to elucidate the hidden thematic information. The clusters
represent specific news topics of the rapidly changing stream,
such as Ukraine crisis, Malaysia Airlines MH370, etc.
• To promote easy understanding, instead of displaying lots of
individual news items contained in each cluster, the dynamic
visualization provides a lightweight interface with minimal
clutters and distractions. Users are capable to identify essen-
tial insights easily and promptly.
• To facilitate user interaction, the size of the clusters can be
controlled by users in realtime together with the dynamic vi-
sualization. A big cluster can also be visualized as a set
of sub-clusters representing sub-topics. Consequently, the
formed clusters and sub-clusters show different levels of news
topics. For example, Ukraine crisis has sub-topics of Ukraine
protest and Crimea votes.
To support such designs of realtime stream processing, TStream-
Monitor starts from a dynamically-changing force directed layout
of individual documents. The particles representing documents
then create an evolving graph by 2D geometric triangulation. Then
topical clusters are generated based on user controlled graph cut.
This approach needs no special text clustering methods, as they
mostly perform batch processing over the whole dataset, which is
incapable for visualizing continuous arriving documents. More-
over, the clusters are naturally formed in a multilevel scheme where
user’s control of their sizes and levels are enabled.
The topical clusters are analyzed by text summarization utilizing
keyword frequency, importance weights and graph-based ranking
algorithms (e.g. PageRank [22]). The salient information of impor-
tant clusters are visualized for easy identification and understand-
ing, instead of massive individual documents. Moreover, the visu-
alization clearly shows the splitting and merging of topical clusters,
enabled by identifying temporal topic evolution through an opti-
mized matching algorithm. Their evolutionary history is further
showed over a topic stream view, which also plays a role for user
interaction, e.g. playback.
The TStreamMonitor system achieves a new web-based platform
which supports effective realtime monitoring and study of evolving
text streams. It can be used in visual exploration of news stories,
blogs, emails, business transactions, and many more text datasets,
where they will raise the level of understanding of realtime ana-
lyzers. We implemented the prototype system and conducted case
studies using a dataset of New York Times news. We also per-
formed a user study to evaluate the system.
2 RELATED WORK
Visualizing large text corpora is an emerging research topic. Tag
cloud techniques (e.g. [18]) use visual depictions of tags (or
words) giving greater prominence to words that appear more fre-
quently. Similarity-based projections help users get insights from
large text collections in 2D views. Typically, multidimensional
1
2. Online Submission ID: 135
Figure 1: TStreamMonitor system framework. A text stream continuously injects arriving documents into the system, which provides an interac-
tive, realtime monitoring interface for users to quickly characterize and understand the streaming data.
scaling (MDS) or force-directed methods are used to form 2D lay-
out from the text documents. For instance, “galaxies” or “moun-
tains” are formed in the displays [26]. A hierarchy of the docu-
ments are built and projected as circles in [23]. InfoSky [3] exploits
hierarchically structured documents at each level with Voronoi dia-
grams. Examplar-based visualization [6] visualizes extremely large
text corpus by probabilistic MDS projection with approximation
and decomposition. Alsakran et al. [2] use a dynamic similarity-
based projection system to depict text streams. Gansner et al. [11]
draw the dynamic graph of streaming documents by MDS.
Text collections often include time-stamped documents. The-
meRiver [13] and LensRiver [12] depict the frequency changes of
keywords as river currents. T-scroll [14] employs a novelty-based
clustering algorithm on time-series documents. StoryTracker [16]
presents a incremental visual analytics system for exploration of
news topics in dynamic information streams. Meme-tracking [19]
extracts keywords from news corpus to generate themes, and then
visualizes them to indicate the flow of stories. Utilizing text mining
tools such as topic modeling, some efforts have been made for vi-
sual analysis of text collections, including EventRiver [21], Cloud-
lines [15], Leadline [9], visual text summary [20], TIARA [25],
TextFlow [7], Textpool [1], and TextWheel [8]. These methods
make great success in visualizing the temporal trends in archives
of text. However, most of them are not directly designed for moni-
toring live text streams.
The most related existing work to TStreamMonitor is from Al-
sakran et al. [2] and Gansner et al. [11], which are targeted at
realtime monitoring and dynamic visualization of text streams.
In [11], the modularity clustering is used after MDS projects
tweet documents onto 2D layout. They use a map metaphor to
depict the clusters, where highly related messages form countries
enclosed by a boundary. Each country has a color different from
its neighbors. Such approach is to create static layout at a time.
To handle dynamic text streams, a two-step approach is used. A
newly arriving document is placed at the average of its neighbors
in the previous graph and local optimization of MDS layout is ap-
plied. However, a special Procrustes transformation has to be ap-
plied to make the new graph best aligned to the old one. Moreover,
to avoid overlapping components after local MDS, a packing algo-
rithm has to be introduced. In [2], such dynamic stability is handled
through continuous force-directed simulation. The dynamic evolu-
tion of documents inside the 2D projected layout is seamless and
smooth. However, this approach visualizes all the particles inside
clusters, whose layout is too complex. And the dynamic motion of
lots of particles is disturbing for users to focus on salient informa-
tion. Similarly, the dynamic maps approach [11] including many
cities and countries on the screen may also impose heavy burden
for observers. In our approach, we use force-directed layout for
dynamic evolving graph, and propose a lightweight visualization
based on the dynamic clusters directly computed from the 2D ge-
ometric distribution of documents. More importantly, these related
approaches do not grant easy access and control for users to manip-
ulate clusters and their topics in visualization. Our approach instead
computes and depicts multiple levels of clusters and their dynamic
merging and splitting in a fast and effective way. TStreamMonitor
thus helps users to interactively discover interesting topics and their
subtopics with an easy-to-read interface.
3 DYNAMIC TOPICAL CLUSTERS
Fig. 1 illustrates the processing framework of TStreamMonitor. A
text stream continuously injects arriving documents into the sys-
tem. Each document has its title, content, and time stamp. It is
typically represented by a sequence of keywords, based on which
we compute the similarities between pair-wise documents. Such
approach is the most popular process used in text clustering and
visualization works. We use the cosine similarity in this computa-
tion. For in situ computation, we update a current set of keywords
once new documents arrive. Not all the keywords are included in
the set, while the TF-IDF (Term Frequency-Inverse Document Fre-
quency) weights are used to find the most important N keywords
for similarity computation.
A 2D layout of active documents is generated, in which similar
documents are placed closely. To achieve this we adopted a force-
directed method. Over this 2D layout, we use triangulation over
the document particles which creates an evolving graph. Then, the
edges having a length smaller than a given threshold are removed
to find document clusters. This approach is based on the actively
evolving graph, similar to the work [2]. It can automatically han-
dle continuously incoming documents and create clusters over these
documents. It has a mechanism for dropping old and stale docu-
ments. Our contribution behind the new visual interface is further
extending this method to find multilevel clusters and handle evolu-
tionary topics.
3.1 Multilevel Clusters
Our system can find multilevel clusters, as shown in Figure 2. Fig-
ure 2a is the evolving graph created after applying Delaunay tri-
angulation to the 2D layout of document particles. By applying
graph cut with a threshold φ1, we can discover several clusters over
the graph in Figure 2b. We further can apply a threshold φ2 inside
these clusters, where some sub-clusters are found, which has shown
in different particle colors in Figure 2c. If needed, more clusters can
be identified inside these sub-clusters (Figure 2d).
Important clusters and their sub-clusters represent the thematic
knowledge of the text stream, as we can look at each cluster as
a topic. Then the sub-clusters refer to related sub-topics. Such
hierarchical structure provides powerful visual analytics ability to
characterize critical topics, and meanwhile, analyze subtopics in-
side them.
In [2], though some clusters are identified, the visualization still
shows individual particles over the force-directed layout. Such vi-
sualization is too complex for users to understand the changing
graph. Moreover, it cannot easily visualize the information of sub-
clusters and their relationships. Instead, our visualization design is
2
3. Online Submission ID: 135
(a) Evolving graph (b) Discovered clusters (c) Sub-clusters (d) Sub-sub-clusters
Figure 2: Multilevel clusters are easily created from the evolving graph. Their generation is controlled by threshold parameters and can be
achieved immediately.
based on the hierarchical clusters to show more salient knowledge
of the text stream.
3.2 Evolutionary Topical Clusters
The key problem in mining and visualizing text streams is to iden-
tify the evolutionary topics. In the monitoring scenario it is even
more challenging as many analytical tools, such as the widely used
LDA (Latent Dirichlet Allocation) model [4], cannot be directly
applied to the actively arriving documents. In TStreamMonitor, we
find the topical evolution based on the relationships between docu-
ment particles. This approach naturally models the topical cluster
variation without seeking help from the text mining techniques. It
can be completed immediately in realtime monitoring which is vital
for the in situ processing.
To trace the evolving clusters, we match the important pairs of
clusters in consecutive times which share most of their document
particles. In detail, given two clusters, Ci and Cj, in consecutive
time t and t’, each of them accommodates a set of particles. We can
compute a match factor δij between the two clusters as:
δij(Ci,Cj) =
N(Ci ∩Cj)
N(Ci)
, (1)
where N(C) is the number of particles inside a cluster C. Then,
considering a set of clusters {Ci},i = 1···N found at time t, and a
set of clusters {Cj}, j = 1···M formed at time t’, we need to find
the best matching pairs Ci → Cj for each Ci. This cluster matching
problem can be solved by a combinatorial optimization algorithm,
the Hungary method [17].
This method provides a better solution than a naive matching
implementation in [2]. In particular, we have a nonnegative M ×N
matrix, where the element in the i-th row and j-th column represents
δij(Ci,Cj). It can be looked as the cost of assigning Ci → Cj. We
apply the Hungary algorithm to find an optimal solution where the
total cost Σiδij(Ci,Cj) of all i = 1···N is minimized. In our imple-
mentation, we further extend the algorithm by setting a lower bound
of δij(Ci,Cj) = 0.15 representing the smallest value between any
matched pair. Because if the match factor is too small, i.e., the two
corresponding clusters do not share enough documents, we cannot
match Cj as the evolved cluster from Ci. Based on this result, we
can identify the merging and splitting of the topical clusters from t
to t’. Finally, any cluster at time t’ that does not have a match from
the previous time is considered as a new emerging cluster.
4 TOPICAL CLUSTER SUMMARIZATION
Visualizing the topical clusters relies on a fast, automatic way to
find and present the most important points of the original docu-
Hottest Keywords:
#United States
International Relations
#Ukraine
#Russia
#Putin, Vladimir V
Figure 3: Summarization of a cluster. Hottest keywords have shown.
Moreover, three different methods are used to find important docu-
ments whose titles are showed in different colors: (Green) the newest
coming document; (Blue) the document having the largest average
similarity to other documents. (Red) the document having the largest
page rank [22].
ments inside a cluster or sub-cluster. Due to the limited real estate
in visualization, only a very small set of words, phrases, or sen-
tences can be used in the visual representation. For evolving text
streams, we utilize three types of information to quickly convey the
important content of a topical cluster including:
1. Active time period: Topics in a text stream evolve along time.
During monitoring, an active cluster which has newly incom-
ing documents may need more attention. The active time pe-
riod of clusters can provide information of a cluster’s time
variation.
2. Representative keywords: As a document is represented by a
set of important keywords, a cluster is also represented by
some representative keywords. The selection of such key-
words can be completed by using the most frequent keywords,
or by using the highly weighted keywords (e.g. by IF-TDF).
3. Significant titles: Significant documents inside a cluster usu-
ally represent the important information. Visualizing the titles
of them helps users understand the topic. Finding the most
significant document can be implemented in different ways:
• The most recently arriving document;
3
4. Online Submission ID: 135
Figure 4: TStreamMonitor interface. (A) Dynamic Topic Monitor; (B) Topic Stream View; (C) Detail View; (D) Control Panel.
• The document that has the largest average similarity to
other documents;
• The document has the highest page rank [22] to other
documents;
Figure 3 is an example of three different particles found in a
cluster with these three options. It indicates the particles and
the titles of their corresponding documents in different col-
ors. The representative hottest keywords are also showed. We
implement the algorithms of all of these options. In realtime
monitoring, users can switch between them freely without dis-
turbing the dynamic process. In the future, we will study more
summarization techniques (e.g. [27]) to better represent a top-
ical cluster.
In TStreamMonitor, we show these information over easy visual-
izations so that users can promptly identify the topical information.
5 TSTREAMMONITOR VISUALIZATION DESIGN
Figure 4 is an overview of the interface in TStreamMonitor. Figure
4(A) is the dynamic topic monitor. It is used to monitor the dynamic
changes of topical clusters in an active text stream. Figure 4(B)
is the topic stream view, which shows the temporal evolution of
the topics. It is also used as a panel for users to explore historical
changes when they want to. Figure 4(C) is the detail view used
to show the details of documents. In Figure 4(D) users can adjust
several parameters and thresholds, and control playback, pause and
forward.
5.1 Dynamic Topic Monitor
The key function of TStreamMonitor is to provide an easy-to-read
view of the important topics discovered by topical clusters. In visu-
alizing dynamic scenes, too many visual elements on the interface
usually confound observers’ perception and understanding, since
the changes of complex visual metaphors can easily become dis-
turbing. Therefore, our major rationale of design is to provide an
interface that is:
• lightweight: with a limited number of visual elements that do
not overwhelm and defocus observers during monitoring;
• informative: including necessary information for users to
quickly identify the topical contents;
• smooth: with minimized abrupt changes during topic evolu-
tion;
• controllable: giving users the ease of parameter adjustment
that leads to natural and sleek view changes.
5.1.1 Visualizing Topic Clusters
For a lightweight topic monitor, we do not overwhelm users by vi-
sualizing all the individual document particles as in [2]. Instead,
we only show the discovered important topics. In Figure 4(A), five
important topics are displayed, each as a circular bubble metaphor.
A bubble represents one corresponding cluster containing a set of
documents. The size of the bubbles is determined by the total num-
ber of the documents inside a cluster. The bubble’s center location
is computed from the document particles’ locations originally de-
fined by the force-directed layout of the evolving graph (Figure 2).
4
5. Online Submission ID: 135
Figure 5: One topical cluster showed as bubble. The associated in-
formation includes: active time period shown in the curved color bar,
four representative keywords in the linked labels, and one significant
title showed at the top. The rectangle label further shows the title
of the latest arriving document in the whole TStreamMonitor system,
which happens to be in this cluster.
Figure 6: Subtopic view of a topical cluster. Representative keywords
and significant titles have shown for each of the three sub-clusters.
Therefore in the simplified view, the distances between the cluster
bubbles represent the similarities of the topics. Related topics are
placed closely.
A straightforward solution is to place the bubble’s center at the
average location of the belonging particles’ locations. However,
during dynamic evolution, such an average location will have small
shivering motion at each time step, since any location change of a
single particle will move this average. This effect makes temporal
instability that greatly distracts observers in animation. We instead
apply a simple but effective algorithm. When a cluster bubble is
to be drawn at the first time, we find the document particle that is
closest to the average location. Then this particle’s location is used
as the bubble center. In this way, the shivering effect is removed,
and the motion of bubble becomes stable and consistent.
Figure 5 is a zoom-in view of one topical cluster. The three types
of summarized information discussed in Section 4 are visualized to
provide a succinct and informative view. On the top, a significant
title inside this cluster has shown. The linked labels present the
representative keywords of this topic. A rectangle text box with
black letters shows the title of the latest arriving document in the
whole system. Moreover, a curved bar over the bubble encodes the
most active time of this topic in colors, which shows the latest time
a new document is added to this cluster. This latest time is showed
in digits above the bar either. In particular, fresh topics are showed
in light green, and those topics with no recently joined documents
are showed in red. The length of the colored curve inside this bar
represents the length of the active time period (i.e., time interval
between the first and the latest document arrival).
Inside a transparent bubble, we show a group of small particles
representing the accommodated documents. Here the particles are
simply placed together to avoid complexity since they are mainly
used to show the merging and splitting of clusters (see below for
details).
5.1.2 Visualizing Subtopics
Based on our multilevel clusters, we can further visualize the
subtopics inside one topical cluster. Figure 6 depicts an example
of three subtopics showed in small bubbles inside the original big
one. To make clear views, the significant title and keywords are
shrunk inside the bubbles and the time bars are removed. A green
box shows the newest document title inside this cluster. Users can
switch between the sub-cluster view and the original one, as well as
control the levels.
5.1.3 Visualizing Topic Merging and Splitting
The merging and splitting of topical clusters are critical cues help-
ing understand the text stream evolution. Visualizing such events is
a unique feature of our system. We implement the smooth animated
effect for the purpose. Figure 7 depicts a splitting process. Particles
inside different cluster bubbles fly away and generate new bubbles
to represent new topical clusters. The merging effect is visualized
in the same way. Such flying effect creates a smooth change of top-
ics. It attracts users’ attention and promote a clear understanding of
the effect, which can be depicted well in the supplemental video.
5.2 Topic Stream Visualization
We design a topic stream view, based on the theme river, to show the
evolution of topic clusters, as shown in Figure 4(B). It provides the
context for the monitoring process. The topical clusters are ordered
from top to bottom by the size at each time axis, and then connects
similar topics by ribbons. The width of ribbons relates to the size of
clusters. Major keywords are showed over the ribbons. The similar
topics evolve along time are detected by the algorithm in Section
3.2. The ribbons stream forward automatically along time during
dynamic monitoring. Users can directly observe a topic’s change,
including emerging, splitting, merging, and disappearing, over the
recent time. Moreover, this view can be dragged backward and
forward. Users can monitor any preferred time period repeatedly
by clicking on any time axis or ribbon to explore the topical clusters
at the selected time. The stream view then plays a role of temporal
control panel.
5.3 Detail Visualization
When users need to investigate the details of a topic, Figure 4(C)
displays the information of individual documents with time and title
in one or multiple clusters. Users can further read more contents if
needed. This view can be turned on or off, since sometimes turning
it off can help users better focus on the monitor view. When it
is turned on, it can also be used to dynamically show the newly
arriving documents, or to dynamically show the documents in a
particular cluster ordered by time.
5.4 Interactive Control
Figure 4(D) provides controls of monitoring process for the users.
First, the dynamic process can be played forward, backward, and
paused. Second, users can choose the options for the different types
of displayed summary information, such as different representative
titles, which is discussed in Section 4. Third, users can change
the clustering parameters for further exploring, including the given
threshold that determines which particles belong to one cluster in
the graph cut (Section 3), and how many topical clusters show in
the dynamic monitor. Figure 8 depicts two different views when
the parameters are changed. They have different numbers of topical
clusters (5 and 10). Users can control such different granularities
5
6. Online Submission ID: 135
(a) 3 clusters at the beginning (b) Splitting with flying particles (c) Split to 4 clusters
Figure 7: Cluster splitting effect. Particles inside the original clusters fly away to generate new clusters. Merging effect is implemented in the
same style.
(a) 5 clusters (b) 10 clusters
Figure 8: Changing clustering parameters. Different numbers of clus-
ters are generated with different topical details.
for their interest. More importantly, such control operations can
be done in realtime during the monitoring process. This shows one
important merit of our dynamic clustering method. Our method can
easily achieve such different clustering effects with an immediate
switch by users, even in a dynamic process.
6 SYSTEM IMPLEMENTATION
To maximize the usability of TStreamMonitor, we develop the sys-
tem in a realtime client-and-server scheme. In particular, after a
text stream injects new documents into the system server, the doc-
uments are processed for cleaning and keywords extraction. Then
the server-side program updates the active list of the important key-
words, performs in situ similarity computation, and uses the force-
directed algorithm to create the 2D layout of particles. Dynamic
and multilevel clustering are then performed (Section 3), followed
by the summarization (Section 4). Finally, a hierarchical data struc-
ture of clusters are created, each cluster stores the summarization
information as well as the indices of their belonging document par-
ticles. All these operations on the server are implemented by opti-
mized C++ programs for fast performance.
The server starts and maintains a realtime data exchange service.
One or more web clients can be initiated by linking them to the
server with internet connection. The data exchange service between
the server and clients is implemented through the JSON (JavaScript
Object Notation) format [10], a text format that is completely lan-
guage independent. The communication between server and client
is completed by Web Sockets API [24] that enables web pages to
use the Web Socket protocol for full-duplex communication with
the remote server. The full-duplex channel is important since in re-
altime we transfer control parameters adjusted by users to the server
side to generate different results according to their input.
A client’s sole task is to provide web-based visualization. It re-
ceives the information of the multi-clusters continuously and then
visualize them on web pages. The visualization is completely web-
based, implemented by Javascript and HTML5 with D3js [5].
Documents of a text stream may arrive at any speed (with a vari-
ety of time intervals). Our system can follow the incoming speed to
inject new documents. We also implement a buffering mechanism
so that the interval between consecutive streaming documents can
be controlled. This provides a flexible scheme since we found if
the stream goes too fast, users do not have enough time for percep-
tion and forming insights. Moreover, documents can stream into
the system one by one, or following their publishing time, in which
case a set of documents may be inserted at the same time.
7 EVALUATION
For evaluating our application, Sec. 7.1 demonstrates a case study
of real world data stream. Sec. 7.2 reports a preliminary user study
by 8 participants.
7.1 Case Study: New York Times News Stream
In this case study, we explored a text stream collected from New
York Times (www.nytimes.com). The text stream includes all(892)
news of the New York Times on March 6th, 2014. Each individual
news document includes the keywords, published time, title, and
the first paragraph of the news. We mimic continuously arriving
documents flowing into TStreamMonitor by their publishing time,
and perform monitoring tasks over the stream. We put all the news
of every 20 minutes into the system with an interval of 5 seconds.
On average, a news is ejected in the system per second. The News
which are older than 12 hours will be dropped. All the computation
are completed in realtime. In the supplemental video, this whole
dynamic monitoring process is clearly illustrated in the animation.
Figure 9a is a snapshot of TStreamMonitor. The dynamic mon-
itor shows 5 topical clusters at 12:31:30 pm March 6th, 2014. The
stream view shows the historical evolution of topics until this time.
From the labeled keywords, we can characterize the clusters related
to “Russia” (green), “British” (purple), “Factory” (blue), “France”
(orange), and “Africa” (red) respectively. The green cluster has the
largest size including lots of documents. We can investigate the de-
tail of this topic by turning on the detail list view on the right. More-
over, we switched this cluster to a multilevel view where we got
three sub-topics, as shown in Figure 9b. Major keywords and titles
indicate the subtopics: “Russia, Crimea, Ukrainian” (top), “Rus-
sia”(right bottom), and “Crimea” (left bottom).
Figure 9c shows the new layout in the next timestamp. The
blue cluster merged into the purple cluster after new items were
injected. And a new cluster about “China” was emerged. The rect-
angle labels of each cluster further shows the titles of new com-
ing news. A lots of news filled into the “Russia” topical cluster.
This topic showed a representative title “EU Slaps Initial Sanctions
6
7. Online Submission ID: 135
(a) (b) (c)
(d) (e) (f)
Figure 9: Case study: Monitoring a text stream of New York Time news on March 6th, 2014. (a)-(f) are snapshots showed in TStreamMonitor.
on Russia ...” and came with frequent keywords including “Rus-
sia”, “Ukraine”, “Crimea” and “European”. We showed the highest
page rank title of the cluster. We further enlarged it and found three
subtopics (Figure 9d). Then we can found an important event about
Russia, Ukraine and European. And this topical cluster became
the largest topical cluster. We are aware of the Crimea crisis and
Ukraine violence. By studying more details, we found the Crimean
referendum was reset to March 30th from its initially planned date
at May 25th, 2014.
Further monitoring the stream, at 18:51:30, The green cluster
became the largest cluster(Figure 9e). It indicated that “California”
and “France” became the hottest topic in the news stream at that
time. By studying more details, a few news about the Paris Fashion
Week (February 25th to March 5th, 2014) and California state filled
in at that time. Figure 9f advanced to the end of March 6th. The
purple cluster which is related to “Police”, “Paris Fashion Week”
and “executive” became the largest cluster. Then we are aware that
the topic about entertainment and social news became the hot topic
in the night.
7.2 User Study
We conducted a preliminary user study to evaluate our TStream-
Monitor system. The text stream for user study is collected from
New York Times by using NYT API. The text stream includes 956
News about ”Obama” which starts from Mar. 1st, 2014 and ends at
Mar. 20, 2014. We put all the news of every 5 hours with an interval
of 5 seconds into our system. Therefore continuously arriving news
flowed into TStreamMonitor by their publishing time. On average,
2 news are ejected in our system per second. News which are older
than 10 days will be dropped.
we designed the user study for evaluating the merits of our mon-
itoring system in the following three aspects:
• TStreamMonitor helps users identify critical information and
trends of a text stream;
• TStreamMonitor provides understandable and easy interface
for dynamic visualization.
• TStreamMonitor facilitates user participation in the monitor-
ing process.
7.2.1 Tasks
We designed three tasks in our user study. The monitoring process
has been described in the case study discussed in Section 7.1. These
tasks are to:
• T1: Identify the important events during the monitor-
ing of the text stream. In this task, participants were asked
whether or not several salient events can be identified.
• T2: Provide feedback of the understandable visualization.
In this task, participants were asked: (1) whether the topical
clusters can be identified clearly. (2) whether the splitting and
merging of clusters is helpful in understanding the events; (3)
whether the topical clusters can be easily traced along time.
• T3: Provide feedback of the user participation in the mon-
itoring process. In this task, participants were asked to eval-
uate whether or not the monitoring process can be controlled
easily.
7.2.2 Participants
The user study was conducted with 8 participants (3 females and 5
males), who were all in computer science major. Participants were
between 25 and 31 years old. Most of them had a basic understand-
ing of data mining and information visualization. They have not
used any text stream monitoring system before. An instructor spent
15 to 20 minutes to brief the user study for the participants. This in-
volved explaining the tasks, introducing the interface, and describ-
ing the concept of keywords and title of documents. Participants
can freely complete the tasks in their preferred order. Participants
have 25 minutes to complete all the three tasks by using our system
to explore the news stream.
7
8. Online Submission ID: 135
7.2.3 Results
Feedback collected from the participants generates positive evalua-
tion for our system.
In Task T1, on average, participants could identify 70.8% of the
important events from the text stream. In detail, there were 7 out
of 8 participants claimed their identification of the event on March
2nd. All the 8 participants claimed their identification of the event
on March 16th. But only 2 out of eight participants claimed their
identification of the event on March 7th. This was possibly caused
by that the size of the topic cluster did not increase enough to at-
tract their attention, although it has become the largest cluster. This
is because we did not set the cluster size linearly proportional to
its size, in order to avoid small clusters are too small on screen to
identify while big ones are too large. It indeed implies one of our
major future work, which is to improve the visual interface in many
subtle details by closely working with user feedbacks.
In Task T2, 7 out of 8 participants proposed that the topical clus-
ters can be identified and traced easily during dynamic monitoring.
All the 8 participants agreed that the splitting and merging of clus-
ter makes the text stream more understandable. Finally, in Task T3,
6 out of 8 participants agreed that they can control the monitoring
process easily.
The results primarily show that our TStreamMonitor system pro-
vides a useful and efficient monitoring platform for users. It needs
to be improved by more delicate design in visual representations,
such as color, size, label, etc.
8 CONCLUSION
We have presented a visual monitoring system, called TStream-
Monitor, which helps users to understand and study the topic infor-
mation of text streams. The system facilitates real-time computa-
tion over streaming data, uses easy-to-understand visual metaphors,
and promotes user participation in the analysis process. Users can
easily characterize salient topics in a large set of arriving docu-
ments. Through observing the topics’ merging and splitting in the
dynamic monitor, users can further understand their evolutionary
trends. A real-world case study illustrated the application for mon-
itoring realtime text streams. And the feedback of user study shows
a effort reduction for learning text stream. Motivated by these re-
sults, we believe that our work will have a positive impact on the
monitoring evolving data stream.
In the future, we will study more topic and cluster summarization
techniques. Since more than 8 thematic clusters are difficult to track
at the same time. It is limited for a document stream has much
higher number of topics clusters. We will use other data mining
method to generate more stable and reliable cluster results. We will
also perform more extensive user studies over specific datasets.
REFERENCES
[1] C. Albrecht-Buehler, B. Watson, and D. A. Shamma. Visualizing
live text streams using motion and temporal pooling. IEEE Computer
Graphics and Applications, 25(3):52–59, 2005.
[2] J. Alsakran, Y. Chen, D. Luo, Y. Zhao, J. Yang, W. Dou, and S. Liu.
Real-time visualization of streaming text with a force-based dynamic
system. IEEE Comput. Graph. Appl., 32(1):34–45, Jan. 2012.
[3] K. Andrews, W. Kienreich, V. Sabol, J. Becker, G. Droschl, F. Kappe,
M. Granitzer, P. Auer, and K. Tochtermann. The infosky visual ex-
plorer: Exploiting hierarchical structure and document similarities.
Information Visualization, 1(3):166–181, Dec. 2002.
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J.
Mach. Learn. Res., 3:993–1022, March 2003.
[5] M. Bostock, V. Ogievetsky, and J. Heer. D3 data-driven docu-
ments. IEEE Transactions on Visualization and Computer Graphics,
17(12):2301–2309, 2011.
[6] Y. Chen, L. Wang, M. Dong, and J. Hua. Exemplar-based visualization
of large document corpus (infovis2009-1115). IEEE Transactions on
Visualization and Computer Graphics, 15(6):1161–1168, 2009.
[7] W. Cui, S. Liu, L. Tan, C. Shi, Y. Song, Z. Gao, H. Qu, and
X. Tong. Textflow: Towards better understanding of evolving topics
in text. IEEE Transactions on Visualization and Computer Graphics,
17(12):2412–2421, Dec. 2011.
[8] W. Cui, H. Qu, H. Zhou, W. Zhang, and S. Skiena. Watch the story un-
fold with textwheel: Visualization of large-scale news streams. ACM
Transactions on Intelligent Systems and Technology (TIST), 3(2):20,
2012.
[9] W. Dou, X. Wang, D. Skau, W. Ribarsky, and M. X. Zhou. Leadline:
Interactive visual analysis of text data through event identification and
exploration. In Proceedings of IEEE Conference on Visual Analytics
Science and Technology, pages 93–102, 2012.
[10] Ecma-International. Ecma-404 the JSON data interchange standard.
http://www.json.org, 2013.
[11] E. R. Gansner, Y. Hu, and S. C. North. Interactive visualization of
streaming text data with dynamic maps. Journal of Graph Algorithms
and Applications, 17(4):515–540, 2013.
[12] M. Ghoniem, D. Luo, J. Yang, and W. Ribarsky. Newslab: Exploratory
broadcast news video analysis. In Proceedings of the 2007 IEEE Sym-
posium on Visual Analytics Science and Technology, pages 123–130,
2007.
[13] S. Havre, P. Whitney, and L. Nowell. Themeriver: Visualizing the-
matic changes in large document collections. IEEE Transactions on
Visualization and Computer Graphics, 8:9–20, 2002.
[14] Y. Ishikawa and M. Hasegawa. T-scroll: Visualizing trends in a time-
series of documents for interactive user exploration. Lecture Notes in
Computer Science, 4675:235–246, Nov. 2007.
[15] M. Krstajic, E. Bertini, and D. Keim. Cloudlines: compact display
of event episodes in multiple time-series. Visualization and Computer
Graphics, IEEE Transactions on, 17(12):2432–2439, 2011.
[16] M. Krstajic, M. Najm-Araghi, F. Mansmann, and D. A. Keim. Incre-
mental visual text analytics of news story development. In IS&T/SPIE
Electronic Imaging, pages 829407–829407. International Society for
Optics and Photonics, 2012.
[17] H. W. Kuhn. The Hungarian method for the assignment problem.
Naval Research Logistics Quarterly, 2:83–97, 1955.
[18] B. Lee, N. H. Riche, A. K. Karlson, and S. Carpendale. SparkClouds:
Visualizing trends in tag clouds. IEEE Trans. Visualization and Com-
puter Graphics, 16(6):1182–1189, 2010.
[19] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the
dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD
international conference on Knowledge discovery and data mining,
pages 497–506, 2009.
[20] S. Liu, M. X. Zhou, S. Pan, W. Qian, W. Cai, and X. Lian. Interactive,
topic-based visual text summarization and analysis. In Proceeding
of the 18th ACM conference on Information and knowledge manage-
ment, pages 543–552, 2009.
[21] D. Luo, J. Yang, M. Krstajic, W. Ribarsky, and D. Keim. Eventriver:
Visually exploring text collections with temporal references. Visual-
ization and Computer Graphics, IEEE Transactions on, 18(1):93–105,
2012.
[22] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation
ranking: Bringing order to the web. Technical Report, 1998.
[23] F. Paulovich and R. Minghim. Hipp: A novel hierarchical point place-
ment strategy and its application to the exploration of document col-
lections. IEEE Transaction on Visualization and Computer Graphics,
16(8):1229–1236, Nov. 2008.
[24] V. Wang, F. Salim, and P. Moskovits. The Definitive Guide to HTML5
WebSocket, Build Real-Time Applications with HTML5. Apress, 2013.
[25] F. Wei, S. Liu, Y. Song, S. Pan, M. Zhou, W. Qian, L. Shi, L. Tan, and
Q. Zhang. Tiara: a visual exploratory text analytic system. In Proc.
KDD, pages 153–162, 2010.
[26] J. A. Wise, J. J. Thomas, K. Pennock, D. Lantrip, M. Pottier, A. Schur,
and V. Crow. Visualizing the non-visual: spatial analysis and interac-
tion with information for text documents. Readings in information
visualization: using vision to think, pages 442–450, 1999.
[27] X. Zhu, A. Goldberg, J. V. Gael, and D. Andrzejewski. Improving di-
versity in ranking using absorbing random walks. HLT-NAACL, pages
97–104, 2007.
8