1. Streaming Ultra High Resolution Images to Large Tiled
Display at Nearly Interactive Frame Rates with vl3
Jie Jiang
University of Illinois-Chicago
jjiang24@uic.edu
Mark Hereld
Argonne National Laboratory
hereld@anl.gov
Joseph Insley
Argonne National Laboratory
insley@anl.gov
Michael E. Papka
Argonne National Laboratory
Northern Illinois University
papka@anl.gov
Silvio Rizzi
Argonne National Laboratory
srizzi@anl.gov
Thomas Uram
Argonne National Laboratory
turam@anl.gov
Venkatram Vishwanath
Argonne National Laboratory
venkat@anl.gov
ABSTRACT
Visualization of large-scale simulations running on super-
computers requires ultra-high resolution images to capture
important features in the data. In this work, we present
a system for streaming ultra-high resolution images from a
visualization cluster to a remote tiled display at nearly in-
teractive frame rates. vl3, a modular framework for large
scale data visualization and analysis, provides the backbone
of our implementation. With this system we are able to
stream over the network volume renderings of a 20483
voxel
dataset at a resolution of 6144x3072 pixels with a frame rate
of approximately 3.3 frames per second.
Categories and Subject Descriptors
I.3.2 [Computing Methodologies]: Computer Graphics-
Graphics Systems: Distributed/network graphics
1. INTRODUCTION
The increasing computing power of leadership supercom-
puters enables scientific simulations at a very large scale
and produces enormous amounts of data. Large scale data
may be challenging to analyze and visualize. The first chal-
lenge is to efficiently recognize and perceive features in the
data. For this, large ultra-high resolution displays become a
prevalent tool for presenting big data with great detail. The
survey in [9] summarizes quantitative and qualitative eval-
uations regarding visual effects and user interaction with
large high-resolution displays. These studies validate the
positive effects of large high-resolution displays on human
performance exploring big datasets.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SC ’15 Austin, Texas USA
Copyright 2015 ACM X-XXXXX-XX-X/XX/XX ...$15.00.
Figure 1: A 6144x3072 pixel image streamed to a 6x4
projector-based tiled display. Image rendered from a 40963
voxel dataset.
The second challenge is visualizing large scale data effi-
ciently. In many domains, including cosmology, astrophysics
and biosciences, large scale simulations running on lead-
ership supercomputers generate extremely large data sets.
For example, the Hardware/Hybrid Accelerated Cosmology
Code (HACC) framework [2] has modeled more than a tril-
lion particles in their simulations. Exploring data interac-
tively at such scales requires a visualization framework that
can render, composite, and stream frames at a sufficient rate.
Our system is built on vl3, an efficient parallel visual-
ization framework. It achieves a nearly-interactive stream-
ing frame rate of ultra-high resolution images by leverag-
ing a parallel compositing scheme and multiple streaming
channels between a visualization cluster and a large high-
resolution tiled display. Figure 1 shows a visualization of a
40963
voxel volumetric fluid simulation on a 6144x3072 pixel
tiled display.
2. BACKGROUND
The system presented in this work relies on large tiled
displays, parallel volume rendering, and remote rendering
and streaming. In this section we describe previous research
2. in these areas.
2.1 SAGE
The scalable adaptive graphics environment (SAGE) [6] is
a middleware developed by the Electronic Visualization Lab-
oratory at the University of Illinois at Chicago to support
distance collaboration in ultra resolution display environ-
ments. The SAGE architecture enables data, high-definition
video, and high-resolution graphics to be streamed in real-
time from remotely distributed rendering and storage clus-
ters to scalable display walls over high-speed networks. The
framework supports the streaming of multiple dynamic ap-
plication windows [4], but it does not support user interac-
tion within each application. It also lacks a native visual-
ization tool for large scale data. SAGE2 [7] is a complete
redesign and implementation of SAGE built on cloud-based
and web browser technologies, focusing on data intensive
co-located and remote collaboration.
2.2 DisplayCluster
DisplayCluster [5] is an interactive visualization environ-
ment for cluster-driven tiled displays. It is designed as a
desktop-like windowing system that can present media in na-
tive high-resolution and also stream graphics content. Dis-
playCluster has been used on Stallion, a 15 x 5 tiled dis-
play wall with a resolution of 2560 x 1600 for each tile, and
is driven by a visualization cluster consisting of 23 render
nodes and one head node. In contrast, our approach focuses
on driving a tiled display from a single node, though it also
supports a cluster configuration.
2.3 vl3
vl3 is a parallel visualization and data analysis framework
developed at Argonne National Laboratory and the Com-
putation Institute of the University of Chicago. It supports
hardware-accelerated rendering of point sprites for particle
data sets and ray casting for volume rendering of regular
grids. vl3 has been used to interactively render large scale
datasets [11]. Its modular design and extensible architecture
allow for rapid development and testing of new functional-
ities. Moreover, a high level of parallelism across all stages
of the rendering pipeline is crucial for achieving an excellent
scalability, as shown in [10].
2.4 Streaming
Support for streaming of large scale rendered images was
added to vl3 to enable remote visualization on tiled displays
[3]. This capability was first showcased at the SC09 con-
ference, where real time volume renderings of a 40963
Enzo
cosmology data set were generated on a visualization clus-
ter at Argonne National Laboratory and streamed to a 5x3
multi-panel display wall on the conference exhibit floor in
Portland, Oregon. This was demonstrated again at the SC10
conference, with the addition of interactive controls and used
to stream multiple visualizations of different variables from
the simulation for interactive exploration and comparison.
3. FRAMEWORK
vl3, a parallel framework for real-time interactive visual-
ization, can run on multiple hardware platforms. Here we
focus on high performance visualization clusters for remote
rendering and streaming of ultra high resolution visualiza-
tions.
Figure 2: System topology and pipeline.
3.1 Design
Our design consists of four components (Figure 2): visual-
ization cluster, tiled display, high speed network connection,
and interaction node.
The process starts at the visualization cluster generating
images and streaming them over a high speed network to the
tiled display. The tiled display presents the current image
to the user, while the interaction node provides a graphical
user interface that takes user input and sends control events
to the visualization cluster.
The system leverages parallelism during data loading, ren-
dering, compositing, streaming and displaying for maximum
performance. It exploits asynchronous streaming to over-
come the disparity between the rendering capacity of the
visualization cluster and the bandwidth of the high speed
network. With this architecture we have three frame rates
that measure performance in various stages of the pipeline:
(i) rendering frame rate on the visualization cluster, (ii)
streaming frame rate for the network communication, and
(iii) graphics update frame rate on the tiled display.
3.1.1 Visualization cluster
The visualization cluster provides computational power to
load, process, and render large data sets in parallel. The par-
allel direct send algorithm is used for compositing [1]. Each
compositor sends one or more streams of images to the tiled
display resource. The last step of gathering tiles from each
compositor and generating a single integrated image is left to
the displaying end. This approach reduces the communica-
tion overhead during compositing, increasing the rendering
frame rate. This adds no extra work on the displaying end
and works consistently with our multiple channel streaming.
3.1.2 Tiled display
The tiled display can be driven by either a single com-
puter with multiple graphics cards, or a cluster of worksta-
tions. An MPI-enabled parallel client runs on the tiled dis-
play single computer or cluster. Each client process receives
a chunk of the final image from the corresponding streamer
and shows it on its corresponding area of the tiled display.
The client application has been designed with flexibility in
mind, so that it can run on different hardware configura-
tions.
3.1.3 High speed network
The visualization cluster and tiled display are connected
through a high speed network. The network bandwidth
3. and latency determine an upper bound for the streaming
frame rate. We use a multi-channel streaming scheme to
take into account various hardware environments. We ad-
just the streaming configuration according to the network
topology between the visualization cluster and the tiled dis-
play, optimizing the bandwidth utilization and streaming
frame rate.
3.1.4 Interaction node
The interaction node provides a graphical user interface
(GUI) for intuitive user interactions. It sends control events
to the visualization cluster using the http protocol. Large
tiled displays do not necessarily support a common interac-
tion device. This poses a challenge for capturing user in-
put across a diverse set of display technologies. By decou-
pling the display and interaction components, we overcome
the variability of available interaction methods between tiled
displays. A consistent interaction application, with familiar
GUI components, should also have an easier learning curve
for the scientists using the system.
3.2 Implementation
vl3 runs on a visualization cluster and it is the core of our
system. In this section, we present extensions developed to
support driving the tiled display and communicating with
the interaction node. The system architecture is shown in
Figure 2.
3.2.1 Streaming Configuration
In this section we define two concepts: group stream-
ing partition and parallel compositing configuration. Group
streaming partition refers to the layout for dividing the im-
age into rectangular sub-images. It defines the total num-
ber of streams as well as the resolution and offset of each
streamed image. Adjusting the total number of streams af-
fects utilization of the aggregated network bandwidth to the
tiled display. Parallel compositing configuration defines the
total number of parallel compositing processes on the visu-
alization cluster, as well as the resolution and offset of the
partial image for each compositor. Altering the number of
compositors affects both the visualization performance and
bandwidth utilization for each compositor.
Combining the group streaming partition and the parallel
compositing configuration we generate a streaming configu-
ration. The streaming configuration contains an entry for
each stream along with information of the resolution, offset,
and hostname of the streaming source. The master process
on the visualization cluster generates the streaming config-
uration as a json package. The tiled display application re-
trieves the streaming configuration. With this configuration
file it has the information required to initiate group stream-
ing receivers and assemble the image tiles appropriately to
produce the final image. In this way, Streaming configura-
tion improves the usability of the system by facilitating the
configuration procedure.
3.2.2 Display application
We developed a Python-based and MPI-aware display client.
The Python code is light-weight and provides great cross-
platform compatibility. We use MPI for inter-process com-
munication, which makes our code compatible for both sin-
gle node or cluster driven tiled displays.
The display client launches in two stages. In stage one,
the display client connects to the master process on the vi-
sualization cluster to retrieve the streaming configuration.
Subsequently, in stage two, the client parses the stream-
ing configuration and launches MPI-aware instances of the
core threads that receive images from the visualization clus-
ter and display them at their corresponding position on the
tiled display.
3.2.3 Streaming
We use Celeritas [12] for high-throughput streaming of raw
pixel data over TCP. The visualization cluster uses asyn-
chronous rendering and streaming. We take into account
the difference between the streaming frame rate and render-
ing frame rate using a fixed-length frame queue. For that,
the streaming sender pops the head frame from the queue
and sends it out. In the meantime, if there is an empty slot
in the streaming queue, the renderer will push the current
frame into the queue; otherwise the current frame will be
discarded.
When the streaming frame rate is higher than the ren-
dering frame rate, every rendered frame will be sent out.
Otherwise, when the streaming frame rate is lower than the
rendering frame rate, new frames are pushed into the queue
faster than they are streamed to the client. In that case, once
the queue reaches its maximum capacity, newly rendered
frames are discarded until a slot in the streaming queue be-
comes available again. This ensures that the server always
streams the latest frame, also improving interactivity.
3.2.4 Synchronization
Synchronization among processes of the display client is
indispensable to ensure the image integrity on the tiled dis-
play and to avoid tearing between tiles while updating ren-
dering parameters. We implement a dual synchronization
mechanism similar to the one described in [8] : data syn-
chronization and swap buffer synchronization.
Data synchronization ensures that each streaming re-
ceiver (i.e a single client process) gets the same frame from
their streaming source. On the visualization cluster, the
streaming sender adds a frame id in increasing order at the
end of each send buffer. Streaming sources synchronize be-
fore pushing a rendered frame into the streaming queue; they
only push the current frame into the queue when every pro-
cess has an empty slot. This guarantees that every streaming
client will be receiving an identical frame sequence.
Swap buffer synchronization takes care of synchro-
nization of the receiving and graphics threads. The dis-
play client uses a double buffer for graphics update. Every
streaming receiver synchronizes at the end of each received
frame. They confirm everyone has an identical frame and
then update their graphics buffer. At the same time, graph-
ics threads synchronize before each buffer swap to ensure
that the whole display updates the displaying content syn-
chronously.
3.2.5 Interaction
A Qt-based interaction client runs on a separate node.
The interaction client provides a GUI that allows intuitive
and easy user interaction with the visualization cluster. The
user has precise control of camera position, transfer function,
color values, and other parameters. The interaction client
sends all user updates to visualization cluster and the modi-
fication will be seen on the tiled display in the subsequently
4. Figure 3: Network performance experiment results. Frame
buffer size is fixed at 6144x3072 pixels.
rendered frames.
4. EXPERIMENTS
We perform two experiments to evaluate our system per-
formance. Network bandwidth test validates the improve-
ment of bandwidth utilization with multi-channel streaming.
Weak scalability test shows the improvement of system per-
formance with multi-channel streaming at different scales.
4.1 Experimental Platform
We use computing and networking resources at the Ar-
gonne Leadership Computing Facility, Argonne National Lab-
oratory. Visualization cluster Cooley and tiled display Oc-
ular are both located in Theory and Computing Sciences
building and connected to each other with high speed net-
work.
4.1.1 Cooley
Cooley is a visualization cluster with a total of 126 com-
pute nodes; each node has two 2.4 GHz Intel Haswell E5-
2620 CPUs (6 cores per CPU, 12 cores total) and one NVIDIA
Tesla K80 dual-GPU card. Memory on Cooley is 384GB
RAM and 24GB GPU memory per node. Aggregate GPU
peak performance is over 293 teraflops double precision. Net-
work interconnect is FDR Infiniband. Each Cooley node has
an independent 10Gbps bandwidth ethernet connection to
one of three aggregating switches.
4.1.2 Ocular
Ocular is a display node with 8 graphics cards driving a
6x4 projector-based tiled display. Each tile has a resolution
of 1024x768 pixels. The full resolution for the tiled display is
6144x3072 pixels. Ocular has a 10Gbps ethernet over fiber
connection.
The 10Gbps connection between Ocular to Cooley is the
bottle neck of communication.
4.2 Results
Our first experiment studies the network capacity by as-
sessing quantitive improvement of aggregated bandwidth from
group streaming. We create a parallel streaming sender that
uses Celeritas and continuously streams an aggregate buffer
of 6144x3072 pixels in parallel. The streaming sender runs
on Cooley while our display client receives and displays the
buffer on Ocular. We change number of streams and mea-
Figure 4: System performance experiment results
sure the receiving frame rate and the corresponding band-
width utilization. The experiment results can be seen at
Figure 3.
We observe that the streaming frame rate increases with
the number of streams. As the number of streams increases
from 1 to 24, the streaming frame rate rises from around
4 FPS to 17 FPS, the equivalent bandwidth for 17 FPS
is around 7344 Mbps. The available bandwidth between
Cooley and Ocular is 10 Gbps. This result shows that group
streaming could improve the bandwidth utilization and with
24 streams our high speed network could deliver a streaming
frame rate sufficient for real-time user interaction.
Our second experiment tests weak scalability of the visu-
alization system. We keep a constant data load of 5123
voxel
volume on each GPU at the visualization cluster while the
full image resolution remains constant at 6144x3072 pixels.
We run 4 cases, where each case doubles the total number of
working GPUs. We assign a single stream for each compos-
itor. Within each case, we run multiple samples, doubling
the number of compositors for each sample. We measure
the streaming frame rate at Ocular. The first few frames
are discarded while the network connection stabilizes. We
then take 300 frames and calculate the average frame rate
every 5 seconds over that period.
Experiment result could be seen at Figure 4. Each line
draws the average frame rate and the maximum/minimum
frame rate observed during each case. For each case with
fixed number of GPUs, generally we observe performance
boost as we increase the number of compositors/streams ex-
cept for samples where number of streams/compositors are
equal to the number of GPUs. The flat part of the line
for the last sample in each test is expected. Each Cooley
node has 2 GPUs. While all GPUs are used for render-
ing for all tests, we double the number of GPUs used for
compositing/streaming at each sample. Increasing the num-
ber of nodes used for compositing/streaming increases the
available bandwidth for compositing communication. For
early samples, only one GPU per node is used for composit-
ing/streaming, and we see an increase in performance. How-
ever, in the last sample both GPUs on each node are used for
compositing/streaming. In this case the two GPUs share the
available bandwidth on that node, so performance remains
relatively constant.
Moreover, the system performance does not reach the
frame rates achieved in our network experiments, indicating
that it is not the network capacity, or the sending and re-
ceiving components, but rather the rendering performance
that is the bottleneck. Overall scaling performance of vl3
5. was modeled and measured in [10], where the network com-
munication component of compositing emerged as the bot-
tleneck. So in our current experiment, as we assign more
compositors, we conclude that it is again this network com-
munication which jeopardizes scalability.
5. CONCLUSIONS
Our system visualizes ultra high resolution images of large
scale simulation data on a large tiled display at nearly in-
teractive frame rates. Our experiment give very promising
results and shows the potential of streaming ultra high res-
olution visualization at an interactive frame rate. It verifies
that combining multi-channel streaming and parallelism of
visualization pipeline could practically raise the efficiency
and utilization of existing hardware. Our methodology and
design could be applied to existing parallel visualization sys-
tem to improve the capability of handling large data set.
Meanwhile, the modular design of our distributed system
could be modified and extended to fit specific visualization
tasks.
The future work of this project is to improve the scala-
bility of this system by optimizing the performance of the
compositor within vl3. This will enable the visualization
of larger data sets and provide higher resolution images at
interactive frame rates. Exploring the application of our sys-
tem for collaboration among geologically distributed groups
would also be an interesting topic.
6. ACKNOWLEDGMENTS
This work was supported by the Office of Advanced Sci-
entific Computing Research, Office of Science, U.S. Depart-
ment of Energy, under Contract DE-AC02-06CH11357 in-
cluding the Scientific Discovery through Advanced Com-
puting (SciDAC) Institute for Scalable Data Management,
Analysis, and Visualization. This research has been funded
in part and used resources of the Argonne Leadership Com-
puting Facility at Argonne National Laboratory, which is
supported by the Office of Science of the U.S. Department
of Energy under contract DE-AC02-06CH11357.
7. REFERENCES
[1] S. Eilemann and R. Pajarola. Direct send compositing
for parallel sort-last rendering. In Proceedings of the
7th Eurographics conference on Parallel Graphics and
Visualization, pages 29–36. Eurographics Association,
2007.
[2] S. Habib, V. Morozov, N. Frontiere, H. Finkel,
A. Pope, and K. Heitmann. Hacc: extreme scaling and
performance across diverse architectures. In
Proceedings of the International Conference on High
Performance Computing, Networking, Storage and
Analysis, page 6. ACM, 2013.
[3] M. Hereld, J. Insley, E. C. Olson, M. E. Papka,
V. Vishwanath, M. L. Norman, and R. Wagner.
Exploring large data over wide area networks. In Large
Data Analysis and Visualization (LDAV), 2011 IEEE
Symposium on, pages 133–134. IEEE, 2011.
[4] B. Jeong, L. Renambot, R. Jagodic, R. Singh,
J. Aguilera, A. Johnson, and J. Leigh.
High-performance dynamic graphics streaming for
scalable adaptive graphics environment. In SC 2006
Conference, Proceedings of the ACM/IEEE, pages
24–24. IEEE, 2006.
[5] G. P. Johnson, G. D. Abram, B. Westing, P. Navr’til,
and K. Gaither. Displaycluster: An interactive
visualization environment for tiled displays. In Cluster
Computing (CLUSTER), 2012 IEEE International
Conference on, pages 239–247. IEEE, 2012.
[6] J. Leigh, L. Renambot, A. Johnson, R. Jagodic,
H. Hur, E. Hofer, and D. Lee. Scalable adaptive
graphics middleware for visualization streaming and
collaboration in ultra resolution display environments.
In Proc. of Workshop on Ultrascale Visualization,
pages 47–54, 2008.
[7] T. Marrinan, J. Aurisano, A. Nishimoto,
K. Bharadwaj, V. Mateevitsi, L. Renambot, L. Long,
A. Johnson, and J. Leigh. Sage2: A new approach for
data intensive collaboration using scalable resolution
shared displays. In Collaborative Computing:
Networking, Applications and Worksharing
(CollaborateCom), 2014 International Conference on,
pages 177–186. IEEE, 2014.
[8] S. Nam, S. Deshpande, V. Vishwanath, B. Jeong,
L. Renambot, and J. Leigh. Multi-application
inter-tile synchronization on ultra-high-resolution
display walls. In Proceedings of the first annual ACM
SIGMM conference on Multimedia systems, pages
145–156. ACM, 2010.
[9] T. Ni, G. S. Schmidt, O. G. Staadt, M. Livingston,
R. Ball, and R. May. A survey of large high-resolution
display technologies, techniques, and applications. In
Virtual Reality Conference, 2006, pages 223–236.
IEEE, 2006.
[10] S. Rizzi, M. Hereld, J. Insley, M. E. Papka, T. Uram,
and V. Vishwanath. Performance Modeling of vl3
Volume Rendering on GPU-Based Clusters. In
M. Amor and M. Hadwiger, editors, Eurographics
Symposium on Parallel Graphics and Visualization.
The Eurographics Association, 2014.
[11] S. Rizzi, M. Hereld, J. Insley, M. E. Papka, T. Uram,
and V. Vishwanath. Large-Scale Parallel Visualization
of Particle-Based Simulations using Point Sprites and
Level-Of-Detail. In C. Dachsbacher and P. Navr˜Aatil,
editors, Eurographics Symposium on Parallel Graphics
and Visualization. The Eurographics Association,
2015.
[12] V. Vishwanath. LambdaRAM: A HighPerformance,
MultiDimensional, Distributed Cache Over UltraHigh
Speed Networks. PhD thesis, University of Illinois at
Chicago, 2009.