Data Analysts have to deal with an ever-growing amount of data resources. One way to make sense of this data is to extract features and use clustering algorithms to group items according to a similarity measure. Algorithm developers are challenged when evaluating the performance of the algorithm since it is hard to identify features that influence the clustering. Moreover, many algorithms can be trained using a semi-supervised approach, where human users provide ground truth samples by manually grouping single items. Hence, visualization techniques are needed that help data analysts achieve their goal in evaluating Big data clustering algorithms. In this context, Multidimensional Scaling (MDS) has become a prominent visualization tool. In this paper, we propose a combination with glyphs that can provide a detailed view of specific features involved in MDS. In consequence, human users can understand, adjust, and ultimately improve clustering algorithms. We present a thorough glyph design, which is founded in a comprehensive survey of related work and report the results of a controlled experiments, where participants solved data analysis tasks with both glyphs and a traditional textual display of data values.
Towards Glyph-based Visualizations for Big Data Clustering
1. VINCI 2017
The 10th International Symposium on Visual
Information Communication and Interaction
Towards Glyph-based
Visualizations for Big
Data Clustering
AUG 15, 2017
Chair of Media Design
Technische Universität Dresden
Mandy Keck, Dietrich Kammer, Thomas Gründer, Thomas Thom, Martin Kleinsteuber,
Alexander Maasch, Rainer Groh
2. Structure
Part 1 Part 2
Part 3 Part 4
Problem Description
Related Work
Data Set,
Glyph Design
User Study Lessons Learned,
Future Work
3. Research Project
VANDA - Visual Analytics Interfaces for Big Data Environments
Data Analytics,
Copyright Observation
Data Crawling, Content
Exploration
Data Analytics and
Text Mining for Smart
Adaptive Learning
Environments
Research on Human
Computer Interaction
and Information
Visualization
Purchasing Platform bet-
ween Businesses with
Millions of Products
www.vanda-project.de
VINCI 2017 AUG 15, 2017 M. Keck et al. 3 | 22
4. Multidimensional Scaling [Torgerson 1952, Munzer 2014]
HD DATA
2D DATA
Item 1
Item 2
Item 3
Item 1
Item 2
Item 3
Dimension1
Dimension2
Dimension3
Dimension4
...
Item n Dimension5
Dimensionn
......
Item n
Dimension1
Dimension2
5. Visualization of Multi-dimensional Data Sets [Keim 2000]
Scatterplot (Geometric Technique) [w1] Glyphs (Icon-based Technique) [w2] Pixel-oriented Technique [Stefaner 2010]
VINCI 2017 AUG 15, 2017 M. Keck et al. 5 | 22
6. Using Glyphs for Cluster Analysis
VINCI 2017 AUG 15, 2017 M. Keck et al. 6 | 22
7. Glyph Design
Flower Glyph Star Plot
B1
Star plot: Whisker Plot with
connected endpoints of each line
B2
Filling the resulting shape for
color encoding
B3
Absolute axes to improve the
identification of extreme values +
transparency to enhance the
visibility of the coordinate system
A1
Length of each petal encodes
a quantitative attribute
A2
Redundant encoding: attribute
value is mapped to length and
brightness for different LoD
A3
Radius border to enhance the
identification of maximum values
8. Data Set
Event Data Set with 15 attributes
6 attributes are selected for glyph design:
price, popularity, time, distance,
estimationmusic and category
Quantitative attributes are normalized to
a value between 0 and 10
Categories are mapped to color
price
popularity
time
estimation-
music
distance
entertainment sports education
band tourism beauty
9. Hypotheses
H1
H2
Glyph-based visualizations reduce completion times in
comparison tasks when compared to tabular display
Glyph-based visualizations reduce completion times to identify outliers and
extreme values when compared to tabular displays
H3
H4
Tabular display of data has the highest accuracy when
compared to other visualization techniques
Flower glyphs reduce completion times and increase accuracy when identifying
extreme values
H5
Star plots reduce completion times and increase
accuracy in comparison tasks
10. Test Setting
Implementation in JavaScript
(jQuery, RequireJS, d3.js)
226 data items of the event data set
27’’ display with WQHD resolution
in portrait orientation
12. Methodology
25 tasks per interface, divided into 5 task types:
Id_High: Identification of high
extreme values
Id_Low: Identification of low
extreme values
Cat_High: Identification of high ex-
treme values in a specific category
Cat_Low: Identification of low
extrem values in a specific category
Comp: Comparison of all 5 values
to a provided example
„Find an event with a value in
popularity as high as possible!“
„Find an event with a
price as low as possible!“
„Find an event in the category tourism which has a
high chance that music is played there!“
„Find an event in the category
education that is very close to your city!“
„Find an event that is similar
to the shown example!“
price
popularity
time
estimation-
music
distance
VINCI 2017 AUG 15, 2017 M. Keck et al. 12 | 22
13. Participants
1427
41
USAGE
EVENT DATA
INFORMATION VISUALIZATION
GLYPH-BASED VISUALIZATION
very infrequent very frequent
TOTAL
AGE
EXPERIENCE
64
14
36,6%
19,5%
7,3%
14,6%
22,0%
no experience extensive experience
17,0%
7,3%
36,6%
29,3%
9,8%
31,7%
31,7%
17,1%
17,1%
2,4%
VINCI 2017 AUG 15, 2017 M. Keck et al. 13 | 22
14. Results | Time
0
10
20
30
40
Id_High Id_Low Cat_High Cat_Low Comp
TABLE FLOWER STAR
SOLUTION TIME IN SECONDSMain effects
Glyphs faster than Table, p < 0.001,
Starplot faster than Table, p < 0.042
Pairwise comparison for tasks
Id_High: Glyphs faster than Table,
Star faster than Flower
Cat_High, Cat_Low & Comp:
Glyphs faster than Table
VINCI 2017 AUG 15, 2017 M. Keck et al. 14 | 22
15. Results | Accuracy and Error Rate
0
0,4
0,8
1,2
Id_High Id_Low Cat_High Cat_Low Comp
TABLE FLOWER STAR
ACCURACYMain effects
Accuracy: no significant main effect for
conditions, p = 0.504
Error Rate: no significant main effect for
conditions, p = 0.122
Pairwise comparison for tasks
Id_Low: Table more accurate than Flower
Cat_Low: Table more accurate than Glyphs
VINCI 2017 AUG 15, 2017 M. Keck et al. 15 | 22
16. Questionnaire
User Experience Questionnaire (UEQ) for each
condition [Laugwitz et al. 2008]
26 adjective pairs - assigned to 6 factors:
Attractiveness, Perspicuity, Efficiency,
Dependability, Stimulation and Novelity
Each adjective pair uses a seven stage scale
(polarity is determined randomly)
attractive unattractive
VINCI 2017 AUG 15, 2017 M. Keck et al. 16 | 22
18. Results | Hypotheses
Hypothesis MeasuresHypothesis
Description
Pairwise
comparison (p-value)
Task
H1 CompSolution Time
Significant
values found
Hypothesis
rejected
Glyphs faster than Table
for comparison tasks
yes no
H2
H3
Id_High
Id_Low
Cat_High
Cat_Low
Solution TimeGlyphs faster than Table
for extreme values
< 0.001
0.709
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
Flower – Table Star – Table
yes no
OverallAccuracyTable more accurate
than Glyphs
Table – Flower Table – Star
0.258 0.057 no no
H4 Flowers more accruate
and faster for extreme values
H5 Stars more accruate and
faster for comparison tasks
Solution Time
Accuracy
Solution Time
Accuracy
Id_High
Id_Low
Cat_High
Cat_Low
Comp
0.014
1.000
0.607
1.000
Solution Time Accuracy
1.000
0.140
0.153
1.000
Flower – Table Star – Table
< 0.001 < 0.001
0.084
Solution Time Accuracy
1.000
partially partially
no no
19. Missing flower petals could be more easily
perceived than zero values on the axes in
star plots
Discussion
More difficult to identify if the petal
encodes the maximum because of its
curvature
When flower glyphs encode data values with low
values in all dimensions, it was more difficult to
assign the value to a specific dimension
Star plot with many high extreme values pop out ea-
sier because of the larger surface area, so other star
plots with just one high value were often disregarded
More difficult to identify high extreme
values in star plots when adjacent
dimensions exhibit zero values
22. References
W.S. Torgerson. 1952. Multidimensional scaling: I. Theory and method. Psychometrika 17 (1952), 401–419.
T. Munzner. 2014. Visualization Analysis and Design. A.K. Peters visualization series. ISBN 978-1-466-50891-0. http://www.cs.ubc.ca/~tmm/
vadbook
Daniel A. Keim. 2000. Designing Pixel-Oriented Visualization Techniques: Theory and Applications. IEEE Trans. on Visualization and Compu-
ter Graphics 6, 1 (Jan. 2000), 59–78. DOI:https://doi.org/10.1109/2945.841121
M. Stefaner. 2010. The Design of “X by Y”. In: Beautiful Visualization: Looking at Data through the Eyes of Experts. O‘Reilly Media; 1 edition
(July 1, 2010). ISBN 978-1449379865
B. Laugwitz, T. Held, and M. Schrepp. 2008. Construction and Evaluation of a
User Experience Questionnaire. Springer Berlin Heidelberg, Berlin, Heidelberg, 63–76. DOI:https://doi.org/10.1007/978-3-540-89350-9 6
D. Kammer, M. Keck, M. Müller, T. Gründer, R. Groh. 2017. Exploring Big Data Landscapes with Elastic Displays Conference. Mensch & Com-
puter 2017 - Workshop Begreifbare Interaktion, Oldenbourg Verlag, Regensburg, Germany, (in press)
w1 - The Antibiotic Abacus. Information is Beautiful. http://www.informationisbeautiful.net/visualizations/antibiotic-resistance/, retrieved on
04.08.2017
w2- Stations & Lines. A visual catalog of every station in major rapid transit systems and the lines they serve. https://c82.net/work/?id=335,
retrieved on 04.08.2017
VINCI 2017 AUG 15, 2017 M. Keck et al. 22 | 22