Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Towards Glyph-based Visualizations for Big Data Clustering


Published on

Data Analysts have to deal with an ever-growing amount of data resources. One way to make sense of this data is to extract features and use clustering algorithms to group items according to a similarity measure. Algorithm developers are challenged when evaluating the performance of the algorithm since it is hard to identify features that influence the clustering. Moreover, many algorithms can be trained using a semi-supervised approach, where human users provide ground truth samples by manually grouping single items. Hence, visualization techniques are needed that help data analysts achieve their goal in evaluating Big data clustering algorithms. In this context, Multidimensional Scaling (MDS) has become a prominent visualization tool. In this paper, we propose a combination with glyphs that can provide a detailed view of specific features involved in MDS. In consequence, human users can understand, adjust, and ultimately improve clustering algorithms. We present a thorough glyph design, which is founded in a comprehensive survey of related work and report the results of a controlled experiments, where participants solved data analysis tasks with both glyphs and a traditional textual display of data values.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Towards Glyph-based Visualizations for Big Data Clustering

  1. 1. VINCI 2017 The 10th International Symposium on Visual Information Communication and Interaction Towards Glyph-based Visualizations for Big Data Clustering AUG 15, 2017 Chair of Media Design Technische Universität Dresden Mandy Keck, Dietrich Kammer, Thomas Gründer, Thomas Thom, Martin Kleinsteuber, Alexander Maasch, Rainer Groh
  2. 2. Structure Part 1 Part 2 Part 3 Part 4 Problem Description Related Work Data Set, Glyph Design User Study Lessons Learned, Future Work
  3. 3. Research Project VANDA - Visual Analytics Interfaces for Big Data Environments Data Analytics, Copyright Observation Data Crawling, Content Exploration Data Analytics and Text Mining for Smart Adaptive Learning Environments Research on Human Computer Interaction and Information Visualization Purchasing Platform bet- ween Businesses with Millions of Products VINCI 2017 AUG 15, 2017 M. Keck et al. 3 | 22
  4. 4. Multidimensional Scaling [Torgerson 1952, Munzer 2014] HD DATA 2D DATA Item 1 Item 2 Item 3 Item 1 Item 2 Item 3 Dimension1 Dimension2 Dimension3 Dimension4 ... Item n Dimension5 Dimensionn ...... Item n Dimension1 Dimension2
  5. 5. Visualization of Multi-dimensional Data Sets [Keim 2000] Scatterplot (Geometric Technique) [w1] Glyphs (Icon-based Technique) [w2] Pixel-oriented Technique [Stefaner 2010] VINCI 2017 AUG 15, 2017 M. Keck et al. 5 | 22
  6. 6. Using Glyphs for Cluster Analysis VINCI 2017 AUG 15, 2017 M. Keck et al. 6 | 22
  7. 7. Glyph Design Flower Glyph Star Plot B1 Star plot: Whisker Plot with connected endpoints of each line B2 Filling the resulting shape for color encoding B3 Absolute axes to improve the identification of extreme values + transparency to enhance the visibility of the coordinate system A1 Length of each petal encodes a quantitative attribute A2 Redundant encoding: attribute value is mapped to length and brightness for different LoD A3 Radius border to enhance the identification of maximum values
  8. 8. Data Set Event Data Set with 15 attributes 6 attributes are selected for glyph design: price, popularity, time, distance, estimationmusic and category Quantitative attributes are normalized to a value between 0 and 10 Categories are mapped to color price popularity time estimation- music distance entertainment sports education band tourism beauty
  9. 9. Hypotheses H1 H2 Glyph-based visualizations reduce completion times in comparison tasks when compared to tabular display Glyph-based visualizations reduce completion times to identify outliers and extreme values when compared to tabular displays H3 H4 Tabular display of data has the highest accuracy when compared to other visualization techniques Flower glyphs reduce completion times and increase accuracy when identifying extreme values H5 Star plots reduce completion times and increase accuracy in comparison tasks
  10. 10. Test Setting Implementation in JavaScript (jQuery, RequireJS, d3.js) 226 data items of the event data set 27’’ display with WQHD resolution in portrait orientation
  11. 11. Table Flower Star Conditions VINCI 2017 AUG 15, 2017 M. Keck et al. 11 | 22
  12. 12. Methodology 25 tasks per interface, divided into 5 task types: Id_High: Identification of high extreme values Id_Low: Identification of low extreme values Cat_High: Identification of high ex- treme values in a specific category Cat_Low: Identification of low extrem values in a specific category Comp: Comparison of all 5 values to a provided example „Find an event with a value in popularity as high as possible!“ „Find an event with a price as low as possible!“ „Find an event in the category tourism which has a high chance that music is played there!“ „Find an event in the category education that is very close to your city!“ „Find an event that is similar to the shown example!“ price popularity time estimation- music distance VINCI 2017 AUG 15, 2017 M. Keck et al. 12 | 22
  13. 13. Participants 1427 41 USAGE EVENT DATA INFORMATION VISUALIZATION GLYPH-BASED VISUALIZATION very infrequent very frequent TOTAL AGE EXPERIENCE 64 14 36,6% 19,5% 7,3% 14,6% 22,0% no experience extensive experience 17,0% 7,3% 36,6% 29,3% 9,8% 31,7% 31,7% 17,1% 17,1% 2,4% VINCI 2017 AUG 15, 2017 M. Keck et al. 13 | 22
  14. 14. Results | Time 0 10 20 30 40 Id_High Id_Low Cat_High Cat_Low Comp TABLE FLOWER STAR SOLUTION TIME IN SECONDSMain effects Glyphs faster than Table, p < 0.001, Starplot faster than Table, p < 0.042 Pairwise comparison for tasks Id_High: Glyphs faster than Table, Star faster than Flower Cat_High, Cat_Low & Comp: Glyphs faster than Table VINCI 2017 AUG 15, 2017 M. Keck et al. 14 | 22
  15. 15. Results | Accuracy and Error Rate 0 0,4 0,8 1,2 Id_High Id_Low Cat_High Cat_Low Comp TABLE FLOWER STAR ACCURACYMain effects Accuracy: no significant main effect for conditions, p = 0.504 Error Rate: no significant main effect for conditions, p = 0.122 Pairwise comparison for tasks Id_Low: Table more accurate than Flower Cat_Low: Table more accurate than Glyphs VINCI 2017 AUG 15, 2017 M. Keck et al. 15 | 22
  16. 16. Questionnaire User Experience Questionnaire (UEQ) for each condition [Laugwitz et al. 2008] 26 adjective pairs - assigned to 6 factors: Attractiveness, Perspicuity, Efficiency, Dependability, Stimulation and Novelity Each adjective pair uses a seven stage scale (polarity is determined randomly) attractive unattractive VINCI 2017 AUG 15, 2017 M. Keck et al. 16 | 22
  17. 17. Results | Questionnaire -2,00 -1,00 0,00 1,00 2,00 Attractiveness Perspicuity Efficiency Dependability Stimulation Novelty TABLE FLOWER STAR QUESTIONNAIREMain effects Glyphs rated better than Table, p < 0.001 Pairwise comparison for factors Attractiveness, Efficiency, Stimulation: Glyphs rated better than Table Novelity: Flower better rated than Star; Star rated better than Table VINCI 2017 AUG 15, 2017 M. Keck et al. 17 | 22
  18. 18. Results | Hypotheses Hypothesis MeasuresHypothesis Description Pairwise comparison (p-value) Task H1 CompSolution Time Significant values found Hypothesis rejected Glyphs faster than Table for comparison tasks yes no H2 H3 Id_High Id_Low Cat_High Cat_Low Solution TimeGlyphs faster than Table for extreme values < 0.001 0.709 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 Flower – Table Star – Table yes no OverallAccuracyTable more accurate than Glyphs Table – Flower Table – Star 0.258 0.057 no no H4 Flowers more accruate and faster for extreme values H5 Stars more accruate and faster for comparison tasks Solution Time Accuracy Solution Time Accuracy Id_High Id_Low Cat_High Cat_Low Comp 0.014 1.000 0.607 1.000 Solution Time Accuracy 1.000 0.140 0.153 1.000 Flower – Table Star – Table < 0.001 < 0.001 0.084 Solution Time Accuracy 1.000 partially partially no no
  19. 19. Missing flower petals could be more easily perceived than zero values on the axes in star plots Discussion More difficult to identify if the petal encodes the maximum because of its curvature When flower glyphs encode data values with low values in all dimensions, it was more difficult to assign the value to a specific dimension Star plot with many high extreme values pop out ea- sier because of the larger surface area, so other star plots with just one high value were often disregarded More difficult to identify high extreme values in star plots when adjacent dimensions exhibit zero values
  20. 20. Improvements of Flower Glyphs and Star Plots
  21. 21. Future Work [Kammer et al. 2017]
  22. 22. References W.S. Torgerson. 1952. Multidimensional scaling: I. Theory and method. Psychometrika 17 (1952), 401–419. T. Munzner. 2014. Visualization Analysis and Design. A.K. Peters visualization series. ISBN 978-1-466-50891-0. vadbook Daniel A. Keim. 2000. Designing Pixel-Oriented Visualization Techniques: Theory and Applications. IEEE Trans. on Visualization and Compu- ter Graphics 6, 1 (Jan. 2000), 59–78. DOI: M. Stefaner. 2010. The Design of “X by Y”. In: Beautiful Visualization: Looking at Data through the Eyes of Experts. O‘Reilly Media; 1 edition (July 1, 2010). ISBN 978-1449379865 B. Laugwitz, T. Held, and M. Schrepp. 2008. Construction and Evaluation of a User Experience Questionnaire. Springer Berlin Heidelberg, Berlin, Heidelberg, 63–76. DOI: 6 D. Kammer, M. Keck, M. Müller, T. Gründer, R. Groh. 2017. Exploring Big Data Landscapes with Elastic Displays Conference. Mensch & Com- puter 2017 - Workshop Begreifbare Interaktion, Oldenbourg Verlag, Regensburg, Germany, (in press) w1 - The Antibiotic Abacus. Information is Beautiful., retrieved on 04.08.2017 w2- Stations & Lines. A visual catalog of every station in major rapid transit systems and the lines they serve., retrieved on 04.08.2017 VINCI 2017 AUG 15, 2017 M. Keck et al. 22 | 22