SlideShare a Scribd company logo
A Methodology for Classifying Visitors to an Amusement Park
VAST Challenge 2015 - Mini-Challenge 1
Gustavo Dejean *, Departamento de Computacion
Universidad Nacional del Oeste - San Antonio de Padua, Pcia. de Buenos Aires - Argentina.
ABSTRACT
The main contribution of this work is showing how to obtain
a classification of visitors to an amusement park by using
cluster analysis and visualization techniques. The selection
of variables for K-means algorithm and the results obtained
are visually analyzed in dispersion graphs according to their
Principal Components, in boxplots and in a Linear Model
so as to fine-tune a result that can explain differences
between the groups and similarities within each group.
Keywords: visual analysis, vast challenge, data mining,
clustering, K-means, cluster analysis.
1 INTRODUCTION
Frequently, clustering algorithms identity clusters that do not exist
in reality [I] or trivial clusters that are not relevant for
researcher's goals. That is why, it is important to be able to
visualize results. This is attained both with 2D and 3D diagrams
with their first Principal Components (PC), as well as with other
types of diagrams.
2 THEORY
2.1 Material
Three archives of moves were used, plus a fourth one that
completed Friday's missing records. Overall, they contain
26,021,945 records. Each record accounts for a move or a check­
in from a visitor at the park on a specific day and time. Archives
are included in [2]. Tools that were used include: Postgresql 8.4
database, IBM SPSS Statics v22; JMP II (SW), and Tableau 8.2.
2.2 Methods
The process used for reaching to a final cluster may be divided
into four stages, that were executed concurrently with different
intensities, according to the progress of the work. These four
stages were: I) Creation of new variables; 2) Creation of models
and visualization; 3) Principal Components (PC); 4) K-means
and cluster visualization.
2.2.1 Creation of New Variables
Approximately 64 new variables were created. Some of them
were instrumental for developing a Linear Model that enabled
visualizing and understanding the data and outcomes that were
obtained. Table 1 describes the main variables that were created.
2.2.2 Creation of Models and Visualization
The Linear Model (LM) exhibited in Figure 1 was instrumental
for visualizing the outcome of the final clustering.
* email: dejean2010@gmail.com
IEEE Conference on Visual Analytics Science and Technology 2015
October 25-30, Chicago, II, USA
978-1-4673-9783-4/15/$31.00 ©2015 IEEE
The LM shows a linear correlation between the number of moves
vs. the amount of check-ins in the games (for every ID), yielding
R2 = 0,848. An interesting observation of the LM is that, in
general, each point represents a group of visitors. That is, each
group of persons entering the park through the same door and at
the same time was represented in the LM by points that were
overlapping or very close to one another. It will always be
interesting visualizing any possible outcome from a clustering
process by using this Model.
Figure 1: Linear Model. counCol movement vs.
counCof
_
check-in.
151
152
By observing the outlier points in the LM, 39 IDs with zero
check-in in the games were identified. These IDs were grouped in
Cluster 0; and they were not taken into account when applying the
K-means algorithm nor in any other LM tunings, nor in the
analysis of Pc. In Cluster 0, this would be represented mainly by
the park employees. The LM without outliers remained with a R2
=0,854.
2.2.3 Principal Components
In the case of the study, as the dimensionality of the problem was
greater than three, the PC were used for reducing it and, thus, for
visualizing the outcomes in 2D or 3D graphs, by using the first
two or three PC.
2.2.4 K-means and cluster visualization
An important point in this stage is determining which is the
appropriate amount for visitor classification. The CCC criterion
was used: Sarle's Cubic Clustering Criterion [3] taking the first
peak with CCC >3. The other important point is selecting the
variables for the input of the K-means algorithm that may yield a
better grouping.
3 RESULTS
With the described procedure, for a subset of variables, a K = 6
was obtained. Thus, in the final result, 7 clusters are obtained.
(N.B. Bear in mind that cluster 0 had been created), see figure 2.
Figure 2: Distribution of each Cluster, Cluster a is not shown in
the histogram.
The interpretation of each group was done by using boxplots, as
shown in Figures 3. For example, when comparing the cluster, it
was discovered that cluster 1 and cluster 3 show preferences for
Thill Rides games and rejection for Kidder Rides. Similarly, with
the other variables, in general, relevant differences were obtained
between the clusters, thus accounting for the classification that
was obtained.
Another interesting visualization of the results is shown in
Figure 4, where the LM is used for visualizing each cluster. It
should be noted, for example, that it is easy to read and detect
which clusters have compulsive players and which have very low
profile players, among other features. Overall, each cluster has a
different straight line tangent. The number of groups in each
cluster was roughly obtained with a SQL query grouped per time
of entry and per entry point.
Figure 3: Preferences for Thill Rides and rejection for Kidder
Rides for cluster 1 and 3
Figure 4: Visualization of Clusters in the LM Rides
4 CONCLUSION
In the resulting classification, it was possible to verify the
existence of similarities between the visitors of each cluster and
the differences between different clusters, mainly using boxplots,
but also leveraging interpretation with the LM visualization.
ACKNOWLEDGMENTS
To the authorities of Universidad Nacional del Oeste and
specially to Mg Antonio Fotti and Lie, Ariel Aizemberg for
enabling the research task.
REFERENCES
[1] Johnson Dallas E., Metodos Multivariados Aplicados al
analisis de Datos, page 321. Mexico, International Thomson
Editores, 2000.
[2] http://www.vacommunity.org/2015+VAST+Challenge
%3A+MCI
[3] Sarle, W.S., The CuvicClustering Criterion, SAS Technical
Report A-I08, Cary NC: SAS Institute, Inc., 1983.

More Related Content

What's hot

Spme 2013 segmentation
Spme 2013 segmentationSpme 2013 segmentation
Spme 2013 segmentation
Qujiang Lei
 
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
MLAI2
 
facility layout paper
 facility layout paper facility layout paper
facility layout paper
Saurabh Tiwary
 
Edge Representation Learning with Hypergraphs
Edge Representation Learning with HypergraphsEdge Representation Learning with Hypergraphs
Edge Representation Learning with Hypergraphs
MLAI2
 
Ijcga1
Ijcga1Ijcga1
Ijcga1
ijcga
 
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
ArchiLab 7
 

What's hot (18)

D044042432
D044042432D044042432
D044042432
 
IRJET- An Efficient Reverse Converter for the Three Non-Coprime Moduli Set {4...
IRJET- An Efficient Reverse Converter for the Three Non-Coprime Moduli Set {4...IRJET- An Efficient Reverse Converter for the Three Non-Coprime Moduli Set {4...
IRJET- An Efficient Reverse Converter for the Three Non-Coprime Moduli Set {4...
 
Spme 2013 segmentation
Spme 2013 segmentationSpme 2013 segmentation
Spme 2013 segmentation
 
B.Sc.IT: Semester - VI (October - 2016) [IDOL - Revised Course | Question Paper]
B.Sc.IT: Semester - VI (October - 2016) [IDOL - Revised Course | Question Paper]B.Sc.IT: Semester - VI (October - 2016) [IDOL - Revised Course | Question Paper]
B.Sc.IT: Semester - VI (October - 2016) [IDOL - Revised Course | Question Paper]
 
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
 
Isam2_v1_2
Isam2_v1_2Isam2_v1_2
Isam2_v1_2
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 
facility layout paper
 facility layout paper facility layout paper
facility layout paper
 
Deep MIML Network
Deep MIML NetworkDeep MIML Network
Deep MIML Network
 
Dynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c meansDynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c means
 
A New Key Stream Generator Based on 3D Henon map and 3D Cat map
A New Key Stream Generator Based on 3D Henon map and 3D Cat mapA New Key Stream Generator Based on 3D Henon map and 3D Cat map
A New Key Stream Generator Based on 3D Henon map and 3D Cat map
 
Edge Representation Learning with Hypergraphs
Edge Representation Learning with HypergraphsEdge Representation Learning with Hypergraphs
Edge Representation Learning with Hypergraphs
 
Ijcga1
Ijcga1Ijcga1
Ijcga1
 
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
 
Analysis of CANADAIR CL-215 retractable landing gear.
Analysis of CANADAIR CL-215 retractable landing gear.Analysis of CANADAIR CL-215 retractable landing gear.
Analysis of CANADAIR CL-215 retractable landing gear.
 
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin RSelf-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Information Hiding for “Color to Gray and back” with Hartley, Slant and Kekre...
Information Hiding for “Color to Gray and back” with Hartley, Slant and Kekre...Information Hiding for “Color to Gray and back” with Hartley, Slant and Kekre...
Information Hiding for “Color to Gray and back” with Hartley, Slant and Kekre...
 

Similar to A Methodology for Classifying Visitors to an Amusement Park VAST Challenge 2015 - Mini-Challenge 1

Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
IAEME Publication
 
FREE- REFERENCE IMAGE QUALITY ASSESSMENT FRAMEWORK USING METRICS FUSION AND D...
FREE- REFERENCE IMAGE QUALITY ASSESSMENT FRAMEWORK USING METRICS FUSION AND D...FREE- REFERENCE IMAGE QUALITY ASSESSMENT FRAMEWORK USING METRICS FUSION AND D...
FREE- REFERENCE IMAGE QUALITY ASSESSMENT FRAMEWORK USING METRICS FUSION AND D...
sipij
 
Image Segmentation Using Two Weighted Variable Fuzzy K Means
Image Segmentation Using Two Weighted Variable Fuzzy K MeansImage Segmentation Using Two Weighted Variable Fuzzy K Means
Image Segmentation Using Two Weighted Variable Fuzzy K Means
Editor IJCATR
 

Similar to A Methodology for Classifying Visitors to an Amusement Park VAST Challenge 2015 - Mini-Challenge 1 (20)

IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
 
IRJET- Crowd Density Estimation using Novel Feature Descriptor
IRJET- Crowd Density Estimation using Novel Feature DescriptorIRJET- Crowd Density Estimation using Novel Feature Descriptor
IRJET- Crowd Density Estimation using Novel Feature Descriptor
 
Application of pattern recognition techniques in banknote classification
Application of pattern recognition techniques in banknote classificationApplication of pattern recognition techniques in banknote classification
Application of pattern recognition techniques in banknote classification
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
IRJET- Crowd Density Estimation using Image Processing
IRJET- Crowd Density Estimation using Image ProcessingIRJET- Crowd Density Estimation using Image Processing
IRJET- Crowd Density Estimation using Image Processing
 
Proposed algorithm for image classification using regression-based pre-proces...
Proposed algorithm for image classification using regression-based pre-proces...Proposed algorithm for image classification using regression-based pre-proces...
Proposed algorithm for image classification using regression-based pre-proces...
 
Control chart pattern recognition using k mica clustering and neural networks
Control chart pattern recognition using k mica clustering and neural networksControl chart pattern recognition using k mica clustering and neural networks
Control chart pattern recognition using k mica clustering and neural networks
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
FREE- REFERENCE IMAGE QUALITY ASSESSMENT FRAMEWORK USING METRICS FUSION AND D...
FREE- REFERENCE IMAGE QUALITY ASSESSMENT FRAMEWORK USING METRICS FUSION AND D...FREE- REFERENCE IMAGE QUALITY ASSESSMENT FRAMEWORK USING METRICS FUSION AND D...
FREE- REFERENCE IMAGE QUALITY ASSESSMENT FRAMEWORK USING METRICS FUSION AND D...
 
Finding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsFinding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster Results
 
FUZZY IMAGE SEGMENTATION USING VALIDITY INDEXES CORRELATION
FUZZY IMAGE SEGMENTATION USING VALIDITY INDEXES CORRELATIONFUZZY IMAGE SEGMENTATION USING VALIDITY INDEXES CORRELATION
FUZZY IMAGE SEGMENTATION USING VALIDITY INDEXES CORRELATION
 
Vivarana literature survey
Vivarana literature surveyVivarana literature survey
Vivarana literature survey
 
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
 
Image Segmentation Using Two Weighted Variable Fuzzy K Means
Image Segmentation Using Two Weighted Variable Fuzzy K MeansImage Segmentation Using Two Weighted Variable Fuzzy K Means
Image Segmentation Using Two Weighted Variable Fuzzy K Means
 
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
 
IMAGE SEGMENTATION AND ITS TECHNIQUES
IMAGE SEGMENTATION AND ITS TECHNIQUESIMAGE SEGMENTATION AND ITS TECHNIQUES
IMAGE SEGMENTATION AND ITS TECHNIQUES
 
Clustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesClustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining Techniques
 

Recently uploaded

Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 

A Methodology for Classifying Visitors to an Amusement Park VAST Challenge 2015 - Mini-Challenge 1

  • 1. A Methodology for Classifying Visitors to an Amusement Park VAST Challenge 2015 - Mini-Challenge 1 Gustavo Dejean *, Departamento de Computacion Universidad Nacional del Oeste - San Antonio de Padua, Pcia. de Buenos Aires - Argentina. ABSTRACT The main contribution of this work is showing how to obtain a classification of visitors to an amusement park by using cluster analysis and visualization techniques. The selection of variables for K-means algorithm and the results obtained are visually analyzed in dispersion graphs according to their Principal Components, in boxplots and in a Linear Model so as to fine-tune a result that can explain differences between the groups and similarities within each group. Keywords: visual analysis, vast challenge, data mining, clustering, K-means, cluster analysis. 1 INTRODUCTION Frequently, clustering algorithms identity clusters that do not exist in reality [I] or trivial clusters that are not relevant for researcher's goals. That is why, it is important to be able to visualize results. This is attained both with 2D and 3D diagrams with their first Principal Components (PC), as well as with other types of diagrams. 2 THEORY 2.1 Material Three archives of moves were used, plus a fourth one that completed Friday's missing records. Overall, they contain 26,021,945 records. Each record accounts for a move or a check­ in from a visitor at the park on a specific day and time. Archives are included in [2]. Tools that were used include: Postgresql 8.4 database, IBM SPSS Statics v22; JMP II (SW), and Tableau 8.2. 2.2 Methods The process used for reaching to a final cluster may be divided into four stages, that were executed concurrently with different intensities, according to the progress of the work. These four stages were: I) Creation of new variables; 2) Creation of models and visualization; 3) Principal Components (PC); 4) K-means and cluster visualization. 2.2.1 Creation of New Variables Approximately 64 new variables were created. Some of them were instrumental for developing a Linear Model that enabled visualizing and understanding the data and outcomes that were obtained. Table 1 describes the main variables that were created. 2.2.2 Creation of Models and Visualization The Linear Model (LM) exhibited in Figure 1 was instrumental for visualizing the outcome of the final clustering. * email: dejean2010@gmail.com IEEE Conference on Visual Analytics Science and Technology 2015 October 25-30, Chicago, II, USA 978-1-4673-9783-4/15/$31.00 ©2015 IEEE The LM shows a linear correlation between the number of moves vs. the amount of check-ins in the games (for every ID), yielding R2 = 0,848. An interesting observation of the LM is that, in general, each point represents a group of visitors. That is, each group of persons entering the park through the same door and at the same time was represented in the LM by points that were overlapping or very close to one another. It will always be interesting visualizing any possible outcome from a clustering process by using this Model. Figure 1: Linear Model. counCol movement vs. counCof _ check-in. 151
  • 2. 152 By observing the outlier points in the LM, 39 IDs with zero check-in in the games were identified. These IDs were grouped in Cluster 0; and they were not taken into account when applying the K-means algorithm nor in any other LM tunings, nor in the analysis of Pc. In Cluster 0, this would be represented mainly by the park employees. The LM without outliers remained with a R2 =0,854. 2.2.3 Principal Components In the case of the study, as the dimensionality of the problem was greater than three, the PC were used for reducing it and, thus, for visualizing the outcomes in 2D or 3D graphs, by using the first two or three PC. 2.2.4 K-means and cluster visualization An important point in this stage is determining which is the appropriate amount for visitor classification. The CCC criterion was used: Sarle's Cubic Clustering Criterion [3] taking the first peak with CCC >3. The other important point is selecting the variables for the input of the K-means algorithm that may yield a better grouping. 3 RESULTS With the described procedure, for a subset of variables, a K = 6 was obtained. Thus, in the final result, 7 clusters are obtained. (N.B. Bear in mind that cluster 0 had been created), see figure 2. Figure 2: Distribution of each Cluster, Cluster a is not shown in the histogram. The interpretation of each group was done by using boxplots, as shown in Figures 3. For example, when comparing the cluster, it was discovered that cluster 1 and cluster 3 show preferences for Thill Rides games and rejection for Kidder Rides. Similarly, with the other variables, in general, relevant differences were obtained between the clusters, thus accounting for the classification that was obtained. Another interesting visualization of the results is shown in Figure 4, where the LM is used for visualizing each cluster. It should be noted, for example, that it is easy to read and detect which clusters have compulsive players and which have very low profile players, among other features. Overall, each cluster has a different straight line tangent. The number of groups in each cluster was roughly obtained with a SQL query grouped per time of entry and per entry point. Figure 3: Preferences for Thill Rides and rejection for Kidder Rides for cluster 1 and 3 Figure 4: Visualization of Clusters in the LM Rides 4 CONCLUSION In the resulting classification, it was possible to verify the existence of similarities between the visitors of each cluster and the differences between different clusters, mainly using boxplots, but also leveraging interpretation with the LM visualization. ACKNOWLEDGMENTS To the authorities of Universidad Nacional del Oeste and specially to Mg Antonio Fotti and Lie, Ariel Aizemberg for enabling the research task. REFERENCES [1] Johnson Dallas E., Metodos Multivariados Aplicados al analisis de Datos, page 321. Mexico, International Thomson Editores, 2000. [2] http://www.vacommunity.org/2015+VAST+Challenge %3A+MCI [3] Sarle, W.S., The CuvicClustering Criterion, SAS Technical Report A-I08, Cary NC: SAS Institute, Inc., 1983.