SlideShare a Scribd company logo
1 of 1
Download to read offline
Thresholded Hierarchical Itemset Clustering for Expert Explorations
Diana Zajac, Thomas Lux, Dr. Jacob Furst, Dr. Daniela Raicu
College of Computing and Digital Media, DePaul University
Summer 2015
Introduction Clustering Algorithms THIC
Datasets
Traditional Machine Learning (ML) techniques are able to
cluster datasets, yet they produce difficult to interpret clusters.
Noise in the data, as well as high-dimensional and complex
data, can make clustering difficult, and produce undesirable
results. In addition, most clustering algorithms produce clusters
without any explanation as to what patters are found between
data points, and based on what patters those clusters were
formed. In attempt to solve the problem of clustering high-
dimensional, complex and noisy datasets, and producing
interpretable results, we created an interactive user-interface
called THIC. THIC stands for Thresholded Hierarchical Itemset
Clustering, and we have given it this name to describe the
method in which it clusters data. What makes THIC so
innovative, is it’s ability to modify the clustering algorithm with
‘expert’ feedback. An ‘expert’ referring to some outside source
of information that can provide intuitive guidance as to what
features the algorithm should cluster upon.
Figure 1 is a part of the 2012 City Livability dataset obtained with permission of The Economic
Intelligence Unit (EIU) from their collaboration with BuzzData.
Another example, given an ‘expert’ who is well-traveled,
the expert could instruct THIC to group countries “most homey”
under one cluster, countries “most beautiful” under another,
etc. THIC will cluster the cities based on the experts guidance,
but will also predict which clusters the cities the expert hasn’t
yet traveled to may fit into—and then explain which city
features are most important in determining the clusters.
Other datasets we worked with included a large text
corpus, lung cancer data, and Chronic Fatigue Syndrome data.
K-Means:
K-Means clustering is an
algorithm that makes k number of
clusters based on distances of each
data-point from the cluster centers. It
begins by plotting each data point—
in the case of City Livability, each
city is a point—with the features as
dimensions. For an n number of
features, there are n number of
dimensions. So each point has a
given (x, y, z, …, n) coordinate
based on its features. K-Means chooses initial cluster centers, and then
iteratively moves them until the distances of the points to the centers is
minimal, and the clusters are separated as best as possible.
K-Means with Feature Selection (KMFS):
KMFS uses feature selection algorithms in aiding k-means clustering.
Feature selection is usually used in order to strip a dataset of irrelevant,
corrupted, or redundant features, thereby enhancing the analysis capabilities
based on those features. KMFS selects features one-by-one starting with
those that create the ‘best’—most defined and separate—clusters, and
continues to add features until the clusters become ‘bad’—overlapping and
spread-out. Incorporating feature selection into k-means clustering allows for k-
means to cluster data and return to the use the most relevant features used.
KMFS gives the user an idea of what each cluster is based on (what features
‘trend’ in each cluster), but it describes cluster features based on probabilities
rather than 100% accuracy, and also fails to provide user-control.
Why THIC is better:
 Expert-guided clustering
 Better data interpretability
 Many different possibilities (for results)
 Provides a controllable tradeoff between optimal results and meaningful
results
 Doesn’t lose data dimensionality (no important information lost in feature
selection)
 THIC’s philosophy is focused on aiding a user in understanding and
exploring datasets, finding unseen patterns and correlations in datasets, and
creating unconventional clustering of data.
Group 1: High: Green Space,
Sprawl
Group 2: Low: Sprawl, Culture
and Environment,
Infrastructure
Group 3: Low: Infrastructure
High: Green Space
Group 4: Low: Sprawl, Culture
and Environment
Group 5: Low: Green Space,
Sprawl
Group 6: Low: Green Space
High: Sprawl
Group 7: Low: Sprawl
High: Green Space
The dataset below is one of the datasets we used in
testing THIC. This dataset is particularly interesting because of
the ‘expert feedback’ opportunity. For example, an expert may
want to cluster cities based on “what do European countries
have in common:” the expert would instruct THIC to group
European countries under one cluster, and THIC will produce
results explaining which features all European cities have in
common.
THIC is an interactive interface that allows users to import a numerical
dataset and cluster the data based on their own preferences, such as:
 Which features should be included/excluded
 Which features should be given higher priority (more weight)
 Sizes of groups
 Making subgroups
 Number of groups
 Define groups using features
 Control between optimal clustering and clustering meaningful to user
Acknowledgments
 Dr. Jacob Furst, PhD, 1998, professor, DePaul University, CDM
 Dr. Daniela Raicu, PhD, 2002, professor, DePaul University, CDM
 College of Digital Media, DePaul University
 Science Research Fellows
 DePauw University
Future Work
Although we completed THIC’s preliminary phase and there is still much to
improve on. The current THIC implementation focuses on single-item-itemsets,
because increasing itemset size increases the computation time and amount of
overlap in groups. Another interest would be developing better ‘stopping criteria’
for the algorithm, which at the moment is based on group overlap and minimal
coverage. With a better stopping criteria, expanding to multi-item-itemsets would
be more feasible, without contradicting the philosophy of THIC.
When completed, THIC will be able to provide meaningful information in
multiple domains, including but not limited to economics, medical sciences, and
statistical analysis.
THIC produces diverse results depending on all of these preferences.
So, the focus of THIC isn’t necessarily the ‘best’ clusters/groupings, but
instead is more about producing results that can aid in understanding a data
set, such as:
 Finding certain patterns that may not be evident without THIC (due to size
of dataset or complexity)
 Producing results by defining ‘known’ clusters, and matching the rest of
the cases to those
 Describing relationships between different features, as well as different
cases—in City Livability, cases are the cities, and features are qualities,
such as pollution and quality of education.

More Related Content

What's hot

DataMining Techniq
DataMining TechniqDataMining Techniq
DataMining TechniqRespa Peter
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Editor IJARCET
 
A Survey On Ontology Agent Based Distributed Data Mining
A Survey On Ontology Agent Based Distributed Data MiningA Survey On Ontology Agent Based Distributed Data Mining
A Survey On Ontology Agent Based Distributed Data MiningEditor IJMTER
 
Data Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world ChallengesData Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world ChallengesYuchen Zhao
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
A holistic approach to distribute dimensionality reduction of big dat,big dat...
A holistic approach to distribute dimensionality reduction of big dat,big dat...A holistic approach to distribute dimensionality reduction of big dat,big dat...
A holistic approach to distribute dimensionality reduction of big dat,big dat...Nexgen Technology
 
Single view vs. multiple views scatterplots
Single view vs. multiple views scatterplotsSingle view vs. multiple views scatterplots
Single view vs. multiple views scatterplotsIJECEIAES
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceIJDKP
 
Assessing data dissemination strategies
Assessing data dissemination strategiesAssessing data dissemination strategies
Assessing data dissemination strategiesOpen University, KMi
 
Cross Domain Data Fusion
Cross Domain Data FusionCross Domain Data Fusion
Cross Domain Data FusionIRJET Journal
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...IJCSIS Research Publications
 
On distributed fuzzy decision trees for big data
On distributed fuzzy decision trees for big dataOn distributed fuzzy decision trees for big data
On distributed fuzzy decision trees for big datanexgentechnology
 
A Novel Approach of Data Driven Analytics for Personalized Healthcare through...
A Novel Approach of Data Driven Analytics for Personalized Healthcare through...A Novel Approach of Data Driven Analytics for Personalized Healthcare through...
A Novel Approach of Data Driven Analytics for Personalized Healthcare through...IJMTST Journal
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.docbutest
 
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET Journal
 

What's hot (18)

G44093135
G44093135G44093135
G44093135
 
DataMining Techniq
DataMining TechniqDataMining Techniq
DataMining Techniq
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
 
A Survey On Ontology Agent Based Distributed Data Mining
A Survey On Ontology Agent Based Distributed Data MiningA Survey On Ontology Agent Based Distributed Data Mining
A Survey On Ontology Agent Based Distributed Data Mining
 
Data Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world ChallengesData Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world Challenges
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
A holistic approach to distribute dimensionality reduction of big dat,big dat...
A holistic approach to distribute dimensionality reduction of big dat,big dat...A holistic approach to distribute dimensionality reduction of big dat,big dat...
A holistic approach to distribute dimensionality reduction of big dat,big dat...
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
 
Single view vs. multiple views scatterplots
Single view vs. multiple views scatterplotsSingle view vs. multiple views scatterplots
Single view vs. multiple views scatterplots
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
 
Assessing data dissemination strategies
Assessing data dissemination strategiesAssessing data dissemination strategies
Assessing data dissemination strategies
 
Cross Domain Data Fusion
Cross Domain Data FusionCross Domain Data Fusion
Cross Domain Data Fusion
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
 
On distributed fuzzy decision trees for big data
On distributed fuzzy decision trees for big dataOn distributed fuzzy decision trees for big data
On distributed fuzzy decision trees for big data
 
1234
12341234
1234
 
A Novel Approach of Data Driven Analytics for Personalized Healthcare through...
A Novel Approach of Data Driven Analytics for Personalized Healthcare through...A Novel Approach of Data Driven Analytics for Personalized Healthcare through...
A Novel Approach of Data Driven Analytics for Personalized Healthcare through...
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.doc
 
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
 

Viewers also liked

En resalat esa_saheha
En resalat esa_sahehaEn resalat esa_saheha
En resalat esa_sahehaLoveofpeople
 
Partners in Science Final Poster
Partners in Science Final PosterPartners in Science Final Poster
Partners in Science Final PosterKaren Ayoub
 
Summer+internship+Cetrificates+Ahmed+Adly_2
Summer+internship+Cetrificates+Ahmed+Adly_2Summer+internship+Cetrificates+Ahmed+Adly_2
Summer+internship+Cetrificates+Ahmed+Adly_2Ahmed Adly
 
George Pjevach-MF Resume
George Pjevach-MF ResumeGeorge Pjevach-MF Resume
George Pjevach-MF ResumeGeorge Pjevach
 
Overlay细分
Overlay细分Overlay细分
Overlay细分semious
 
пыхалов реванш сталина. вернуть русские земли!
пыхалов   реванш сталина. вернуть русские земли!пыхалов   реванш сталина. вернуть русские земли!
пыхалов реванш сталина. вернуть русские земли!Grigory Ushkuynik
 

Viewers also liked (9)

En resalat esa_saheha
En resalat esa_sahehaEn resalat esa_saheha
En resalat esa_saheha
 
Partners in Science Final Poster
Partners in Science Final PosterPartners in Science Final Poster
Partners in Science Final Poster
 
Summer+internship+Cetrificates+Ahmed+Adly_2
Summer+internship+Cetrificates+Ahmed+Adly_2Summer+internship+Cetrificates+Ahmed+Adly_2
Summer+internship+Cetrificates+Ahmed+Adly_2
 
Resume
ResumeResume
Resume
 
George Pjevach-MF Resume
George Pjevach-MF ResumeGeorge Pjevach-MF Resume
George Pjevach-MF Resume
 
San Mateo
San MateoSan Mateo
San Mateo
 
Overlay细分
Overlay细分Overlay细分
Overlay细分
 
пыхалов реванш сталина. вернуть русские земли!
пыхалов   реванш сталина. вернуть русские земли!пыхалов   реванш сталина. вернуть русские земли!
пыхалов реванш сталина. вернуть русские земли!
 
презентация7
презентация7презентация7
презентация7
 

Similar to THIC MedIX Summer 2015 Poster

A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...Editor IJCATR
 
On Machine Learning and Data Mining
On Machine Learning and Data MiningOn Machine Learning and Data Mining
On Machine Learning and Data Miningbutest
 
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...Bernhard Rieder
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data AnalysisKaty Allen
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesIRJET Journal
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningPramit Choudhary
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfDr. Radhey Shyam
 
Applications Of Clustering Techniques In Data Mining A Comparative Study
Applications Of Clustering Techniques In Data Mining  A Comparative StudyApplications Of Clustering Techniques In Data Mining  A Comparative Study
Applications Of Clustering Techniques In Data Mining A Comparative StudyFiona Phillips
 
A h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningA h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGIJwest
 
Data Tactics Analytics Practice
Data Tactics Analytics PracticeData Tactics Analytics Practice
Data Tactics Analytics PracticeDataTactics
 
Toward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxToward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxjuliennehar
 
Giovanni Maria Sacco
Giovanni Maria SaccoGiovanni Maria Sacco
Giovanni Maria Saccoguest66dc5f
 
Nt1310 Unit 1 Literature Review
Nt1310 Unit 1 Literature ReviewNt1310 Unit 1 Literature Review
Nt1310 Unit 1 Literature ReviewCamella Taylor
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...Nicolle Dammann
 
Screening of Mental Health in Adolescents using ML.pptx
Screening of Mental Health in Adolescents using ML.pptxScreening of Mental Health in Adolescents using ML.pptx
Screening of Mental Health in Adolescents using ML.pptxNitishChoudhary23
 

Similar to THIC MedIX Summer 2015 Poster (20)

A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
 
On Machine Learning and Data Mining
On Machine Learning and Data MiningOn Machine Learning and Data Mining
On Machine Learning and Data Mining
 
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
Applications Of Clustering Techniques In Data Mining A Comparative Study
Applications Of Clustering Techniques In Data Mining  A Comparative StudyApplications Of Clustering Techniques In Data Mining  A Comparative Study
Applications Of Clustering Techniques In Data Mining A Comparative Study
 
winbis1005
winbis1005winbis1005
winbis1005
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
A h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningA h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learning
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
Data Tactics Analytics Practice
Data Tactics Analytics PracticeData Tactics Analytics Practice
Data Tactics Analytics Practice
 
Toward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxToward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docx
 
Giovanni Maria Sacco
Giovanni Maria SaccoGiovanni Maria Sacco
Giovanni Maria Sacco
 
Nt1310 Unit 1 Literature Review
Nt1310 Unit 1 Literature ReviewNt1310 Unit 1 Literature Review
Nt1310 Unit 1 Literature Review
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...
 
Screening of Mental Health in Adolescents using ML.pptx
Screening of Mental Health in Adolescents using ML.pptxScreening of Mental Health in Adolescents using ML.pptx
Screening of Mental Health in Adolescents using ML.pptx
 

THIC MedIX Summer 2015 Poster

  • 1. Thresholded Hierarchical Itemset Clustering for Expert Explorations Diana Zajac, Thomas Lux, Dr. Jacob Furst, Dr. Daniela Raicu College of Computing and Digital Media, DePaul University Summer 2015 Introduction Clustering Algorithms THIC Datasets Traditional Machine Learning (ML) techniques are able to cluster datasets, yet they produce difficult to interpret clusters. Noise in the data, as well as high-dimensional and complex data, can make clustering difficult, and produce undesirable results. In addition, most clustering algorithms produce clusters without any explanation as to what patters are found between data points, and based on what patters those clusters were formed. In attempt to solve the problem of clustering high- dimensional, complex and noisy datasets, and producing interpretable results, we created an interactive user-interface called THIC. THIC stands for Thresholded Hierarchical Itemset Clustering, and we have given it this name to describe the method in which it clusters data. What makes THIC so innovative, is it’s ability to modify the clustering algorithm with ‘expert’ feedback. An ‘expert’ referring to some outside source of information that can provide intuitive guidance as to what features the algorithm should cluster upon. Figure 1 is a part of the 2012 City Livability dataset obtained with permission of The Economic Intelligence Unit (EIU) from their collaboration with BuzzData. Another example, given an ‘expert’ who is well-traveled, the expert could instruct THIC to group countries “most homey” under one cluster, countries “most beautiful” under another, etc. THIC will cluster the cities based on the experts guidance, but will also predict which clusters the cities the expert hasn’t yet traveled to may fit into—and then explain which city features are most important in determining the clusters. Other datasets we worked with included a large text corpus, lung cancer data, and Chronic Fatigue Syndrome data. K-Means: K-Means clustering is an algorithm that makes k number of clusters based on distances of each data-point from the cluster centers. It begins by plotting each data point— in the case of City Livability, each city is a point—with the features as dimensions. For an n number of features, there are n number of dimensions. So each point has a given (x, y, z, …, n) coordinate based on its features. K-Means chooses initial cluster centers, and then iteratively moves them until the distances of the points to the centers is minimal, and the clusters are separated as best as possible. K-Means with Feature Selection (KMFS): KMFS uses feature selection algorithms in aiding k-means clustering. Feature selection is usually used in order to strip a dataset of irrelevant, corrupted, or redundant features, thereby enhancing the analysis capabilities based on those features. KMFS selects features one-by-one starting with those that create the ‘best’—most defined and separate—clusters, and continues to add features until the clusters become ‘bad’—overlapping and spread-out. Incorporating feature selection into k-means clustering allows for k- means to cluster data and return to the use the most relevant features used. KMFS gives the user an idea of what each cluster is based on (what features ‘trend’ in each cluster), but it describes cluster features based on probabilities rather than 100% accuracy, and also fails to provide user-control. Why THIC is better:  Expert-guided clustering  Better data interpretability  Many different possibilities (for results)  Provides a controllable tradeoff between optimal results and meaningful results  Doesn’t lose data dimensionality (no important information lost in feature selection)  THIC’s philosophy is focused on aiding a user in understanding and exploring datasets, finding unseen patterns and correlations in datasets, and creating unconventional clustering of data. Group 1: High: Green Space, Sprawl Group 2: Low: Sprawl, Culture and Environment, Infrastructure Group 3: Low: Infrastructure High: Green Space Group 4: Low: Sprawl, Culture and Environment Group 5: Low: Green Space, Sprawl Group 6: Low: Green Space High: Sprawl Group 7: Low: Sprawl High: Green Space The dataset below is one of the datasets we used in testing THIC. This dataset is particularly interesting because of the ‘expert feedback’ opportunity. For example, an expert may want to cluster cities based on “what do European countries have in common:” the expert would instruct THIC to group European countries under one cluster, and THIC will produce results explaining which features all European cities have in common. THIC is an interactive interface that allows users to import a numerical dataset and cluster the data based on their own preferences, such as:  Which features should be included/excluded  Which features should be given higher priority (more weight)  Sizes of groups  Making subgroups  Number of groups  Define groups using features  Control between optimal clustering and clustering meaningful to user Acknowledgments  Dr. Jacob Furst, PhD, 1998, professor, DePaul University, CDM  Dr. Daniela Raicu, PhD, 2002, professor, DePaul University, CDM  College of Digital Media, DePaul University  Science Research Fellows  DePauw University Future Work Although we completed THIC’s preliminary phase and there is still much to improve on. The current THIC implementation focuses on single-item-itemsets, because increasing itemset size increases the computation time and amount of overlap in groups. Another interest would be developing better ‘stopping criteria’ for the algorithm, which at the moment is based on group overlap and minimal coverage. With a better stopping criteria, expanding to multi-item-itemsets would be more feasible, without contradicting the philosophy of THIC. When completed, THIC will be able to provide meaningful information in multiple domains, including but not limited to economics, medical sciences, and statistical analysis. THIC produces diverse results depending on all of these preferences. So, the focus of THIC isn’t necessarily the ‘best’ clusters/groupings, but instead is more about producing results that can aid in understanding a data set, such as:  Finding certain patterns that may not be evident without THIC (due to size of dataset or complexity)  Producing results by defining ‘known’ clusters, and matching the rest of the cases to those  Describing relationships between different features, as well as different cases—in City Livability, cases are the cities, and features are qualities, such as pollution and quality of education.