SlideShare a Scribd company logo
Hierarchical clustering
in Python & elsewhere
For @PyDataConf London, June 2015, by Frank Kelly
Data Scientist, Engineer @analyticsseo
@norhustla
Hierarchical
Clustering
Theory Practice Visualisation
Origins & definitions
Methods & considerations
Hierachical theory
Metrics & performance
My use case
Python libraries
Example
Static
Interactive
Further ideas
All opinions expressed are my own
Who am I?
All opinions expressed are my own
Attribution: www.alexmaclean.com
Clustering: a recap
Clustering is an unsupervised learning
problem
"SLINK-Gaussian-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
https://commons.wikimedia.org/wiki/File:SLINK-Gaussian-data.svg#/media/File:SLINK-Gaussian-data.svg
based on some
notion of similarity.
whereby we aim to
group subsets of
entities with one
another
Origins
1930s:
Anthropology
&
Psychology
http://dienekes.blogspot.co.uk/2013/12/europeans-neolithic-farmers-mesolithic.html
Diverse applications
Attribution: stack overflow, wikipedia, scikit-learn.org, http://www.poolparty.biz/
Two
main
purposes
Exploratory analysis – standalone tool
(Data mining)
As a component of a supervised learning
pipeline (in which distinct classifiers or
regression models are trained for each
cluster).
(Machine Learning)
Clustering considerations
Partitioning
criteria
(single /
multi level)
Separation
Exclusive /
non-
exclusive
Clustering
space
(Full-space /
sub-space)
Similarity
measure
(distance /
connectivity)
Use case: search keywords
RD
P
P
P
KW
KW
KW
KW
KW
CP
CP
KW
KW
KW
The
competition!
KW
KW
CP
CD
You
Opportunity!
CD = Competing domains
CP = Competitor’s pages
RD = Ranking domain
P = Your page
KW = Keyword
….x 100,000 !!
Use case: search keywords
KW…so we have found 100,000 new ‘s – now what?
How do we summarise and present these to a client?
Clients’ questions…
• Do search categories in general
align with my website structure?
• Which categories of opportunity
keywords have the highest
search volume, bring the most
visitors, revenue etc.?
• Which keywords are not
relevant?
Website-like structure
Requirements
• Need: visual insights;
structure
• Allow targeting of
problem in hand
• May develop into a
semi- supervised
solution
• High-dimensional and sparse
data set
• Values correspond to word
frequencies
• Recommended methods
include: hierarchical
clustering, Kmeans with an
appropriate distance measure,
topic modelling (LDA, LSI),
co-clustering
Options for text
clustering?
Hierarchical Clustering
bringing structure
2 types
Agglomerative
Divisive Deterministic algorithms!
Attribution: Wikipedia
Agglomerative
Start with many
“singleton” clusters
…
Merge 2 at a time
continuously
…
Build a hierarchy
Divisive
Start with a huge “macro”
cluster
…
Iteratively split into 2
groups
…
Build a hierarchy
Agglomerative method:
Linkage types
• Single (similarity between
most similar – based on nearest
neighbour - two elements)
• Complete (similarity between
most dissimilar two elements)
Attribution: https://www.coursera.org/course/clusteranalysis
Agglomerative method:
Linkage types
Average link
( avg. of similarity between
all inter-cluster pairs )
Computationally expensive (Na*Nb)
Trick: Centroid link (similarity
between centroid of two clusters)
Attribution: https://www.coursera.org/course/clusteranalysis
Ward’s criterion
• Minimise a function: total in-cluster variance
• As defined by, e.g.:
• Once merged, then the SSE will increase
(cluster becomes bigger) by:
https://en.wikipedia.org/wiki/Ward's_method
Divisive clustering
• Top-down approach
• Criterion to split: Ward’s criterion
• Handling noise: Use a threshold to determine
the termination criteria
Attribution: https://www.coursera.org/course/clusteranalysis
Similarity measures
This will certainly influence the shape of the
clusters!
• Numerical: Use a variation of the Manhattan
distance (e.g. City block, Euclidean)
• Binary: Manhattan, Jaccard co-efficient,
Hamming
• Text: Cosine similarity.
Cosine similarity
Represent a document by a bag of terms
Record the frequency of a particular term (word/ topic/ phrase)
If d1 and d2 are two term vectors,
…can thus calculate the similarity between them
Attribution: https://www.coursera.org/course/clusteranalysis
Gather word documents = keyword phrases
Aggregate search words with URL “words”
Text clustering:
preparations
• Add features where possible
o I added URL words to my word set
• Stem words
o Choose the right stemmer – too severe can be bad
• Stop words
o NLTK tokeniser
o Scikit learn TF-IDF tokeniser
• Low frequency cut-off
o 2 => words appearing less than twice in whole corpus
• High frequency cut-off
o 0.5 => words that appear in more than 50% of documents
• N-grams
o Single words, bi-grams, tri-grams
• Beware of foreign languages
o Separate datasets if possible
Text preparation
Dimensionality
• Get a sparse matrix
o Mostly zeros
• Reduce the number of dimensions
o PCA
o Spectral clustering
• The “curse” of dimensionality
Results: reduced dimensions
Results: reduced dimensions
The
dendrogram
Assess the quality of your
clusters
• Internal: Purity, completeness & homogeneity
• External: Adjusted Rand index, Normalised
Information index
Topic labelling
Hierarchical Clustering
Beyond Python (!?)
Life on the inside:
Elasticsearch
• Why not perform pre-processing and clustering
inside elasticsearch?
• Document store
• TF-IDF and other
• Stop words
• Language specific analysers
Elasticsearch
- try it ! -
• https://www.elastic.co/
• NoSQL document store
• Aggregations and stats
• Fast, distributed
• Quick to set up
Document storage in ES
Lingo 3G algorithm
• Lingo 3G: Hierarchical clustering off-the-shelf
• Built-in part of speech (POS)
• User-defined word/synonym/label dictionaries
• Built-in stemmer / word inflection database
• Multi-lingual support, advanced tuning
• Commercial: costs attached
http://download.carrotsearch.com/lingo3g/manual/#section.es
http://project.carrot2.org/algorithms.html
Elasticsearch with
clustering – Utopia?
Carrot2’s Lingo3G in action :
http://search.carrot2.org/stable/search
Foamtree visualisation example
Visualisation of hierarchical structure possible for
large datasets via “lazy loading”
http://get.carrotsearch.com/foamtree/demo/demos/large.html
Limitations of hierarchical
clustering
• Can’t undo what’s done (divisive method, work
on sub clusters, cannot re-merge). Even true for
agglomerative (once merged will never split it
again)
• Every split or merge must be refined
• Methods may not scale well, checking all possible
pairs, complexity goes high
There are extensions: BIRCH, CURE and
CHAMELEON
Thank you!
A decent introductory course to clustering;
https://www.coursera.org/course/clusteranalysis
Hierarchical (agglomerative) clustering in Python:
http://scikit-
learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
Recent (ish) relevant Kaggle challenge: https://www.kaggle.com/c/lshtc
Visualisation: http://carrotsearch.com/foamtree-overview
Clustering elsewhere (Lingo, Lingo3G) with
Carrot2:http://download.carrotsearch.com/
Elasticsearch: https://www.elastic.co/
Analytics SEO: http://www.analyticsseo.com/
Me: @norhustla / frank.kelly@cantab.net
Attribution: http://wynway.com/
Extra slide: Why work
inside the database?
1. Sharing data (management of)
Support concurrent access by multiple readers and writers
2. Data Model Enforcement
Make sure all applications see clean, organised data
3. Scale
Work with datasets too large to fit in memory (over a certain size,
need specialised algorithms to deal with the data -> bottleneck)
The database organises and exposes algorithms for you
conveniently
4. Flexibility
Use the data in new, unanticipated ways -> anticipate a broad set
of ways of accessing the data

More Related Content

What's hot

Pca
PcaPca
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
Archana Swaminathan
 
Clustering
ClusteringClustering
Clustering
LipikaSaha2
 
Graph clustering
Graph clusteringGraph clustering
Graph clustering
ssusered887b
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
SOYEON KIM
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
ishmecse13
 
K means clustering
K means clusteringK means clustering
K means clustering
keshav goyal
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
Mohammad Junaid Khan
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
Mahbubur Rahman Shimul
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
Krish_ver2
 
Network centrality measures and their effectiveness
Network centrality measures and their effectivenessNetwork centrality measures and their effectiveness
Network centrality measures and their effectiveness
emapesce
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
CosmoAIMS Bassett
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
Prashanth Guntal
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
Decision tree
Decision treeDecision tree
Decision tree
Ami_Surati
 
Algorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsAlgorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching Algorithms
Mohamed Loey
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
Arshad Farhad
 

What's hot (20)

Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Pca
PcaPca
Pca
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Clustering
ClusteringClustering
Clustering
 
Graph clustering
Graph clusteringGraph clustering
Graph clustering
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Network centrality measures and their effectiveness
Network centrality measures and their effectivenessNetwork centrality measures and their effectiveness
Network centrality measures and their effectiveness
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Decision tree
Decision treeDecision tree
Decision tree
 
Algorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsAlgorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching Algorithms
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 

Viewers also liked

Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
Carlos Castillo (ChaTo)
 
Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical
Pier Luca Lanzi
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clusteringguestfee8698
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
DataminingTools Inc
 
NLTK in 20 minutes
NLTK in 20 minutesNLTK in 20 minutes
NLTK in 20 minutes
Jacob Perkins
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudRob Gillen
 
Alz Hack II
Alz Hack IIAlz Hack II
Alz Hack II
Frank Kelly
 
28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering
Andres Mendez-Vazquez
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_SagarSagar Kumar
 
Machine Learning and Data Mining: 05 Advanced Association Rule Mining
Machine Learning and Data Mining: 05 Advanced Association Rule MiningMachine Learning and Data Mining: 05 Advanced Association Rule Mining
Machine Learning and Data Mining: 05 Advanced Association Rule Mining
Pier Luca Lanzi
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using wekarathorenitin87
 
The Open-Source Monitoring Landscape
The Open-Source Monitoring LandscapeThe Open-Source Monitoring Landscape
The Open-Source Monitoring Landscape
Mike Merideth
 
Data Mining using Weka
Data Mining using WekaData Mining using Weka
Data Mining using Weka
Shashidhar Shenoy
 
Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and Clustering
Ankur Shrivastava
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
Mahmoud Alfarra
 
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing SystemsShuyo Nakatani
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 

Viewers also liked (20)

Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clustering
 
Text clustering
Text clusteringText clustering
Text clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
08 clustering
08 clustering08 clustering
08 clustering
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
NLTK in 20 minutes
NLTK in 20 minutesNLTK in 20 minutes
NLTK in 20 minutes
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the Cloud
 
Alz Hack II
Alz Hack IIAlz Hack II
Alz Hack II
 
28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_Sagar
 
Machine Learning and Data Mining: 05 Advanced Association Rule Mining
Machine Learning and Data Mining: 05 Advanced Association Rule MiningMachine Learning and Data Mining: 05 Advanced Association Rule Mining
Machine Learning and Data Mining: 05 Advanced Association Rule Mining
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
The Open-Source Monitoring Landscape
The Open-Source Monitoring LandscapeThe Open-Source Monitoring Landscape
The Open-Source Monitoring Landscape
 
Data Mining using Weka
Data Mining using WekaData Mining using Weka
Data Mining using Weka
 
Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and Clustering
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 

Similar to Hierarchical clustering in Python and beyond

04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
Marco Quartulli
 
clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.ppt
HODECE21
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
Enrico Daga
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD Microthesauri
Marcia Zeng
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Metadata first, ontologies second
Metadata first, ontologies secondMetadata first, ontologies second
Metadata first, ontologies secondJoseba Abaitua
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
Pivorak MeetUp
 
DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!
dclsocialmedia
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
Herbert Van de Sompel
 
Linking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization SystemsLinking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization Systems
Jakob .
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
Carole Goble
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 
Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...
Kelle Cruz
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibility
Oscar Corcho
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
Polytechnic University of Bari
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and Dato
Ashok Venkatesan
 
UCIAD overview
UCIAD overviewUCIAD overview
UCIAD overview
Mathieu d'Aquin
 
Open Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and ExchangeOpen Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and Exchange
lagoze
 
Reduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchReduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective Search
Lucidworks
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
Stefan Schlobach
 

Similar to Hierarchical clustering in Python and beyond (20)

04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.ppt
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD Microthesauri
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Metadata first, ontologies second
Metadata first, ontologies secondMetadata first, ontologies second
Metadata first, ontologies second
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
Linking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization SystemsLinking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization Systems
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibility
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and Dato
 
UCIAD overview
UCIAD overviewUCIAD overview
UCIAD overview
 
Open Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and ExchangeOpen Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and Exchange
 
Reduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchReduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective Search
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
 

Recently uploaded

Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 

Recently uploaded (20)

Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 

Hierarchical clustering in Python and beyond

  • 1. Hierarchical clustering in Python & elsewhere For @PyDataConf London, June 2015, by Frank Kelly Data Scientist, Engineer @analyticsseo @norhustla
  • 2. Hierarchical Clustering Theory Practice Visualisation Origins & definitions Methods & considerations Hierachical theory Metrics & performance My use case Python libraries Example Static Interactive Further ideas All opinions expressed are my own
  • 3. Who am I? All opinions expressed are my own
  • 5. Clustering is an unsupervised learning problem "SLINK-Gaussian-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:SLINK-Gaussian-data.svg#/media/File:SLINK-Gaussian-data.svg based on some notion of similarity. whereby we aim to group subsets of entities with one another
  • 7. Diverse applications Attribution: stack overflow, wikipedia, scikit-learn.org, http://www.poolparty.biz/
  • 8. Two main purposes Exploratory analysis – standalone tool (Data mining) As a component of a supervised learning pipeline (in which distinct classifiers or regression models are trained for each cluster). (Machine Learning)
  • 9. Clustering considerations Partitioning criteria (single / multi level) Separation Exclusive / non- exclusive Clustering space (Full-space / sub-space) Similarity measure (distance / connectivity)
  • 10. Use case: search keywords RD P P P KW KW KW KW KW CP CP KW KW KW The competition! KW KW CP CD You Opportunity! CD = Competing domains CP = Competitor’s pages RD = Ranking domain P = Your page KW = Keyword
  • 12. Use case: search keywords KW…so we have found 100,000 new ‘s – now what? How do we summarise and present these to a client?
  • 13. Clients’ questions… • Do search categories in general align with my website structure? • Which categories of opportunity keywords have the highest search volume, bring the most visitors, revenue etc.? • Which keywords are not relevant?
  • 15. Requirements • Need: visual insights; structure • Allow targeting of problem in hand • May develop into a semi- supervised solution
  • 16. • High-dimensional and sparse data set • Values correspond to word frequencies • Recommended methods include: hierarchical clustering, Kmeans with an appropriate distance measure, topic modelling (LDA, LSI), co-clustering Options for text clustering?
  • 18. 2 types Agglomerative Divisive Deterministic algorithms! Attribution: Wikipedia
  • 19. Agglomerative Start with many “singleton” clusters … Merge 2 at a time continuously … Build a hierarchy Divisive Start with a huge “macro” cluster … Iteratively split into 2 groups … Build a hierarchy
  • 20. Agglomerative method: Linkage types • Single (similarity between most similar – based on nearest neighbour - two elements) • Complete (similarity between most dissimilar two elements) Attribution: https://www.coursera.org/course/clusteranalysis
  • 21.
  • 22.
  • 23. Agglomerative method: Linkage types Average link ( avg. of similarity between all inter-cluster pairs ) Computationally expensive (Na*Nb) Trick: Centroid link (similarity between centroid of two clusters) Attribution: https://www.coursera.org/course/clusteranalysis
  • 24. Ward’s criterion • Minimise a function: total in-cluster variance • As defined by, e.g.: • Once merged, then the SSE will increase (cluster becomes bigger) by: https://en.wikipedia.org/wiki/Ward's_method
  • 25. Divisive clustering • Top-down approach • Criterion to split: Ward’s criterion • Handling noise: Use a threshold to determine the termination criteria Attribution: https://www.coursera.org/course/clusteranalysis
  • 26. Similarity measures This will certainly influence the shape of the clusters! • Numerical: Use a variation of the Manhattan distance (e.g. City block, Euclidean) • Binary: Manhattan, Jaccard co-efficient, Hamming • Text: Cosine similarity.
  • 27. Cosine similarity Represent a document by a bag of terms Record the frequency of a particular term (word/ topic/ phrase) If d1 and d2 are two term vectors, …can thus calculate the similarity between them Attribution: https://www.coursera.org/course/clusteranalysis
  • 28.
  • 29. Gather word documents = keyword phrases
  • 30. Aggregate search words with URL “words”
  • 31. Text clustering: preparations • Add features where possible o I added URL words to my word set • Stem words o Choose the right stemmer – too severe can be bad • Stop words o NLTK tokeniser o Scikit learn TF-IDF tokeniser • Low frequency cut-off o 2 => words appearing less than twice in whole corpus • High frequency cut-off o 0.5 => words that appear in more than 50% of documents • N-grams o Single words, bi-grams, tri-grams • Beware of foreign languages o Separate datasets if possible
  • 33. Dimensionality • Get a sparse matrix o Mostly zeros • Reduce the number of dimensions o PCA o Spectral clustering • The “curse” of dimensionality
  • 34.
  • 38. Assess the quality of your clusters • Internal: Purity, completeness & homogeneity • External: Adjusted Rand index, Normalised Information index
  • 41. Life on the inside: Elasticsearch • Why not perform pre-processing and clustering inside elasticsearch? • Document store • TF-IDF and other • Stop words • Language specific analysers
  • 42. Elasticsearch - try it ! - • https://www.elastic.co/ • NoSQL document store • Aggregations and stats • Fast, distributed • Quick to set up
  • 44. Lingo 3G algorithm • Lingo 3G: Hierarchical clustering off-the-shelf • Built-in part of speech (POS) • User-defined word/synonym/label dictionaries • Built-in stemmer / word inflection database • Multi-lingual support, advanced tuning • Commercial: costs attached http://download.carrotsearch.com/lingo3g/manual/#section.es http://project.carrot2.org/algorithms.html
  • 45. Elasticsearch with clustering – Utopia? Carrot2’s Lingo3G in action : http://search.carrot2.org/stable/search Foamtree visualisation example Visualisation of hierarchical structure possible for large datasets via “lazy loading” http://get.carrotsearch.com/foamtree/demo/demos/large.html
  • 46. Limitations of hierarchical clustering • Can’t undo what’s done (divisive method, work on sub clusters, cannot re-merge). Even true for agglomerative (once merged will never split it again) • Every split or merge must be refined • Methods may not scale well, checking all possible pairs, complexity goes high There are extensions: BIRCH, CURE and CHAMELEON
  • 47. Thank you! A decent introductory course to clustering; https://www.coursera.org/course/clusteranalysis Hierarchical (agglomerative) clustering in Python: http://scikit- learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html Recent (ish) relevant Kaggle challenge: https://www.kaggle.com/c/lshtc Visualisation: http://carrotsearch.com/foamtree-overview Clustering elsewhere (Lingo, Lingo3G) with Carrot2:http://download.carrotsearch.com/ Elasticsearch: https://www.elastic.co/ Analytics SEO: http://www.analyticsseo.com/ Me: @norhustla / frank.kelly@cantab.net Attribution: http://wynway.com/
  • 48. Extra slide: Why work inside the database? 1. Sharing data (management of) Support concurrent access by multiple readers and writers 2. Data Model Enforcement Make sure all applications see clean, organised data 3. Scale Work with datasets too large to fit in memory (over a certain size, need specialised algorithms to deal with the data -> bottleneck) The database organises and exposes algorithms for you conveniently 4. Flexibility Use the data in new, unanticipated ways -> anticipate a broad set of ways of accessing the data

Editor's Notes

  1. http://dienekes.blogspot.co.uk/2013/12/europeans-neolithic-farmers-mesolithic.html
  2. https://upload.wikimedia.org/wikipedia/commons/3/39/Swiss_complete.png