Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Hierarchical clustering
in Python & elsewhere
For @PyDataConf London, June 2015, by Frank Kelly
Data Scientist, Engineer @...
Hierarchical
Clustering
Theory Practice Visualisation
Origins & definitions
Methods & considerations
Hierachical theory
Me...
Who am I?
All opinions expressed are my own
Attribution: www.alexmaclean.com
Clustering: a recap
Clustering is an unsupervised learning
problem
"SLINK-Gaussian-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via ...
Origins
1930s:
Anthropology
&
Psychology
http://dienekes.blogspot.co.uk/2013/12/europeans-neolithic-farmers-mesolithic.html
Diverse applications
Attribution: stack overflow, wikipedia, scikit-learn.org, http://www.poolparty.biz/
Two
main
purposes
Exploratory analysis – standalone tool
(Data mining)
As a component of a supervised learning
pipeline (i...
Clustering considerations
Partitioning
criteria
(single /
multi level)
Separation
Exclusive /
non-
exclusive
Clustering
sp...
Use case: search keywords
RD
P
P
P
KW
KW
KW
KW
KW
CP
CP
KW
KW
KW
The
competition!
KW
KW
CP
CD
You
Opportunity!
CD = Compet...
….x 100,000 !!
Use case: search keywords
KW…so we have found 100,000 new ‘s – now what?
How do we summarise and present these to a client?
Clients’ questions…
• Do search categories in general
align with my website structure?
• Which categories of opportunity
k...
Website-like structure
Requirements
• Need: visual insights;
structure
• Allow targeting of
problem in hand
• May develop into a
semi- supervised...
• High-dimensional and sparse
data set
• Values correspond to word
frequencies
• Recommended methods
include: hierarchical...
Hierarchical Clustering
bringing structure
2 types
Agglomerative
Divisive Deterministic algorithms!
Attribution: Wikipedia
Agglomerative
Start with many
“singleton” clusters
…
Merge 2 at a time
continuously
…
Build a hierarchy
Divisive
Start wit...
Agglomerative method:
Linkage types
• Single (similarity between
most similar – based on nearest
neighbour - two elements)...
Agglomerative method:
Linkage types
Average link
( avg. of similarity between
all inter-cluster pairs )
Computationally ex...
Ward’s criterion
• Minimise a function: total in-cluster variance
• As defined by, e.g.:
• Once merged, then the SSE will ...
Divisive clustering
• Top-down approach
• Criterion to split: Ward’s criterion
• Handling noise: Use a threshold to determ...
Similarity measures
This will certainly influence the shape of the
clusters!
• Numerical: Use a variation of the Manhattan...
Cosine similarity
Represent a document by a bag of terms
Record the frequency of a particular term (word/ topic/ phrase)
I...
Gather word documents = keyword phrases
Aggregate search words with URL “words”
Text clustering:
preparations
• Add features where possible
o I added URL words to my word set
• Stem words
o Choose the r...
Text preparation
Dimensionality
• Get a sparse matrix
o Mostly zeros
• Reduce the number of dimensions
o PCA
o Spectral clustering
• The “c...
Results: reduced dimensions
Results: reduced dimensions
The
dendrogram
Assess the quality of your
clusters
• Internal: Purity, completeness & homogeneity
• External: Adjusted Rand index, Normal...
Topic labelling
Hierarchical Clustering
Beyond Python (!?)
Life on the inside:
Elasticsearch
• Why not perform pre-processing and clustering
inside elasticsearch?
• Document store
•...
Elasticsearch
- try it ! -
• https://www.elastic.co/
• NoSQL document store
• Aggregations and stats
• Fast, distributed
•...
Document storage in ES
Lingo 3G algorithm
• Lingo 3G: Hierarchical clustering off-the-shelf
• Built-in part of speech (POS)
• User-defined word/s...
Elasticsearch with
clustering – Utopia?
Carrot2’s Lingo3G in action :
http://search.carrot2.org/stable/search
Foamtree vis...
Limitations of hierarchical
clustering
• Can’t undo what’s done (divisive method, work
on sub clusters, cannot re-merge). ...
Thank you!
A decent introductory course to clustering;
https://www.coursera.org/course/clusteranalysis
Hierarchical (agglo...
Extra slide: Why work
inside the database?
1. Sharing data (management of)
Support concurrent access by multiple readers a...
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
Upcoming SlideShare
Loading in …5
×

of

YouTube videos are no longer supported on SlideShare

View original on YouTube

Hierarchical clustering in Python and beyond Slide 2 Hierarchical clustering in Python and beyond Slide 3 Hierarchical clustering in Python and beyond Slide 4 Hierarchical clustering in Python and beyond Slide 5 Hierarchical clustering in Python and beyond Slide 6 Hierarchical clustering in Python and beyond Slide 7 Hierarchical clustering in Python and beyond Slide 8 Hierarchical clustering in Python and beyond Slide 9 Hierarchical clustering in Python and beyond Slide 10 Hierarchical clustering in Python and beyond Slide 11 Hierarchical clustering in Python and beyond Slide 12 Hierarchical clustering in Python and beyond Slide 13 Hierarchical clustering in Python and beyond Slide 14 Hierarchical clustering in Python and beyond Slide 15 Hierarchical clustering in Python and beyond Slide 16 Hierarchical clustering in Python and beyond Slide 17 Hierarchical clustering in Python and beyond Slide 18 Hierarchical clustering in Python and beyond Slide 19 Hierarchical clustering in Python and beyond Slide 20 Hierarchical clustering in Python and beyond Slide 21 Hierarchical clustering in Python and beyond Slide 22 Hierarchical clustering in Python and beyond Slide 23 Hierarchical clustering in Python and beyond Slide 24 Hierarchical clustering in Python and beyond Slide 25 Hierarchical clustering in Python and beyond Slide 26 Hierarchical clustering in Python and beyond Slide 27 Hierarchical clustering in Python and beyond Slide 28 Hierarchical clustering in Python and beyond Slide 29 Hierarchical clustering in Python and beyond Slide 30 Hierarchical clustering in Python and beyond Slide 31 Hierarchical clustering in Python and beyond Slide 32 Hierarchical clustering in Python and beyond Slide 33 Hierarchical clustering in Python and beyond Slide 34 Hierarchical clustering in Python and beyond Slide 35 Hierarchical clustering in Python and beyond Slide 36 Hierarchical clustering in Python and beyond Slide 37 Hierarchical clustering in Python and beyond Slide 38 Hierarchical clustering in Python and beyond Slide 39 Hierarchical clustering in Python and beyond Slide 40 Hierarchical clustering in Python and beyond Slide 41 Hierarchical clustering in Python and beyond Slide 42 Hierarchical clustering in Python and beyond Slide 43 Hierarchical clustering in Python and beyond Slide 44 Hierarchical clustering in Python and beyond Slide 45 Hierarchical clustering in Python and beyond Slide 46 Hierarchical clustering in Python and beyond Slide 47 Hierarchical clustering in Python and beyond Slide 48 Hierarchical clustering in Python and beyond Slide 49
Upcoming SlideShare
Hierarchical Clustering
Next
Download to read offline and view in fullscreen.

17

Share

Download to read offline

Hierarchical clustering in Python and beyond

Download to read offline

Clustering of data is an increasingly important task for many data scientists. This talk will explore the challenge of hierarchical clustering of text data for summarisation purposes. We'll take a look at some great solutions now available to Python users including the relevant Scikit Learn libraries, via Elasticsearch (with the carrot2 plugin), and check out visualisations from both approaches.

https://www.youtube.com/watch?v=KFs9pBAetOo

Related Books

Free with a 30 day trial from Scribd

See all

Hierarchical clustering in Python and beyond

  1. 1. Hierarchical clustering in Python & elsewhere For @PyDataConf London, June 2015, by Frank Kelly Data Scientist, Engineer @analyticsseo @norhustla
  2. 2. Hierarchical Clustering Theory Practice Visualisation Origins & definitions Methods & considerations Hierachical theory Metrics & performance My use case Python libraries Example Static Interactive Further ideas All opinions expressed are my own
  3. 3. Who am I? All opinions expressed are my own
  4. 4. Attribution: www.alexmaclean.com Clustering: a recap
  5. 5. Clustering is an unsupervised learning problem "SLINK-Gaussian-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:SLINK-Gaussian-data.svg#/media/File:SLINK-Gaussian-data.svg based on some notion of similarity. whereby we aim to group subsets of entities with one another
  6. 6. Origins 1930s: Anthropology & Psychology http://dienekes.blogspot.co.uk/2013/12/europeans-neolithic-farmers-mesolithic.html
  7. 7. Diverse applications Attribution: stack overflow, wikipedia, scikit-learn.org, http://www.poolparty.biz/
  8. 8. Two main purposes Exploratory analysis – standalone tool (Data mining) As a component of a supervised learning pipeline (in which distinct classifiers or regression models are trained for each cluster). (Machine Learning)
  9. 9. Clustering considerations Partitioning criteria (single / multi level) Separation Exclusive / non- exclusive Clustering space (Full-space / sub-space) Similarity measure (distance / connectivity)
  10. 10. Use case: search keywords RD P P P KW KW KW KW KW CP CP KW KW KW The competition! KW KW CP CD You Opportunity! CD = Competing domains CP = Competitor’s pages RD = Ranking domain P = Your page KW = Keyword
  11. 11. ….x 100,000 !!
  12. 12. Use case: search keywords KW…so we have found 100,000 new ‘s – now what? How do we summarise and present these to a client?
  13. 13. Clients’ questions… • Do search categories in general align with my website structure? • Which categories of opportunity keywords have the highest search volume, bring the most visitors, revenue etc.? • Which keywords are not relevant?
  14. 14. Website-like structure
  15. 15. Requirements • Need: visual insights; structure • Allow targeting of problem in hand • May develop into a semi- supervised solution
  16. 16. • High-dimensional and sparse data set • Values correspond to word frequencies • Recommended methods include: hierarchical clustering, Kmeans with an appropriate distance measure, topic modelling (LDA, LSI), co-clustering Options for text clustering?
  17. 17. Hierarchical Clustering bringing structure
  18. 18. 2 types Agglomerative Divisive Deterministic algorithms! Attribution: Wikipedia
  19. 19. Agglomerative Start with many “singleton” clusters … Merge 2 at a time continuously … Build a hierarchy Divisive Start with a huge “macro” cluster … Iteratively split into 2 groups … Build a hierarchy
  20. 20. Agglomerative method: Linkage types • Single (similarity between most similar – based on nearest neighbour - two elements) • Complete (similarity between most dissimilar two elements) Attribution: https://www.coursera.org/course/clusteranalysis
  21. 21. Agglomerative method: Linkage types Average link ( avg. of similarity between all inter-cluster pairs ) Computationally expensive (Na*Nb) Trick: Centroid link (similarity between centroid of two clusters) Attribution: https://www.coursera.org/course/clusteranalysis
  22. 22. Ward’s criterion • Minimise a function: total in-cluster variance • As defined by, e.g.: • Once merged, then the SSE will increase (cluster becomes bigger) by: https://en.wikipedia.org/wiki/Ward's_method
  23. 23. Divisive clustering • Top-down approach • Criterion to split: Ward’s criterion • Handling noise: Use a threshold to determine the termination criteria Attribution: https://www.coursera.org/course/clusteranalysis
  24. 24. Similarity measures This will certainly influence the shape of the clusters! • Numerical: Use a variation of the Manhattan distance (e.g. City block, Euclidean) • Binary: Manhattan, Jaccard co-efficient, Hamming • Text: Cosine similarity.
  25. 25. Cosine similarity Represent a document by a bag of terms Record the frequency of a particular term (word/ topic/ phrase) If d1 and d2 are two term vectors, …can thus calculate the similarity between them Attribution: https://www.coursera.org/course/clusteranalysis
  26. 26. Gather word documents = keyword phrases
  27. 27. Aggregate search words with URL “words”
  28. 28. Text clustering: preparations • Add features where possible o I added URL words to my word set • Stem words o Choose the right stemmer – too severe can be bad • Stop words o NLTK tokeniser o Scikit learn TF-IDF tokeniser • Low frequency cut-off o 2 => words appearing less than twice in whole corpus • High frequency cut-off o 0.5 => words that appear in more than 50% of documents • N-grams o Single words, bi-grams, tri-grams • Beware of foreign languages o Separate datasets if possible
  29. 29. Text preparation
  30. 30. Dimensionality • Get a sparse matrix o Mostly zeros • Reduce the number of dimensions o PCA o Spectral clustering • The “curse” of dimensionality
  31. 31. Results: reduced dimensions
  32. 32. Results: reduced dimensions
  33. 33. The dendrogram
  34. 34. Assess the quality of your clusters • Internal: Purity, completeness & homogeneity • External: Adjusted Rand index, Normalised Information index
  35. 35. Topic labelling
  36. 36. Hierarchical Clustering Beyond Python (!?)
  37. 37. Life on the inside: Elasticsearch • Why not perform pre-processing and clustering inside elasticsearch? • Document store • TF-IDF and other • Stop words • Language specific analysers
  38. 38. Elasticsearch - try it ! - • https://www.elastic.co/ • NoSQL document store • Aggregations and stats • Fast, distributed • Quick to set up
  39. 39. Document storage in ES
  40. 40. Lingo 3G algorithm • Lingo 3G: Hierarchical clustering off-the-shelf • Built-in part of speech (POS) • User-defined word/synonym/label dictionaries • Built-in stemmer / word inflection database • Multi-lingual support, advanced tuning • Commercial: costs attached http://download.carrotsearch.com/lingo3g/manual/#section.es http://project.carrot2.org/algorithms.html
  41. 41. Elasticsearch with clustering – Utopia? Carrot2’s Lingo3G in action : http://search.carrot2.org/stable/search Foamtree visualisation example Visualisation of hierarchical structure possible for large datasets via “lazy loading” http://get.carrotsearch.com/foamtree/demo/demos/large.html
  42. 42. Limitations of hierarchical clustering • Can’t undo what’s done (divisive method, work on sub clusters, cannot re-merge). Even true for agglomerative (once merged will never split it again) • Every split or merge must be refined • Methods may not scale well, checking all possible pairs, complexity goes high There are extensions: BIRCH, CURE and CHAMELEON
  43. 43. Thank you! A decent introductory course to clustering; https://www.coursera.org/course/clusteranalysis Hierarchical (agglomerative) clustering in Python: http://scikit- learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html Recent (ish) relevant Kaggle challenge: https://www.kaggle.com/c/lshtc Visualisation: http://carrotsearch.com/foamtree-overview Clustering elsewhere (Lingo, Lingo3G) with Carrot2:http://download.carrotsearch.com/ Elasticsearch: https://www.elastic.co/ Analytics SEO: http://www.analyticsseo.com/ Me: @norhustla / frank.kelly@cantab.net Attribution: http://wynway.com/
  44. 44. Extra slide: Why work inside the database? 1. Sharing data (management of) Support concurrent access by multiple readers and writers 2. Data Model Enforcement Make sure all applications see clean, organised data 3. Scale Work with datasets too large to fit in memory (over a certain size, need specialised algorithms to deal with the data -> bottleneck) The database organises and exposes algorithms for you conveniently 4. Flexibility Use the data in new, unanticipated ways -> anticipate a broad set of ways of accessing the data
  • AhmetTopu1

    Jan. 31, 2020
  • groty

    Jun. 18, 2019
  • NasmaBoumajdi

    May. 20, 2019
  • ssuser718481

    Sep. 7, 2018
  • pronam

    May. 19, 2018
  • YassineBouAbdallaoui

    Mar. 13, 2018
  • amoyyean

    Jan. 23, 2018
  • AsrulHanafi3

    Jun. 5, 2017
  • mohannadalhanahnah

    Feb. 6, 2017
  • EmmanuelNwabuaso

    Jan. 20, 2017
  • AnkitSwarnkar2

    Dec. 30, 2016
  • matiskay

    Sep. 27, 2016
  • YaoCui4

    Jul. 6, 2016
  • RayHan17

    May. 10, 2016
  • ehealthgr

    Mar. 9, 2016
  • RamyaRamesh10

    Dec. 13, 2015
  • sakura21

    Nov. 12, 2015

Clustering of data is an increasingly important task for many data scientists. This talk will explore the challenge of hierarchical clustering of text data for summarisation purposes. We'll take a look at some great solutions now available to Python users including the relevant Scikit Learn libraries, via Elasticsearch (with the carrot2 plugin), and check out visualisations from both approaches. https://www.youtube.com/watch?v=KFs9pBAetOo

Views

Total views

10,468

On Slideshare

0

From embeds

0

Number of embeds

91

Actions

Downloads

209

Shares

0

Comments

0

Likes

17

×