Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
From bag of texts
to bag of clusters
Paul Khudan Yevgen Terpil
pk@youscan.io jt@youscan.io
Map of ML mentions
Mar 2017, collected by YouScan
Map of ML mentions
конференция, meetup
Map of ML mentions
Приглашаем 13 мая на Data Science Lab…
конференция, meetup
Part 1
Classic approach
Word embeddings
Semantic representation of texts
1. Text (semi/un)supervised classification
2. Document retrieval
3. Topic insights
4. Text...
Requirements
• Vector representation is handy

• Descriptive (not distinctive) features

• Language/style/genre independen...
• Token-based methods, although char-based are more robust
• Preprocessing and unification
• Tokenization
• Lemmatization?
...
BoW, Tf-idf and more
• Bag of Words: one-hot encoding over the observed dictionary

• TF-IDF: ‘term frequency’ * ‘inverse ...
BoW, Tf-idf and more
- easily interpretable

- easy to implement

- parameters are straightforward
- not robust to languag...
ODS курс на хабре
Google купила kaggle
распознавание раковых
опухолей
яндекс крипта,
запросы женщин
Data Science Lab
TF-ID...
нейронная сеть
artificial intelligence
TF-IDF + SVD
deep learning
Clustering
1. K-means
2. Hierarchical clustering
3. Density Based Scan
K-means
• Separate all observations in K groups of equal variance

• Iteratively reassign cluster members for cluster memb...
Hierarchical clustering
• Build a hierarchy of clusters

• Bottom-up or top-down approach (agglomerative or divisive clust...
Density Based Scan
• Find areas of high density separated by areas of low density of samples

• Involves two parameters: e...
K-Means clusters
TF-IDF + SVD
Word embeddings
Word embeddings that capture semantics: word2vec family, fastText, GloVe
CBOW Skip-gram
Word embeddings
Word embeddings
Dimension-wise mean/sum/min/max over embeddings of words in text
Words Mover’s Distance
Word embeddings
- semantics is included

- moderately robust to language variance

- scales better, including OOV
- embedd...
ODS курс на хабре
Google купила kaggle
распознавание раковых
опухолей
яндекс крипта,
запросы женщин
Data Science Lab
Word2...
Word2Vec mean
покупка, инвестиции
TF-IDF + SVD
покупка, инвестиции
Sense clusters
Sense clusters
0 0.9 0 0 0.95 0 0.1
3000
еда времяовощи
картошка
• Find K cluster centers over 

target vocabulary embeddi...
Sense clusters
- semantics is now valuable(expressed by
concrete values in vectors)

- meaning now accumulates in text vec...
ODS курс на хабре
Google купила kaggle
распознавание раковых
опухолей
яндекс крипта,
запросы женщин
Data Science Lab
Word2...
покупка, инвестиции
Word2Sense mean
Doc2Vec
ODS курс на хабре
Google купила kaggle
яндекс крипта,
запросы женщин
Doc2Vec
Part 2
Alternatives
Deep learning
ODS курс на хабре
Google купила kaggle
распознавание раковых
опухолей
яндекс крипта,
запросы женщин
Data Science Lab
K-Mea...
Topic modeling
LDA
Google купила kaggle
ODS курс на хабре
Sequence-to-Sequence Models
document vector
Neural Machine Translation Text Summarization
Examples:
sentence vector
Objective
Skip Thought
word embedding
Objective
Fast Sent
Sentence representation
softmax
ODS курс на хабре Google купила kaggle
распознавание раковых
опухолей
яндекс крипта,
запросы женщин
Data Science Lab
Fast ...
покупка, инвестиции
Fast Sent
Fast Sent
конференция, meetup
Sequential Denoising Autoencoder (SDAE)
купил для исследователейGoogle
Google
Google купил для
исследователей
сервис
серви...
ODS курс на хабре
Google купила kaggle
яндекс крипта,
запросы женщин
Data Science Lab
SDAE
конференция, meetup
SDAE
Supervised evaluations
Learning Distributed Representations of Sentences from Unlabelled Data
Unsupervised (relatedness) evaluations
Learning Distributed Representations of Sentences from Unlabelled Data
Links
Learning Distributed Representations of Sentences from Unlabelled Data
http://www.aclweb.org/anthology/N16-1162
Fast...
Questions?
Upcoming SlideShare
Loading in …5
×

DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

163 views

Published on

From bag of texts to bag of clusters
Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)
Мы рассмотрим современные подходы к кластеризации текстов и их визуализации. Начиная от классического K-means на TF-IDF и заканчивая Deep Learning репрезентациями текстов. В качестве практического примера, мы проанализируем набор сообщений из соц. сетей и попробуем найти основные темы обсуждения.
Все материалы: http://datascience.in.ua/report2017

Published in: Technology
  • Be the first to comment

  • Be the first to like this

DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

  1. 1. From bag of texts to bag of clusters
  2. 2. Paul Khudan Yevgen Terpil pk@youscan.io jt@youscan.io
  3. 3. Map of ML mentions Mar 2017, collected by YouScan
  4. 4. Map of ML mentions конференция, meetup
  5. 5. Map of ML mentions Приглашаем 13 мая на Data Science Lab… конференция, meetup
  6. 6. Part 1 Classic approach Word embeddings
  7. 7. Semantic representation of texts 1. Text (semi/un)supervised classification 2. Document retrieval 3. Topic insights 4. Text similarity/relatedness
  8. 8. Requirements • Vector representation is handy
 • Descriptive (not distinctive) features
 • Language/style/genre independence
 • Robustness to language/speech variance (word- and phrase- level synonymy, word order, newly emerging words and entities)
  9. 9. • Token-based methods, although char-based are more robust • Preprocessing and unification • Tokenization • Lemmatization? Prerequisites
  10. 10. BoW, Tf-idf and more • Bag of Words: one-hot encoding over the observed dictionary
 • TF-IDF: ‘term frequency’ * ‘inverse document frequency’ for term weighting (include different normalization schemes)
 • Bag of n-grams: collocations carry more specific senses
 • Singular Value Decomposition (SVD) of the original term- document matrix (compression with less relevant information loss):
 ◦ resolves inter-document relations: similarity
 ◦ resolves inter-term relations: synonymy and polysemy
 ◦ reduces dimensionality

  11. 11. BoW, Tf-idf and more - easily interpretable
 - easy to implement
 - parameters are straightforward - not robust to language variance
 - scales badly
 - vulnerable to overfitting
 Pros Cons
  12. 12. ODS курс на хабре Google купила kaggle распознавание раковых опухолей яндекс крипта, запросы женщин Data Science Lab TF-IDF + SVD + TSNE
  13. 13. нейронная сеть artificial intelligence TF-IDF + SVD deep learning
  14. 14. Clustering 1. K-means 2. Hierarchical clustering 3. Density Based Scan
  15. 15. K-means • Separate all observations in K groups of equal variance
 • Iteratively reassign cluster members for cluster members mean to minimize the inertia: within- cluster sum of squared criterion

  16. 16. Hierarchical clustering • Build a hierarchy of clusters
 • Bottom-up or top-down approach (agglomerative or divisive clustering)
 • Various metrics for cluster dissimilarity • Cluster count and contents depends on chosen dissimilarity threshold
 Clusters: a, bc, def
  17. 17. Density Based Scan • Find areas of high density separated by areas of low density of samples
 • Involves two parameters: epsilon and minimum points
 • Epsilon sets the minimum distance for two points to be considered close enough
 Minimum points stand for the amount of mutually close points to be considered a new cluster
 

  18. 18. K-Means clusters TF-IDF + SVD
  19. 19. Word embeddings Word embeddings that capture semantics: word2vec family, fastText, GloVe CBOW Skip-gram
  20. 20. Word embeddings
  21. 21. Word embeddings Dimension-wise mean/sum/min/max over embeddings of words in text Words Mover’s Distance
  22. 22. Word embeddings - semantics is included
 - moderately robust to language variance
 - scales better, including OOV - embeddings source and quality?
 - vector relations (distance measures, separating planes) is what really means, not vector values
 - meaning degrades quickly on moderate-to-large texts - interpretation is a tedious work
 Pros Cons
  23. 23. ODS курс на хабре Google купила kaggle распознавание раковых опухолей яндекс крипта, запросы женщин Data Science Lab Word2Vec mean
  24. 24. Word2Vec mean покупка, инвестиции
  25. 25. TF-IDF + SVD покупка, инвестиции
  26. 26. Sense clusters
  27. 27. Sense clusters 0 0.9 0 0 0.95 0 0.1 3000 еда времяовощи картошка • Find K cluster centers over 
 target vocabulary embeddings
 • Calculate distances (cosine measure) to cluster centers for each vocabulary word, ignore relatively small ones
 • Use distances as new K-dimensional feature vector (word embedding)
 • Aggregate embeddings
 • Normalize?

  28. 28. Sense clusters - semantics is now valuable(expressed by concrete values in vectors)
 - meaning now accumulates in text vectors better
 - it is possible to retrofit clusters on sense interpretations for readability - inherited from word embeddings - chained complexity
 - additional parameters to fiddle with
 - vector length is higher (around 3k dimensions) -> bigger, cumbersome, heavier
 Pros Cons
  29. 29. ODS курс на хабре Google купила kaggle распознавание раковых опухолей яндекс крипта, запросы женщин Data Science Lab Word2Sense mean
  30. 30. покупка, инвестиции Word2Sense mean
  31. 31. Doc2Vec
  32. 32. ODS курс на хабре Google купила kaggle яндекс крипта, запросы женщин Doc2Vec
  33. 33. Part 2 Alternatives Deep learning
  34. 34. ODS курс на хабре Google купила kaggle распознавание раковых опухолей яндекс крипта, запросы женщин Data Science Lab K-Means representation
  35. 35. Topic modeling
  36. 36. LDA Google купила kaggle ODS курс на хабре
  37. 37. Sequence-to-Sequence Models document vector Neural Machine Translation Text Summarization Examples:
  38. 38. sentence vector Objective Skip Thought
  39. 39. word embedding Objective Fast Sent Sentence representation softmax
  40. 40. ODS курс на хабре Google купила kaggle распознавание раковых опухолей яндекс крипта, запросы женщин Data Science Lab Fast Sent
  41. 41. покупка, инвестиции Fast Sent
  42. 42. Fast Sent конференция, meetup
  43. 43. Sequential Denoising Autoencoder (SDAE) купил для исследователейGoogle Google Google купил для исследователей сервис сервис купил сервис для Delete word Swap bigram Corrupt sentence by p0 Є [0, 1] px Є [0, 1] and predict original sentence
  44. 44. ODS курс на хабре Google купила kaggle яндекс крипта, запросы женщин Data Science Lab SDAE
  45. 45. конференция, meetup SDAE
  46. 46. Supervised evaluations Learning Distributed Representations of Sentences from Unlabelled Data
  47. 47. Unsupervised (relatedness) evaluations Learning Distributed Representations of Sentences from Unlabelled Data
  48. 48. Links Learning Distributed Representations of Sentences from Unlabelled Data http://www.aclweb.org/anthology/N16-1162 FastSent, SDAE https://github.com/fh295/SentenceRepresentation Skip-Thought Vectors https://github.com/ryankiros/skip-thoughts Sense clusters https://servponomarev.livejournal.com/10604 https://habrahabr.ru/post/277563/
  49. 49. Questions?

×