SlideShare a Scribd company logo
1 of 49
Download to read offline
From bag of texts
to bag of clusters
Paul Khudan Yevgen Terpil
pk@youscan.io jt@youscan.io
Map of ML mentions
Mar 2017, collected by YouScan
Map of ML mentions
конференция, meetup
Map of ML mentions
Приглашаем 13 мая на Data Science Lab…
конференция, meetup
Part 1
Classic approach
Word embeddings
Semantic representation of texts
1. Text (semi/un)supervised classification
2. Document retrieval
3. Topic insights
4. Text similarity/relatedness
Requirements
• Vector representation is handy

• Descriptive (not distinctive) features

• Language/style/genre independence

• Robustness to language/speech variance
(word- and phrase- level synonymy, word
order, newly emerging words and entities)
• Token-based methods, although char-based are more robust
• Preprocessing and unification
• Tokenization
• Lemmatization?
Prerequisites
BoW, Tf-idf and more
• Bag of Words: one-hot encoding over the observed dictionary

• TF-IDF: ‘term frequency’ * ‘inverse document frequency’ for
term weighting (include different normalization schemes)

• Bag of n-grams: collocations carry more specific senses

• Singular Value Decomposition (SVD) of the original term-
document matrix (compression with less relevant information
loss):

◦ resolves inter-document relations: similarity

◦ resolves inter-term relations: synonymy and polysemy

◦ reduces dimensionality

BoW, Tf-idf and more
- easily interpretable

- easy to implement

- parameters are straightforward
- not robust to language variance

- scales badly

- vulnerable to overfitting

Pros Cons
ODS курс на хабре
Google купила kaggle
распознавание раковых
опухолей
яндекс крипта,
запросы женщин
Data Science Lab
TF-IDF + SVD + TSNE
нейронная сеть
artificial intelligence
TF-IDF + SVD
deep learning
Clustering
1. K-means
2. Hierarchical clustering
3. Density Based Scan
K-means
• Separate all observations in K groups of equal variance

• Iteratively reassign cluster members for cluster members mean to minimize the inertia: within-
cluster sum of squared criterion

Hierarchical clustering
• Build a hierarchy of clusters

• Bottom-up or top-down approach (agglomerative or divisive clustering)

• Various metrics for cluster dissimilarity
• Cluster count and contents depends on chosen
dissimilarity threshold

Clusters:
a, bc, def
Density Based Scan
• Find areas of high density separated by areas of low density of samples

• Involves two parameters: epsilon and minimum points

• Epsilon sets the minimum distance for two points to be considered close enough

Minimum points stand for the amount of mutually close points to be considered a new cluster



K-Means clusters
TF-IDF + SVD
Word embeddings
Word embeddings that capture semantics: word2vec family, fastText, GloVe
CBOW Skip-gram
Word embeddings
Word embeddings
Dimension-wise mean/sum/min/max over embeddings of words in text
Words Mover’s Distance
Word embeddings
- semantics is included

- moderately robust to language variance

- scales better, including OOV
- embeddings source and quality?

- vector relations (distance measures, separating
planes) is what really means, not vector values

- meaning degrades quickly on moderate-to-large
texts
- interpretation is a tedious work

Pros Cons
ODS курс на хабре
Google купила kaggle
распознавание раковых
опухолей
яндекс крипта,
запросы женщин
Data Science Lab
Word2Vec mean
Word2Vec mean
покупка, инвестиции
TF-IDF + SVD
покупка, инвестиции
Sense clusters
Sense clusters
0 0.9 0 0 0.95 0 0.1
3000
еда времяовощи
картошка
• Find K cluster centers over 

target vocabulary embeddings

• Calculate distances (cosine measure) to cluster
centers for each vocabulary word, ignore
relatively small ones

• Use distances as new K-dimensional feature
vector (word embedding)

• Aggregate embeddings

• Normalize?

Sense clusters
- semantics is now valuable(expressed by
concrete values in vectors)

- meaning now accumulates in text vectors better

- it is possible to retrofit clusters on sense
interpretations for readability
- inherited from word embeddings
- chained complexity

- additional parameters to fiddle with

- vector length is higher (around 3k
dimensions) -> bigger, cumbersome,
heavier

Pros Cons
ODS курс на хабре
Google купила kaggle
распознавание раковых
опухолей
яндекс крипта,
запросы женщин
Data Science Lab
Word2Sense mean
покупка, инвестиции
Word2Sense mean
Doc2Vec
ODS курс на хабре
Google купила kaggle
яндекс крипта,
запросы женщин
Doc2Vec
Part 2
Alternatives
Deep learning
ODS курс на хабре
Google купила kaggle
распознавание раковых
опухолей
яндекс крипта,
запросы женщин
Data Science Lab
K-Means representation
Topic modeling
LDA
Google купила kaggle
ODS курс на хабре
Sequence-to-Sequence Models
document vector
Neural Machine Translation Text Summarization
Examples:
sentence vector
Objective
Skip Thought
word embedding
Objective
Fast Sent
Sentence representation
softmax
ODS курс на хабре Google купила kaggle
распознавание раковых
опухолей
яндекс крипта,
запросы женщин
Data Science Lab
Fast Sent
покупка, инвестиции
Fast Sent
Fast Sent
конференция, meetup
Sequential Denoising Autoencoder (SDAE)
купил для исследователейGoogle
Google
Google купил для
исследователей
сервис
сервис
купил сервис для
Delete word Swap bigram
Corrupt sentence by
p0 Є [0, 1] px Є [0, 1]
and predict original sentence
ODS курс на хабре
Google купила kaggle
яндекс крипта,
запросы женщин
Data Science Lab
SDAE
конференция, meetup
SDAE
Supervised evaluations
Learning Distributed Representations of Sentences from Unlabelled Data
Unsupervised (relatedness) evaluations
Learning Distributed Representations of Sentences from Unlabelled Data
Links
Learning Distributed Representations of Sentences from Unlabelled Data
http://www.aclweb.org/anthology/N16-1162
FastSent, SDAE
https://github.com/fh295/SentenceRepresentation
Skip-Thought Vectors
https://github.com/ryankiros/skip-thoughts
Sense clusters
https://servponomarev.livejournal.com/10604
https://habrahabr.ru/post/277563/
Questions?

More Related Content

What's hot

Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big DataSameer Wadkar
 
Convolutional neural networks for sentiment classification
Convolutional neural networks for sentiment classificationConvolutional neural networks for sentiment classification
Convolutional neural networks for sentiment classificationYunchao He
 
Deep Learning for NLP
Deep Learning for NLPDeep Learning for NLP
Deep Learning for NLPAmit Kapoor
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernelsDev Nath
 
Text categorization
Text categorizationText categorization
Text categorizationKU Leuven
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Classifying Text using CNN
Classifying Text using CNNClassifying Text using CNN
Classifying Text using CNNSomnath Banerjee
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4Glenn De Backer
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
mailfilter.ppt
mailfilter.pptmailfilter.ppt
mailfilter.pptbutest
 
Recurrent Convolutional Neural Networks for Text Classification
Recurrent Convolutional Neural Networks for Text ClassificationRecurrent Convolutional Neural Networks for Text Classification
Recurrent Convolutional Neural Networks for Text ClassificationShuangshuang Zhou
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesBryan Gummibearehausen
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Innovation Quotient Pvt Ltd
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
 
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰ssuserc35c0e
 

What's hot (20)

Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
 
Convolutional neural networks for sentiment classification
Convolutional neural networks for sentiment classificationConvolutional neural networks for sentiment classification
Convolutional neural networks for sentiment classification
 
Ir 03
Ir   03Ir   03
Ir 03
 
[ppt]
[ppt][ppt]
[ppt]
 
Deep Learning for NLP
Deep Learning for NLPDeep Learning for NLP
Deep Learning for NLP
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Ir 02
Ir   02Ir   02
Ir 02
 
Text categorization
Text categorizationText categorization
Text categorization
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Classifying Text using CNN
Classifying Text using CNNClassifying Text using CNN
Classifying Text using CNN
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
mailfilter.ppt
mailfilter.pptmailfilter.ppt
mailfilter.ppt
 
Recurrent Convolutional Neural Networks for Text Classification
Recurrent Convolutional Neural Networks for Text ClassificationRecurrent Convolutional Neural Networks for Text Classification
Recurrent Convolutional Neural Networks for Text Classification
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
Ir 08
Ir   08Ir   08
Ir 08
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
 
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
 

Similar to DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3Nick Grattan
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspectiveankurpandeyinfo
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectorsSimon Hughes
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Lucidworks
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingSimon Hughes
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...DataScienceConferenc1
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...DataScienceConferenc1
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
Sparse Composite Document Vector (Emnlp 2017)
Sparse Composite Document Vector (Emnlp 2017)Sparse Composite Document Vector (Emnlp 2017)
Sparse Composite Document Vector (Emnlp 2017)Vivek Gupta
 

Similar to DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan) (20)

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspective
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Text features
Text featuresText features
Text features
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
asdrfasdfasdf
asdrfasdfasdfasdrfasdfasdf
asdrfasdfasdf
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Sparse Composite Document Vector (Emnlp 2017)
Sparse Composite Document Vector (Emnlp 2017)Sparse Composite Document Vector (Emnlp 2017)
Sparse Composite Document Vector (Emnlp 2017)
 

More from GeeksLab Odessa

DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...GeeksLab Odessa
 
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...GeeksLab Odessa
 
DataScience Lab 2017_Блиц-доклад_Турский Виктор
DataScience Lab 2017_Блиц-доклад_Турский ВикторDataScience Lab 2017_Блиц-доклад_Турский Виктор
DataScience Lab 2017_Блиц-доклад_Турский ВикторGeeksLab Odessa
 
DataScience Lab 2017_Обзор методов детекции лиц на изображение
DataScience Lab 2017_Обзор методов детекции лиц на изображениеDataScience Lab 2017_Обзор методов детекции лиц на изображение
DataScience Lab 2017_Обзор методов детекции лиц на изображениеGeeksLab Odessa
 
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...GeeksLab Odessa
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладGeeksLab Odessa
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладGeeksLab Odessa
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладGeeksLab Odessa
 
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...GeeksLab Odessa
 
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...GeeksLab Odessa
 
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко GeeksLab Odessa
 
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...GeeksLab Odessa
 
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...GeeksLab Odessa
 
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...GeeksLab Odessa
 
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...GeeksLab Odessa
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...GeeksLab Odessa
 
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот GeeksLab Odessa
 
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...GeeksLab Odessa
 
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js GeeksLab Odessa
 
JS Lab2017_Redux: время двигаться дальше?_Екатерина Лизогубова
JS Lab2017_Redux: время двигаться дальше?_Екатерина ЛизогубоваJS Lab2017_Redux: время двигаться дальше?_Екатерина Лизогубова
JS Lab2017_Redux: время двигаться дальше?_Екатерина ЛизогубоваGeeksLab Odessa
 

More from GeeksLab Odessa (20)

DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
 
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
 
DataScience Lab 2017_Блиц-доклад_Турский Виктор
DataScience Lab 2017_Блиц-доклад_Турский ВикторDataScience Lab 2017_Блиц-доклад_Турский Виктор
DataScience Lab 2017_Блиц-доклад_Турский Виктор
 
DataScience Lab 2017_Обзор методов детекции лиц на изображение
DataScience Lab 2017_Обзор методов детекции лиц на изображениеDataScience Lab 2017_Обзор методов детекции лиц на изображение
DataScience Lab 2017_Обзор методов детекции лиц на изображение
 
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-доклад
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-доклад
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-доклад
 
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
 
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
 
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
 
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
 
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
 
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
 
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
 
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
 
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
 
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
 
JS Lab2017_Redux: время двигаться дальше?_Екатерина Лизогубова
JS Lab2017_Redux: время двигаться дальше?_Екатерина ЛизогубоваJS Lab2017_Redux: время двигаться дальше?_Екатерина Лизогубова
JS Lab2017_Redux: время двигаться дальше?_Екатерина Лизогубова
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

  • 1. From bag of texts to bag of clusters
  • 2. Paul Khudan Yevgen Terpil pk@youscan.io jt@youscan.io
  • 3. Map of ML mentions Mar 2017, collected by YouScan
  • 4. Map of ML mentions конференция, meetup
  • 5. Map of ML mentions Приглашаем 13 мая на Data Science Lab… конференция, meetup
  • 7. Semantic representation of texts 1. Text (semi/un)supervised classification 2. Document retrieval 3. Topic insights 4. Text similarity/relatedness
  • 8. Requirements • Vector representation is handy
 • Descriptive (not distinctive) features
 • Language/style/genre independence
 • Robustness to language/speech variance (word- and phrase- level synonymy, word order, newly emerging words and entities)
  • 9. • Token-based methods, although char-based are more robust • Preprocessing and unification • Tokenization • Lemmatization? Prerequisites
  • 10. BoW, Tf-idf and more • Bag of Words: one-hot encoding over the observed dictionary
 • TF-IDF: ‘term frequency’ * ‘inverse document frequency’ for term weighting (include different normalization schemes)
 • Bag of n-grams: collocations carry more specific senses
 • Singular Value Decomposition (SVD) of the original term- document matrix (compression with less relevant information loss):
 ◦ resolves inter-document relations: similarity
 ◦ resolves inter-term relations: synonymy and polysemy
 ◦ reduces dimensionality

  • 11. BoW, Tf-idf and more - easily interpretable
 - easy to implement
 - parameters are straightforward - not robust to language variance
 - scales badly
 - vulnerable to overfitting
 Pros Cons
  • 12. ODS курс на хабре Google купила kaggle распознавание раковых опухолей яндекс крипта, запросы женщин Data Science Lab TF-IDF + SVD + TSNE
  • 14. Clustering 1. K-means 2. Hierarchical clustering 3. Density Based Scan
  • 15. K-means • Separate all observations in K groups of equal variance
 • Iteratively reassign cluster members for cluster members mean to minimize the inertia: within- cluster sum of squared criterion

  • 16. Hierarchical clustering • Build a hierarchy of clusters
 • Bottom-up or top-down approach (agglomerative or divisive clustering)
 • Various metrics for cluster dissimilarity • Cluster count and contents depends on chosen dissimilarity threshold
 Clusters: a, bc, def
  • 17. Density Based Scan • Find areas of high density separated by areas of low density of samples
 • Involves two parameters: epsilon and minimum points
 • Epsilon sets the minimum distance for two points to be considered close enough
 Minimum points stand for the amount of mutually close points to be considered a new cluster
 

  • 19. Word embeddings Word embeddings that capture semantics: word2vec family, fastText, GloVe CBOW Skip-gram
  • 21. Word embeddings Dimension-wise mean/sum/min/max over embeddings of words in text Words Mover’s Distance
  • 22. Word embeddings - semantics is included
 - moderately robust to language variance
 - scales better, including OOV - embeddings source and quality?
 - vector relations (distance measures, separating planes) is what really means, not vector values
 - meaning degrades quickly on moderate-to-large texts - interpretation is a tedious work
 Pros Cons
  • 23. ODS курс на хабре Google купила kaggle распознавание раковых опухолей яндекс крипта, запросы женщин Data Science Lab Word2Vec mean
  • 25. TF-IDF + SVD покупка, инвестиции
  • 27. Sense clusters 0 0.9 0 0 0.95 0 0.1 3000 еда времяовощи картошка • Find K cluster centers over 
 target vocabulary embeddings
 • Calculate distances (cosine measure) to cluster centers for each vocabulary word, ignore relatively small ones
 • Use distances as new K-dimensional feature vector (word embedding)
 • Aggregate embeddings
 • Normalize?

  • 28. Sense clusters - semantics is now valuable(expressed by concrete values in vectors)
 - meaning now accumulates in text vectors better
 - it is possible to retrofit clusters on sense interpretations for readability - inherited from word embeddings - chained complexity
 - additional parameters to fiddle with
 - vector length is higher (around 3k dimensions) -> bigger, cumbersome, heavier
 Pros Cons
  • 29. ODS курс на хабре Google купила kaggle распознавание раковых опухолей яндекс крипта, запросы женщин Data Science Lab Word2Sense mean
  • 32. ODS курс на хабре Google купила kaggle яндекс крипта, запросы женщин Doc2Vec
  • 34. ODS курс на хабре Google купила kaggle распознавание раковых опухолей яндекс крипта, запросы женщин Data Science Lab K-Means representation
  • 36. LDA Google купила kaggle ODS курс на хабре
  • 37. Sequence-to-Sequence Models document vector Neural Machine Translation Text Summarization Examples:
  • 40. ODS курс на хабре Google купила kaggle распознавание раковых опухолей яндекс крипта, запросы женщин Data Science Lab Fast Sent
  • 43. Sequential Denoising Autoencoder (SDAE) купил для исследователейGoogle Google Google купил для исследователей сервис сервис купил сервис для Delete word Swap bigram Corrupt sentence by p0 Є [0, 1] px Є [0, 1] and predict original sentence
  • 44. ODS курс на хабре Google купила kaggle яндекс крипта, запросы женщин Data Science Lab SDAE
  • 46. Supervised evaluations Learning Distributed Representations of Sentences from Unlabelled Data
  • 47. Unsupervised (relatedness) evaluations Learning Distributed Representations of Sentences from Unlabelled Data
  • 48. Links Learning Distributed Representations of Sentences from Unlabelled Data http://www.aclweb.org/anthology/N16-1162 FastSent, SDAE https://github.com/fh295/SentenceRepresentation Skip-Thought Vectors https://github.com/ryankiros/skip-thoughts Sense clusters https://servponomarev.livejournal.com/10604 https://habrahabr.ru/post/277563/