SlideShare a Scribd company logo
1 of 38
Download to read offline
BigARTM: Open Source Library for
Regularized Multimodal Topic Modeling
of Large Collections
Konstantin Vorontsov, Oleksandr Frei, Murat Apishev,
Peter Romov, Marina Dudarenko
Yandex • CC RAS • MIPT • HSE • MSU
Analysis of Images, Social Networks and Texts
Ekaterinburg • 9–11 April 2015
Contents
1 Theory
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
2 BigARTM implementation — http://bigartm.org
BigARTM: parallel architecture
BigARTM: time and memory performance
How to start using BigARTM
3 Experiments
ARTM for combining regularizers
Multi-ARTM for classification
Multi-ARTM for multi-language TM
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
What is “topic”?
Topic is a special terminology of a particular domain area.
Topic is a set of coherent terms (words or phrases)
that often occur together in documents.
Formally, topic is a probability distribution over terms:
p(w|t) is (unknown) frequency of word w in topic t.
Document semantics is a probability distribution over topics:
p(t|d) is (unknown) frequency of topic t in document d.
Each document d consists of terms w1, w2, . . . , wnd
:
p(w|d) is (known) frequency of term w in document d.
When writing term w in document d author thinks about topic t.
Topic model tries to uncover latent topics from a text collection.
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 3 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Goals and applications of Topic Modeling
Goals:
Uncover a hidden thematic structure of the text collection
Find a compressed semantic representation of each document
Applications:
Information retrieval for long-text queries
Semantic search in large scientific document collections
Revealing research trends and research fronts
Expert search
News aggregation
Recommender systems
Categorization, classification, summarization, segmentation
of texts, images, video, signals, social media
and many others
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 4 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Probabilistic Topic Modeling: milestones and mainstream
1 PLSA — Probabilistic Latent Semantic Analysis (1999)
2 LDA — Latent Dirichlet Allocation (2003)
3 100s of PTMs based on Graphical Models & Bayesian Inference
David Blei. Probabilistic topic models // Communications of the ACM, 2012.
Vol. 55. No. 4. Pp. 77–84.
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 5 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Generative Probabilistic Topic Model (PTM)
Topic model explains terms w in documents d by topics t:
p(w|d) =
t
p(w|t)p(t|d)
Ра а а а -а а а
а . М а а а а а а а
а а а а
GC- GA- а а а а а . На
а а а , а а а а а а
а ( а , а а а ) а а а
а. М а а а а а а а а а . О
а а а а , а
а а . Е а а а
( а а а а а).
‫)݀|ݐ(݌‬
‫:)ݐ|ݓ(݌‬
‫ݐ‬ଵ, … , ‫ݐ‬௡೏
‫ݓ‬ଵ, … , ‫ݓ‬௡೏
:
0.018 а а а
0.013
0.011 а
… … … …
0.023
0.016
0.009
… … … …
0.014 а
0.009
0.006 а
… … … …
‫ݐ‬ଵ ‫ݐ‬ଶଵ ‫ݐ‬ଷଵଷ ‫ݐ‬ସଷଵ‫ݐ‬ହଷଵ
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 6 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
PLSA: Probabilistic Latent Semantic Analysis [T. Hofmann 1999]
Given: D is a set (collection) of documents
W is a set (vocabulary) of terms
ndw = how many times term w appears in document d
Find: parameters φwt =p(w|t), θtd =p(t|d) of the topic model
p(w|d) =
t
φwtθtd .
The problem of log-likelihood maximization under
non-negativeness and normalization constraints:
d,w
ndw ln
t
φwtθtd → max
Φ,Θ
,
φwt 0,
w∈W
φwt = 1; θtd 0,
t∈T
θtd = 1.
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 7 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Topic Modeling is an ill-posed inverse problem
Topic Modeling is the problem of stochastic matrix factorization:
p(w|d) =
t∈T
φwtθtd .
In matrix notation P
W ×D
= Φ
W ×T
· Θ
T×D
, where
P = p(w|d) W ×D
is known term–document matrix,
Φ = φwt W ×T
is unknown term–topic matrix, φwt =p(w|t),
Θ = θtd T×D
is unknown topic–document matrix, θtd =p(t|d).
Matrix factorization is not unique, the solution is not stable:
ΦΘ = (ΦS)(S−1
Θ) = Φ′
Θ′
for all S such that Φ′ = ΦS, Θ′ = S−1Θ are stochastic.
Then, regularization is needed to find appropriate solution.
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 8 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
ARTM: Additive Regularization of Topic Model
Additional regularization criteria Ri (Φ, Θ) → max, i = 1, . . . , n.
The problem of regularized log-likelihood maximization under
non-negativeness and normalization constraints:
d,w
ndw ln
t∈T
φwtθtd
log-likelihood L (Φ,Θ)
+
n
i=1
τi Ri (Φ, Θ)
R(Φ,Θ)
→ max
Φ,Θ
,
φwt 0;
w∈W
φwt = 1; θtd 0;
t∈T
θtd = 1
where τi > 0 are regularization coefficients.
Vorontsov K. V., Potapenko A. A. Tutorial on Probabilistic Topic Modeling:
Additive Regularization for Stochastic Matrix Factorization // AIST’2014,
Springer CCIS, 2014. Vol. 436. pp. 29–46.
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 9 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
ARTM: available regularizers
topic smoothing (equivalent to LDA)
topic sparsing
topic decorrelation
topic selection via entropy sparsing
topic coherence maximization
supervised learning for classification and regression
semi-supervised learning
using documents citation and links
modeling temporal topic dynamics
using vocabularies in multilingual topic models
and many others
Vorontsov K. V., Potapenko A. A. Additive Regularization of Topic Models //
Machine Learning. Special Issue “Data Analysis and Intelligent Optimization
with Applications”. Springer, 2014.
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 10 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Multimodal Probabilistic Topic Modeling
Given a text document collection Probabilistic Topic Model finds:
p(t|d) — topic distribution for each document d,
p(w|t) — term distribution for each topic t.
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 11 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Multimodal Probabilistic Topic Modeling
Multimodal Topic Model finds topical distribution for terms p(w|t),
authors p(a|t), time p(y|t),
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s
Metadata:
Authors
Data Time
Conference
Organization
URL
etc.
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 12 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Multimodal Probabilistic Topic Modeling
Multimodal Topic Model finds topical distribution for terms p(w|t),
authors p(a|t), time p(y|t), objects on images p(o|t),
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s
Metadata:
Authors
Data Time
Conference
Organization
URL
etc.
Images
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 13 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Multimodal Probabilistic Topic Modeling
Multimodal Topic Model finds topical distribution for terms p(w|t),
authors p(a|t), time p(y|t), objects on images p(o|t),
linked documents p(d′|t),
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s
Metadata:
Authors
Data Time
Conference
Organization
URL
etc.
Images Links
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 14 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Multimodal Probabilistic Topic Modeling
Multimodal Topic Model finds topical distribution for terms p(w|t),
authors p(a|t), time p(y|t), objects on images p(o|t),
linked documents p(d′|t), advertising banners p(b|t),
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s
Metadata:
Authors
Data Time
Conference
Organization
URL
etc.
Ads Images Links
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 15 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Multimodal Probabilistic Topic Modeling
Multimodal Topic Model finds topical distribution for terms p(w|t),
authors p(a|t), time p(y|t), objects on images p(o|t),
linked documents p(d′|t), advertising banners p(b|t), users p(u|t),
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s
Metadata:
Authors
Data Time
Conference
Organization
URL
etc.
Ads Images Links
Users
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 16 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Multimodal Probabilistic Topic Modeling
Multimodal Topic Model finds topical distribution for terms p(w|t),
authors p(a|t), time p(y|t), objects on images p(o|t),
linked documents p(d′|t), advertising banners p(b|t), users p(u|t),
and binds all these modalities into a single topic model.
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s
Metadata:
Authors
Data Time
Conference
Organization
URL
etc.
Ads Images Links
Users
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 17 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Multi-ARTM: combining multimodality with regularization
M is the set of modalities
W m is a vocabulary of tokens of m-th modality, m ∈ M
W = W 1 ⊔ · · · ⊔ W M is a joint vocabulary of all modalities
The problem of multimodal regularized log-likelihood
maximization under non-negativeness and normalization constraints:
m∈M
λm
d∈D w∈W m
ndw ln
t∈T
φwtθtd
modality log-likelihood Lm(Φ,Θ)
+
n
i=1
τi Ri (Φ, Θ)
R(Φ,Θ)
→ max
Φ,Θ
,
φwt 0,
w∈W m
φwt = 1, m ∈ M; θtd 0,
t∈T
θtd = 1.
where λm > 0, τi > 0 are regularization coefficients.
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 18 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Multi-ARTM: multimodal regularized EM-algorithm
EM-algorithm is a simple-iteration method for a system of equations
Theorem. The local maximum (Φ, Θ) satisfies the following
system of equations with auxiliary variables ptdw = p(t|d, w):
ptdw = norm
t∈T
φwtθtd ;
φwt = norm
w∈W m
nwt + φwt
∂R
∂φwt
; nwt =
d∈D
λm(w)ndw ptdw ;
θtd = norm
t∈T
ntd + θtd
∂R
∂θtd
; ntd =
w∈d
λm(w)ndw ptdw ;
where norm
t∈T
xt = max{xt ,0}
s∈T
max{xs ,0} is nonnegative normalization;
m(w) is the modality of the term w, so that w ∈ W m(w).
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 19 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Fast online EM-algorithm for Multi-ARTM
Input: collection D split into batches Db, b = 1, . . . , B;
Output: matrix Φ;
1 initialize φwt for all w ∈ W , t ∈ T;
2 nwt := 0, ˜nwt := 0 for all w ∈ W , t ∈ T;
3 for all batches Db, b = 1, . . . , B
4 iterate each document d ∈ Db at a constant matrix Φ:
(˜nwt) := (˜nwt) + ProcessBatch (Db, Φ);
5 if (synchronize) then
6 nwt := nwt + ˜ndw for all w ∈ W , t ∈ T;
7 φwt := norm
w∈W m
nwt + φwt
∂R
∂φwt
for all w ∈W m, m∈M, t ∈T;
8 ˜nwt := 0 for all w ∈ W , t ∈ T;
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 20 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
Fast online EM-algorithm for Multi-ARTM
ProcessBatch iterates documents d ∈ Db at a constant matrix Φ.
matrix (˜nwt) := ProcessBatch (set of documents Db, matrix Φ)
1 ˜nwt := 0 for all w ∈ W , t ∈ T;
2 for all d ∈ Db
3 initialize θtd := 1
|T| for all t ∈ T;
4 repeat
5 ptdw := norm
t∈T
φwtθtd for all w ∈ d, t ∈ T;
6 ntd :=
w∈d
λm(w)ndw ptdw for all t ∈ T;
7 θtd := norm
t∈T
ntd + θtd
∂R
∂θtd
for all t ∈ T;
8 until θd converges;
9 ˜nwt := ˜nwt + λm(w)ndw ptdw for all w ∈ d, t ∈ T;
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 21 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
Probabilistic Topic Modeling
ARTM — Additive Regularization for Topic Modeling
Multimodal Probabilistic Topic Modeling
ARTM approach: benefits and restrictions
Benefits
Single EM-algorithm for many models and their combinations
PLSA, LDA, and 100s of PTMs are covered by ARTM
No complicated inference and graphical models
ARTM reduces barriers to entry into PTM research field
ARTM encourages any combinations of regularizers
Multi-ARTM encourages any combinations of modalities
Multi-ARTM is implemented in BigARTM open-source project
Under development (not really restrictions):
3-matrix factorization P = ΦΨΘ, e.g. Author-Topic Model
Further generalization of hypergraph-based Multi-ARTM
Adaptive optimization of regularization coefficients
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 22 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
BigARTM: parallel architecture
BigARTM: time and memory performance
How to start using BigARTM
The BigARTM project: main features
Parallel online Multi-ARTM framework
Open-source http://bigartm.org
Distributed storage of collection is possible
Built-in regularizers:
smoothing, sparsing, decorrelation, semi-supervised learning,
and many others coming soon
Built-in quality measures:
perplexity, sparsity, kernel contrast and purity,
and many others coming soon
Many types of PTMs can be implemented via Multi-ARTM:
multilanguage, temporal, hierarchical, multigram,
and many others
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 23 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
BigARTM: parallel architecture
BigARTM: time and memory performance
How to start using BigARTM
The BigARTM project: parallel architecture
Concurrent processing of batches
Simple single-threaded code for ProcessBatch
User controls when to update the model in online algorithm
Deterministic (reproducible) results from run to run
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 24 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
BigARTM: parallel architecture
BigARTM: time and memory performance
How to start using BigARTM
BigARTM vs Gensim vs Vowpal Wabbit
3.7M articles from Wikipedia, 100K unique words
procs train inference perplexity
BigARTM 1 35 min 72 sec 4000
Gensim.LdaModel 1 369 min 395 sec 4161
VowpalWabbit.LDA 1 73 min 120 sec 4108
BigARTM 4 9 min 20 sec 4061
Gensim.LdaMulticore 4 60 min 222 sec 4111
BigARTM 8 4.5 min 14 sec 4304
Gensim.LdaMulticore 8 57 min 224 sec 4455
procs = number of parallel threads
inference = time to infer θd for 100K held-out documents
perplexity is calculated on held-out documents.
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 25 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
BigARTM: parallel architecture
BigARTM: time and memory performance
How to start using BigARTM
Running BigARTM in Parallel
3.7M articles from Wikipedia, 100K unique words
Amazon EC2 c3.8xlarge (16 physical cores + hyperthreading)
No extra memory cost for adding more threads
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 26 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
BigARTM: parallel architecture
BigARTM: time and memory performance
How to start using BigARTM
How to start using BigARTM
1 Download links, tutorials, documentation:
http://bigartm.org
2 Linux: compile and start examples
Windows: start examples
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 27 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
BigARTM: parallel architecture
BigARTM: time and memory performance
How to start using BigARTM
How to start using BigARTM
1 Download links, tutorials, documentation:
http://bigartm.org
2 Linux: compile and start examples
Windows: start examples
BigARTM community:
1 Post questions in BigARTM discussion group:
https://groups.google.com/group/bigartm-users
2 Report bugs in BigARTM issue tracker:
https://github.com/bigartm/bigartm/issues
3 Contribute to BigARTM project via pull requests:
https://github.com/bigartm/bigartm/pulls
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 28 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
BigARTM: parallel architecture
BigARTM: time and memory performance
How to start using BigARTM
License and programming environment
Freely available for commercial usage (BSD 3-Clause license)
Cross-platform — Windows, Linux, Mac OS X (32 bit, 64 bit)
Simple command-line API — available now
Rich programming API in C++ and Python — available now
Rich programming API in C# and Java — coming soon
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 29 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
ARTM for combining regularizers
Multi-ARTM for classification
Multi-ARTM for multi-language TM
Combining Regularizers: experiment on 3.7M Wikipedia collection
Additive combination of 5 regularizers:
smoothing background (common lexis) topics B in Φ and Θ
sparsing domain-specific topics S = TB in Φ and Θ
decorrelation of topics in Φ
ΦW ×T ΘT×D
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 30 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
ARTM for combining regularizers
Multi-ARTM for classification
Multi-ARTM for multi-language TM
Combining Regularizers: experiment on 3.7M Wikipedia collection
Additive combination of 5 regularizers:
smoothing background (common lexis) topics B in Φ and Θ
sparsing domain-specific topics S = TB in Φ and Θ
decorrelation of topics in Φ
R(Φ, Θ) = + β1
t∈B w∈W
βw ln φwt + α1
d∈D t∈B
αt ln θtd
− β0
t∈S w∈W
βw ln φwt − α0
d∈D t∈S
αt ln θtd
− γ
t∈T s∈Tt w∈W
φwtφws
where β0, α0, β1, α1, γ are regularization coefficients.
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 31 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
ARTM for combining regularizers
Multi-ARTM for classification
Multi-ARTM for multi-language TM
Combining Regularizers: LDA vs ARTM models
P10k , P100k — hold-out perplexity (10K, 100K documents)
SΦ, SΘ — sparsity of Φ and Θ matrices (in %)
Ks , Kp, Kc — average topic kernel size, purity and contrast
Model P10k P100k SΦ SΘ Ks Kp Kc
LDA 3436 3801 0.0 0.0 873 0.533 0.507
ARTM 3577 3947 96.3 80.9 1079 0.785 0.731
Convergence of LDA (thin lines) and ARTM (bold lines)
1 · 106
2 · 106
3 · 106
0.34
0.52
0.69
0.87
1.04
1.22
·104
Perplexity
0
20
40
60
80
100
Sparsity
Perplexity Phi Theta
1 · 106
2 · 106
3 · 106
0
0.25
0.5
0.75
1
1.25
·103
Kernelsize
0
0.2
0.4
0.6
0.8
1
Purityandcontrast
Size Purity Contrast
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 32 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
ARTM for combining regularizers
Multi-ARTM for classification
Multi-ARTM for multi-language TM
EUR-Lex corpus
19 800 documents about European Union law
Two modalities: 21K words, 3 250 categories (class labels)
EUR-Lex is a “power-law dataset” with unbalanced classes:
Left: # unique labels with a given # documents per label
Right: # documents with a given # labels
Rubin T. N., Chambers A., Smyth P., Steyvers M. Statistical topic models for
multi-label document classification // Machine Learning, 2012, 88(1-2).
Pp. 157–208.
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 33 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
ARTM for combining regularizers
Multi-ARTM for classification
Multi-ARTM for multi-language TM
Multi-ARTM for classification
Regularizers:
Uniform smoothing for Θ
Uniform smoothing for word–topic matrix Φ1
Label regularization for class–topic matrix Φ2:
R(Φ2
) = τ
c∈W 2
ˆpc ln p(c) → max,
where
p(c) =
t∈T
φctp(t) is the model distribution of class c,
p(t) = nt
n can be easily estimated along EM iterations,
ˆpc is the empirical frequency of class c in the training data.
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 34 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
ARTM for combining regularizers
Multi-ARTM for classification
Multi-ARTM for multi-language TM
The comparative study of models on EUR-Lex classification task
DLDA (Dependency LDA) [Rubin 2012] is a nearest analog
of Multi-ARTM for classification among Bayesian Topic Models
Quality measures [Rubin 2012]:
AUC-PR (%, ⇑) — Area under precision-recall curve
AUC (%, ⇑) — Area under ROC curve
OneErr (%, ⇓) — One error (most ranked label is not relevant)
IsErr (%, ⇓) — Is error (no perfect classification)
Results:
|T|opt AUC-PR AUC OneErr IsErr
Multi-ARTM 10 000 51.3 98.0 29.1 95.5
DLDA [Rubin 2012] 200 49.2 98.2 32.0 97.2
SVM 43.5 97.5 31.6 98.1
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 35 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
ARTM for combining regularizers
Multi-ARTM for classification
Multi-ARTM for multi-language TM
Multi-language ARTM
We consider languages as modalities in Multi-ARTM.
Collection of 216 175 Russian–English Wikipedia articles pairs.
Top 10 words with p(w|t) probabilities (in %):
Topic 68 Topic 79
research 4.56 институт 6.03 goals 4.48 матч 6.02
technology 3.14 университет 3.35 league 3.99 игрок 5.56
engineering 2.63 программа 3.17 club 3.76 сборная 4.51
institute 2.37 учебный 2.75 season 3.49 фк 3.25
science 1.97 технический 2.70 scored 2.72 против 3.20
program 1.60 технология 2.30 cup 2.57 клуб 3.14
education 1.44 научный 1.76 goal 2.48 футболист 2.67
campus 1.43 исследование 1.67 apps 1.74 гол 2.65
management 1.38 наука 1.64 debut 1.69 забивать 2.53
programs 1.36 образование 1.47 match 1.67 команда 2.14
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 36 / 38
Theory
BigARTM implementation — http://bigartm.org
Experiments
ARTM for combining regularizers
Multi-ARTM for classification
Multi-ARTM for multi-language TM
Multi-language ARTM
Collection of 216 175 Russian–English Wikipedia articles pairs.
Top 10 words with p(w|t) probabilities (in %):
Topic 88 Topic 251
opera 7.36 опера 7.82 windows 8.00 windows 6.05
conductor 1.69 оперный 3.13 microsoft 4.03 microsoft 3.76
orchestra 1.14 дирижер 2.82 server 2.93 версия 1.86
wagner 0.97 певец 1.65 software 1.38 приложение 1.86
soprano 0.78 певица 1.51 user 1.03 сервер 1.63
performance 0.78 театр 1.14 security 0.92 server 1.54
mozart 0.74 партия 1.05 mitchell 0.82 программный 1.08
sang 0.70 сопрано 0.97 oracle 0.82 пользователь 1.04
singing 0.69 вагнер 0.90 enterprise 0.78 обеспечение 1.02
operas 0.68 оркестр 0.82 users 0.78 система 0.96
All |T| = 400 topics were reviewed by an independent assessor,
and he successfully interpreted 396 topics.
Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 37 / 38
Conclusions
ARTM (Additive Regularization for Topic Modeling)
is a general framework, which makes topic models
easy to design, to infer, to explain, and to combine.
Multi-ARTM is a further generalization of ARTM
for multimodal topic modeling
BigARTM is an open source project for parallel online
topic modeling of large text collections.
http://bigartm.org
Join BigARTM community!

More Related Content

What's hot

Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad DataSteffen Staab
 
Information theoritic analysis of entity dynamics on the linked open data cloud
Information theoritic analysis of entity dynamics on the linked open data cloudInformation theoritic analysis of entity dynamics on the linked open data cloud
Information theoritic analysis of entity dynamics on the linked open data cloudMOVING Project
 
Bytewise Approximate Match: Theory, Algorithms and Applications
Bytewise Approximate Match:  Theory, Algorithms and ApplicationsBytewise Approximate Match:  Theory, Algorithms and Applications
Bytewise Approximate Match: Theory, Algorithms and ApplicationsLiwei Ren任力偉
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all thatZhibo Xiao
 
SWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologiesSWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologiesChristoph Lange
 
New Approaches to Interactive Multimedia Content Retrieval from different Sou...
New Approaches to Interactive Multimedia Content Retrieval from different Sou...New Approaches to Interactive Multimedia Content Retrieval from different Sou...
New Approaches to Interactive Multimedia Content Retrieval from different Sou...Grupo HULAT
 
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect match
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect matchLinked Open (Geo)Data and the Distributed Ontology Language – a perfect match
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect matchChristoph Lange
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Text categorization
Text categorizationText categorization
Text categorizationKU Leuven
 
Instance-Based Ontological Knowledge Acquisition
Instance-Based Ontological Knowledge AcquisitionInstance-Based Ontological Knowledge Acquisition
Instance-Based Ontological Knowledge AcquisitionLihua Zhao
 
Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011Lihua Zhao
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationMarco Righini
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown BagDataTactics
 
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...Paolo Nesi
 
Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mariana Damova, Ph.D
 

What's hot (20)

Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad Data
 
Information theoritic analysis of entity dynamics on the linked open data cloud
Information theoritic analysis of entity dynamics on the linked open data cloudInformation theoritic analysis of entity dynamics on the linked open data cloud
Information theoritic analysis of entity dynamics on the linked open data cloud
 
Bytewise Approximate Match: Theory, Algorithms and Applications
Bytewise Approximate Match:  Theory, Algorithms and ApplicationsBytewise Approximate Match:  Theory, Algorithms and Applications
Bytewise Approximate Match: Theory, Algorithms and Applications
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
SWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologiesSWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologies
 
New Approaches to Interactive Multimedia Content Retrieval from different Sou...
New Approaches to Interactive Multimedia Content Retrieval from different Sou...New Approaches to Interactive Multimedia Content Retrieval from different Sou...
New Approaches to Interactive Multimedia Content Retrieval from different Sou...
 
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect match
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect matchLinked Open (Geo)Data and the Distributed Ontology Language – a perfect match
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect match
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Text categorization
Text categorizationText categorization
Text categorization
 
Instance-Based Ontological Knowledge Acquisition
Instance-Based Ontological Knowledge AcquisitionInstance-Based Ontological Knowledge Acquisition
Instance-Based Ontological Knowledge Acquisition
 
Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Lecture20 xing
Lecture20 xingLecture20 xing
Lecture20 xing
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Text categorization
Text categorizationText categorization
Text categorization
 
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
 
Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]
 

Viewers also liked

MailingConf-2011 Mail.Ru (English)
MailingConf-2011 Mail.Ru (English)MailingConf-2011 Mail.Ru (English)
MailingConf-2011 Mail.Ru (English)Sergey Martynov
 
NCSCA Fall 2013 final10_09
NCSCA Fall 2013 final10_09NCSCA Fall 2013 final10_09
NCSCA Fall 2013 final10_09Chuck Small
 
Торговая площадка морских сервисов
Торговая площадка морских сервисовТорговая площадка морских сервисов
Торговая площадка морских сервисовMikhail Velkin
 
С.Есин, IBS. ИТ-стратегия вуза или путь на пик технологий.
С.Есин, IBS. ИТ-стратегия вуза или путь на пик технологий.С.Есин, IBS. ИТ-стратегия вуза или путь на пик технологий.
С.Есин, IBS. ИТ-стратегия вуза или путь на пик технологий.IBS
 
"Выбраться из спама - как повысить CTR рассылки без потери активности". Андре...
"Выбраться из спама - как повысить CTR рассылки без потери активности". Андре..."Выбраться из спама - как повысить CTR рассылки без потери активности". Андре...
"Выбраться из спама - как повысить CTR рассылки без потери активности". Андре...Badoo Development
 
современные технологии космической съемки для оптимизации принятия управленче...
современные технологии космической съемки для оптимизации принятия управленче...современные технологии космической съемки для оптимизации принятия управленче...
современные технологии космической съемки для оптимизации принятия управленче...LAZOVOY
 

Viewers also liked (6)

MailingConf-2011 Mail.Ru (English)
MailingConf-2011 Mail.Ru (English)MailingConf-2011 Mail.Ru (English)
MailingConf-2011 Mail.Ru (English)
 
NCSCA Fall 2013 final10_09
NCSCA Fall 2013 final10_09NCSCA Fall 2013 final10_09
NCSCA Fall 2013 final10_09
 
Торговая площадка морских сервисов
Торговая площадка морских сервисовТорговая площадка морских сервисов
Торговая площадка морских сервисов
 
С.Есин, IBS. ИТ-стратегия вуза или путь на пик технологий.
С.Есин, IBS. ИТ-стратегия вуза или путь на пик технологий.С.Есин, IBS. ИТ-стратегия вуза или путь на пик технологий.
С.Есин, IBS. ИТ-стратегия вуза или путь на пик технологий.
 
"Выбраться из спама - как повысить CTR рассылки без потери активности". Андре...
"Выбраться из спама - как повысить CTR рассылки без потери активности". Андре..."Выбраться из спама - как повысить CTR рассылки без потери активности". Андре...
"Выбраться из спама - как повысить CTR рассылки без потери активности". Андре...
 
современные технологии космической съемки для оптимизации принятия управленче...
современные технологии космической съемки для оптимизации принятия управленче...современные технологии космической съемки для оптимизации принятия управленче...
современные технологии космической съемки для оптимизации принятия управленче...
 

Similar to Open Source Library for Regularized Multimodal Topic Modeling of Large Collections

Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...IRJET Journal
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Project Proposal Topics Modeling (Ir)
Project Proposal    Topics Modeling (Ir)Project Proposal    Topics Modeling (Ir)
Project Proposal Topics Modeling (Ir)Svitlana volkova
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...AIST
 
MetaScience: Holistic Approach for Research Modeling and Analysis
MetaScience: Holistic Approach for Research Modeling and AnalysisMetaScience: Holistic Approach for Research Modeling and Analysis
MetaScience: Holistic Approach for Research Modeling and AnalysisJordi Cabot
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
SWiM – A Semantic Wiki for Mathematical Knowledge Management
SWiM – A Semantic Wiki for Mathematical Knowledge ManagementSWiM – A Semantic Wiki for Mathematical Knowledge Management
SWiM – A Semantic Wiki for Mathematical Knowledge ManagementChristoph Lange
 
Using Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative ResearchUsing Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative ResearchJoshuaApolonio1
 
Kdd 2014 tutorial bringing structure to text - chi
Kdd 2014 tutorial   bringing structure to text - chiKdd 2014 tutorial   bringing structure to text - chi
Kdd 2014 tutorial bringing structure to text - chiBarbara Starr
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1Dave King
 
Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Angelo Salatino
 
Multidimensional Patterns of Disturbance in Digital Social Networks
Multidimensional Patterns of Disturbance in Digital Social NetworksMultidimensional Patterns of Disturbance in Digital Social Networks
Multidimensional Patterns of Disturbance in Digital Social NetworksDimitar Denev
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerFrancesco Osborne
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 

Similar to Open Source Library for Regularized Multimodal Topic Modeling of Large Collections (20)

Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
 
Eric Smidth
Eric SmidthEric Smidth
Eric Smidth
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Project Proposal Topics Modeling (Ir)
Project Proposal    Topics Modeling (Ir)Project Proposal    Topics Modeling (Ir)
Project Proposal Topics Modeling (Ir)
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
 
MetaScience: Holistic Approach for Research Modeling and Analysis
MetaScience: Holistic Approach for Research Modeling and AnalysisMetaScience: Holistic Approach for Research Modeling and Analysis
MetaScience: Holistic Approach for Research Modeling and Analysis
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Uk seminar
Uk seminarUk seminar
Uk seminar
 
SWiM – A Semantic Wiki for Mathematical Knowledge Management
SWiM – A Semantic Wiki for Mathematical Knowledge ManagementSWiM – A Semantic Wiki for Mathematical Knowledge Management
SWiM – A Semantic Wiki for Mathematical Knowledge Management
 
qualitative.ppt
qualitative.pptqualitative.ppt
qualitative.ppt
 
Using Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative ResearchUsing Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative Research
 
Kdd 2014 tutorial bringing structure to text - chi
Kdd 2014 tutorial   bringing structure to text - chiKdd 2014 tutorial   bringing structure to text - chi
Kdd 2014 tutorial bringing structure to text - chi
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics
 
Multidimensional Patterns of Disturbance in Digital Social Networks
Multidimensional Patterns of Disturbance in Digital Social NetworksMultidimensional Patterns of Disturbance in Digital Social Networks
Multidimensional Patterns of Disturbance in Digital Social Networks
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Text mining
Text miningText mining
Text mining
 

More from AIST

Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray ImagesAIST
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныAIST
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...AIST
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискAIST
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...AIST
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...AIST
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...AIST
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAAIST
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeAIST
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesAIST
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationAIST
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsAIST
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceAIST
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...AIST
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...AIST
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumAIST
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...AIST
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...AIST
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingAIST
 
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...AIST
 

More from AIST (20)

Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBA
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product Categories
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chants
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
 
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
 

Recently uploaded

Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Pooja Nehwal
 
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Salam Al-Karadaghi
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptssuser319dad
 
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)Basil Achie
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...henrik385807
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...NETWAYS
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AITatiana Gurgel
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfhenrik385807
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024eCommerce Institute
 
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...NETWAYS
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation TrackSBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation TrackSebastiano Panichella
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024eCommerce Institute
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxmavinoikein
 
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...NETWAYS
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )Pooja Nehwal
 
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...NETWAYS
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Delhi Call girls
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...NETWAYS
 

Recently uploaded (20)

Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
 
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.ppt
 
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AI
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024
 
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation TrackSBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptx
 
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
 
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
 

Open Source Library for Regularized Multimodal Topic Modeling of Large Collections

  • 1. BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections Konstantin Vorontsov, Oleksandr Frei, Murat Apishev, Peter Romov, Marina Dudarenko Yandex • CC RAS • MIPT • HSE • MSU Analysis of Images, Social Networks and Texts Ekaterinburg • 9–11 April 2015
  • 2. Contents 1 Theory Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling 2 BigARTM implementation — http://bigartm.org BigARTM: parallel architecture BigARTM: time and memory performance How to start using BigARTM 3 Experiments ARTM for combining regularizers Multi-ARTM for classification Multi-ARTM for multi-language TM
  • 3. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling What is “topic”? Topic is a special terminology of a particular domain area. Topic is a set of coherent terms (words or phrases) that often occur together in documents. Formally, topic is a probability distribution over terms: p(w|t) is (unknown) frequency of word w in topic t. Document semantics is a probability distribution over topics: p(t|d) is (unknown) frequency of topic t in document d. Each document d consists of terms w1, w2, . . . , wnd : p(w|d) is (known) frequency of term w in document d. When writing term w in document d author thinks about topic t. Topic model tries to uncover latent topics from a text collection. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 3 / 38
  • 4. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Goals and applications of Topic Modeling Goals: Uncover a hidden thematic structure of the text collection Find a compressed semantic representation of each document Applications: Information retrieval for long-text queries Semantic search in large scientific document collections Revealing research trends and research fronts Expert search News aggregation Recommender systems Categorization, classification, summarization, segmentation of texts, images, video, signals, social media and many others Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 4 / 38
  • 5. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Probabilistic Topic Modeling: milestones and mainstream 1 PLSA — Probabilistic Latent Semantic Analysis (1999) 2 LDA — Latent Dirichlet Allocation (2003) 3 100s of PTMs based on Graphical Models & Bayesian Inference David Blei. Probabilistic topic models // Communications of the ACM, 2012. Vol. 55. No. 4. Pp. 77–84. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 5 / 38
  • 6. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Generative Probabilistic Topic Model (PTM) Topic model explains terms w in documents d by topics t: p(w|d) = t p(w|t)p(t|d) Ра а а а -а а а а . М а а а а а а а а а а а GC- GA- а а а а а . На а а а , а а а а а а а ( а , а а а ) а а а а. М а а а а а а а а а . О а а а а , а а а . Е а а а ( а а а а а). ‫)݀|ݐ(݌‬ ‫:)ݐ|ݓ(݌‬ ‫ݐ‬ଵ, … , ‫ݐ‬௡೏ ‫ݓ‬ଵ, … , ‫ݓ‬௡೏ : 0.018 а а а 0.013 0.011 а … … … … 0.023 0.016 0.009 … … … … 0.014 а 0.009 0.006 а … … … … ‫ݐ‬ଵ ‫ݐ‬ଶଵ ‫ݐ‬ଷଵଷ ‫ݐ‬ସଷଵ‫ݐ‬ହଷଵ Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 6 / 38
  • 7. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling PLSA: Probabilistic Latent Semantic Analysis [T. Hofmann 1999] Given: D is a set (collection) of documents W is a set (vocabulary) of terms ndw = how many times term w appears in document d Find: parameters φwt =p(w|t), θtd =p(t|d) of the topic model p(w|d) = t φwtθtd . The problem of log-likelihood maximization under non-negativeness and normalization constraints: d,w ndw ln t φwtθtd → max Φ,Θ , φwt 0, w∈W φwt = 1; θtd 0, t∈T θtd = 1. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 7 / 38
  • 8. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Topic Modeling is an ill-posed inverse problem Topic Modeling is the problem of stochastic matrix factorization: p(w|d) = t∈T φwtθtd . In matrix notation P W ×D = Φ W ×T · Θ T×D , where P = p(w|d) W ×D is known term–document matrix, Φ = φwt W ×T is unknown term–topic matrix, φwt =p(w|t), Θ = θtd T×D is unknown topic–document matrix, θtd =p(t|d). Matrix factorization is not unique, the solution is not stable: ΦΘ = (ΦS)(S−1 Θ) = Φ′ Θ′ for all S such that Φ′ = ΦS, Θ′ = S−1Θ are stochastic. Then, regularization is needed to find appropriate solution. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 8 / 38
  • 9. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling ARTM: Additive Regularization of Topic Model Additional regularization criteria Ri (Φ, Θ) → max, i = 1, . . . , n. The problem of regularized log-likelihood maximization under non-negativeness and normalization constraints: d,w ndw ln t∈T φwtθtd log-likelihood L (Φ,Θ) + n i=1 τi Ri (Φ, Θ) R(Φ,Θ) → max Φ,Θ , φwt 0; w∈W φwt = 1; θtd 0; t∈T θtd = 1 where τi > 0 are regularization coefficients. Vorontsov K. V., Potapenko A. A. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization // AIST’2014, Springer CCIS, 2014. Vol. 436. pp. 29–46. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 9 / 38
  • 10. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling ARTM: available regularizers topic smoothing (equivalent to LDA) topic sparsing topic decorrelation topic selection via entropy sparsing topic coherence maximization supervised learning for classification and regression semi-supervised learning using documents citation and links modeling temporal topic dynamics using vocabularies in multilingual topic models and many others Vorontsov K. V., Potapenko A. A. Additive Regularization of Topic Models // Machine Learning. Special Issue “Data Analysis and Intelligent Optimization with Applications”. Springer, 2014. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 10 / 38
  • 11. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Multimodal Probabilistic Topic Modeling Given a text document collection Probabilistic Topic Model finds: p(t|d) — topic distribution for each document d, p(w|t) — term distribution for each topic t. Topics of documents Words and keyphrases of topics doc1: doc2: doc3: doc4: ... Text documents Topic Modeling D o c u m e n t s T o p i c s Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 11 / 38
  • 12. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Multimodal Probabilistic Topic Modeling Multimodal Topic Model finds topical distribution for terms p(w|t), authors p(a|t), time p(y|t), Topics of documents Words and keyphrases of topics doc1: doc2: doc3: doc4: ... Text documents Topic Modeling D o c u m e n t s T o p i c s Metadata: Authors Data Time Conference Organization URL etc. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 12 / 38
  • 13. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Multimodal Probabilistic Topic Modeling Multimodal Topic Model finds topical distribution for terms p(w|t), authors p(a|t), time p(y|t), objects on images p(o|t), Topics of documents Words and keyphrases of topics doc1: doc2: doc3: doc4: ... Text documents Topic Modeling D o c u m e n t s T o p i c s Metadata: Authors Data Time Conference Organization URL etc. Images Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 13 / 38
  • 14. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Multimodal Probabilistic Topic Modeling Multimodal Topic Model finds topical distribution for terms p(w|t), authors p(a|t), time p(y|t), objects on images p(o|t), linked documents p(d′|t), Topics of documents Words and keyphrases of topics doc1: doc2: doc3: doc4: ... Text documents Topic Modeling D o c u m e n t s T o p i c s Metadata: Authors Data Time Conference Organization URL etc. Images Links Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 14 / 38
  • 15. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Multimodal Probabilistic Topic Modeling Multimodal Topic Model finds topical distribution for terms p(w|t), authors p(a|t), time p(y|t), objects on images p(o|t), linked documents p(d′|t), advertising banners p(b|t), Topics of documents Words and keyphrases of topics doc1: doc2: doc3: doc4: ... Text documents Topic Modeling D o c u m e n t s T o p i c s Metadata: Authors Data Time Conference Organization URL etc. Ads Images Links Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 15 / 38
  • 16. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Multimodal Probabilistic Topic Modeling Multimodal Topic Model finds topical distribution for terms p(w|t), authors p(a|t), time p(y|t), objects on images p(o|t), linked documents p(d′|t), advertising banners p(b|t), users p(u|t), Topics of documents Words and keyphrases of topics doc1: doc2: doc3: doc4: ... Text documents Topic Modeling D o c u m e n t s T o p i c s Metadata: Authors Data Time Conference Organization URL etc. Ads Images Links Users Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 16 / 38
  • 17. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Multimodal Probabilistic Topic Modeling Multimodal Topic Model finds topical distribution for terms p(w|t), authors p(a|t), time p(y|t), objects on images p(o|t), linked documents p(d′|t), advertising banners p(b|t), users p(u|t), and binds all these modalities into a single topic model. Topics of documents Words and keyphrases of topics doc1: doc2: doc3: doc4: ... Text documents Topic Modeling D o c u m e n t s T o p i c s Metadata: Authors Data Time Conference Organization URL etc. Ads Images Links Users Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 17 / 38
  • 18. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Multi-ARTM: combining multimodality with regularization M is the set of modalities W m is a vocabulary of tokens of m-th modality, m ∈ M W = W 1 ⊔ · · · ⊔ W M is a joint vocabulary of all modalities The problem of multimodal regularized log-likelihood maximization under non-negativeness and normalization constraints: m∈M λm d∈D w∈W m ndw ln t∈T φwtθtd modality log-likelihood Lm(Φ,Θ) + n i=1 τi Ri (Φ, Θ) R(Φ,Θ) → max Φ,Θ , φwt 0, w∈W m φwt = 1, m ∈ M; θtd 0, t∈T θtd = 1. where λm > 0, τi > 0 are regularization coefficients. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 18 / 38
  • 19. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Multi-ARTM: multimodal regularized EM-algorithm EM-algorithm is a simple-iteration method for a system of equations Theorem. The local maximum (Φ, Θ) satisfies the following system of equations with auxiliary variables ptdw = p(t|d, w): ptdw = norm t∈T φwtθtd ; φwt = norm w∈W m nwt + φwt ∂R ∂φwt ; nwt = d∈D λm(w)ndw ptdw ; θtd = norm t∈T ntd + θtd ∂R ∂θtd ; ntd = w∈d λm(w)ndw ptdw ; where norm t∈T xt = max{xt ,0} s∈T max{xs ,0} is nonnegative normalization; m(w) is the modality of the term w, so that w ∈ W m(w). Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 19 / 38
  • 20. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Fast online EM-algorithm for Multi-ARTM Input: collection D split into batches Db, b = 1, . . . , B; Output: matrix Φ; 1 initialize φwt for all w ∈ W , t ∈ T; 2 nwt := 0, ˜nwt := 0 for all w ∈ W , t ∈ T; 3 for all batches Db, b = 1, . . . , B 4 iterate each document d ∈ Db at a constant matrix Φ: (˜nwt) := (˜nwt) + ProcessBatch (Db, Φ); 5 if (synchronize) then 6 nwt := nwt + ˜ndw for all w ∈ W , t ∈ T; 7 φwt := norm w∈W m nwt + φwt ∂R ∂φwt for all w ∈W m, m∈M, t ∈T; 8 ˜nwt := 0 for all w ∈ W , t ∈ T; Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 20 / 38
  • 21. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling Fast online EM-algorithm for Multi-ARTM ProcessBatch iterates documents d ∈ Db at a constant matrix Φ. matrix (˜nwt) := ProcessBatch (set of documents Db, matrix Φ) 1 ˜nwt := 0 for all w ∈ W , t ∈ T; 2 for all d ∈ Db 3 initialize θtd := 1 |T| for all t ∈ T; 4 repeat 5 ptdw := norm t∈T φwtθtd for all w ∈ d, t ∈ T; 6 ntd := w∈d λm(w)ndw ptdw for all t ∈ T; 7 θtd := norm t∈T ntd + θtd ∂R ∂θtd for all t ∈ T; 8 until θd converges; 9 ˜nwt := ˜nwt + λm(w)ndw ptdw for all w ∈ d, t ∈ T; Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 21 / 38
  • 22. Theory BigARTM implementation — http://bigartm.org Experiments Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling ARTM approach: benefits and restrictions Benefits Single EM-algorithm for many models and their combinations PLSA, LDA, and 100s of PTMs are covered by ARTM No complicated inference and graphical models ARTM reduces barriers to entry into PTM research field ARTM encourages any combinations of regularizers Multi-ARTM encourages any combinations of modalities Multi-ARTM is implemented in BigARTM open-source project Under development (not really restrictions): 3-matrix factorization P = ΦΨΘ, e.g. Author-Topic Model Further generalization of hypergraph-based Multi-ARTM Adaptive optimization of regularization coefficients Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 22 / 38
  • 23. Theory BigARTM implementation — http://bigartm.org Experiments BigARTM: parallel architecture BigARTM: time and memory performance How to start using BigARTM The BigARTM project: main features Parallel online Multi-ARTM framework Open-source http://bigartm.org Distributed storage of collection is possible Built-in regularizers: smoothing, sparsing, decorrelation, semi-supervised learning, and many others coming soon Built-in quality measures: perplexity, sparsity, kernel contrast and purity, and many others coming soon Many types of PTMs can be implemented via Multi-ARTM: multilanguage, temporal, hierarchical, multigram, and many others Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 23 / 38
  • 24. Theory BigARTM implementation — http://bigartm.org Experiments BigARTM: parallel architecture BigARTM: time and memory performance How to start using BigARTM The BigARTM project: parallel architecture Concurrent processing of batches Simple single-threaded code for ProcessBatch User controls when to update the model in online algorithm Deterministic (reproducible) results from run to run Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 24 / 38
  • 25. Theory BigARTM implementation — http://bigartm.org Experiments BigARTM: parallel architecture BigARTM: time and memory performance How to start using BigARTM BigARTM vs Gensim vs Vowpal Wabbit 3.7M articles from Wikipedia, 100K unique words procs train inference perplexity BigARTM 1 35 min 72 sec 4000 Gensim.LdaModel 1 369 min 395 sec 4161 VowpalWabbit.LDA 1 73 min 120 sec 4108 BigARTM 4 9 min 20 sec 4061 Gensim.LdaMulticore 4 60 min 222 sec 4111 BigARTM 8 4.5 min 14 sec 4304 Gensim.LdaMulticore 8 57 min 224 sec 4455 procs = number of parallel threads inference = time to infer θd for 100K held-out documents perplexity is calculated on held-out documents. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 25 / 38
  • 26. Theory BigARTM implementation — http://bigartm.org Experiments BigARTM: parallel architecture BigARTM: time and memory performance How to start using BigARTM Running BigARTM in Parallel 3.7M articles from Wikipedia, 100K unique words Amazon EC2 c3.8xlarge (16 physical cores + hyperthreading) No extra memory cost for adding more threads Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 26 / 38
  • 27. Theory BigARTM implementation — http://bigartm.org Experiments BigARTM: parallel architecture BigARTM: time and memory performance How to start using BigARTM How to start using BigARTM 1 Download links, tutorials, documentation: http://bigartm.org 2 Linux: compile and start examples Windows: start examples Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 27 / 38
  • 28. Theory BigARTM implementation — http://bigartm.org Experiments BigARTM: parallel architecture BigARTM: time and memory performance How to start using BigARTM How to start using BigARTM 1 Download links, tutorials, documentation: http://bigartm.org 2 Linux: compile and start examples Windows: start examples BigARTM community: 1 Post questions in BigARTM discussion group: https://groups.google.com/group/bigartm-users 2 Report bugs in BigARTM issue tracker: https://github.com/bigartm/bigartm/issues 3 Contribute to BigARTM project via pull requests: https://github.com/bigartm/bigartm/pulls Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 28 / 38
  • 29. Theory BigARTM implementation — http://bigartm.org Experiments BigARTM: parallel architecture BigARTM: time and memory performance How to start using BigARTM License and programming environment Freely available for commercial usage (BSD 3-Clause license) Cross-platform — Windows, Linux, Mac OS X (32 bit, 64 bit) Simple command-line API — available now Rich programming API in C++ and Python — available now Rich programming API in C# and Java — coming soon Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 29 / 38
  • 30. Theory BigARTM implementation — http://bigartm.org Experiments ARTM for combining regularizers Multi-ARTM for classification Multi-ARTM for multi-language TM Combining Regularizers: experiment on 3.7M Wikipedia collection Additive combination of 5 regularizers: smoothing background (common lexis) topics B in Φ and Θ sparsing domain-specific topics S = TB in Φ and Θ decorrelation of topics in Φ ΦW ×T ΘT×D Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 30 / 38
  • 31. Theory BigARTM implementation — http://bigartm.org Experiments ARTM for combining regularizers Multi-ARTM for classification Multi-ARTM for multi-language TM Combining Regularizers: experiment on 3.7M Wikipedia collection Additive combination of 5 regularizers: smoothing background (common lexis) topics B in Φ and Θ sparsing domain-specific topics S = TB in Φ and Θ decorrelation of topics in Φ R(Φ, Θ) = + β1 t∈B w∈W βw ln φwt + α1 d∈D t∈B αt ln θtd − β0 t∈S w∈W βw ln φwt − α0 d∈D t∈S αt ln θtd − γ t∈T s∈Tt w∈W φwtφws where β0, α0, β1, α1, γ are regularization coefficients. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 31 / 38
  • 32. Theory BigARTM implementation — http://bigartm.org Experiments ARTM for combining regularizers Multi-ARTM for classification Multi-ARTM for multi-language TM Combining Regularizers: LDA vs ARTM models P10k , P100k — hold-out perplexity (10K, 100K documents) SΦ, SΘ — sparsity of Φ and Θ matrices (in %) Ks , Kp, Kc — average topic kernel size, purity and contrast Model P10k P100k SΦ SΘ Ks Kp Kc LDA 3436 3801 0.0 0.0 873 0.533 0.507 ARTM 3577 3947 96.3 80.9 1079 0.785 0.731 Convergence of LDA (thin lines) and ARTM (bold lines) 1 · 106 2 · 106 3 · 106 0.34 0.52 0.69 0.87 1.04 1.22 ·104 Perplexity 0 20 40 60 80 100 Sparsity Perplexity Phi Theta 1 · 106 2 · 106 3 · 106 0 0.25 0.5 0.75 1 1.25 ·103 Kernelsize 0 0.2 0.4 0.6 0.8 1 Purityandcontrast Size Purity Contrast Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 32 / 38
  • 33. Theory BigARTM implementation — http://bigartm.org Experiments ARTM for combining regularizers Multi-ARTM for classification Multi-ARTM for multi-language TM EUR-Lex corpus 19 800 documents about European Union law Two modalities: 21K words, 3 250 categories (class labels) EUR-Lex is a “power-law dataset” with unbalanced classes: Left: # unique labels with a given # documents per label Right: # documents with a given # labels Rubin T. N., Chambers A., Smyth P., Steyvers M. Statistical topic models for multi-label document classification // Machine Learning, 2012, 88(1-2). Pp. 157–208. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 33 / 38
  • 34. Theory BigARTM implementation — http://bigartm.org Experiments ARTM for combining regularizers Multi-ARTM for classification Multi-ARTM for multi-language TM Multi-ARTM for classification Regularizers: Uniform smoothing for Θ Uniform smoothing for word–topic matrix Φ1 Label regularization for class–topic matrix Φ2: R(Φ2 ) = τ c∈W 2 ˆpc ln p(c) → max, where p(c) = t∈T φctp(t) is the model distribution of class c, p(t) = nt n can be easily estimated along EM iterations, ˆpc is the empirical frequency of class c in the training data. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 34 / 38
  • 35. Theory BigARTM implementation — http://bigartm.org Experiments ARTM for combining regularizers Multi-ARTM for classification Multi-ARTM for multi-language TM The comparative study of models on EUR-Lex classification task DLDA (Dependency LDA) [Rubin 2012] is a nearest analog of Multi-ARTM for classification among Bayesian Topic Models Quality measures [Rubin 2012]: AUC-PR (%, ⇑) — Area under precision-recall curve AUC (%, ⇑) — Area under ROC curve OneErr (%, ⇓) — One error (most ranked label is not relevant) IsErr (%, ⇓) — Is error (no perfect classification) Results: |T|opt AUC-PR AUC OneErr IsErr Multi-ARTM 10 000 51.3 98.0 29.1 95.5 DLDA [Rubin 2012] 200 49.2 98.2 32.0 97.2 SVM 43.5 97.5 31.6 98.1 Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 35 / 38
  • 36. Theory BigARTM implementation — http://bigartm.org Experiments ARTM for combining regularizers Multi-ARTM for classification Multi-ARTM for multi-language TM Multi-language ARTM We consider languages as modalities in Multi-ARTM. Collection of 216 175 Russian–English Wikipedia articles pairs. Top 10 words with p(w|t) probabilities (in %): Topic 68 Topic 79 research 4.56 институт 6.03 goals 4.48 матч 6.02 technology 3.14 университет 3.35 league 3.99 игрок 5.56 engineering 2.63 программа 3.17 club 3.76 сборная 4.51 institute 2.37 учебный 2.75 season 3.49 фк 3.25 science 1.97 технический 2.70 scored 2.72 против 3.20 program 1.60 технология 2.30 cup 2.57 клуб 3.14 education 1.44 научный 1.76 goal 2.48 футболист 2.67 campus 1.43 исследование 1.67 apps 1.74 гол 2.65 management 1.38 наука 1.64 debut 1.69 забивать 2.53 programs 1.36 образование 1.47 match 1.67 команда 2.14 Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 36 / 38
  • 37. Theory BigARTM implementation — http://bigartm.org Experiments ARTM for combining regularizers Multi-ARTM for classification Multi-ARTM for multi-language TM Multi-language ARTM Collection of 216 175 Russian–English Wikipedia articles pairs. Top 10 words with p(w|t) probabilities (in %): Topic 88 Topic 251 opera 7.36 опера 7.82 windows 8.00 windows 6.05 conductor 1.69 оперный 3.13 microsoft 4.03 microsoft 3.76 orchestra 1.14 дирижер 2.82 server 2.93 версия 1.86 wagner 0.97 певец 1.65 software 1.38 приложение 1.86 soprano 0.78 певица 1.51 user 1.03 сервер 1.63 performance 0.78 театр 1.14 security 0.92 server 1.54 mozart 0.74 партия 1.05 mitchell 0.82 программный 1.08 sang 0.70 сопрано 0.97 oracle 0.82 пользователь 1.04 singing 0.69 вагнер 0.90 enterprise 0.78 обеспечение 1.02 operas 0.68 оркестр 0.82 users 0.78 система 0.96 All |T| = 400 topics were reviewed by an independent assessor, and he successfully interpreted 396 topics. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 37 / 38
  • 38. Conclusions ARTM (Additive Regularization for Topic Modeling) is a general framework, which makes topic models easy to design, to infer, to explain, and to combine. Multi-ARTM is a further generalization of ARTM for multimodal topic modeling BigARTM is an open source project for parallel online topic modeling of large text collections. http://bigartm.org Join BigARTM community!