This document provides an overview of hierarchical topic modeling. It begins with background on text summarization and topic modeling. Some key concepts in topic modeling like latent semantic analysis and probabilistic latent semantic indexing (PLSI) are introduced. Popular topic models like latent Dirichlet allocation (LDA) and hierarchical topic models using the Chinese restaurant process are described. Gibbs sampling is discussed as a method for parameter estimation in topic models. The document concludes with examples of hierarchical topic modeling and information on the author's related work.
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftSebastian Ruder
Oral presentation given at ACL 2018 about our paper Strong Baselines for Neural Semi-supervised Learning under Domain Shift (http://aclweb.org/anthology/P18-1096).
Modern enterprise data—tracking key performance indicators like conversions or click-throughs—exhibits a pathologically high dimensionality, which requires re-thinking data representation to make analysis tractable.
Representing financial reports on the semantic web a faithful translation f...Jie Bao
Jie Bao, Graham Rong, Xian Li, and Li Ding (2010). Representing Financial Reports on the Semantic Web - A Faithful Translation from XBRL to OWL. In The 4th International Web Rule Symposium (RuleML).
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftSebastian Ruder
Oral presentation given at ACL 2018 about our paper Strong Baselines for Neural Semi-supervised Learning under Domain Shift (http://aclweb.org/anthology/P18-1096).
Modern enterprise data—tracking key performance indicators like conversions or click-throughs—exhibits a pathologically high dimensionality, which requires re-thinking data representation to make analysis tractable.
Representing financial reports on the semantic web a faithful translation f...Jie Bao
Jie Bao, Graham Rong, Xian Li, and Li Ding (2010). Representing Financial Reports on the Semantic Web - A Faithful Translation from XBRL to OWL. In The 4th International Web Rule Symposium (RuleML).
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in RMarco Wirthlin
What is probabilistic programming and Bayesian statistics? What are their strengths and limitations? In his talk, Marco located Bayesian networks in the current AI landscape, gently introduced Bayesian reasoning and computation and explained how to implement generative models in R.
Introduction to Model-Based Machine LearningDaniel Emaasit
The field of machine learning has seen the development of thousands of learning algorithms. Typically, scientists choose from these algorithms to solve specific problems. Their choices often being limited by their familiarity with these algorithms. In this classical/traditional framework of machine learning, scientists are constrained to making some assumptions so as to use an existing algorithm. This is in contrast to the model-based machine learning approach which seeks to create a bespoke solution tailored to each new problem.
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
Rattle is Free (as in Libre) Open Source Software and the source code is available from the Bitbucket repository. We give you the freedom to review the code, use it for whatever purpose you like, and to extend it however you like, without restriction, except that if you then distribute your changes you also need to distribute your source code too.
Rattle - the R Analytical Tool To Learn Easily - is a popular GUI for data mining using R. It presents statistical and visual summaries of data, transforms data that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets. One of the most important features (according to me) is that all of your interactions through the graphical user interface are captured as an R script that can be readily executed in R independently of the Rattle interface.
Rattle clocks between 10,000 and 20,000 installations per month from the RStudio CRAN node (one of over 100 nodes). Rattle has been downloaded several million times overall.
This presentation shows an overview of the main concepts introduced in the EDBT2015 Summer School, which took place in Palamos. For each area, we summarize the main issues and current approaches. We also describe the challenges and main activities that were undertaken in the summer school
Deep Learning and Automatic Differentiation from Theano to PyTorchinside-BigData.com
Inquisitive minds want to know what causes the universe to expand, how M-theory binds the smallest of the small particles or how social dynamics can lead to revolutions. In recent centuries, developments in science and technology brought us closer to explore the expanding universe, discover unknown particles like bosons or find out how and why a society interacts and reacts. To explain the fascinating phenomena of nature, Natural scientists develop complex 'mechanistic models' of deterministic or stochastic nature. But the hard question is how to choose the best model for our data or how to calibrate the model given the data.
The way that statisticians answer these questions is with Approximate Bayesian Computation (ABC), which we learn on the first day of the summer school and which we combine with High Performance Computing (HPC). The second day focuses on a popular machine learning approach 'Deep-learning' which mimics the deep neural network structure in our brain, in order to predict complex phenomena of nature. The summer school takes a route of open discussion and brainstorming sessions where we explore two cornerstones of today's data-science, ABC and Deep Learning being accelerated by HPC with hands on examples and exercises.
Watch the video: https://wp.me/p3RLHQ-hQQ
Learn more: https://github.com/probprog/CSCS-summer-school-2017
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in RMarco Wirthlin
What is probabilistic programming and Bayesian statistics? What are their strengths and limitations? In his talk, Marco located Bayesian networks in the current AI landscape, gently introduced Bayesian reasoning and computation and explained how to implement generative models in R.
Introduction to Model-Based Machine LearningDaniel Emaasit
The field of machine learning has seen the development of thousands of learning algorithms. Typically, scientists choose from these algorithms to solve specific problems. Their choices often being limited by their familiarity with these algorithms. In this classical/traditional framework of machine learning, scientists are constrained to making some assumptions so as to use an existing algorithm. This is in contrast to the model-based machine learning approach which seeks to create a bespoke solution tailored to each new problem.
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
Rattle is Free (as in Libre) Open Source Software and the source code is available from the Bitbucket repository. We give you the freedom to review the code, use it for whatever purpose you like, and to extend it however you like, without restriction, except that if you then distribute your changes you also need to distribute your source code too.
Rattle - the R Analytical Tool To Learn Easily - is a popular GUI for data mining using R. It presents statistical and visual summaries of data, transforms data that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets. One of the most important features (according to me) is that all of your interactions through the graphical user interface are captured as an R script that can be readily executed in R independently of the Rattle interface.
Rattle clocks between 10,000 and 20,000 installations per month from the RStudio CRAN node (one of over 100 nodes). Rattle has been downloaded several million times overall.
This presentation shows an overview of the main concepts introduced in the EDBT2015 Summer School, which took place in Palamos. For each area, we summarize the main issues and current approaches. We also describe the challenges and main activities that were undertaken in the summer school
Deep Learning and Automatic Differentiation from Theano to PyTorchinside-BigData.com
Inquisitive minds want to know what causes the universe to expand, how M-theory binds the smallest of the small particles or how social dynamics can lead to revolutions. In recent centuries, developments in science and technology brought us closer to explore the expanding universe, discover unknown particles like bosons or find out how and why a society interacts and reacts. To explain the fascinating phenomena of nature, Natural scientists develop complex 'mechanistic models' of deterministic or stochastic nature. But the hard question is how to choose the best model for our data or how to calibrate the model given the data.
The way that statisticians answer these questions is with Approximate Bayesian Computation (ABC), which we learn on the first day of the summer school and which we combine with High Performance Computing (HPC). The second day focuses on a popular machine learning approach 'Deep-learning' which mimics the deep neural network structure in our brain, in order to predict complex phenomena of nature. The summer school takes a route of open discussion and brainstorming sessions where we explore two cornerstones of today's data-science, ABC and Deep Learning being accelerated by HPC with hands on examples and exercises.
Watch the video: https://wp.me/p3RLHQ-hQQ
Learn more: https://github.com/probprog/CSCS-summer-school-2017
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Recommender system slides for undergraduateYueshen Xu
Slides for undergraduate in IR class. Presented in Chinese
Mainly focus on the background, application, real case, idea, basic method of recommender systems
This is an introduction of Topic Modeling, including tf-idf, LSA, pLSA, LDA, EM, and some other related materials. I know there are definitely some mistakes, and you can correct them with your wisdom. Thank you~
1. (Hierarchical) Topic Modeling
Yueshen Xu (lecturer)
ysxu@xidian.edu.cn / xuyueshen@163.com
Data and Knowledge Engineering Research Center
Xidian University
Text Mining & NLP & ML
2. Software Engineering2016/12/29
Outline
Background
Some Concepts
Topic Modeling
Probabilistic Latent Semantic Indexing (PLSI)
Latent Dirichlet Allocation (LDA)
Hierarchical Topic Modeling
Chinese Restaurant Process (CRP)
What I do
Supplement & Reference
2
Keywords: topic modeling, hierarchical topic modeling, probabilistic graphical
model, Bayesian model
Basics, not
state-of-the-art
4. Software Engineering2016/12/29
Background
Text Summarization
Document Summarization
What do these docs (or this doc) talk about?
Review Summarization
What do these consumers care about or complain about?
Short Text/Tweets Summarization
What are people discussing about?
4
Automatic Applicable Explainable
Basic Requirement
Topic Modeling
5. Software Engineering2016/12/29
General Concepts
Latent Semantic Analysis
Text Mining
Natural Language Processing
Computational Linguistics
Information Retrieval
Dimension Reduction
Topic Modeling
Some Concepts
5
Information Retrieval
Computational Linguistics
Natural Language Processing
LSA/Topic Model
Text Mining
LSA
Data Mining
Reduction
Dimension
Machine
Learning
Machine
Translation
Topic
Modeling
to learn the latent topics from a corpus/document
6. Software Engineering2016/12/29
Topic Modeling
Topic modeling
an example in Chinese (from my doctorate thesis)
6
继续实施稳健的货币政策,保
持松紧适度适时预调微调,做
好与供给侧结构,并综合运用
数量、价格等多种货币政策
从员额上来看,这次改革远远超
过了裁军的数量,它是一种结构
性的改革,是军队组织结构现代
化的一个关键步骤
美元作为主要国际货币的地位在
可预见的将来仍无可取代,唯一
的出路是推动全球治理向更均衡
的方向发展。国际货币基金组织
总裁拉加德日前在美国马里兰大
学演讲时就呼吁,国际治理改革
应认清新兴经济体越来越重要这
一现实。
独立学院从母体高校“断奶”后,
可能会面临品牌、招生等方面阵
痛,但是在国家和省市鼓励民间
资本进入教育领域的实施意见发
布后,一些独立学院果断切割连
接母体大学的“脐带”,自立门
户发展。
Corpus
Doc
1
Doc2
Doc
3 Doc4
8. Software Engineering2016/12/29
Topic Modeling
A topic
A word cluster a group of words
Not clustered randomly, but meaningfully (not semantically)
8
Models
Parametric models
Latent Semantic Indexing (LSI)
PLSI; Latent Dirichlet Allocation (LDA)
Non-parametric models (Dirichlet Process)
(Nested) Chinese Restaurant Process
Indian Buffet Process
Pitman-Yor Process
9. Software Engineering2016/12/29
Topic Modeling
9
pLSI Model
w1
w2
wN
z1
zK
z2
d1
d2
dM
…..
…..
…..
)(dp)|( dzp)|( zwp
Assumption
Pairs(d,w) are assumed to be
generated independently
Conditioned on z, w is generated
independently of d
Words in a document are
exchangeable
Documents are exchangeable
Latent topics z are independent
The generative process
∑∑ ∈∈ ZzZz
dzpzwpdpdzwpdpdpdwpwdp )|()|()(=)|,()(=)()|(=),(
Multinomial Distribution
Multinomial Distribution
One layer of ‘Deep
Neutral Network’
10. Software Engineering2016/12/29
Topic Modeling
10
Latent Dirichlet Allocation (LDA)
David M. Blei, Andrew Y. Ng, Michael I. Jordan
Hierarchical Bayesian model; Bayesian pLSI
θ z w
N
M
α
β
iterative times
Generative process of LDA
Choose N ~ Poisson(𝜉);
For each document d={𝑤1, 𝑤2 … 𝑤 𝑛}
Choose 𝜃 ~𝐷𝑖𝑟(𝛼); For each of the N
words 𝑤 𝑛 in d:
a) Choose a topic 𝑧 𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝜃
b) Choose a word 𝑤 𝑛 from 𝑝 𝑤 𝑛 𝑧 𝑛, 𝛽 ,
a multinomial distribution conditioned on 𝑧 𝑛
11. Software Engineering2016/12/29
Gibbs Sampling (MCMC, Markov Chain Monte Carlo)
‘I want to know a distribution, but I haven’t known yet, so I find a
way to generate its samples’
300 lines (code) for LDA, not complex but solid
lim
𝑛→∞
𝜋0 𝑃 𝑛
=
𝜋(1) … 𝜋(|𝑆|)
⋮ ⋮ ⋮
𝜋(1) 𝜋(|𝑆|)
𝜋 = {𝜋 1 , 𝜋 2 , … , 𝜋 𝑗 , … , 𝜋(|𝑆|)}
Topic Modeling
Parameter Estimation
Variational Inference (+EM) :Complex, rarely use
‘I want to know a distribution, but I haven’t known yet, so I find a
similar distribution (tight upper bound or lower bound)’
K-L divergence (or information gain)
11
Stationary Distribution
13. Software Engineering2016/12/29
Hierarchical Topic Modeling
13
Chinese Restaurant Process (Dirichlet Process)
A restaurant with an infinite number of tables, and
customers (word) enter this restaurant sequentially. The ith
customer (𝜃𝑖) sits at a table (𝜙 𝑘) according to the probability
𝜙 𝑘: Clustering == 1/2 unsupervised learning clustering, topic modeling (two layer
clustering), hierarchical concept building, collaborative filtering, similarity computation…
14. Software Engineering2016/12/29
Hierarchical Topic Modeling
14
The generative process (nested CRP)
Focus on the insight
1. Let 𝑐1 be the root restaurant (only one table)
2. For each level 𝑙 ∈ {2, … , 𝐿}:
Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to
by that table
3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼)
4. For each word 𝑤 𝑛:
Draw 𝑧 ∈ 1, … , 𝐿 ~ Mult(𝜃)
Draw 𝑤 𝑛 from the topic associated with restaurant 𝑐 𝑧
α
zm,n
N
c1
c2
cL
T
γ
wm,n
M
β
k
m
Matryoshka
(Russia) Doll
15. Software Engineering2016/12/29
Hierarchical Topic Modeling
Examples
15
root topic analysis obtain base system concentration
thermal
polymer acid
property
diamine
activity compound acid
derivative active
compound ligand group
investigate synergistic
reaction
derivative
yield synthesis
microwave
assay food quality content
analysis
decoction
component
radix quality
constituent
compound
activity
synthesize salt
derivative
antioxidant
activity extract
inhibitory
flavonoid
interaction
cation metal
energy
solution
16. Software Engineering2016/12/29
What I do
Topic-specific opinion mining
Goal: automatically learn which group of aspects people like,
dislike, and how people like, and why people like
Methods: topic model (LDA), Dirichlet process, Gibbs sampling,
etc.
Collaborative recommendation
Goal: automatically learn which group of products people like,
dislike, and how people like, and why people like
Methods: matrix factorization, gradient descent, regularization
norm, etc.
Common basics: Bayesian inference (MLE, MAP, PGM)
16
17. Software Engineering2016/12/29
Supplement
17
Some supplements
Probabilistic Graphical Model
Modeling Bayesian Network using plates and circles
Generative Model & Discriminative Model: 𝑝(𝜃|𝑋/𝐷𝑎𝑡𝑎)
Generative Model: p(θ|X) ∝ p(X|θ)p(θ)
- Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning
Discriminative Model: 𝑝(𝜃|𝑋)
- LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning
Also can be represented by
graphical models
18. Software Engineering2016/12/29
Reference
My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)
‘Topic modeling (an introduction)’
‘Non-parametric Bayesian learning in discrete data’
‘The research of topic modeling in text mining’
‘Matrix factorization with user generated content’
…, etc
Website
You can download all slides of mine
http://web.xidian.edu.cn/ysxu/teach.html
http://liu.cs.uic.edu/yueshenxu/
http://www.slideshare.net/obamaxys2011
https://www.researchgate.net/profile/Yueshen_Xu
18
19. Software Engineering2016/12/29
Reference
• David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003
• Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007
• Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical
Association, 2006
• David Blei. Probabilstic topic models. Communications of the ACM, 2012
• David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of
Topic Hierarchies. Journal of the ACM, 2010
• Gregor Heinrich. Parameter Estimation for Text Analysis, 2008
• T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals
of Statistics, 1973
• Martin J. Wainwright. Graphical Models, Exponential Families, and Variational
Inference
• Rick Durrett. Probability: Theory and Examples, 2010
• Christopher Bishop. Pattern Recognition and Machine Learning, 2007
• Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014
19