LDA : latent Dirichlet Allocation (Fairies NLP Series) - Korean Ver.

LDA
누구나 따라할 수 있는 텍스트마이닝
LDA 본 실습 및 정리
- Fairies NLP Series
*Github: /sanghan1990
*E-mail:sanghan1990@gmail.com

Data 유형
결론 및 인사이트
미래
프로젝트진행
용어(부록)
LDA란 무엇인가
두 가지로 구성! 하나는 여러분도 사용할 수 있는
오픈소스 데이터 / 현장 데이터
어떤 패키지를 사용했는지, 어떻게 Time-Schedule
을 짰는지, 시행착오들은 무엇이었는지?
Fairies : Code단으로 자세히 설명부탁드려요!! (아..
말로만퉁칠려고 했는데 ㅜㅜ)
LDA의 Latent 는 무엇이며 D, A는 무엇이고 어떤
프로젝트에 적합한지 말씀해주셔야죠!
Fairies : 굉장히 정성적으로 이해하기 쉽게 저만의
워드로 적어놔보았습니다.
텍스트마이닝이라고 했지만 결국 머신러닝이었으
니 그에 관한 다양한 용어들을 정리해줘야겠죠!
Fairies : 용어설명 하나씩 하면 본질에 벗어날때가
많아. 한 곳에 모아놨습니다.
이건 덕분에 많이 알게됬으니 다른것에 흥미를 가
질래. 뭐가 있을까?
Fairies : 문서가 크든 작든 기계가 알려준 토픽을
찾아서 직관적으로 빠르게 설명을 빠르게 할 수 있
는 방법으로 뭐가 있을지 생각해볼 수 있어요!
구체적으로 뭐가 나온건지 설명해봐?
Fairies : 비지도학습을 통해 categor를 갖지 않아도
학습후 기계가 알아서 토픽을 원하는 만큼을 분류
해주어 해당 뉴스나 텍스트안에서의 토픽을 나누
어 주요단어
를 볼 수 있습니다.
Data 의 유형은 다양합니다. 하지만 텍스트마이닝
의 데이터는 텍스트라 간단하다고 생각하겠지만
실제로 전처리 해야할 부분이 많죠.
Fairies : 딥러닝도 아니고 모델링을 실행하는데 오
래걸리진 않는다 생각하죠. 전처리때문에 시간을
많이 쓴 것 외에도 패키지가 GPU를 사용할 수 없
으니까 시간이 많이 들죠!

토픽이 추상적인 존재라 잠재라는 단어를 씀.

latent dirichlet allocation
• NLP – LDA (NLP란? 용어설명)
• 잠재 디리클레 할당
• 문서에 잠재되어 있는 topic을 이용해 분류(cluster)하는 방법론
• 잠재변수를 사용.
• 비지도학습
• LDA의 확률분포는 토픽 벡터의 요소가 양수이며 모든 요소를 더한 값이 1인 경우
에 대해서 확률값이 정의되는 분포임.

활용사례
• 트위터 등 비격식 문서를 갖고 연령대, 성별, 지역을 분류한다.
feature를 선택하고 LDA를 이용하여 예측된 확률분포를 활용하
여 분류한 결과, 연령 72%, 성별75%, 지역43%의 납득할만한 예
측 정확도 결과를 얻게 되었다.

활용사례
• 주제가 광범위 하고 양이 많은 데이터에 적용시키기 위해 적용
하였다.
• LDA를 기반으로 트위터 데이터를 분석하여 토픽의 변화 시점
및 패턴을 파악하는 연구를 진행.

활용사례(부정적)
LDA is a poor method made popular by the marketing genius
of some academics who have built their careers on it. It
entirely ignores complicated and important aspects of
linguistics to describe a rather unbelievable generative
process of text that usually doesn’t yield anything meaningful,
surprising, and insightful. If you have every used it, you will
know what I am talking about.

Learning
Process
1. 찾고싶은 토픽개수 지정
k
2. K개의 토픽 들 중 하나를
랜덤하게 할당
3. 각 문서 d에 대해 각 단
어 w에
p(topic | document
d),
p(word | document d)
4. p(topic | document d)
* p(word | document d)에
따른 토픽 t 를 고른다.

수식으로 쉽게 이야기 하면
문서를 토픽으로 표현한 것 뿐
1. 푸아송 분포 Choose N ~ Poisson(€) 로 문서에 들어
간 N개 단어를 결정하고
2. 디리클레 분포 Choose theta ~ Dir(alpha)로 k개 토
픽세트에서 토픽을 결정. 1/3은 토픽 1, 1/3은 토픽 2에
들어가는 것
3. 토픽의 다항분포를 갖고 단어 확률 생성한다.

Sklearn – LDA()
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
learning_method : ‘batch’ | ‘online’, default=’online’
Method
used to update _component. Only used in fit method. In general, if the data size is large, the online update will be
much faster than the batch update. The default learning method is going to be changed to ‘batch’ in the 0.20 release.
Valid options:
'batch': Batch variational Bayes method. Use all training data in
each EM update.
Old `components_` will be overwritten in each iteration.
'online': Online variational Bayes method. In each EM update, use
mini-batch of training data to update the ``components_``
variable incrementally. The learning rate is controlled by the
``learning_decay`` and the ``learning_offset`` parameters.

프로젝트
1) 영화
리뷰 데이
터셋
(aclImdb)
1. 데이터 전처리(생략)
2. Bag of word, stop word
3. LDA
4. Print out & sort Topic
사용되는 패키지 : sklearn(LatentDirichletAllocation).
Numpy. Mglearn(print_topics)

vect = CountVectorizer(max_features=10000, max_df=.15)
X = vect.fit_transform(text_train)

2. Bag of word, stop word
• 문서에서 나타나는 단어에 15% 나타나는 단어를 삭제
• 가장 많이 등장하는 10,000개에 대한 Bag Of Word 모델 설계
vect = CountVectorizer(max_features=10000, max_df=.15)
X = vect.fit_transform(text_train)

3. LDA (parameter by
sklearn)
• We build the model and
transform the data in one
step. (모델 생성과 변환을 한
번에)
• Computing transform takes
some time, and we can save
time by doing both at
once(변환 시간이 좀 걸리므로 시
간을 절약하기 위해 동시처리)
• n_topic = 10; 10개의 토픽으로
토픽 모델 학습할 것. 토픽의 수
를 바꾸면 모든 토픽이 바뀐다.
• batch 방법을 사용
• max_iter :기본값 10; 모델 성
능을 위해 max_iter 값 증
# LDA
from sklearn.decomposition import
LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=10,
learning_method="batch",
max_iter=25, random_state=0)
document_topics = lda.fit_transform(X)

단어 10,000개에 대한 10개 토픽

• # for each topic (a row in the components_), sort the features
(ascending).
• # Invert rows with [:, ::-1] to make sorting descending
• # 내림차순이 되도록 [:, ::-1]을 사용해 행의 정렬을 반대로 바꾼
다.
sorting = np.argsort(lda.components_, axis=1)[:, ::-1]
# get the feature names from the vectorizer:
feature_names = np.array(vect.get_feature_names())

mglearn packages
10개 토픽을 출력.
(topic per chunk )

결과를 보
면 Topic
(N)
0: 액션물 1: 자극적 2: 역사물
3: TV
시리즈물
4: 일반적인
단어
5: 영화용어
(약함)
6: 공포물
7 : 영화용어
(약함)
8 : 가족물
9 : 영화용어
(약함)

프로젝트
2)
54-news
분석체계
1. 데이터 유형
2. 패키지별 파라미터 설명
3. 전처리
사용되는 패키지 :
sklearn(LatentDirichletAlloca

DataSet -
54 개 분류
체계 중 1
개 : 게임
문장수 200개
문장당 150~200단어
총 35181단어
그림 : 한 문장(192단어)

2. 패키지별 파라미터 설명 - Parameter
1)Vectorizer : TF-IDF /
CounterVectorizer
• Param: max_iter / no. of feature
2)Korean BOW
packages(extract Noun) :
• 1) Kkma
• 2) Hannanum
• 3) Twitter
3)LDA: Main
• 1)Learning Method
• 2)n_component : No. of topics

3. 전처리
Counter-Vectorization
TF-IDF
한마음 패키지 Twitter 패키지
Kkma(코코마)
패키지

Batch_method_counter(max_feature=100)_kkma_topic5

단어를 작게뽑으면 sort진행한 숫자부터 먼저 나옴.

Batch_method_tfidf_twitter_topic5

데이터셋의 35181단어중 100개 단어만 countvectorizer 로 추출

Batch_method_counter(max_feature=100)_twitter_topic5

• 최신들어서 토픽모델링분야 인기를 유지하고 있는 textrank 는
파이썬으로 구현된 소스들이 많이 있다.
• R로 구현된 한글데이터로 구현된 소스코드나 형태소분석기 등
등을 모색중.
• LDA는 R로 더 자세한 연구를 해보인듯.
• R이 LDA쪽으로 더 많은 연구가 있었으나 한국어데이터셋으로
한 것은 드물다.

LDAGibbs 5 TopicProbabilities (30corpus)

아이디어
기사 10개로 나눠
기사 하나씩 보면서. (회사자원을 쓰도록)
다른기사로 넣어봤는데 잘 됨.
새로운 기사 넣으면 어디에 해당될 것이다.
정확도가 된다.
성능을 평가하는 방법은 고려.
설명력있게 설득력있게

인사이트
- 빈도가 많은 단어는 불용어에 가깝고
- 어떠한 토픽은 부정적으로 보이고 칭찬멘트가 많은 단어들로 구성된
것들이 있는지 알아낼수 있다.
- 특정 영화에 대한 의견이거나 평가 점수를 합리화 하거나 강조하기
위한 댓글이라는 사실이 재미있다.
- LDA라는 토픽모델은 라벨이 없거나 있어도 상광넚이 텍스트
말뭉치를 해석하는데 도움을 준다.
- LDA는 확률적 알고리즘이기때문에 random_state 를 바꾸면 결과가
바뀌니 주의하자.
- 보수적으로 평가해야 하는 이유는 비지도학습이기 때문에 과정을
이해하는데 해당 하는 문서를 직접 보지 않고 판단하는건 좋지 않다.

LDA Overall Process by CODE
You won’t visit my github

Data Preprocessing – Extract Noun

Best LDA Model from Grid search

(View)Grid-search_compareLDAmodel

LDA-quickmodel3-mglearn-5topics

LDA-quickmodel-log-likelihood,perplexity

LDA_topic100 – Vectorizer to LDA

Conclusion
In fact, it is such a non-interpretable model that there are papers
written in how to analyze and interpret its output. One such paper
compared analyzing the resultant topics from LDA to reading tea
leaves!
A senior tenured researcher I know performed LDA on text and
then used the learned topics to assign each document to a single
topic, effectively using the LDA mixed-membership model for
publishability when all he eventually wanted was a clustering
model of the text documents.
Use cases of LDA are littered with such incidents where people
used LDA or its variants to seem knowlegeable and smart when
something simpler was far more suitable for their research task.

Conclusion
• Finding latent topics in a large corpus of
documents
• Massive automatic movies indexation from
subtitles.
• Topical stock quote motions.
• Modeling musical influence
• Behavior mining of Internet users.

LDA : latent Dirichlet Allocation (Fairies NLP Series) - Korean Ver.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to LDA : latent Dirichlet Allocation (Fairies NLP Series) - Korean Ver.

Similar to LDA : latent Dirichlet Allocation (Fairies NLP Series) - Korean Ver. (20)

LDA : latent Dirichlet Allocation (Fairies NLP Series) - Korean Ver.

Editor's Notes