SlideShare a Scribd company logo
1 of 32
Download to read offline
DistributedRepresentationsof
SentencesandDocuments
(2014) Quoc Le, Tomas Mikolov
발표: 곽근봉
© NBT All Rights Reserved.
이논문을선정한이유
추천 엔진에서의
Cold Start Problem
© NBT All Rights Reserved.
이논문을선정한이유
추천 엔진에서의
Cold Start Problem
컨텐츠 태깅
© NBT All Rights Reserved.
이논문을선정한이유
추천 엔진에서의
Cold Start Problem
컨텐츠 태깅
문서 유사도
© NBT All Rights Reserved.
이논문을선정한이유
추천 엔진에서의
Cold Start Problem
컨텐츠 태깅
컨텐츠 유사도컨텐츠 임베딩
© NBT All Rights Reserved.
참고 자료
Lucy Park님 Pycon 발표자료
https://www.lucypark.kr/docs/2015-pyconkr/#1
Ratsgo님 블로그
https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/
PyData 2017 발표자료
https://www.youtube.com/watch?v=zFScws0mb7M
© NBT All Rights Reserved.
개요
Word2Vec의후속버전Doc2Vec
• 문장 분석을 위한 새로운 embedding방식을 제안
• Word2Vec의 아이디어를 활용
• PV-DM & PV-DBOW
© NBT All Rights Reserved.
문제정의–이논문에서풀고자했던것
• 문장, 문단, 문서의 특징을 capture할 수 있는 embedding
• Embedding을 통한 Classification 성능 향상
© NBT All Rights Reserved.
Word2Vec
딥러닝을 활용한 자연어처리에
단어가 출현하는 위치를 기반으로 단어를 학습시켜보자 (Skip-grams)
Center word
(Position t)
Output Context words
(m word window)
Output Context words
(m word window)
P(wt+1|wt)P(wt-1|wt)
“저는
P(wt-2|wt)
관심이 많습니다”
P(wt+2|wt)
© NBT All Rights Reserved.
Word2Vec
단어가 출현하는 위치를 기반으로 단어를 학습시켜보자
© NBT All Rights Reserved.
Word2Vec
학습 결과
의미, 문법과 관련된 정보들이 Capture 된다!
© NBT All Rights Reserved.
모델설명PV-DM
© NBT All Rights Reserved.
모델설명PV-DM(예시)
“나는 배가 고파서 밥을 먹었다”
Paragraph Dictionary Word Dictionary
ID Paragraph
1 나는 배가 고파
서 밥을 먹었다
ID Word
1 나는
2 배가
3 고파서
4 밥을
5 먹었다
© NBT All Rights Reserved.
모델설명PV-DM(예시)
“나는 배가 고파서 밥을 먹었다”
Paragraph Embedding Word Embedding
ID Paragraph
Embedding
1 [0.5, 0.41, 0.55
…]
ID Word Embedding
1 [0.2, 0.11, 0.55 …]
2 [0.9, 0.41, 0.75 …]
3 [0.4, 0.15, 0.53 …]
4 [0.3, 0.78, 0.48 …]
5 [0.6, 0.23, 0.12 …]
© NBT All Rights Reserved.
모델설명PV-DM(예시)
“나는 배가 고파서 밥을 먹었다”
Step Input Label
1 [나는 배가 고파서 밥을 먹었다, 나는, 배가, 고파서] 밥을
2 [나는 배가 고파서 밥을 먹었다, 배가, 고파서, 밥을] 먹었다
© NBT All Rights Reserved.
모델설명PV-DM(예시)
“나는 배가 고파서 밥을 먹었다”
Step Input Label
1 [d1, w1, w2, w3] w4
2 [d1, w2, w3, w4] w5
© NBT All Rights Reserved.
모델설명PV-DM(예시)
“나는 배가 고파서 밥을 먹었다”
Step Input Label
1
[[0.5, 0.41, 0.55 …], [0.2, 0.11, 0.55 …]
, [0.9, 0.41, 0.75 …], [0.4, 0.15, 0.53 …]]
[0,0,0,0,1,0]
2
[[0.5, 0.41, 0.55 …], [0.9, 0.41, 0.75 …],
[0.4, 0.15, 0.53 …], [0.3, 0.78, 0.48 …]]
[0,0,0,0,0,1]
© NBT All Rights Reserved.
PV-DMInferenceStage
© NBT All Rights Reserved.
모델설명PV-DBOW
© NBT All Rights Reserved.
모델설명PV-DBOW(예시)
“나는 배가 고파서 밥을 먹었다”
Step Input Label
1 나는 배가 고파서 밥을 먹었다 나는
2 나는 배가 고파서 밥을 먹었다 배가
3 나는 배가 고파서 밥을 먹었다 고파서
4 나는 배가 고파서 밥을 먹었다 밥을
5 나는 배가 고파서 밥을 먹었다 먹었다
© NBT All Rights Reserved.
실험및결과
© NBT All Rights Reserved.
실험및결과
© NBT All Rights Reserved.
실험및결과
© NBT All Rights Reserved.
그이외의발견들
• PV-DM이 PV-DBOW 보다 일반적으로 더 좋은 성능을 낸다
• PV-DM이랑 PV-DBOW를 concat으로 합치는게 sum 보다 낫다
• Window size는 5~12로 잡는게 일반적으로 좋더라
© NBT All Rights Reserved.
사용해보기
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
© NBT All Rights Reserved.
Gensim사용법
© NBT All Rights Reserved.
Gensim사용법
© NBT All Rights Reserved.
Gensim사용법
© NBT All Rights Reserved.
Gensim사용법
© NBT All Rights Reserved.
Gensim사용법
© NBT All Rights Reserved.
결론
• Word2Vec과 유사한 아이디어로 문서도 Embedding 가능
• 성능도 꽤 잘 나옴
• 쉽게 활용 가능
© NBT All Rights Reserved.
감사합니다.

More Related Content

What's hot

PostgreSQL 15の新機能を徹底解説
PostgreSQL 15の新機能を徹底解説PostgreSQL 15の新機能を徹底解説
PostgreSQL 15の新機能を徹底解説Masahiko Sawada
 
Migration Guide from Java 8 to Java 11 #jjug
Migration Guide from Java 8 to Java 11 #jjugMigration Guide from Java 8 to Java 11 #jjug
Migration Guide from Java 8 to Java 11 #jjugYuji Kubota
 
kpackによるコンテナイメージのビルド
kpackによるコンテナイメージのビルドkpackによるコンテナイメージのビルド
kpackによるコンテナイメージのビルドMasanori Nara
 
Microsoft Azure/SRT - 最新技術情報アップデート
Microsoft Azure/SRT - 最新技術情報アップデートMicrosoft Azure/SRT - 最新技術情報アップデート
Microsoft Azure/SRT - 最新技術情報アップデートShige Fukushima
 
C# 8.0 null許容参照型
C# 8.0 null許容参照型C# 8.0 null許容参照型
C# 8.0 null許容参照型信之 岩永
 
Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造Etsuji Nakai
 
TIME_WAITに関する話
TIME_WAITに関する話TIME_WAITに関する話
TIME_WAITに関する話Takanori Sejima
 
delayed_jobの自動再起動
delayed_jobの自動再起動delayed_jobの自動再起動
delayed_jobの自動再起動firewood
 
コンテナセキュリティにおける権限制御(OCHaCafe5 #3 Kubernetes のセキュリティ 発表資料)
コンテナセキュリティにおける権限制御(OCHaCafe5 #3 Kubernetes のセキュリティ 発表資料)コンテナセキュリティにおける権限制御(OCHaCafe5 #3 Kubernetes のセキュリティ 発表資料)
コンテナセキュリティにおける権限制御(OCHaCafe5 #3 Kubernetes のセキュリティ 発表資料)NTT DATA Technology & Innovation
 
殺しても死なないアプリ 〜Core Bluetooth の「状態の保存と復元」機能〜
殺しても死なないアプリ 〜Core Bluetooth の「状態の保存と復元」機能〜殺しても死なないアプリ 〜Core Bluetooth の「状態の保存と復元」機能〜
殺しても死なないアプリ 〜Core Bluetooth の「状態の保存と復元」機能〜Shuichi Tsutsumi
 
ネットワーク自動化ツール紹介(Ansible・NAPALM編)
ネットワーク自動化ツール紹介(Ansible・NAPALM編)ネットワーク自動化ツール紹介(Ansible・NAPALM編)
ネットワーク自動化ツール紹介(Ansible・NAPALM編)akira6592
 
【OpenStackDaysTokyo】4-B1-3 自動化を支えるCICDパイプラインの世界
【OpenStackDaysTokyo】4-B1-3 自動化を支えるCICDパイプラインの世界【OpenStackDaysTokyo】4-B1-3 自動化を支えるCICDパイプラインの世界
【OpenStackDaysTokyo】4-B1-3 自動化を支えるCICDパイプラインの世界Shingo Kitayama
 
クラウドコラボレーションサーバ「Collabora Online」を構築してみた
クラウドコラボレーションサーバ「Collabora Online」を構築してみたクラウドコラボレーションサーバ「Collabora Online」を構築してみた
クラウドコラボレーションサーバ「Collabora Online」を構築してみたShinji Enoki
 
アプリ開発検証はLXC+Ansibleで楽ちんにやろう!
アプリ開発検証はLXC+Ansibleで楽ちんにやろう!アプリ開発検証はLXC+Ansibleで楽ちんにやろう!
アプリ開発検証はLXC+Ansibleで楽ちんにやろう!Mutsumi IWAISHI
 
インターネットの仕組みとISPの構造
インターネットの仕組みとISPの構造インターネットの仕組みとISPの構造
インターネットの仕組みとISPの構造Taiji Tsuchiya
 
Ss systemdのwslディストロを作る kernelvm探検隊online part 3
Ss systemdのwslディストロを作る kernelvm探検隊online part 3Ss systemdのwslディストロを作る kernelvm探検隊online part 3
Ss systemdのwslディストロを作る kernelvm探検隊online part 3Takaya Saeki
 
MySQLの文字コード事情
MySQLの文字コード事情MySQLの文字コード事情
MySQLの文字コード事情Masahiro Tomita
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Kohei Tokunaga
 
ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~
ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~
ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~啓 杉本
 

What's hot (20)

PostgreSQL 15の新機能を徹底解説
PostgreSQL 15の新機能を徹底解説PostgreSQL 15の新機能を徹底解説
PostgreSQL 15の新機能を徹底解説
 
Migration Guide from Java 8 to Java 11 #jjug
Migration Guide from Java 8 to Java 11 #jjugMigration Guide from Java 8 to Java 11 #jjug
Migration Guide from Java 8 to Java 11 #jjug
 
kpackによるコンテナイメージのビルド
kpackによるコンテナイメージのビルドkpackによるコンテナイメージのビルド
kpackによるコンテナイメージのビルド
 
Microsoft Azure/SRT - 最新技術情報アップデート
Microsoft Azure/SRT - 最新技術情報アップデートMicrosoft Azure/SRT - 最新技術情報アップデート
Microsoft Azure/SRT - 最新技術情報アップデート
 
C# 8.0 null許容参照型
C# 8.0 null許容参照型C# 8.0 null許容参照型
C# 8.0 null許容参照型
 
Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造
 
TIME_WAITに関する話
TIME_WAITに関する話TIME_WAITに関する話
TIME_WAITに関する話
 
delayed_jobの自動再起動
delayed_jobの自動再起動delayed_jobの自動再起動
delayed_jobの自動再起動
 
コンテナセキュリティにおける権限制御(OCHaCafe5 #3 Kubernetes のセキュリティ 発表資料)
コンテナセキュリティにおける権限制御(OCHaCafe5 #3 Kubernetes のセキュリティ 発表資料)コンテナセキュリティにおける権限制御(OCHaCafe5 #3 Kubernetes のセキュリティ 発表資料)
コンテナセキュリティにおける権限制御(OCHaCafe5 #3 Kubernetes のセキュリティ 発表資料)
 
殺しても死なないアプリ 〜Core Bluetooth の「状態の保存と復元」機能〜
殺しても死なないアプリ 〜Core Bluetooth の「状態の保存と復元」機能〜殺しても死なないアプリ 〜Core Bluetooth の「状態の保存と復元」機能〜
殺しても死なないアプリ 〜Core Bluetooth の「状態の保存と復元」機能〜
 
ネットワーク自動化ツール紹介(Ansible・NAPALM編)
ネットワーク自動化ツール紹介(Ansible・NAPALM編)ネットワーク自動化ツール紹介(Ansible・NAPALM編)
ネットワーク自動化ツール紹介(Ansible・NAPALM編)
 
【OpenStackDaysTokyo】4-B1-3 自動化を支えるCICDパイプラインの世界
【OpenStackDaysTokyo】4-B1-3 自動化を支えるCICDパイプラインの世界【OpenStackDaysTokyo】4-B1-3 自動化を支えるCICDパイプラインの世界
【OpenStackDaysTokyo】4-B1-3 自動化を支えるCICDパイプラインの世界
 
クラウドコラボレーションサーバ「Collabora Online」を構築してみた
クラウドコラボレーションサーバ「Collabora Online」を構築してみたクラウドコラボレーションサーバ「Collabora Online」を構築してみた
クラウドコラボレーションサーバ「Collabora Online」を構築してみた
 
アプリ開発検証はLXC+Ansibleで楽ちんにやろう!
アプリ開発検証はLXC+Ansibleで楽ちんにやろう!アプリ開発検証はLXC+Ansibleで楽ちんにやろう!
アプリ開発検証はLXC+Ansibleで楽ちんにやろう!
 
インターネットの仕組みとISPの構造
インターネットの仕組みとISPの構造インターネットの仕組みとISPの構造
インターネットの仕組みとISPの構造
 
Katib
KatibKatib
Katib
 
Ss systemdのwslディストロを作る kernelvm探検隊online part 3
Ss systemdのwslディストロを作る kernelvm探検隊online part 3Ss systemdのwslディストロを作る kernelvm探検隊online part 3
Ss systemdのwslディストロを作る kernelvm探検隊online part 3
 
MySQLの文字コード事情
MySQLの文字コード事情MySQLの文字コード事情
MySQLの文字コード事情
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
 
ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~
ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~
ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~
 

More from keunbong kwak

그로스해킹 기초
그로스해킹 기초그로스해킹 기초
그로스해킹 기초keunbong kwak
 
NBT 그로스해킹 교육 자료
NBT 그로스해킹 교육 자료NBT 그로스해킹 교육 자료
NBT 그로스해킹 교육 자료keunbong kwak
 
PR-099: MRNet-Product2Vec
PR-099: MRNet-Product2VecPR-099: MRNet-Product2Vec
PR-099: MRNet-Product2Veckeunbong kwak
 
Wide&Deep Learning for Recommender Systems
Wide&Deep Learning for Recommender SystemsWide&Deep Learning for Recommender Systems
Wide&Deep Learning for Recommender Systemskeunbong kwak
 
논문리뷰: Deep Neural Networks for YouTube Recommendations
논문리뷰: Deep Neural Networks for YouTube Recommendations논문리뷰: Deep Neural Networks for YouTube Recommendations
논문리뷰: Deep Neural Networks for YouTube Recommendationskeunbong kwak
 
Ask me anything: Dynamic memory networks for natural language processing
Ask me anything: Dynamic memory networks for natural language processingAsk me anything: Dynamic memory networks for natural language processing
Ask me anything: Dynamic memory networks for natural language processingkeunbong kwak
 
GloVe:Global vectors for word representation
GloVe:Global vectors for word representationGloVe:Global vectors for word representation
GloVe:Global vectors for word representationkeunbong kwak
 
딥러닝 논문 리뷰 Learning phrase representations using rnn encoder decoder for stati...
딥러닝 논문 리뷰 Learning phrase representations using rnn encoder decoder for stati...딥러닝 논문 리뷰 Learning phrase representations using rnn encoder decoder for stati...
딥러닝 논문 리뷰 Learning phrase representations using rnn encoder decoder for stati...keunbong kwak
 
에디터스랩 발표
에디터스랩 발표에디터스랩 발표
에디터스랩 발표keunbong kwak
 
Convolutional neural networks for sentence classification
Convolutional neural networks for sentence classificationConvolutional neural networks for sentence classification
Convolutional neural networks for sentence classificationkeunbong kwak
 
NBT 유클라우드 사례 발표
NBT 유클라우드 사례 발표NBT 유클라우드 사례 발표
NBT 유클라우드 사례 발표keunbong kwak
 
Window manager활용하기 곽근봉
Window manager활용하기 곽근봉Window manager활용하기 곽근봉
Window manager활용하기 곽근봉keunbong kwak
 

More from keunbong kwak (13)

그로스해킹 기초
그로스해킹 기초그로스해킹 기초
그로스해킹 기초
 
NBT 그로스해킹 교육 자료
NBT 그로스해킹 교육 자료NBT 그로스해킹 교육 자료
NBT 그로스해킹 교육 자료
 
PR-099: MRNet-Product2Vec
PR-099: MRNet-Product2VecPR-099: MRNet-Product2Vec
PR-099: MRNet-Product2Vec
 
Wide&Deep Learning for Recommender Systems
Wide&Deep Learning for Recommender SystemsWide&Deep Learning for Recommender Systems
Wide&Deep Learning for Recommender Systems
 
논문리뷰: Deep Neural Networks for YouTube Recommendations
논문리뷰: Deep Neural Networks for YouTube Recommendations논문리뷰: Deep Neural Networks for YouTube Recommendations
논문리뷰: Deep Neural Networks for YouTube Recommendations
 
Ad Tech 개요
Ad Tech 개요Ad Tech 개요
Ad Tech 개요
 
Ask me anything: Dynamic memory networks for natural language processing
Ask me anything: Dynamic memory networks for natural language processingAsk me anything: Dynamic memory networks for natural language processing
Ask me anything: Dynamic memory networks for natural language processing
 
GloVe:Global vectors for word representation
GloVe:Global vectors for word representationGloVe:Global vectors for word representation
GloVe:Global vectors for word representation
 
딥러닝 논문 리뷰 Learning phrase representations using rnn encoder decoder for stati...
딥러닝 논문 리뷰 Learning phrase representations using rnn encoder decoder for stati...딥러닝 논문 리뷰 Learning phrase representations using rnn encoder decoder for stati...
딥러닝 논문 리뷰 Learning phrase representations using rnn encoder decoder for stati...
 
에디터스랩 발표
에디터스랩 발표에디터스랩 발표
에디터스랩 발표
 
Convolutional neural networks for sentence classification
Convolutional neural networks for sentence classificationConvolutional neural networks for sentence classification
Convolutional neural networks for sentence classification
 
NBT 유클라우드 사례 발표
NBT 유클라우드 사례 발표NBT 유클라우드 사례 발표
NBT 유클라우드 사례 발표
 
Window manager활용하기 곽근봉
Window manager활용하기 곽근봉Window manager활용하기 곽근봉
Window manager활용하기 곽근봉
 

Recently uploaded

Console API (Kitworks Team Study 백혜인 발표자료)
Console API (Kitworks Team Study 백혜인 발표자료)Console API (Kitworks Team Study 백혜인 발표자료)
Console API (Kitworks Team Study 백혜인 발표자료)Wonjun Hwang
 
Merge (Kitworks Team Study 이성수 발표자료 240426)
Merge (Kitworks Team Study 이성수 발표자료 240426)Merge (Kitworks Team Study 이성수 발표자료 240426)
Merge (Kitworks Team Study 이성수 발표자료 240426)Wonjun Hwang
 
캐드앤그래픽스 2024년 5월호 목차
캐드앤그래픽스 2024년 5월호 목차캐드앤그래픽스 2024년 5월호 목차
캐드앤그래픽스 2024년 5월호 목차캐드앤그래픽스
 
Continual Active Learning for Efficient Adaptation of Machine LearningModels ...
Continual Active Learning for Efficient Adaptation of Machine LearningModels ...Continual Active Learning for Efficient Adaptation of Machine LearningModels ...
Continual Active Learning for Efficient Adaptation of Machine LearningModels ...Kim Daeun
 
A future that integrates LLMs and LAMs (Symposium)
A future that integrates LLMs and LAMs (Symposium)A future that integrates LLMs and LAMs (Symposium)
A future that integrates LLMs and LAMs (Symposium)Tae Young Lee
 
MOODv2 : Masked Image Modeling for Out-of-Distribution Detection
MOODv2 : Masked Image Modeling for Out-of-Distribution DetectionMOODv2 : Masked Image Modeling for Out-of-Distribution Detection
MOODv2 : Masked Image Modeling for Out-of-Distribution DetectionKim Daeun
 

Recently uploaded (6)

Console API (Kitworks Team Study 백혜인 발표자료)
Console API (Kitworks Team Study 백혜인 발표자료)Console API (Kitworks Team Study 백혜인 발표자료)
Console API (Kitworks Team Study 백혜인 발표자료)
 
Merge (Kitworks Team Study 이성수 발표자료 240426)
Merge (Kitworks Team Study 이성수 발표자료 240426)Merge (Kitworks Team Study 이성수 발표자료 240426)
Merge (Kitworks Team Study 이성수 발표자료 240426)
 
캐드앤그래픽스 2024년 5월호 목차
캐드앤그래픽스 2024년 5월호 목차캐드앤그래픽스 2024년 5월호 목차
캐드앤그래픽스 2024년 5월호 목차
 
Continual Active Learning for Efficient Adaptation of Machine LearningModels ...
Continual Active Learning for Efficient Adaptation of Machine LearningModels ...Continual Active Learning for Efficient Adaptation of Machine LearningModels ...
Continual Active Learning for Efficient Adaptation of Machine LearningModels ...
 
A future that integrates LLMs and LAMs (Symposium)
A future that integrates LLMs and LAMs (Symposium)A future that integrates LLMs and LAMs (Symposium)
A future that integrates LLMs and LAMs (Symposium)
 
MOODv2 : Masked Image Modeling for Out-of-Distribution Detection
MOODv2 : Masked Image Modeling for Out-of-Distribution DetectionMOODv2 : Masked Image Modeling for Out-of-Distribution Detection
MOODv2 : Masked Image Modeling for Out-of-Distribution Detection
 

PR12 논문 리뷰 Distributed Representations of Sentences and Documents

  • 2. © NBT All Rights Reserved. 이논문을선정한이유 추천 엔진에서의 Cold Start Problem
  • 3. © NBT All Rights Reserved. 이논문을선정한이유 추천 엔진에서의 Cold Start Problem 컨텐츠 태깅
  • 4. © NBT All Rights Reserved. 이논문을선정한이유 추천 엔진에서의 Cold Start Problem 컨텐츠 태깅 문서 유사도
  • 5. © NBT All Rights Reserved. 이논문을선정한이유 추천 엔진에서의 Cold Start Problem 컨텐츠 태깅 컨텐츠 유사도컨텐츠 임베딩
  • 6. © NBT All Rights Reserved. 참고 자료 Lucy Park님 Pycon 발표자료 https://www.lucypark.kr/docs/2015-pyconkr/#1 Ratsgo님 블로그 https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/ PyData 2017 발표자료 https://www.youtube.com/watch?v=zFScws0mb7M
  • 7. © NBT All Rights Reserved. 개요 Word2Vec의후속버전Doc2Vec • 문장 분석을 위한 새로운 embedding방식을 제안 • Word2Vec의 아이디어를 활용 • PV-DM & PV-DBOW
  • 8. © NBT All Rights Reserved. 문제정의–이논문에서풀고자했던것 • 문장, 문단, 문서의 특징을 capture할 수 있는 embedding • Embedding을 통한 Classification 성능 향상
  • 9. © NBT All Rights Reserved. Word2Vec 딥러닝을 활용한 자연어처리에 단어가 출현하는 위치를 기반으로 단어를 학습시켜보자 (Skip-grams) Center word (Position t) Output Context words (m word window) Output Context words (m word window) P(wt+1|wt)P(wt-1|wt) “저는 P(wt-2|wt) 관심이 많습니다” P(wt+2|wt)
  • 10. © NBT All Rights Reserved. Word2Vec 단어가 출현하는 위치를 기반으로 단어를 학습시켜보자
  • 11. © NBT All Rights Reserved. Word2Vec 학습 결과 의미, 문법과 관련된 정보들이 Capture 된다!
  • 12. © NBT All Rights Reserved. 모델설명PV-DM
  • 13. © NBT All Rights Reserved. 모델설명PV-DM(예시) “나는 배가 고파서 밥을 먹었다” Paragraph Dictionary Word Dictionary ID Paragraph 1 나는 배가 고파 서 밥을 먹었다 ID Word 1 나는 2 배가 3 고파서 4 밥을 5 먹었다
  • 14. © NBT All Rights Reserved. 모델설명PV-DM(예시) “나는 배가 고파서 밥을 먹었다” Paragraph Embedding Word Embedding ID Paragraph Embedding 1 [0.5, 0.41, 0.55 …] ID Word Embedding 1 [0.2, 0.11, 0.55 …] 2 [0.9, 0.41, 0.75 …] 3 [0.4, 0.15, 0.53 …] 4 [0.3, 0.78, 0.48 …] 5 [0.6, 0.23, 0.12 …]
  • 15. © NBT All Rights Reserved. 모델설명PV-DM(예시) “나는 배가 고파서 밥을 먹었다” Step Input Label 1 [나는 배가 고파서 밥을 먹었다, 나는, 배가, 고파서] 밥을 2 [나는 배가 고파서 밥을 먹었다, 배가, 고파서, 밥을] 먹었다
  • 16. © NBT All Rights Reserved. 모델설명PV-DM(예시) “나는 배가 고파서 밥을 먹었다” Step Input Label 1 [d1, w1, w2, w3] w4 2 [d1, w2, w3, w4] w5
  • 17. © NBT All Rights Reserved. 모델설명PV-DM(예시) “나는 배가 고파서 밥을 먹었다” Step Input Label 1 [[0.5, 0.41, 0.55 …], [0.2, 0.11, 0.55 …] , [0.9, 0.41, 0.75 …], [0.4, 0.15, 0.53 …]] [0,0,0,0,1,0] 2 [[0.5, 0.41, 0.55 …], [0.9, 0.41, 0.75 …], [0.4, 0.15, 0.53 …], [0.3, 0.78, 0.48 …]] [0,0,0,0,0,1]
  • 18. © NBT All Rights Reserved. PV-DMInferenceStage
  • 19. © NBT All Rights Reserved. 모델설명PV-DBOW
  • 20. © NBT All Rights Reserved. 모델설명PV-DBOW(예시) “나는 배가 고파서 밥을 먹었다” Step Input Label 1 나는 배가 고파서 밥을 먹었다 나는 2 나는 배가 고파서 밥을 먹었다 배가 3 나는 배가 고파서 밥을 먹었다 고파서 4 나는 배가 고파서 밥을 먹었다 밥을 5 나는 배가 고파서 밥을 먹었다 먹었다
  • 21. © NBT All Rights Reserved. 실험및결과
  • 22. © NBT All Rights Reserved. 실험및결과
  • 23. © NBT All Rights Reserved. 실험및결과
  • 24. © NBT All Rights Reserved. 그이외의발견들 • PV-DM이 PV-DBOW 보다 일반적으로 더 좋은 성능을 낸다 • PV-DM이랑 PV-DBOW를 concat으로 합치는게 sum 보다 낫다 • Window size는 5~12로 잡는게 일반적으로 좋더라
  • 25. © NBT All Rights Reserved. 사용해보기 https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
  • 26. © NBT All Rights Reserved. Gensim사용법
  • 27. © NBT All Rights Reserved. Gensim사용법
  • 28. © NBT All Rights Reserved. Gensim사용법
  • 29. © NBT All Rights Reserved. Gensim사용법
  • 30. © NBT All Rights Reserved. Gensim사용법
  • 31. © NBT All Rights Reserved. 결론 • Word2Vec과 유사한 아이디어로 문서도 Embedding 가능 • 성능도 꽤 잘 나옴 • 쉽게 활용 가능
  • 32. © NBT All Rights Reserved. 감사합니다.