SlideShare a Scribd company logo
1 of 46
Download to read offline
<Little Big Data #1>
summatic@scatterlab.co.kr
1
• 

(@ , 2016. 1~)

• 

(2016. 8~)

• 

(2018. 5~)
!2
:
:
:• 

• (?) 

• B

•
.

•
.

•
.

•
.

• 

• ( , id )
.

• ( , , )
.
!3
• Intro

• 

• 

• 

• 

• 

• Preprocessing

• Word Embedding

• Document Similarity

•
!4
Intro
• 

• 

• “ ” -> “ " -> “ ” .

• “ ” .

• .

• .

• 

• .

• .
!6
-
• Hell 

• .

• 

• 

• 

• 

•
< >
- ?
- ? / ? ?
< > , ,
< > , ,
< >
< > , , ,
!7
• 

• 

•
-
< >
/ / ? / / ? / ? / 

< >
- (X) -> (O)
- ? (X) -> ? (O)
- ? (X) -> ? (O)
- (X) -> (O)
< >
-
-
!8
- preprocess
• Data Science 

• Garbage in, Garbage out

• , preprocess
.

• preprocess ?
!10
Preprocessing
Preprocessing -
• 

• preprocess (POS1 tagger)
.

• : 

• KoNLPy2

• 

• , ,
1) POS: Part of speech

2) http://konlpy-ko.readthedocs.io/ko/v0.4.3/
!12
Preprocessing - ( )
• . ?
 • _NP _MAG _VV _ECE 

_VXA _EFN ._SF _MAG 

_VV _EFQ ?_SF
• 
 • _NP _MAG _NNG 

_XSV _ECE
• . 
 • _NNG _VA _ECD _VV 

_EFN ._SF _MAG _VV 

_ECE _NNG _XSV _ECS
< > < >
!13
Preprocessing - ( )
• . ?
 • _UN _JKS _MAG _MAG 

_VV _ECE _NNG _MAG 

_MAG _VV _ECS ?_SF
• 
 • _NP _NNG _NNG 

_JKM _VV
• . 
 • _NNG _VA _ECD _NP 

_UN ._SF _MAG _VV _ECE
_MAG _VV _ECS _EMO
< > < >
!15
Preprocessing -
• 

• ( , corpus)


• (corpus)

•
!17
: https://ko.wikipedia.org/wiki/
Preprocessing -
• Sejong Corpus

• National Institute of the Korean Language, 1998-2007.

• 

• (..)
!18
: https://ithub.korean.go.kr/user/guide/corpus/guide1.do
• preprocess

• normalize( )

• preprocessing

• 

• tokenizing
< >
count(“ ”) < count(“ ?”) , “ ” .
Preprocessing -
!19
Preprocessing - Tokenizing
• Tokenizing: 

• token , .

• , token 

• “ ” “ ” tokenizing
.
!20
< >
before tokenizing:
.
after tokenizing:
/ / / / / / / / / / / / / / / /
/ / / / .
• 

• 

• c1c2..cn-1 cn c1..cn 

•
Preprocessing - Tokenizing(Cohesion Probability)
!21
< >
“ ” “ ” .
: https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/05/05/cohesion/
Preprocessing - Tokenizing(Cohesion Probability)
• ) 

• = +
!22
substring count
- count( ) = 20000
- count( ) = 1500
- count( ) = 1200
- count( ) = 30
- count( ) = 15
cohesion probability
- CP( ) = 0.2738
- CP( ) = 0.3914
- CP( ) = 0.1968
- CP( ) = 0.2371
Preprocessing - Tokenizing
• Cohesion probability .

• .

• [ 2017] NLP - 

• 

• https://www.slideshare.net/kimhyunjoonglovit/pycon2017-koreannlp

• 

• https://github.com/lovit/soynlp
!23
Word Embedding
Word Embedding - Word2Vec
• vector .

• word embedding word representation .

• word2vec

• You shall know a word by the company it keeps (Firth, J. R. 1957:11)
!25
Word Embedding - Word2Vec
• word2vec OOV
.

• OOV(Out-of-vocabulary): (=dictionary ) vocabulary
vector 

• training input vocabulary OOV
, inference .

• inference : 

•


• ( , )
, dictionary .
!26
• word2vec 

• word2vec:

• 

• fasttext: 

• where the set of n grams appearing in w

• subword
Word Embedding - Fasttext
!27
< >
w: Alpaca
n grams of w (n=3) = <Al, Alp, lpa, pac, aca, ca>
: Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:
1607.04606.
Word Embedding - Fasttext +
• fasttext .

• (character) subword 

• subword 

• , OOV .
!28
< >
subwords( ) = < , , , , >
< >
= _ _ _
subwords( ) = < , _, _ , …, >
Word Embedding - Fasttext
•
!29
- , 0.8590
- , 0.8465
- , 0.8180
- , 0.8055
- , 0.8018
- , 0.8017
- , 0.8007
- , 0.7983
- , 0.7972
- , 0.7948
- , 0.9022
- , 0.8986
- , 0.8887
- , 0.8866
- , 0.8567
- , 0.8498
- , 0.8474
- , 0.8413
- , 0.8335
- , 0.8191
Word Embedding - Fasttext
• 

• 

• Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word
vectors with subword information. arXiv preprint arXiv:1607.04606.

• 

• https://github.com/facebookresearch/fastText

• https://radimrehurek.com/gensim/models/fasttext.html

• https://github.com/summatic/hangul_jamo_fasttext
!30
Sentence Similarity
Setence Similarity
• document
.

• document short sentence .

• word embedding vector embedding
cosine similarity .
!32
< >
sim( , ?)
Sentence Similarity - BOW + Word Embedding
• word vector 

• doc2vec 

• word embedding 

• word embedding ?

• word embedding 

• !=
!34
- similarity( , ) = 0.9011
- similarity( , ) = 0.8839
- similarity( , ) = 0.9707
Sentence Similarity - RNN
• sentence embedding RNN (LSTM, Bi-
RNN, GRU ) .

• RNN language modeling

• “ .” <-> “ ”


• sequence embedding .

• .. “ ” “ ” embedding .

• “?”
!35
Sentence Similarity - Term vector
• vector embedding
embedding .

• embedding term vector 

• one hot encoding .

• term vector cosine similarity, edit distance
.
!36
< >
- I love you, you love me
- {“I”: 1, “love”: 2, “you”: 2, “me”: 1}
Sentence Similarity - Term vector
• term vector 

• . 

• 

• pair1 pair2 ?
!38
< >
pair1: I love you <-> I like you
pair2: I love you <-> I hate you
Sentence Similarity - ESA Similarity
• ESA: Explicit Semantic Analysis

• (=word vector) 

• cosine similarity

• ESA similarity
!39
I love you
I like you
similarity I love you
I 1 0.2 0.5
like 0.3 0.9 0.4
you 0.5 0.4 1
1 0.9 1
Sentence Similarity - ESA Similarity
• ESA: Explicit Semantic Analysis

• (=word vector) 

• cosine similarity

• ESA similarity
!40
I love you
I hate you
similarity I love you
I 1 0.2 0.5
hate 0.3 0.5 0.4
you 0.5 0.4 1
1 0.5 1
Sentence Similarity - ESA Similarity
• ESA: Explicit Semantic Analysis

• I love you 

• .
!41
I like you I hate you
cosine 0.667 0.667
ESA 0.967 0.833
Sentence Similarity - ESA Similarity
• .

• 

• Song, Y., & Roth, D. (2015). Unsupervised sparse vector densification for short
text similarity. In Proceedings of the 2015 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies (pp. 1275-1280).

• 

• ( )
!42
• preprocessing 80% 

• Zipf’s law

• corpus ,


• ( ) .


• 

• 

• , count based


• unlabeled data label 

• label insight
!44
WE WANT YOU!
- End of Document -
46

More Related Content

Similar to <Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정)

Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotechPy "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotechShinichi Nakagawa
 
Elasticsearch at EyeEm
Elasticsearch at EyeEmElasticsearch at EyeEm
Elasticsearch at EyeEmLars Fronius
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for RubyistsSean Cribbs
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic searchHenry Saputra
 
How to look like a model? MongoDB for Rails apps
How to look like a model? MongoDB for Rails appsHow to look like a model? MongoDB for Rails apps
How to look like a model? MongoDB for Rails appsboogie_cat
 
Programming Contest Hacks
Programming Contest HacksProgramming Contest Hacks
Programming Contest HacksKosei Moriyama
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go ProgrammingLin Yo-An
 
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기Heejong Ahn
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Databricks
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 
Zeppelin, TensorFlow, Deep Learning 맛보기
Zeppelin, TensorFlow, Deep Learning 맛보기Zeppelin, TensorFlow, Deep Learning 맛보기
Zeppelin, TensorFlow, Deep Learning 맛보기Taejun Kim
 
Svetlin Nakov - What's New In CLR 2.0
Svetlin Nakov - What's New In CLR 2.0Svetlin Nakov - What's New In CLR 2.0
Svetlin Nakov - What's New In CLR 2.0Svetlin Nakov
 
Archetype autoplugins
Archetype autopluginsArchetype autoplugins
Archetype autopluginsMark Schaake
 
Abusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and ProfitAbusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and ProfitWojciech Gawroński
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithmAndrew Koo
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to pythonActiveState
 

Similar to <Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정) (20)

Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotechPy "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
 
Elasticsearch at EyeEm
Elasticsearch at EyeEmElasticsearch at EyeEm
Elasticsearch at EyeEm
 
04 standard class library c#
04 standard class library c#04 standard class library c#
04 standard class library c#
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic search
 
How to look like a model? MongoDB for Rails apps
How to look like a model? MongoDB for Rails appsHow to look like a model? MongoDB for Rails apps
How to look like a model? MongoDB for Rails apps
 
Programming Contest Hacks
Programming Contest HacksProgramming Contest Hacks
Programming Contest Hacks
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go Programming
 
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
Zeppelin, TensorFlow, Deep Learning 맛보기
Zeppelin, TensorFlow, Deep Learning 맛보기Zeppelin, TensorFlow, Deep Learning 맛보기
Zeppelin, TensorFlow, Deep Learning 맛보기
 
Svetlin Nakov - What's New In CLR 2.0
Svetlin Nakov - What's New In CLR 2.0Svetlin Nakov - What's New In CLR 2.0
Svetlin Nakov - What's New In CLR 2.0
 
Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)
 
Archetype autoplugins
Archetype autopluginsArchetype autoplugins
Archetype autoplugins
 
Abusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and ProfitAbusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and Profit
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithm
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 

Recently uploaded

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 

Recently uploaded (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 

<Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정)

  • 1. <Little Big Data #1> summatic@scatterlab.co.kr 1
  • 2. • 
 (@ , 2016. 1~) • 
 (2016. 8~) • 
 (2018. 5~) !2
  • 3. : : :• • (?) • B • . • . • . • . • • ( , id ) . • ( , , ) . !3
  • 4. • Intro • • • • • • Preprocessing • Word Embedding • Document Similarity • !4
  • 6. • • • “ ” -> “ " -> “ ” . • “ ” . • . • . • • . • . !6
  • 7. - • Hell • . • • • • • < > - ? - ? / ? ? < > , , < > , , < > < > , , , !7
  • 8. • • • - < > / / ? / / ? / ? / 
 < > - (X) -> (O) - ? (X) -> ? (O) - ? (X) -> ? (O) - (X) -> (O) < > - - !8
  • 9.
  • 10. - preprocess • Data Science • Garbage in, Garbage out • , preprocess . • preprocess ? !10
  • 12. Preprocessing - • • preprocess (POS1 tagger) . • : • KoNLPy2 • • , , 1) POS: Part of speech 2) http://konlpy-ko.readthedocs.io/ko/v0.4.3/ !12
  • 13. Preprocessing - ( ) • . ? • _NP _MAG _VV _ECE 
 _VXA _EFN ._SF _MAG 
 _VV _EFQ ?_SF • • _NP _MAG _NNG 
 _XSV _ECE • . • _NNG _VA _ECD _VV 
 _EFN ._SF _MAG _VV 
 _ECE _NNG _XSV _ECS < > < > !13
  • 14.
  • 15. Preprocessing - ( ) • . ? • _UN _JKS _MAG _MAG 
 _VV _ECE _NNG _MAG 
 _MAG _VV _ECS ?_SF • • _NP _NNG _NNG 
 _JKM _VV • . • _NNG _VA _ECD _NP 
 _UN ._SF _MAG _VV _ECE _MAG _VV _ECS _EMO < > < > !15
  • 16.
  • 17. Preprocessing - • • ( , corpus) • (corpus) • !17 : https://ko.wikipedia.org/wiki/
  • 18. Preprocessing - • Sejong Corpus • National Institute of the Korean Language, 1998-2007. • • (..) !18 : https://ithub.korean.go.kr/user/guide/corpus/guide1.do
  • 19. • preprocess • normalize( ) • preprocessing • • tokenizing < > count(“ ”) < count(“ ?”) , “ ” . Preprocessing - !19
  • 20. Preprocessing - Tokenizing • Tokenizing: • token , . • , token • “ ” “ ” tokenizing . !20 < > before tokenizing: . after tokenizing: / / / / / / / / / / / / / / / / / / / / .
  • 21. • • • c1c2..cn-1 cn c1..cn • Preprocessing - Tokenizing(Cohesion Probability) !21 < > “ ” “ ” . : https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/05/05/cohesion/
  • 22. Preprocessing - Tokenizing(Cohesion Probability) • ) • = + !22 substring count - count( ) = 20000 - count( ) = 1500 - count( ) = 1200 - count( ) = 30 - count( ) = 15 cohesion probability - CP( ) = 0.2738 - CP( ) = 0.3914 - CP( ) = 0.1968 - CP( ) = 0.2371
  • 23. Preprocessing - Tokenizing • Cohesion probability . • . • [ 2017] NLP - • • https://www.slideshare.net/kimhyunjoonglovit/pycon2017-koreannlp • • https://github.com/lovit/soynlp !23
  • 25. Word Embedding - Word2Vec • vector . • word embedding word representation . • word2vec • You shall know a word by the company it keeps (Firth, J. R. 1957:11) !25
  • 26. Word Embedding - Word2Vec • word2vec OOV . • OOV(Out-of-vocabulary): (=dictionary ) vocabulary vector • training input vocabulary OOV , inference . • inference : • • ( , ) , dictionary . !26
  • 27. • word2vec • word2vec: • • fasttext: • where the set of n grams appearing in w • subword Word Embedding - Fasttext !27 < > w: Alpaca n grams of w (n=3) = <Al, Alp, lpa, pac, aca, ca> : Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv: 1607.04606.
  • 28. Word Embedding - Fasttext + • fasttext . • (character) subword • subword • , OOV . !28 < > subwords( ) = < , , , , > < > = _ _ _ subwords( ) = < , _, _ , …, >
  • 29. Word Embedding - Fasttext • !29 - , 0.8590 - , 0.8465 - , 0.8180 - , 0.8055 - , 0.8018 - , 0.8017 - , 0.8007 - , 0.7983 - , 0.7972 - , 0.7948 - , 0.9022 - , 0.8986 - , 0.8887 - , 0.8866 - , 0.8567 - , 0.8498 - , 0.8474 - , 0.8413 - , 0.8335 - , 0.8191
  • 30. Word Embedding - Fasttext • • • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. • • https://github.com/facebookresearch/fastText • https://radimrehurek.com/gensim/models/fasttext.html • https://github.com/summatic/hangul_jamo_fasttext !30
  • 32. Setence Similarity • document . • document short sentence . • word embedding vector embedding cosine similarity . !32 < > sim( , ?)
  • 33.
  • 34. Sentence Similarity - BOW + Word Embedding • word vector • doc2vec • word embedding • word embedding ? • word embedding • != !34 - similarity( , ) = 0.9011 - similarity( , ) = 0.8839 - similarity( , ) = 0.9707
  • 35. Sentence Similarity - RNN • sentence embedding RNN (LSTM, Bi- RNN, GRU ) . • RNN language modeling • “ .” <-> “ ” • sequence embedding . • .. “ ” “ ” embedding . • “?” !35
  • 36. Sentence Similarity - Term vector • vector embedding embedding . • embedding term vector • one hot encoding . • term vector cosine similarity, edit distance . !36 < > - I love you, you love me - {“I”: 1, “love”: 2, “you”: 2, “me”: 1}
  • 37.
  • 38. Sentence Similarity - Term vector • term vector • . • • pair1 pair2 ? !38 < > pair1: I love you <-> I like you pair2: I love you <-> I hate you
  • 39. Sentence Similarity - ESA Similarity • ESA: Explicit Semantic Analysis • (=word vector) • cosine similarity • ESA similarity !39 I love you I like you similarity I love you I 1 0.2 0.5 like 0.3 0.9 0.4 you 0.5 0.4 1 1 0.9 1
  • 40. Sentence Similarity - ESA Similarity • ESA: Explicit Semantic Analysis • (=word vector) • cosine similarity • ESA similarity !40 I love you I hate you similarity I love you I 1 0.2 0.5 hate 0.3 0.5 0.4 you 0.5 0.4 1 1 0.5 1
  • 41. Sentence Similarity - ESA Similarity • ESA: Explicit Semantic Analysis • I love you • . !41 I like you I hate you cosine 0.667 0.667 ESA 0.967 0.833
  • 42. Sentence Similarity - ESA Similarity • . • • Song, Y., & Roth, D. (2015). Unsupervised sparse vector densification for short text similarity. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1275-1280). • • ( ) !42
  • 43.
  • 44. • preprocessing 80% • Zipf’s law • corpus , • ( ) . • • • , count based • unlabeled data label • label insight !44
  • 46. - End of Document - 46