SlideShare a Scribd company logo
1 of 26
文字分析 PYTHON 入門
Terence Huang
1
How 結合文字到數據模型?
• 傳統數據模型
₋ 數字: 身高、體重、…
₋ 類別: 性別、人種、…
• How to 結合文字?
2
𝑦 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑚 𝑥 𝑚
"With all this stuff going down at the moment with
MJ i've started listening to his music, watching the
odd documentary here and there, watched The Wiz
and watched Moonwalker again. Maybe i just want
to get a certain insight into this guy who i thought
was really cool in the eighties just to maybe make up
my mind whether he is guilty or innocent.
Moonwalker is part biography, part feature film
which i remember going to see at the cinema when it
was originally released. Some of it has subtle
messages about MJ's feeling towards the press and
also the obvious message of drugs are bad m'kay.
<br/><br/>..."
𝑓 𝑥1, … , 𝑥 𝑚 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑚 𝑥 𝑚 + 𝛾1 ⋅ TEXT
𝑓 𝑥1, … , 𝑥 𝑚 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑚 𝑥 𝑚 + 𝛾1 ⋅ 𝑡1 + ⋯ + 𝛾𝑛 ⋅ 𝑡 𝑛
M1: 直接塞進一格
M2: 分別塞進很多格
這一格很不一樣,怎麼算?
這些格子怎麼來!!?
Outline
• 目標
₋ 說明格子 (feature) 怎麼來的 → feature 的意義
₋ 怎麼用 python 做
3
Feature 取得的方法 意義
BOW, Bag of word 文中每個字出現的次數/比例
TF-IDF 文中每個字的獨特程度
Word2vec 考慮上下文對每個字的意義影響,以向量做表示
Doc2vec 每個文章的意義,以向量做表示
LDA, Latent Dirichlet Allocation 每個文章中的主題比例
BOW, Bag of words
• 意義:文中每個字出現的次數/比例
• 用途:用詞習慣比較、文章抄襲、…
4
BOW, Bag of words
• 計算方式: Counting the number of times each meaningful word appears in each
document. Remove stop words.
5
"With all this stuff going down at the moment with MJ i've
started listening to his music, watching the odd
documentary here and there, watched The Wiz and
watched Moonwalker again. Maybe i just want to get a
certain insight into this guy who i thought was really cool
in the eighties just to maybe make up my mind whether he
is guilty or innocent. Moonwalker is part biography, part
feature film which i remember going to see at the cinema
when it was originally released. Some of it has subtle
messages about MJ's feeling towards the press and also the
obvious message of drugs are bad m'kay. <br/><br/>..."
feature Count
also 2
bad 3
film 2
get 1
like 3
made 1
make 1
…
…
Document 1
𝑡1
𝑡2
𝑡7
⋮
⋮
BOW, Bag of words
• 計算方式: We have many documents, each document has different meaningful words.
6
also bad film get like made …
2 3 2 1 3 1 …
film films like made man many …
1 2 1 1 1 2 …
also film get good movie one …
1 4 1 1 1 2 …
even film made movie much time …
3 1 1 1 1 1 …
Document 1
Document 2
Document 3
Document 4
BOW, Bag of words
• 計算方式: Use the most frequent words in all documents.
7
film movie one like good even …
Document 1 2 3 4 3 0 0 …
Document 2 1 1 2 1 0 0 …
Document 3 4 1 2 0 1 0 …
Document 4 1 1 0 0 0 3 …
…
…
…
…
…
…
…
Document 1000 0 2 0 3 1 2 …
1653 1630 1088 769 664 501 …
𝑡1 𝑡2 𝑡6⋯ ⋯
BOW, Bag of words
• 計算方式: Use the most frequent words in all documents.
8
film movie one like good even …
Document 1 2% 3% 4% 3% 0% 0% … 100%
Document 2 1% 1% 2% 1% 0% 0% … 100%
Document 3 4% 1% 2% 0% 1% 0% … 100%
Document 4 1% 1% 0% 0% 0% 3% … 100%
…
…
…
…
…
…
…
100%
Document 1000 0% 2% 0% 3% 1% 2% … 100%
BOW, Bag of words - Implement
• Package
• from sklearn.feature_extraction.text import CountVectorizer
• Input
• docs= [[word1-1,…word1-m1], [word2-1, …,word2-m2],…[wordn-1,…wordn-mn]]
• Setting bag of words tool
• vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None,
max_features = 5000)
• Get Features
• features = vectorizer.fit_transform(docs)
9
幫切字 去除 stop word
只選出現次數最多的 5000 字
TFIDF, Term frequency–inverse document frequency
• 意義:文中每個字的獨特程度,獨特性是跟其他的文一起比
• 用途:用來找出一篇文章中, 足以代表這篇文章的關鍵字的方法
10
TFIDF, Term frequency–inverse document frequency
• 計算方式:尋找這樣的關鍵字, 要考慮以下兩個要素
₋ TF, Term Frequency
• 這個字在這篇文章中出現的頻率
• IDF, Inverse-Document Frequency
• 在所有的文章中,有幾篇文章有這個字
• TF-IDF 值越大的, 越可以作為代表這篇文章的關鍵字
11
𝑇𝐹𝐼𝐷𝐹 = 𝑇𝐹 × Inv 𝐷𝐹 = 𝑛 𝑤
𝑑
× log2
𝑁
𝑁 𝑤
𝑛 𝑤
𝑑
:文章 𝑑 中,文字 𝑤 出現的次數
𝑁:表示總共有幾篇文章
𝑁 𝑤:表示有文字 𝑤 的文章有幾篇
TFIDF, Term frequency–inverse document frequency
• Example:Reuters Corpus (10788 篇文章)
• 其中某篇新聞內容
• GRAIN SHIPS LOADING AT PORTLAND
There were three grain ships loading and two ships were waiting to load at Portland , according to the
Portland Merchants Exchange .
• 𝑇𝐹𝐼𝐷𝐹portland = 3 × log2
10788
28
≈ 25.769, 10788 篇中有 28 篇出現 portland
• 𝑇𝐹𝐼𝐷𝐹𝑡𝑜 = 2 × log2
10788
6944
≈ 1.271, 10788 篇中有 6944 篇出現 to
12
TFIDF, Term frequency–inverse document frequency - Implement
13
Word2vec
• 意義:考慮上下文對每個字的意義影響,以向量做表示
• 用途:相似字、雙語翻譯、Analogies (ex."king - man + woman = queen.")
14
之前編碼 Word2vec
dog Id222 0,1,0,0,0,0,0,0 0.12,0.13,0.01,0.01,0.01
cat Id357 0,0,0,0,0,0,1,0 0.12,0.12,0.01,0.01,0.01
car Id358 0,0,0,0,0,0,0,1 0.01,0.01,0.33,0.40,0.25
Word2vec
• 計算方式: Skip-Gram
₋ ex. Doc1 = I have a cat. I love cat and dog.
₋ 7 個不同單字
8 個 pair
15
pair * i have i have a have a cat cat and dog
字 i have a ⋯ and
上下文 [*, have] [i, a] [have, cat] [cat, dog]
範例
Word2vec
• 計算方式: Skip-Gram
₋ ex. have a cat → a, [have, cat]
16
0
0
1
0
0
0
0
0.1
⋮
0.2
a
0
1
0
0
0
0
0
0
0
0
1
0
0
0
have
cat
0.1
0.9
0.0
0.0
0.1
0.3
0.2
0.1
0.1
0.0
0.0
0.1
0.4
0.2
∑
∑
∑
∑
∑
越像越好
轉出的向量
0.1
⋮
0.2
one hot
vector
one hot
vector
one hot
vector
越像越好
∑
∑
∑
Word2vec
• 意義:考慮上下文對每個字的意義影響,以向量做表示
• 用途:相似字、雙語翻譯、Analogies (ex."king - man + woman = queen.")
17
之前編碼 Word2vec
dog Id222 0,1,0,0,0,0,0,0 0.12,0.13,0.01,0.01,0.01
cat Id357 0,0,0,0,0,0,1,0 0.12,0.12,0.01,0.01,0.01
car Id358 0,0,0,0,0,0,0,1 0.01,0.01,0.33,0.40,0.25
Word2vec - Implement
• Package
• from gensim
18
Doc2vec
• 意義:每個文章的意義,以向量做表示
19
Doc2vec
• 計算方式: 借用 word2vec 的結果,平均內文 meaningful word 的向量
20
I have a cat. I love cat and dog.
Document 1
0.12,0.12,0.01,0.01,0.01
0.12,0.13,0.01,0.01,0.01
I
have
a
cat
cat
I
love
and
dog
0.12,0.12,0.01,0.01,0.01
0.01,0.01,0.01,0.01,0.95
0.01,0.01,0.01,0.01,0.95
0.01,0.01,0.01,0.95,0.01
0.39,0.40,0.06,1.00,1.00
0.39,0.40,0.06,1.00,1.00
6
文章的意義不單一?
LDA, Latent Dirichlet Allocation
• 意義:每個主題佔內文的比例
21
0.2
0.1
0.4
0.3
0.3
0.5
0.1
0.1
TheWilliam Randolph
Hearst Foundation will
give $1.25 million to
Lincoln Center,
Metropolitan …
''Our board felt that we
had a real opportunity to
make a mark on the future
of …
Documents
…
LDA
…
Arts
Budgets
Children
Education
LDA, Latent Dirichlet Allocation
• 意義:每個主題佔內文的比例
22
…
…
TheWilliam Randolph
Hearst Foundation will
give $1.25 million to
Lincoln Center,
Metropolitan …
''Our board felt that we
had a real opportunity to
make a mark on the future
of …
Documents
Topics
Arts
0.461 * music +
0.134 * movie + …
Budgets
0.211 * tax +
0.035 * money + …
Children
0.309 * children +
0.241 * family + …
Education
0.217 * school +
0.222 * teacher + …
LDA
Arts
Budgets
Children
Education
LDA, Latent Dirichlet Allocation
• 意義:每個主題佔內文的比例
23
Arts
0.461 * music +
0.134 * movie + …
Budgets
0.211 * tax +
0.035 * money + …
Children
0.309 * children +
0.241 * family + …
Education
0.217 * school +
0.222 * teacher + …
0.2
0.1
0.4
0.3
0.3
0.5
0.1
0.1
TheWilliam Randolph
Hearst Foundation will
give $1.25 million to
Lincoln Center,
Metropolitan …
''Our board felt that we
had a real opportunity to
make a mark on the future
of …
Topics
Documents…
LDA
LDA, Latent Dirichlet Allocation - Implement
• Package
• from gensim
24
Outline
• 目標
₋ 說明格子 (feature) 怎麼來的 → feature 的意義
₋ 怎麼用 python 做
25
Feature 取得的方法 意義
BOW, Bag of word 文中每個字出現的次數/比例
TF-IDF 文中每個字的獨特程度
Word2vec 考慮上下文對每個字的意義影響,以向量做表示
Doc2vec 每個文章的意義,以向量做表示
LDA, Latent Dirichlet Allocation 每個文章中的主題比例
Q&A
Thanks
26

More Related Content

More from Terence Huang

# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustmentTerence Huang
 
# 手把手 Python 資料分析 I
# 手把手 Python 資料分析 I# 手把手 Python 資料分析 I
# 手把手 Python 資料分析 ITerence Huang
 
# From statistics to ai
# From statistics to ai# From statistics to ai
# From statistics to aiTerence Huang
 
Deep Learning Advance: # Capsule net
Deep Learning Advance: # Capsule netDeep Learning Advance: # Capsule net
Deep Learning Advance: # Capsule netTerence Huang
 
Deep Learning Advance: #01 Domain Adaptation
Deep Learning Advance: #01 Domain AdaptationDeep Learning Advance: #01 Domain Adaptation
Deep Learning Advance: #01 Domain AdaptationTerence Huang
 
Deep Learning Basic: #01 start from CNN
Deep Learning Basic: #01 start from CNNDeep Learning Basic: #01 start from CNN
Deep Learning Basic: #01 start from CNNTerence Huang
 
從統計到資料科學
從統計到資料科學從統計到資料科學
從統計到資料科學Terence Huang
 

More from Terence Huang (8)

# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment
 
# 手把手 Python 資料分析 I
# 手把手 Python 資料分析 I# 手把手 Python 資料分析 I
# 手把手 Python 資料分析 I
 
# From statistics to ai
# From statistics to ai# From statistics to ai
# From statistics to ai
 
Deep Learning Advance: # Capsule net
Deep Learning Advance: # Capsule netDeep Learning Advance: # Capsule net
Deep Learning Advance: # Capsule net
 
Deep Learning Advance: #01 Domain Adaptation
Deep Learning Advance: #01 Domain AdaptationDeep Learning Advance: #01 Domain Adaptation
Deep Learning Advance: #01 Domain Adaptation
 
Deep Learning Basic: #01 start from CNN
Deep Learning Basic: #01 start from CNNDeep Learning Basic: #01 start from CNN
Deep Learning Basic: #01 start from CNN
 
從統計到資料科學
從統計到資料科學從統計到資料科學
從統計到資料科學
 
SQL 語言簡介
SQL 語言簡介 SQL 語言簡介
SQL 語言簡介
 

Recently uploaded

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 

Recently uploaded (20)

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 

文字分析 Python 入門

  • 2. How 結合文字到數據模型? • 傳統數據模型 ₋ 數字: 身高、體重、… ₋ 類別: 性別、人種、… • How to 結合文字? 2 𝑦 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑚 𝑥 𝑚 "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay. <br/><br/>..." 𝑓 𝑥1, … , 𝑥 𝑚 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑚 𝑥 𝑚 + 𝛾1 ⋅ TEXT 𝑓 𝑥1, … , 𝑥 𝑚 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑚 𝑥 𝑚 + 𝛾1 ⋅ 𝑡1 + ⋯ + 𝛾𝑛 ⋅ 𝑡 𝑛 M1: 直接塞進一格 M2: 分別塞進很多格 這一格很不一樣,怎麼算? 這些格子怎麼來!!?
  • 3. Outline • 目標 ₋ 說明格子 (feature) 怎麼來的 → feature 的意義 ₋ 怎麼用 python 做 3 Feature 取得的方法 意義 BOW, Bag of word 文中每個字出現的次數/比例 TF-IDF 文中每個字的獨特程度 Word2vec 考慮上下文對每個字的意義影響,以向量做表示 Doc2vec 每個文章的意義,以向量做表示 LDA, Latent Dirichlet Allocation 每個文章中的主題比例
  • 4. BOW, Bag of words • 意義:文中每個字出現的次數/比例 • 用途:用詞習慣比較、文章抄襲、… 4
  • 5. BOW, Bag of words • 計算方式: Counting the number of times each meaningful word appears in each document. Remove stop words. 5 "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay. <br/><br/>..." feature Count also 2 bad 3 film 2 get 1 like 3 made 1 make 1 … … Document 1 𝑡1 𝑡2 𝑡7 ⋮ ⋮
  • 6. BOW, Bag of words • 計算方式: We have many documents, each document has different meaningful words. 6 also bad film get like made … 2 3 2 1 3 1 … film films like made man many … 1 2 1 1 1 2 … also film get good movie one … 1 4 1 1 1 2 … even film made movie much time … 3 1 1 1 1 1 … Document 1 Document 2 Document 3 Document 4
  • 7. BOW, Bag of words • 計算方式: Use the most frequent words in all documents. 7 film movie one like good even … Document 1 2 3 4 3 0 0 … Document 2 1 1 2 1 0 0 … Document 3 4 1 2 0 1 0 … Document 4 1 1 0 0 0 3 … … … … … … … … Document 1000 0 2 0 3 1 2 … 1653 1630 1088 769 664 501 … 𝑡1 𝑡2 𝑡6⋯ ⋯
  • 8. BOW, Bag of words • 計算方式: Use the most frequent words in all documents. 8 film movie one like good even … Document 1 2% 3% 4% 3% 0% 0% … 100% Document 2 1% 1% 2% 1% 0% 0% … 100% Document 3 4% 1% 2% 0% 1% 0% … 100% Document 4 1% 1% 0% 0% 0% 3% … 100% … … … … … … … 100% Document 1000 0% 2% 0% 3% 1% 2% … 100%
  • 9. BOW, Bag of words - Implement • Package • from sklearn.feature_extraction.text import CountVectorizer • Input • docs= [[word1-1,…word1-m1], [word2-1, …,word2-m2],…[wordn-1,…wordn-mn]] • Setting bag of words tool • vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000) • Get Features • features = vectorizer.fit_transform(docs) 9 幫切字 去除 stop word 只選出現次數最多的 5000 字
  • 10. TFIDF, Term frequency–inverse document frequency • 意義:文中每個字的獨特程度,獨特性是跟其他的文一起比 • 用途:用來找出一篇文章中, 足以代表這篇文章的關鍵字的方法 10
  • 11. TFIDF, Term frequency–inverse document frequency • 計算方式:尋找這樣的關鍵字, 要考慮以下兩個要素 ₋ TF, Term Frequency • 這個字在這篇文章中出現的頻率 • IDF, Inverse-Document Frequency • 在所有的文章中,有幾篇文章有這個字 • TF-IDF 值越大的, 越可以作為代表這篇文章的關鍵字 11 𝑇𝐹𝐼𝐷𝐹 = 𝑇𝐹 × Inv 𝐷𝐹 = 𝑛 𝑤 𝑑 × log2 𝑁 𝑁 𝑤 𝑛 𝑤 𝑑 :文章 𝑑 中,文字 𝑤 出現的次數 𝑁:表示總共有幾篇文章 𝑁 𝑤:表示有文字 𝑤 的文章有幾篇
  • 12. TFIDF, Term frequency–inverse document frequency • Example:Reuters Corpus (10788 篇文章) • 其中某篇新聞內容 • GRAIN SHIPS LOADING AT PORTLAND There were three grain ships loading and two ships were waiting to load at Portland , according to the Portland Merchants Exchange . • 𝑇𝐹𝐼𝐷𝐹portland = 3 × log2 10788 28 ≈ 25.769, 10788 篇中有 28 篇出現 portland • 𝑇𝐹𝐼𝐷𝐹𝑡𝑜 = 2 × log2 10788 6944 ≈ 1.271, 10788 篇中有 6944 篇出現 to 12
  • 13. TFIDF, Term frequency–inverse document frequency - Implement 13
  • 14. Word2vec • 意義:考慮上下文對每個字的意義影響,以向量做表示 • 用途:相似字、雙語翻譯、Analogies (ex."king - man + woman = queen.") 14 之前編碼 Word2vec dog Id222 0,1,0,0,0,0,0,0 0.12,0.13,0.01,0.01,0.01 cat Id357 0,0,0,0,0,0,1,0 0.12,0.12,0.01,0.01,0.01 car Id358 0,0,0,0,0,0,0,1 0.01,0.01,0.33,0.40,0.25
  • 15. Word2vec • 計算方式: Skip-Gram ₋ ex. Doc1 = I have a cat. I love cat and dog. ₋ 7 個不同單字 8 個 pair 15 pair * i have i have a have a cat cat and dog 字 i have a ⋯ and 上下文 [*, have] [i, a] [have, cat] [cat, dog] 範例
  • 16. Word2vec • 計算方式: Skip-Gram ₋ ex. have a cat → a, [have, cat] 16 0 0 1 0 0 0 0 0.1 ⋮ 0.2 a 0 1 0 0 0 0 0 0 0 0 1 0 0 0 have cat 0.1 0.9 0.0 0.0 0.1 0.3 0.2 0.1 0.1 0.0 0.0 0.1 0.4 0.2 ∑ ∑ ∑ ∑ ∑ 越像越好 轉出的向量 0.1 ⋮ 0.2 one hot vector one hot vector one hot vector 越像越好 ∑ ∑ ∑
  • 17. Word2vec • 意義:考慮上下文對每個字的意義影響,以向量做表示 • 用途:相似字、雙語翻譯、Analogies (ex."king - man + woman = queen.") 17 之前編碼 Word2vec dog Id222 0,1,0,0,0,0,0,0 0.12,0.13,0.01,0.01,0.01 cat Id357 0,0,0,0,0,0,1,0 0.12,0.12,0.01,0.01,0.01 car Id358 0,0,0,0,0,0,0,1 0.01,0.01,0.33,0.40,0.25
  • 18. Word2vec - Implement • Package • from gensim 18
  • 20. Doc2vec • 計算方式: 借用 word2vec 的結果,平均內文 meaningful word 的向量 20 I have a cat. I love cat and dog. Document 1 0.12,0.12,0.01,0.01,0.01 0.12,0.13,0.01,0.01,0.01 I have a cat cat I love and dog 0.12,0.12,0.01,0.01,0.01 0.01,0.01,0.01,0.01,0.95 0.01,0.01,0.01,0.01,0.95 0.01,0.01,0.01,0.95,0.01 0.39,0.40,0.06,1.00,1.00 0.39,0.40,0.06,1.00,1.00 6 文章的意義不單一?
  • 21. LDA, Latent Dirichlet Allocation • 意義:每個主題佔內文的比例 21 0.2 0.1 0.4 0.3 0.3 0.5 0.1 0.1 TheWilliam Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropolitan … ''Our board felt that we had a real opportunity to make a mark on the future of … Documents … LDA … Arts Budgets Children Education
  • 22. LDA, Latent Dirichlet Allocation • 意義:每個主題佔內文的比例 22 … … TheWilliam Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropolitan … ''Our board felt that we had a real opportunity to make a mark on the future of … Documents Topics Arts 0.461 * music + 0.134 * movie + … Budgets 0.211 * tax + 0.035 * money + … Children 0.309 * children + 0.241 * family + … Education 0.217 * school + 0.222 * teacher + … LDA Arts Budgets Children Education
  • 23. LDA, Latent Dirichlet Allocation • 意義:每個主題佔內文的比例 23 Arts 0.461 * music + 0.134 * movie + … Budgets 0.211 * tax + 0.035 * money + … Children 0.309 * children + 0.241 * family + … Education 0.217 * school + 0.222 * teacher + … 0.2 0.1 0.4 0.3 0.3 0.5 0.1 0.1 TheWilliam Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropolitan … ''Our board felt that we had a real opportunity to make a mark on the future of … Topics Documents… LDA
  • 24. LDA, Latent Dirichlet Allocation - Implement • Package • from gensim 24
  • 25. Outline • 目標 ₋ 說明格子 (feature) 怎麼來的 → feature 的意義 ₋ 怎麼用 python 做 25 Feature 取得的方法 意義 BOW, Bag of word 文中每個字出現的次數/比例 TF-IDF 文中每個字的獨特程度 Word2vec 考慮上下文對每個字的意義影響,以向量做表示 Doc2vec 每個文章的意義,以向量做表示 LDA, Latent Dirichlet Allocation 每個文章中的主題比例