2. How 結合文字到數據模型?
• 傳統數據模型
₋ 數字: 身高、體重、…
₋ 類別: 性別、人種、…
• How to 結合文字?
2
𝑦 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑚 𝑥 𝑚
"With all this stuff going down at the moment with
MJ i've started listening to his music, watching the
odd documentary here and there, watched The Wiz
and watched Moonwalker again. Maybe i just want
to get a certain insight into this guy who i thought
was really cool in the eighties just to maybe make up
my mind whether he is guilty or innocent.
Moonwalker is part biography, part feature film
which i remember going to see at the cinema when it
was originally released. Some of it has subtle
messages about MJ's feeling towards the press and
also the obvious message of drugs are bad m'kay.
<br/><br/>..."
𝑓 𝑥1, … , 𝑥 𝑚 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑚 𝑥 𝑚 + 𝛾1 ⋅ TEXT
𝑓 𝑥1, … , 𝑥 𝑚 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑚 𝑥 𝑚 + 𝛾1 ⋅ 𝑡1 + ⋯ + 𝛾𝑛 ⋅ 𝑡 𝑛
M1: 直接塞進一格
M2: 分別塞進很多格
這一格很不一樣,怎麼算?
這些格子怎麼來!!?
4. BOW, Bag of words
• 意義:文中每個字出現的次數/比例
• 用途:用詞習慣比較、文章抄襲、…
4
5. BOW, Bag of words
• 計算方式: Counting the number of times each meaningful word appears in each
document. Remove stop words.
5
"With all this stuff going down at the moment with MJ i've
started listening to his music, watching the odd
documentary here and there, watched The Wiz and
watched Moonwalker again. Maybe i just want to get a
certain insight into this guy who i thought was really cool
in the eighties just to maybe make up my mind whether he
is guilty or innocent. Moonwalker is part biography, part
feature film which i remember going to see at the cinema
when it was originally released. Some of it has subtle
messages about MJ's feeling towards the press and also the
obvious message of drugs are bad m'kay. <br/><br/>..."
feature Count
also 2
bad 3
film 2
get 1
like 3
made 1
make 1
…
…
Document 1
𝑡1
𝑡2
𝑡7
⋮
⋮
6. BOW, Bag of words
• 計算方式: We have many documents, each document has different meaningful words.
6
also bad film get like made …
2 3 2 1 3 1 …
film films like made man many …
1 2 1 1 1 2 …
also film get good movie one …
1 4 1 1 1 2 …
even film made movie much time …
3 1 1 1 1 1 …
Document 1
Document 2
Document 3
Document 4
7. BOW, Bag of words
• 計算方式: Use the most frequent words in all documents.
7
film movie one like good even …
Document 1 2 3 4 3 0 0 …
Document 2 1 1 2 1 0 0 …
Document 3 4 1 2 0 1 0 …
Document 4 1 1 0 0 0 3 …
…
…
…
…
…
…
…
Document 1000 0 2 0 3 1 2 …
1653 1630 1088 769 664 501 …
𝑡1 𝑡2 𝑡6⋯ ⋯
8. BOW, Bag of words
• 計算方式: Use the most frequent words in all documents.
8
film movie one like good even …
Document 1 2% 3% 4% 3% 0% 0% … 100%
Document 2 1% 1% 2% 1% 0% 0% … 100%
Document 3 4% 1% 2% 0% 1% 0% … 100%
Document 4 1% 1% 0% 0% 0% 3% … 100%
…
…
…
…
…
…
…
100%
Document 1000 0% 2% 0% 3% 1% 2% … 100%
9. BOW, Bag of words - Implement
• Package
• from sklearn.feature_extraction.text import CountVectorizer
• Input
• docs= [[word1-1,…word1-m1], [word2-1, …,word2-m2],…[wordn-1,…wordn-mn]]
• Setting bag of words tool
• vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None,
max_features = 5000)
• Get Features
• features = vectorizer.fit_transform(docs)
9
幫切字 去除 stop word
只選出現次數最多的 5000 字
12. TFIDF, Term frequency–inverse document frequency
• Example:Reuters Corpus (10788 篇文章)
• 其中某篇新聞內容
• GRAIN SHIPS LOADING AT PORTLAND
There were three grain ships loading and two ships were waiting to load at Portland , according to the
Portland Merchants Exchange .
• 𝑇𝐹𝐼𝐷𝐹portland = 3 × log2
10788
28
≈ 25.769, 10788 篇中有 28 篇出現 portland
• 𝑇𝐹𝐼𝐷𝐹𝑡𝑜 = 2 × log2
10788
6944
≈ 1.271, 10788 篇中有 6944 篇出現 to
12
14. Word2vec
• 意義:考慮上下文對每個字的意義影響,以向量做表示
• 用途:相似字、雙語翻譯、Analogies (ex."king - man + woman = queen.")
14
之前編碼 Word2vec
dog Id222 0,1,0,0,0,0,0,0 0.12,0.13,0.01,0.01,0.01
cat Id357 0,0,0,0,0,0,1,0 0.12,0.12,0.01,0.01,0.01
car Id358 0,0,0,0,0,0,0,1 0.01,0.01,0.33,0.40,0.25
15. Word2vec
• 計算方式: Skip-Gram
₋ ex. Doc1 = I have a cat. I love cat and dog.
₋ 7 個不同單字
8 個 pair
15
pair * i have i have a have a cat cat and dog
字 i have a ⋯ and
上下文 [*, have] [i, a] [have, cat] [cat, dog]
範例
16. Word2vec
• 計算方式: Skip-Gram
₋ ex. have a cat → a, [have, cat]
16
0
0
1
0
0
0
0
0.1
⋮
0.2
a
0
1
0
0
0
0
0
0
0
0
1
0
0
0
have
cat
0.1
0.9
0.0
0.0
0.1
0.3
0.2
0.1
0.1
0.0
0.0
0.1
0.4
0.2
∑
∑
∑
∑
∑
越像越好
轉出的向量
0.1
⋮
0.2
one hot
vector
one hot
vector
one hot
vector
越像越好
∑
∑
∑
17. Word2vec
• 意義:考慮上下文對每個字的意義影響,以向量做表示
• 用途:相似字、雙語翻譯、Analogies (ex."king - man + woman = queen.")
17
之前編碼 Word2vec
dog Id222 0,1,0,0,0,0,0,0 0.12,0.13,0.01,0.01,0.01
cat Id357 0,0,0,0,0,0,1,0 0.12,0.12,0.01,0.01,0.01
car Id358 0,0,0,0,0,0,0,1 0.01,0.01,0.33,0.40,0.25
20. Doc2vec
• 計算方式: 借用 word2vec 的結果,平均內文 meaningful word 的向量
20
I have a cat. I love cat and dog.
Document 1
0.12,0.12,0.01,0.01,0.01
0.12,0.13,0.01,0.01,0.01
I
have
a
cat
cat
I
love
and
dog
0.12,0.12,0.01,0.01,0.01
0.01,0.01,0.01,0.01,0.95
0.01,0.01,0.01,0.01,0.95
0.01,0.01,0.01,0.95,0.01
0.39,0.40,0.06,1.00,1.00
0.39,0.40,0.06,1.00,1.00
6
文章的意義不單一?
21. LDA, Latent Dirichlet Allocation
• 意義:每個主題佔內文的比例
21
0.2
0.1
0.4
0.3
0.3
0.5
0.1
0.1
TheWilliam Randolph
Hearst Foundation will
give $1.25 million to
Lincoln Center,
Metropolitan …
''Our board felt that we
had a real opportunity to
make a mark on the future
of …
Documents
…
LDA
…
Arts
Budgets
Children
Education
22. LDA, Latent Dirichlet Allocation
• 意義:每個主題佔內文的比例
22
…
…
TheWilliam Randolph
Hearst Foundation will
give $1.25 million to
Lincoln Center,
Metropolitan …
''Our board felt that we
had a real opportunity to
make a mark on the future
of …
Documents
Topics
Arts
0.461 * music +
0.134 * movie + …
Budgets
0.211 * tax +
0.035 * money + …
Children
0.309 * children +
0.241 * family + …
Education
0.217 * school +
0.222 * teacher + …
LDA
Arts
Budgets
Children
Education
23. LDA, Latent Dirichlet Allocation
• 意義:每個主題佔內文的比例
23
Arts
0.461 * music +
0.134 * movie + …
Budgets
0.211 * tax +
0.035 * money + …
Children
0.309 * children +
0.241 * family + …
Education
0.217 * school +
0.222 * teacher + …
0.2
0.1
0.4
0.3
0.3
0.5
0.1
0.1
TheWilliam Randolph
Hearst Foundation will
give $1.25 million to
Lincoln Center,
Metropolitan …
''Our board felt that we
had a real opportunity to
make a mark on the future
of …
Topics
Documents…
LDA