MM text team
蔡捷恩
莊文立
溫鈺瑋
2015@Delta Research Center
Fully automatic F/T matrix
analysis from patent data
蔡捷恩
Function/Technology MatrixUsing keyword “ ”
“The Patent-Classification Technology/Function Matrix - A Systematic Method for Design Around”, Cheng et al. Mar-2013, CSIR
Problem reduce
• detecting problem/solution pairs in a patent
document
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Problem term detection
• Step1. finding key frames
• Step2. feature extraction
– Unsupervised feature
– Supervised feature
• Step3. classifier training
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Step1. key frames detection
• We define key frames to be “
”
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Step2 – unsupervised feature
(language model)
• The model:
Maximize likelihood evaluation(MLE)
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Step2 – supervised feature
(linguistic model)
• By part-of-speech(POS) statistic on labeled
patents
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Step2 – supervised feature
(linguistic model)
• The model:
Delta function = 1 only when the current key frame
matches the given pattern
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Step3. classifier training
• Simply concatenate the features mention
above => LIBSVM
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Solution term detection
• Step1. key frame detection
• Step2. feature extraction
– Unsupervised feature
– Supervised feature: based on problem terms
• Step3. classifier training
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Problems
• Lacked of labeled data => the linguistic
model proposed in the paper seems general
enough => believe it directly with porter
stemming
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Further improvement
• Coreference resolution
– “the method solves the problem of overfitting.”
• Semantic based clustering
– Okapi BM25 ”The Probabilistic Relevance Framework: BM25 and Beyond”, Robertson et al., 2009
– Word vector “Efficient Estimation of Word Representations in Vector Space” T. Mikolov, ICLR, 2013.
– Document vector “Distributed Representations of Words and Phrases and their
Compositionality”,NIPS, 2013.
In my opinion: okapi > word vector > document vector
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Thank you
中文領域術語提取
溫鈺瑋
範例
×目前 此車 铣 設備 由 绮 發 機械 提供
目前 此 車铣 設備 由 绮發機械 提供
×L 固定 板會 有 擺動 過大 疑慮
L固定板 會 有 擺動 過大 疑慮
方法
• Collocation
– 利用Mutual information (簡稱MI) 得知「字跟字」及
「詞跟字」搭配成詞的機率, 詞的內部結合強度
– 例: c = “自然語言處理”, a = “自然語言處”
b = “然語言處理”
方法
• Adaptation
目前 此車 铣 設備 由 绮 發 機械 提供
b e b e s b e s s s b e b e
目前此車铣設備由绮發機械提供
CKIP, stanford, jieba…
手動調整
目前 此 車铣 設備 由 绮發機械 提供
b e s b e b e s b m m e b e
CRF-based DELTA word segmentor
Input : L 固定 板會 有 擺動 過大 疑慮
Output : L固定板 會 有 擺動 過大 疑慮
Thank you
台達資料的知識萃取
莊文立
Information Extraction
• Named Entity Recognition (NER)
– 專有名詞的辨識和分類
• 公司、人物、產品、地點…等等
• Relation Extraction (RE)
– 從文字裡找出named entities之間的關係,例如
• 競爭
• 合作
• 客戶
• 上游廠商
– 通常用(subject,relation,object)三元組來表示
SALES拜訪記錄:
對於BV3418專案價格的了解,欣特協寶姚經理給出的回應是,周總
認為,台達的價格比西門子808低階機種NC控制器的價格高。
• NER
• 西門子/Organization
• 欣特協寶/Organization
• 台達/Organization
• 姚經理/Person
• 周總/Person
• RE
# Subject Relation Object
1 台達 COMPETE_WITH 西門子
2 台達 IS_VENDOR 欣特協寶
3 西門子 IS_VENDOR 欣特協寶
4 欣特協寶 SUBORDINATE 姚經理
5 欣特協寶 SUBORDINATE 周總
Named Entity Recognition
• 資料處理
– 中文需要良好的斷詞結果
– 人工標記
• 模型: Conditional Random Fields (CRF)
– 從每個字的特徵裡,學習專有名詞使用的規律
• 本身的詞、詞性
• 上下文的詞、詞性
• 文法剖析樹
• 搭配用法
• 稱謂、姓氏
• 專有名詞資料庫
Relation Extraction
• 還是需要人工標記 
• Deep Learning!
– 讓機器自己發現最適合的表達方法
• Recursive Neural Network
– 順著文法剖析樹往上”爬”
– 每個字用 矩陣 +向量 表示
• 向量表示本身詞義
• 矩陣表示上下文資訊
– 兩個named entity交會處輸出的向量,放入分類器
1
−3
4
⋮
5
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●
●
Classifier
Future work
• Cross sentence
• Cross document
• Cross language
Thank you

Multimedia-text team report_2015-07-31