[機械学習]文章のクラス分類

 キーワードに対応したカテゴリを出力する。
 教師あり学習のクラス分類/教師なし学習のクラスタリ
ングの比較。
 カテゴリを説明する文章を教師データとしてクラス分類を
する。随時トレーニングデータを増やしていって学習させ
ていく（or アルゴリズムの変更）。
【入力】キーワード
【出力】クラス（ or 各クラスに分類される確率の方が、閾値確認とかできてよい。）
2

 Numpy、SciPyを使う
 文章のクラス分類
Numpy、SciPy、Sklearnを使ってやります。
・手っ取り早く準備
sudo apt-get install python-numpy python-scipy python-matplotlib
ipython ipython-notebook python-pandas python-sympy python-nose
python-sklearn
・参考にしたドキュメント
https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
http://docs.scipy.org/doc/
http://scikit-learn.org/stable/
やってみたはいいが、機械学習の基本がわかっていないので、用語等がさっぱりわ
からず。勉強して再度やり直そう。
3

Python標準の配列と異なる配列操作ができる多次元配列を提供するパッ
ケージ。
>>> import numpy as np
>>> np.array([1,2,3,4,5,6]).reshape(2,3) #1次元配列⇒2次元配列への変換。
array([[1, 2, 3],
[4, 5, 6]])
>>> np.array([1,2,3,4,5,6])*2 #配列の各要素に対して操作を実行。
array([ 2, 4, 6, 8, 10, 12])
>>> np.array([1,2,3,4,5,6])**2 #配列の各要素に対して操作を実行。
array([ 1, 4, 9, 16, 25, 36])
>>> np.array([1,2,3,4,5,6])[np.array([0,1,3,5])] #配列でインデックスを指定。
array([1, 2, 4, 6])
>>> np.array([1,2,3,4,5,6]).clip(2,3) #配列の上限/下限を設定。
array([2, 2, 3, 3, 3, 3])
>>> a = np.array([1,2,3,4,5,6])
>>> a[a>3] #条件を満たす要素だけを抽出できる。
array([4, 5, 6])
4

数値計算アルゴリズムを提供するパッケージ。
（一例）http://docs.scipy.org/doc/scipy/reference/
5
SciPyパッケージ説明
Clustering クラスタリングアルゴリズム。
情報理論、ターゲット検出、通信、圧縮などの分野に使用される。
Constants 物理定数、数学定数。πとか。
Discrete Fourier transforms フーリエ変換。
Integration and ODEs 積分。
Interpolation 補間。
Input and output 入出力。
Linear algebra 線形代数。
Multi-dimensional image processing 多次元画像処理。
Orthogonal distance regression 直行距離回帰。
Optimization and root finding 最適化とルート検索。
Signal processing 信号処理。

データの入力
>>> import scipy as sp
>>> data = sp.genfromtxt ("data/hoge.tsv", delimiter="t")
#Scipyのgenfromtxt関数は、指定したdelimiterでデータを区切って配列に格納する。
>>> data[:10]
array([[ 1.00000000e+00, 2.27200000e+03],
[ 2.00000000e+00, nan],
[ 3.00000000e+00, 1.38600000e+03],
[ 4.00000000e+00, 1.36500000e+03],
[ 5.00000000e+00, 1.48800000e+03],
[ 6.00000000e+00, 1.33700000e+03],
[ 7.00000000e+00, 1.88300000e+03],
[ 8.00000000e+00, 2.28300000e+03],
[ 9.00000000e+00, 1.33500000e+03],
[ 1.00000000e+01, 1.02500000e+03]])
>>> data.shape
(743, 2) #742×2の行列データ
6

無効データの削除
>>> x = data[:,0] #1次元行列に変換
>>> y = data[:,1] #1次元行列に変換
>>> y.shape
(743,)
>>> x = x[~sp.isnan(y)] #xからyの無効データの要素に対応するデータを取り除く
>>> y = y[~sp.isnan(y)] #yからyの無効データを取り除く
>>> y.shape #無効データが削除された
(735,)
#参考こういうイメージ。
>>> np.array([1, 2, 3, 4, 5])[np.array([True, True, False, False, True])]
array([1, 2, 5])
7

データの近似① 多項式近似
polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False)
・引数
x, y：x軸、y軸のデータ
deg：次数
full：詳細情報を戻り値に含めるかどうか
w：y軸に適用される重み
・戻り値
p：多項式の係数
residuals, rank, singular_values, rcond ：近似に関する情報。
残差平方和、係数行列のrank、係数行列の特異値(singular value)、条件数の逆数
(reciprocal condition number)
残差平方和(Residual Sum of Squares:RSS)は近似誤差を表す。
・Polyfit
http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.polyfit.html
8

データの近似① 多項式近似
>>> fp1, r1, rank, sv, rcond = sp.polyfit(x, y, 1, full=True) #次数=1
>>> r1, r5, r10 #残差平方和
(array([ 3.17389767e+08]), array([ 1.24464715e+08]),
array([ 1.21942326e+08]))
>>> fp1.size, fp5.size, fp10.size #多項式の係数の数
(2, 6, 11)
>>> f1 = sp.poly1d(fp1) #関数の作成
9

グラフ化
>>> import matplotlib.pyplot as plt
>>> plt.scatter(x,y) #データのプロット
<matplotlib.collections.PathCollection object at 0x7fe34be023d0>
>>> plt.plot(x, f1(x), linewidth = 5) #次数=1の近似曲線プロット
[<matplotlib.lines.Line2D object at 0x7fe34bcc2350>]
[<matplotlib.lines.Line2D object at 0x7fe34bc60bd0>]
[<matplotlib.lines.Line2D object at 0x7fe34bc60fd0>]
>>> plt.show() #表示
10
次数=1、5、10の近似曲線グラフ

データ分析
x=0～550くらいのところと、x=550以上のところでデータの傾向が異なる。
11
青：未学習（学習不足）
赤：過学習（トレーニングデータに合わせすぎている）
赤が若干過学習
x=0～550ではゆるやかに増加 x=550以上では急増
将来の傾向x=550以上
を分析する

データの近似② データの傾向を反映した多項式近似
x=550くらいで異なる近似式を用いる。
>>> var = 550
>>> x1 = x[:var]
>>> y1 = y[:var]
>>> x2 = x[var:]
>>> y2 = y[var:] #データを分割
>>> f1_1 = sp.polyfit(x1, y1, 1)
>>> f1_2 = sp.polyfit(x2, y2, 1) #近似計算
>>> fx1_1 = sp.poly1d(f1_1)
>>> fx1_2 = sp.poly1d(f1_2)
>>> plt.ylim([0,7000])
(0, 7000)
>>> plt.scatter(x,y)
<matplotlib.collections.PathCollection object at 0x7fe34bd333d0>
>>> plt.plot(x, fx1_1(x), linewidth = 5)
[<matplotlib.lines.Line2D object at 0x7fe34bd335d0>]
>>> plt.plot(x, fx1_2(x), linewidth = 5)
[<matplotlib.lines.Line2D object at 0x7fe34bc2ba50>]
>>> plt.show()
12
凡例追加
>>>plt.plot(x, f(x), label=“fx”)
>>> plt.legend()

近似モデルの評価
データをトレーニングデータとテストデータに分ける。
トレーニングデータで近似モデルを作成する。
テストデータで近似モデルの評価をする（テストデータと実データの誤差を比較）。
>>> random_list_index = sp.random.permutation(range(x2.size)) #0～x.sizeまでのリストをランダムに並べる
>>> tmp = int(0.1 * x2.size)
>>> test_data = sorted(random_list_index[:tmp]) #ランダムリストの前半10%の配列インデックスをテストデータに設定
>>> training_data = sorted(random_list_index[tmp:]) #ランダムリストの前半10%～最後までの配列インデックスをテストデータに設定
>>> for i in range(10):
... f = sp.polyfit(x2[training_data], y2[training_data], i)
... fx = sp.poly1d(f)
... print(i, sp.sum((fx(x2[test_data])-y2[test_data])**2))
...
(0, 27183815.34587115)
(1, 2497456.2254364942)
(2, 1453163.5456958658)
(3, 1256031.6360023962) ★次数=3が最も将来のテストデータに近い
(4, 1334216.6754045566)
(5, 1519554.9450757354)
(6, 1733850.3693509575)
(7, 1726445.9845295851)
(8, 1802679.4758058237)
13

 Classification（クラス分類）（別名：Supervised learning（教師あり学習））
を用いたクラス分類。
 4つの特徴量(features)を持つ、3つのクラスに分類されるデータを分類する
モデルを作成する。
※特徴量は重要なところに反応が大きく、重要でないところに反応が小さいのが理想
 Pythonの機械学習パッケージsklearnを使う。
 sklearnにあるアイリスという花をクラス分類するチュートリアルを実施する。
・ドキュメント
http://scikit-
learn.org/stable/modules/generated/sklearn.datasets.load_iris.html
14

アイリスデータの生成
>>> from matplotlib import pyplot as plt
>>> from sklearn.datasets import load_iris
>>> data = load_iris()
>>> type(data)
<class 'sklearn.datasets.base.Bunch'>
>>>features = data['data']
>>>target = data['target']
>>>target_names = data['target_names']
>>>labels = target_names[target]
dataはこんな構造です。
 target_namesはターゲットの名前
'target_names': array(['setosa', 'versicolor', 'virginica'],
 dataは4つの特徴量
'data': array([[ 5.1, 3.5, 1.4, 0.2],
[ 4.9, 3. , 1.4, 0.2],,,,,,
 targetは各ターゲットの名前を示すインデックス
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ,,,,
15

アイリスデータのプロット
特徴量毎（0-1、0-2、0-3、1-2、1-3、2-3）、クラス毎(○、☓、★)にプロット。
>>> features = data['data']
>>>plt.scatter(features[target==0, 0], features[target==0, 1], marker='o') #赤
>>>plt.scatter(features[target==1, 0], features[target==1, 1], marker='x') #青
>>>plt.scatter(features[target==2, 0], features[target==2, 1], marker='*') #緑
16
0-1 0-2 0-3
1-2 1-3 2-3
s=100, c='r'/'g'/'b'
でサイズ、色も調整。
class 0(赤●)は
特徴量「2」で分類できる。

特徴量「2」による、class 0の分類をモデル化
class 0の特徴量「2」の最大値、他のクラスの特徴値「2」の最小値を求め、モデル化する。
>>> feature2 = features[:, 2]
>>> is_class0 = (labels == 'setosa') #class 0(setosa)がTrue、それ以外FalseのBoolean配列
>>> max_class0_feature2 = feature2[is_class0].max() #class 0の特徴量「2」の最大値
>>> min_class12_feature2 = feature2[~is_class0].min() #class 1 or 2の特徴量「2」の最小値
>>> max_class0_feature2, min_class12_feature2
(1.8999999999999999, 3.0)
class 0 と class 1、class 2を分類するモデル関数。
def model_function( feature ) :
if feature[2] < 2.0 : print('Class 0')
else : print("Class 1 or Class 2")
次に、class 1とclass 2を分類する方法を考える。
17

class 1、class 2を分類する特徴量を探す
各特徴量毎にclass 1、class 2を分類し、正しく分類できる確率が最も高い特徴値を探す。
>>> best_accuracy = -1.0
>>> best_feature = -1.0
>>> best_j = -1.0
>>> for i in range(4):
... tmp_feature = features[:, i].copy()
... for j in tmp_feature:
... judge = (features[:,i] > j)
... inference_class2 = (labels[judge]).size
... correct_class2 = (labels[judge] == 'virginica').sum()
... missing_class2 = (labels=='virginica').sum() - correct_class2
... accuracy = float(correct_class2)/(inference_class2 + missing_class2)
... if accuracy > best_accuracy:
... best_feature, best_accuracy, best_j = i, accuracy, j
... print(i, best_feature, best_accuracy, best_j)
...
(0, 0, 0.5909090909090909, 6.0999999999999996)
(1, 0, 0.5909090909090909, 6.0999999999999996)
(2, 2, 0.875, 4.7000000000000002)
(3, 3, 0.8846153846153846, 1.6000000000000001)
★特徴量「3」閾値1.6が最も精度が高い。
特徴量「3」と閾値1.6をclass1、class 2の分類に使用。
18
正解したClass 2の数
推測したClass 2の数 + 見逃したClass 2の数
plt.axvline(x=1.6, c="r")
Class 2
Class 1

汎化能力(generalization ability)の検証のため、トレーニングデータの一部
をテストデータに使用する。ただし、そうすると、トレーニングデータが少なくな
る。
そこで「交差検定（cross-validation）」を使用する。
トレーニングデータをk分割して各ブロックをトレーニングデータ、テストデー
タに分割するk分割交差検定(k-fold Cross Validation)
k分割交差検定のK=トレーニングデータ数のleave-one-out法
がある。
19
テストデータK分割交差検定
トレーニングデータ
テストデータ
・・・
検証パターン#1 検証パターン#k
・・・・・・

特徴量が多い場合、アイリスのような人手で各class毎を分類する特徴量と閾
値を探す手法は使えない。
入力データと特徴量の距離(例：ユークリッド距離、マンハッタン距離)が近いも
のにclass分類する下記の手法を用いる。
最近傍法（Nearest neighbor classification：NNC）
k近傍法（k-nearest neighbor classification：KNNC）
20
k近傍法
wikipedia
https://ja.wikipedia.org/wiki/%E6%9C%80%E8%BF%91%E5%82%8D%E6%8E%A2%E7%B4%A2
特徴量7つ。
https://archive.ics.uci.edu/ml/datasets/seeds

 ランダムフォレスト
決定木を用いた機械学習アルゴリズム。
21
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

1. 形態素解析
単語を抽出。下記を参考にmecabで実施。
http://qiita.com/yoshikyoto/items/1a6de08a639f053b2d0a
2. ストップワードの排除
ストップワード（頻繁に使われる役に立たない単語）を排除する。
3. ステミング(stemming)
単語の語形変化を特定の語幹に変換する。(日本語では不要？)
4. 文章の特徴ベクトル化
bag-of-wordsに変換し、文章を特徴量のベクトルで表現する。GensimPyを使用。
https://github.com/samantp/gensimPy3/releases/tag/0.8.7
5. クラス分類
ランダムフォレストでクラスタリングする。
各クラスタにクラス名を付与する。
22

 MeCabで名詞を取り出す。
>>> import MeCab
>>> c=open('cloud.txt')
>>> s=open('sec.txt')
>>> c_text=c.read()
>>> s_text=s.read()
>>> c.close()
>>> s.close()
>>> tag = MeCab.Tagger('')
>>> arr = []
>>> c_node = tag.parseToNode(c_text)
>>> s_node = tag.parseToNode(s_text)
>>> while c_node:
... print c_node.surface + 't' + c_node.feature
... if c_node.feature.split(',')[0] == '名詞'
... arr.append(c_node.surface)
... c_node = c_node.next
23
「クラウド」の記事
http://www.itmedia.co.jp/enterprise/articles/1602/15/news067.html
「セキュリティ」の記事
http://www.itmedia.co.jp/enterprise/articles/1602/10/news032.html
日本名詞,固有名詞,地域,国,*,*,日本,ニッポン,ニッポン
オラクル名詞,一般,*,*,*,*,*
が助詞,格助詞,一般,*,*,*,が,ガ,ガ
発表名詞,サ変接続,*,*,*,*,発表,ハッピョウ,ハッピョー
し動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
た助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
クラウドパートナー名詞,一般,*,*,*,*,*
施策名詞,サ変接続,*,*,*,*,施策,シサク,シサク
と助詞,格助詞,引用,*,*,*,と,ト,ト
は助詞,係助詞,*,*,*,*,は,ハ,ワ
クラ名詞,固有名詞,一般,*,*,*,クラ,クラ,クラ
形態素解析の結果
名詞のみのリスト(arr)を作成
日本、オラクル、発表、クラウドパートナー
施策、クラ、ウド、サービス、展開、大手、ベンダー、パートナー、
戦略、ビジネス、拡大、重要、取り組み、米国、Amazon 、Web 、
Services 、AWS 、Salesfoce 、. 、com 、クラ
ウド、サービス
専業、ベンダー、日本
パートナーエコシステム
づくり、ここ
IBM 、 Microsoft
Oracle 、SAP
富士通、NEC 、日本、勢、本格、的、それぞれ、体制
強化、中、日本、オラクル

特徴ベクトル化のための辞書作成
>>> from gensim import corpora
>>> dic = corpora.Dictionary([arr])
ストップワードの排除：高頻度、低頻度の用語を排除
>>> dic.filter_extremes(no_below=100, no_above=0.01)
辞書のエクスポート/インポート
>>> dic.save_as_text('it.dic')
>>> dic = corpora.Dictionary.load_from_text('dic')
特徴ベクトル化
>>> vec_tmp = dic.doc2bow(arr)
>>> vec = list(gensim.matutils.corpus2dense([vec_tmp], num_terms=len(dic)).T[0])
[2.0, 2.0, 4.0, 2.0, 4.0, 2.0, 2.0, 2.0, 4.0, 2.0, 2.0, 2.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0,
4.0, 2.0, 6.0, 4.0, 2.0, 2.0, 2.0, 2.0, 14.0, 6.0, 6.0, 2.0, 2.0, 12.0, 2.0, 2.0, 2.0, 2.0,
6.0, 2.0, 10.0, 6.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 、、、
24

学習（トレーニングデータの特徴ベクトル化）
>>> from sklearn.ensemble import RandomForestClassifier
>>> train = [sec_arr, cloud_arr]
>>> class = [0,1]
>>> model = RandomForestClassifier()
>>> model.fit(train, class)
クラス分類
>>> print( model.predict(test_arr) )
[0, 0, 1]
検定
>>> print( model.score(test_data, test_label))
1.000000000000
25

 代表的なクラスタリングは2つに分類される。
◦ フラットクラスタリング(flat clustering)
クラスタ間の関連性を考慮しない。データは1つの1クラスタ
に属する。
◦ 階層的クラスタリング(hierarchical clustering)
クラスタが階層構造をもつ。
scikit-leanのクラスタリングライブラリ
http://scikit-learn.org/stable/modules/clustering.html
26

 100×3の行列をフラットクラスタリングする。
>>> from numpy.random import *
>>> arr = randint(0,100,(100,3))
>>> from sklearn.cluster import KMeans
27
>>> for i, j in zip(cl.labels_,arr):
... print(i, j, j.sum(), j.mean())
...
(0, array([19, 50, 69]), 138, 46.0)
(1, array([58, 15, 75]), 148, 49.333333333333336)
(0, array([26, 41, 83]), 150, 50.0)
(0, array([33, 54, 96]), 183, 61.0)
(2, array([97, 68, 5]), 170, 56.666666666666664)
(2, array([78, 52, 93]), 223, 74.333333333333329)
(1, array([99, 19, 17]), 135, 45.0)
>>> cl = KMeans(n_clusters=3, max_iter=1000, n_init=100).fit(arr)
>>> for i, j in zip(cl.labels_,arr):
... print(i, j, j.sum(), j.mean())
...
(2, array([19, 50, 69]), 138, 46.0)
(0, array([58, 15, 75]), 148, 49.333333333333336)
(2, array([26, 41, 83]), 150, 50.0)
(2, array([33, 54, 96]), 183, 61.0)
(1, array([97, 68, 5]), 170, 56.666666666666664)
(1, array([78, 52, 93]), 223, 74.333333333333329)
KMeansは最適化アルゴリズムのため、パラメーターを変えるとクラスタも変わる。
KMeans（k平均法）
k平均法(wikipedia)
https://ja.wikipedia.org/wiki/K%E5%B9%B3%E5%9D%87%E6%B3%95
各データにランダムにクラスタを割り振り、クラスタの中心を計算する。
各データを中心距離が最も近いクラスタに割り当て直す。
下記の最適化問題を解くアルゴリズム

[機械学習]文章のクラス分類

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [機械学習]文章のクラス分類

Similar to [機械学習]文章のクラス分類 (20)

More from Tetsuya Hasegawa

More from Tetsuya Hasegawa (20)

[機械学習]文章のクラス分類