R -> Python

自己紹介
• kzfm (@fmkz___)
– blog.kzfmix.com
– Shizuoka.py
• とある製薬企業の研究員
• 日本酒とdrum’n’bass好き
• Python歴は6年くらい
– （その前はPerl）
• よく使うのはFlask, Pandas

Rstudioとか
アメリカの各州での
1990-2010年における
UFO目撃頻度を視覚化する

ggplot2とか
g + geom_point() + facet_wrap(~Species) + geom_smooth(method='lm')

今日はRでやっていることを
Pythonでやれるようにするツー
ルを紹介します
• Pandas
– (DataFrame)
• ggplot
– (ggplotのPython実装)
• scikit-learn
– (pythonの機械学習ライブラリ)

Pandas
http://pandas.pydata.org/

Pandasとは何か？
• Rでいうところのデータフレームやベクト
ルに相当するものを提供するライブラリ
• よく分からなければ以下を参考にしてみ
てください
– http://www.slideshare.net/KazufumiOhkawa/1
2-20049278

Series(ベクトル)
>>> a = pd.Series(range(5), index=list(“abcde”)) # 0..5のリストにa..eのインデックス
>>> a[list(“ace”)] #indexアクセス
a 0
c 2
e 4
dtype: int64
>>> a[[0,2,4]] # 0,2,4番目の要素
a 0
c 2
e 4
dtype: int64
>>> a[(a<1)|(a>3)] #1より小さい、または3より大きい要素
a 0
e 4
dtype: int64

DataFrameを作成
>>> pd.DataFrame([[1,2,3],[4,5,6]])
0 1 2
0 1 2 3
1 4 5 6

DataFrameに列名、行名を追加
>>> x.index = list("ab")
>>> x.columns = list("cde")
>>> x
c d e
a 1 2 3
b 4 5 6

DataFrameの列にアクセス
>>> x["c"]
a 1
b 4
Name: c, dtype: int64
>>> x.c
a 1
b 4
Name: c, dtype: int64

メソッドを呼ぶ
>>> x.c.mean() # (1+4) / 2
2.5
>>> x.c.sum() # 1+4
5

データフレームの結合
>>> x
0 1
0 1 0
1 -2 3
>>> pd.concat([x, x], axis=0) # rbind
0 1
0 1 0
1 -2 3
0 1 0
1 -2 3
>>> pd.concat([x, x], axis=1) # cbind
0 1 0 1
0 1 0 1 0
1 -2 3 -2 3

逆引きをPandasで
• Rと大体似たような感じでかけました
– http://blog.kzfmix.com/entry/1387969720

(前提)ggplot2とは
• 良い感じのグラフを手軽にかけるライブ
ラリ
• オブジェクト指向っぽくグラフを作る
• Photoshopのレイヤーを重ねるようにグラ
フを作成していく
• ggplot2のためにRを使うとかありがち
– 私とか

散布図
g <- ggplot(data=iris, aes_string(x='Sepal.Length', y='Sepal.Width', color='Petal.Length'))
g + geom_point()

種毎に分ける
g + geom_point() + facet_wrap(~Species)

ラベルを変更
g + geom_point() + facet_wrap(~Species) + xlab("Length") + ylab("Width")

線形回帰
g + geom_point() + facet_wrap(~Species) + geom_smooth(method='lm')

pythonでもやりたい
• http://ggplot.yhathq.com/
• すごい！えらい！素敵♡

python-ggplotの良いところ
• 画像生成処理をバッチで流せる
– Rだとちょっと面倒くさい
• APIもR版ggplot2と同一のものを提供するこ
とを目指しているのでR版ggplot2の本が参考
になる
• 開発中なので色々不足しているところはある
けどほぼ満足している

早速何かやってみます
• Spleen tyrosine kinase (SYK)の阻害活性
データから分子量とALogPのプロットを
する
• ChEMBLのデータを利用します

データ作成
• pychembldbを使います
• 出力をsyk.csvとして保存
– ChEMBL便利☆
from pychembldb import *
#Inhibition of recombinant Syk
#Bioorg. Med. Chem. Lett. (2009) 19:1944-1949
assay = chembldb.query(Assay).filter_by(chembl_id="CHEMBL1022010").one()
print '"ID","IC50","ALOGP","MWT"'
for act in assay.activities:
if act.standard_relation == "=":
print '"{}",{},{},{}'.format(act.compound.molecule.chembl_id,
act.standard_value,
act.compound.molecule.property.alogp,
act.compound.molecule.property.mw_freebase)

データはこんな感じ
ID IC50 ALOGP MWT pIC50
0 CHEMBL475575 4.0 1.99 426.47 8.397940
1 CHEMBL162 3.0 3.82 466.53 8.522879
2 CHEMBL473229 9.0 3.52 397.49 8.045757
3 CHEMBL475250 30.0 2.38 401.41 7.522879
4 CHEMBL475251 41.0 3.88 470.45 7.387216
5 CHEMBL515756 50.0 3.21 339.43 7.301030
6 CHEMBL105427 60.0 1.23 454.50 7.221849
7 CHEMBL30873 90.0 3.59 320.43 7.045757
8 CHEMBL474361 230.0 1.13 286.33 6.638272
9 CHEMBL515271 300.0 3.93 312.30 6.522879
10 CHEMBL474362 310.0 3.08 335.45 6.508638
11 CHEMBL443514 500.0 2.69 270.31 6.301030
12 CHEMBL105740 940.0 3.32 439.53 6.026872
13 CHEMBL470716 2000.0 3.84 304.37 5.698970
14 CHEMBL470717 3800.0 3.23 299.28 5.420216

pandasで読み込んでggplotで描画
import pandas as pd
from ggplot import *
import numpy as np
d = pd.read_csv("syk.csv")
d["pIC50"] = 9 - np.log10(d["IC50"])
p = ggplot(aes(x='MWT', y='ALOGP', color="pIC50", size="pIC50"), data=d) + geom_point()
#print p
ggsave("2dplot.png", p)

ヒストグラムも
p = ggplot(aes(x="pIC50"), data=d) + geom_histogram()
ggsave("hist.png", p)

(おまけ)時系列データ
p = ggplot(aes(x='Date', y='nw'), data=d) +
geom_point(color='lightblue') +
stat_smooth(span=.15, color='black', se=True) +
ggtitle("Simple Diet") +
xlab("Date") +
ylab("Weight")

視覚化はggplotでOK
続いて機械学習

今日使うもの
• PCA
• RandamForest
• 交差検定 <- 便利!

今回はRDKitとの連携例
• pychembldbでデータを取ってきて
• RDKitでフィンガープリントを出して
• Scikit-learnで
– クラスタリング
• PCAでケミストリースペースの把握
– 予測モデル作成
• RandamForest

pychembldb
from pychembldb import *
#Inhibition of recombinant Syk
#Bioorg. Med. Chem. Lett. (2009) 19:1944-1949
assay = chembldb.query(Assay).filter_by(chembl_id="CHEMBL1022010").one()
for act in assay.activities:
if act.standard_relation == "=":
print act.compound.molecule.structure.molfile, "n$$$$"
先に使ったSYKのデータから
sdfを作っておく

ケミストリースペースの把握

PCA
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs
from sklearn.decomposition import PCA
from ggplot import *
import numpy as np
import pandas as pd
suppl = Chem.SDMolSupplier('syk.sdf')
fps = []
for mol in suppl:
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
fps.append(arr)
Morganフィンガープリントを作って
Scikit-learnで扱えるように
NumpyArrayに変換

PCA
pca = PCA(n_components=2)
pca.fit(fps)
v = pca.components_
d = pd.DataFrame(v).T
d.columns = ["PCA1", "PCA2"]
g = ggplot(aes(x="PCA1", y="PCA2"), data=d) + geom_point(color="lightblue") + xlab("PCA1") +
ylab("PCA2")
ggsave("pca.png", g)
PCAで第二主成分まで計算して、
Xに第一、Yに第二主成分をプロット

RandamForest
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd
d = pd.read_csv("syk.csv")
d["pIC50"] = 9 - np.log10(d["IC50"])
d["ACT"] = d.pIC50.apply(lambda x: 1 if x > 8 else 0)
先に使ったSYKのデータpIC50をもとめて
8オーダーより強いものを活性ありとした
(0:活性あり、1:活性なし)

RandamForest
suppl = Chem.SDMolSupplier('syk.sdf')
fps = []
for mol in suppl:
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
fps.append(arr)
RDKitでsdfを読み込みMorganFingerprint
を計算し、それをScikit-learnで使えるように
NumpyArrayに変換

RandamForest
x_train, x_test, y_train, y_test = cross_validation.train_test_split(fps, d, test_size=0.4,
random_state=0)
print x_train.shape, y_train.shape
print x_test.shape, y_test.shape
rf = RandomForestClassifier(n_estimators=100, random_state=1123)
rf.fit(x_train, y_train[:,5])
print "predictn", rf.predict(x_test)
print "nresultn", y_test[:,5]
#print y_test[:,[0,5]]
データを訓練、テストセットにわけ、
RandamForestでモデルをつくり
テスト

Result
predict
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
result
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Pythonで機械学習するのもオ
ススメです！

R -> Python

More Related Content

What's hot

Viewers also liked

Similar to R -> Python

More from Kazufumi Ohkawa

R -> Python