Intoroduction of Pandas with Python

集合知プログラミング
勉強会キックオフMtg

Pandas超入門

@gepuro1

自己紹介
● 早川　敦士　（@gepuro)
● 電気通信大学システム工学科４年生
● 専攻：信頼性工学
● 好きな言語：Python,R,AWK
● 趣味：花火,テキストマイニング,アニメ
● 活動場所：電通大,MMA,iAnalysis,DBCLS

2

Pandasとは？
pandas is an open source, BSD-licensed
library providing high-performance, easy-to-
use data structures and data analysis tools
for the Python programming language.
http://pandas.pydata.org/ より

pandasはハイパフォーマンスなライブラリ
で、Pythonでデータ構造やデータ解析ツー
ルをめっちゃ使いやすい。（超意訳）
3

Pandasの役割
● データの加工からモデリングをスムーズに行
える。
● R言語のデータフレームに出来るような操作を
Pythonで行える。
● Pandas自体には多くの統計モデルが実装され
ていないので、他のライブラリとの連携が必
要。
● pandasは、Rへのインターフェースがある。
Rpy2との連携
● などなど 4

Pandasの情報
● http://pandas.pydata.org/pandas-docs/stable/
にある公式ドキュメントが整っていて見やす
い。
● この勉強会では、ここにある情報を主に紹介
できたらと思う。

5

Pandasのインストール
ソースから
git clone
git://github.com/pydata/pandas.git
cd pandas
python setup.py install

ubntuユーザなら
apt-get install python-pandas

6

Pandasの依存ライブラリ
Dependencies
NumPy: 1.6.1 or higher
python-dateutil 1.5

Optional dependencies
SciPy: miscellaneous statistical functions
PyTables: necessary for HDF5-based storage
matplotlib: for plotting
scikits.statsmodels:Needed for parts of pandas.stats
pytz:Needed for time zone support with date_range

7

データ構造
Dimensions Name Description

1 Series 1D labeled
homogeneously-typed
array

1 TimeSeries Series with index
containing datetimes

2 DataFrame General 2D labeled, size-
mutable tabular structure
with potentially
heterogeneously-typed
columns

3 Panel General 3D labeled, also
size-mutable array 8

Seriesの利用
Seriesに挿入できるデータ型
● a Python dict

● an ndarray

● a scalar value (like 5)

In [52]: pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
Out[52]:
a -0.904244
b 0.870734
c -0.217093
d 0.123815
e 0.356112

In [53]: s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

In [54]: s.index
Out[54]: Index([a, b, c, d, e], dtype=object)

9

Seriesでの計算
In [66]: s + s
Out[66]:
a 0.388344
b 3.670871
c 1.306869
d -0.237199
e 3.168135
ベクトル計算も楽に出来る！
In [67]: s * 2
Out[67]:
a 0.388344
b 3.670871
Rでの操作と同じ感覚！
c 1.306869
d -0.237199
e 3.168135

In [68]: np.log(s)
Out[68]:
a -1.639012
b 0.607282
c -0.425513
d NaN 10
e 0.459996

DataFrameの利用
DataFrameに挿入できるデータ型
● Dict of 1D ndarrays, lists, dicts, or Series

● 2-D numpy.ndarray

● Structured or record ndarray

● A Series

● Another DataFrame

In [9]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [10]: df = pd.DataFrame(d)

In [11]: df
Out[11]:
one two
a 1 1
b 2 2
c 3 3 11
d NaN 4

DataFrameでの計算
In [17]: df + df
Out[17]:
one two
a 2 2
b 4 4
c 6 6
d NaN 8 Series同様に、
In [18]: df + 1 馴染みの操作が出来る。
Out[18]:
one two
a 2 2
b 3 3
c 4 4
d NaN 5

12

DataFrameでの列の追加
In [21]: df
Out[21]:
one two
a 1 1
b 2 2
c 3 3
d NaN 4 辞書型（連想配列）を扱う
In [22]: df["three"] = df["one"] * df["two"] ように、列を追加出来る。
In [23]: df
Out[23]:
one two three
a 1 1 1
b 2 2 4
c 3 3 9
d NaN 4 NaN

13

DataFrameへのアクセス
In [46]: df["one"]
Out[46]:
a 1
列でアクセス
b 2
c 3
d NaN
Name: one

In [47]: df.xs("a")
Out[47]:
one 1
行でアクセス
two 1
three 1
Name: a

14

In [91]: df.mean()
DataFrame
Out[91]:
one 2.000000 での便利メソッド
two 2.500000
three 4.666667

In [92]: df.max()
Out[92]:
one 3
two 4
three 9

In [93]: df.var()
Out[93]:
one 1.000000
two 1.666667
Three 16.333333

In [100]: df.apply(lambda x:
x.max() - x.min())
Out[100]:
one 2
two 3
three 8
15
DataFrameに対して、統計量を求めることが出来る。

DataFrameに対してソート
In [121]: df.sort_index(by="one")
Out[121]:
one two three
a 1 1 1
b 2 2 4
c 3 3 9
d NaN 4 NaN ソートも楽に出来る。
In [122]: byに配列を使って複数指定する
df.sort_index(by="one",ascending=False) ことも可能
Out[122]:
one two three
d NaN 4 NaN
c 3 3 9
b 2 2 4
a 1 1 1

16

pandasとscikit-learn
gepuro@ubuntu:~$ cat hoge.csv
a,b
1,1
2,3
4,6
1,3
2,2
1,1

参考：http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#example-linear-model-plot-ols-py 17

ご清聴ありがとう
ございました！

18

Intoroduction of Pandas with Python

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Intoroduction of Pandas with Python

Similar to Intoroduction of Pandas with Python (20)

More from Atsushi Hayakawa

More from Atsushi Hayakawa (20)

Recently uploaded

Recently uploaded (12)

Intoroduction of Pandas with Python