3. Pandasとは?
pandas is an open source, BSD-licensed
library providing high-performance, easy-to-
use data structures and data analysis tools
for the Python programming language.
http://pandas.pydata.org/ より
pandasはハイパフォーマンスなライブラリ
で、Pythonでデータ構造やデータ解析ツー
ルをめっちゃ使いやすい。(超意訳)
3
7. Pandasの依存ライブラリ
Dependencies
NumPy: 1.6.1 or higher
python-dateutil 1.5
Optional dependencies
SciPy: miscellaneous statistical functions
PyTables: necessary for HDF5-based storage
matplotlib: for plotting
scikits.statsmodels:Needed for parts of pandas.stats
pytz:Needed for time zone support with date_range
7
8. データ構造
Dimensions Name Description
1 Series 1D labeled
homogeneously-typed
array
1 TimeSeries Series with index
containing datetimes
2 DataFrame General 2D labeled, size-
mutable tabular structure
with potentially
heterogeneously-typed
columns
3 Panel General 3D labeled, also
size-mutable array 8
9. Seriesの利用
Seriesに挿入できるデータ型
● a Python dict
● an ndarray
● a scalar value (like 5)
In [52]: pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
Out[52]:
a -0.904244
b 0.870734
c -0.217093
d 0.123815
e 0.356112
In [53]: s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
In [54]: s.index
Out[54]: Index([a, b, c, d, e], dtype=object)
9
10. Seriesでの計算
In [66]: s + s
Out[66]:
a 0.388344
b 3.670871
c 1.306869
d -0.237199
e 3.168135
ベクトル計算も楽に出来る!
In [67]: s * 2
Out[67]:
a 0.388344
b 3.670871
Rでの操作と同じ感覚!
c 1.306869
d -0.237199
e 3.168135
In [68]: np.log(s)
Out[68]:
a -1.639012
b 0.607282
c -0.425513
d NaN 10
e 0.459996
11. DataFrameの利用
DataFrameに挿入できるデータ型
● Dict of 1D ndarrays, lists, dicts, or Series
● 2-D numpy.ndarray
● Structured or record ndarray
● A Series
● Another DataFrame
In [9]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
In [10]: df = pd.DataFrame(d)
In [11]: df
Out[11]:
one two
a 1 1
b 2 2
c 3 3 11
d NaN 4
12. DataFrameでの計算
In [17]: df + df
Out[17]:
one two
a 2 2
b 4 4
c 6 6
d NaN 8 Series同様に、
In [18]: df + 1 馴染みの操作が出来る。
Out[18]:
one two
a 2 2
b 3 3
c 4 4
d NaN 5
12
13. DataFrameでの列の追加
In [21]: df
Out[21]:
one two
a 1 1
b 2 2
c 3 3
d NaN 4 辞書型(連想配列)を扱う
In [22]: df["three"] = df["one"] * df["two"] ように、列を追加出来る。
In [23]: df
Out[23]:
one two three
a 1 1 1
b 2 2 4
c 3 3 9
d NaN 4 NaN
13
15. In [91]: df.mean()
DataFrame
Out[91]:
one 2.000000 での便利メソッド
two 2.500000
three 4.666667
In [92]: df.max()
Out[92]:
one 3
two 4
three 9
In [93]: df.var()
Out[93]:
one 1.000000
two 1.666667
Three 16.333333
In [100]: df.apply(lambda x:
x.max() - x.min())
Out[100]:
one 2
two 3
three 8
15
DataFrameに対して、統計量を求めることが出来る。
16. DataFrameに対してソート
In [121]: df.sort_index(by="one")
Out[121]:
one two three
a 1 1 1
b 2 2 4
c 3 3 9
d NaN 4 NaN ソートも楽に出来る。
In [122]: byに配列を使って複数指定する
df.sort_index(by="one",ascending=False) ことも可能
Out[122]:
one two three
d NaN 4 NaN
c 3 3 9
b 2 2 4
a 1 1 1
16