Pandas presentation

Pandasによるデータ分析入門Pandasによるデータ分析入門
Hiroyuki SannomiyaHiroyuki Sannomiya
1

先月のおさらい先月のおさらい
先月わざわざ東京まで行って『データ分析のための
Python入門』のセミナーを受講した．
Pythonでデータ分析を行う上で重要なリストリテラ
ルの使い方や基礎文法を先月末のすごい広島 with
Pythonで発表した．
→今回はPandasを用いたデータ前処理とmatplotlibに
よる可視化を行う
2

データの確認データの確認
データの取得はデータベースからSQLを用いて取得するこ
とが多いが，今回はCSVファイルをPandasで読み込む．
# read_csv関数とhead関数の使い方，
import pandas
customers = pandas.read_csv('customers.csv')
print(customers[['customer_id','registration_date']].head())
customerid registrationdate
0 IK152942 2019-01-01 00:25:33
1 TS808488 2019-01-01 01:13:45
2 AS834628 2019-01-01 02:00:14

3 AS345469 2019-01-01 04:48:22
4 GD892565 2019-01-01 04:54:51
3

その他のデータの読み込みその他のデータの読み込み
import pandas
items = pandas.read_csv('items.csv')
print(items.head())
itemid itemname itemprice
0 I001 MediaPlayer 50000
1 I002 SmartPhone 85000
2 I003 LaptopPC 120000
3 I004 DesktopPC 180000
4 I005 GamingPC 210000
4

import pandas
log_1 = pandas.read_csv('log_1.csv')
print(log_1.head())
logid price paymentdate
1 T0000000114 50000 2019-02-01 01
2 T0000000115 120000 2019-02-01 02
3 T0000000116 210000 2019-02-01 02
4 T0000000117 170000 2019-02-01 04
5

import pandas
log_detail_1 = pandas.read_csv('log_detail_1.csv')
print(log_detail_1.head())
detailid logid itemid quantity
0 0 T0000000113 I005 1
1 1 T0000000114 I001 1
2 2 T0000000115 I003 1
3 3 T0000000116 I005 1
4 4 T0000000117 I002 2
6

データ件数の出力データ件数の出力
# -*- coding: utf-8 -*-
import pandas
customers = pandas.read_csv('customers.csv') # 顧客データ
items = pandas.read_csv('items.csv') # 商品データ
log_1 = pandas.read_csv('log_1.csv') # 購入データ
log_detail_1 = pandas.read_csv('log_detail_1.csv') # 購入詳細データ
print('顧客データ:', len(customers),'件')
print('商品データ:', len(items), '件')
print('購入データ:', len(log_1), '件')
print('購入詳細データ:', len(log_detail_1), '件')
顧客データ: 5000 件
商品データ: 5 件
購入データ: 5000 件

購入詳細データ: 5000 件
7

前処理前処理
購入データと購入詳細データを結合する．分割されている
購入データ1と2を concat関数を利用して結合する．この
ように縦方向にデータを結合することをユニオンと言う．
# -*- coding: utf-8 -*-
import pandas
logs = pandas.concat([log_1, log_2], ignore_index=True)
print(f'{len(log_1)} + {len(log_2)} ={len(logs)}')
log_detail_1= pandas.read_csv('log_detail_1.csv')
# ignore_index=Trueとするとindexを0から振りなおしてくれる．
log_details= pandas.concat([log_detail_1,log_detail_2], ignore_index=True)
print(f'{len(log_detail_1)} + {len(log_detail_2)} ={len(log_details)}')

5000 + 1786 =6786
5000 + 2144 =7144
8

購入データと購入詳細デー購入データと購入詳細デー
タを結合するタを結合する
import pandas
logs = pandas.concat([log_1, log_2], ignore_index=True)
log_details= pandas.concat([log_detail_1,log_detail_2], ignore_index=True)
merge_data = pandas.merge(
log_details,
logs[['log_id' , 'payment_date', 'customer_id']],
on='log_id' ,
how= 'left')
print(merge_data.head())
9

logdetailのデータにlogidが一致するlogsのデータが追加された．
このように，横方向にデータを結合することをジョインと言
います．
detailid logid itemid quantity
0 0 T0000000113 I005 1
1 1 T0000000114 I001 1
2 2 T0000000115 I003 1
3 3 T0000000116 I005 1
4 4 T0000000117 I002 2
10

顧客データと商品データを顧客データと商品データを
ジョインする．ジョインする．
merge_data = pandas.merge(merge_data, customers, on='customer_id' , how='left')
merge_data = pandas.merge(merge_data, items, on='item_id' , how='left')
print(merge_data[['detail_id', 'log_id' , 'item_id', 'customer_id']].head())
11

detailid logid itemid customer
0 0 T0000000113 I005 PL563502
1 1 T0000000114 I001 HD67801
2 2 T0000000115 I003 HD29812
3 3 T0000000116 I005 IK452215
4 4 T0000000117 I002 PL542865
12

Priceを計算して新たな項Priceを計算して新たな項
目の追加目の追加
merge_data['price'] = merge_data['item_price'] * merge_data['quantity']
print(merge_data[['detail_id' , 'log_id' , 'item_id' , 'customer_id' , 'price']].head())
13

detailid logid itemid customer
0 0 T0000000113 I005 PL563502
1 1 T0000000114 I001 HD67801
2 2 T0000000115 I003 HD29812
3 3 T0000000116 I005 IK452215
4 4 T0000000117 I002 PL542865
14

データをチェックしてみようデータをチェックしてみよう
if merge_data['price'].sum() == logs['price'].sum():
print('あってます．')
else:
print('間違ってます．')
あってます．
15

欠損値を調べよう欠損値を調べよう
print(merge_data.isnull().sum())
detailid 0
logid 0
itemid 0
quantity 0
paymentdate 0
customerid 0
customername 0

registrationdate 0
customernamekana 0
email 0
gender 0
age 0
birth 0
pref 0
itemname 0
itemprice 0
price 0

分析分析
統計量を確認してみよう．
print(merge_data.describe())
detailid quantity age
count 7144.000000 7144.000000 7144.00
mean 3571.500000 1.199888 50.2656
std 2062.439494 0.513647 17.1903
min 0.000000 1.000000 20.0000
25% 1785.750000 1.000000 36.0000

50% 3571.500000 1.000000 50.0000
75% 5357.250000 1.000000 65.0000
max 7143.000000 4.000000 80.0000
17

月別で集計月別で集計
print(merge_data.dtypes)
detailid int64
logid object
itemid object
quantity int64
paymentdate object
customerid object
customername object

registrationdate object
customernamekana object
email object
gender object
age int64
birth object
pref object
itemname object
itemprice int64
price int64

月別分析用に年月の項目月別分析用に年月の項目
を追加を追加
objectとなっているのは文字列のデータですが，日付の処
理に適した，datetime型に変換する．
merge_data['payment_date'] =pandas.to_datetime(merge_data['payment_date'])
merge_data['payment_month'] =merge_data['payment_date'].dt.strftime('%Y%m')
print(merge_data[['payment_date', 'payment_month']].head())
paymentdate paymentmonth
0
2019-
02-01
01:36:57 201902
1 2019- 01:37:23 201902

02-01
2
2019-
02-01
02:34:19 201902
3
2019-
02-01
02:47:23 201902
4
2019-
02-01
04:33:46 201902
19

項目の値ごとにまとめて集項目の値ごとにまとめて集
計したい場合計したい場合
groupby関数を用いる．
print(merge_data.groupby('payment_month').sum()['price'])
paymentmonth
201902 160185000
201903 160370000
201904 160510000
201905 155420000

201906 164030000
201907 170620000
Name: price, dtype: int64
20

商品ごとの詳細データ商品ごとの詳細データ
print(merge_data.groupby(['payment_month', 'item_name']).sum()[['price','quantity']])
price qua
paymentmonth itemname
201902 DesktopPC 31140000 173
GamingPC 59850000 285
LaptopPC 19800000 165
MediaPlayer 24150000 483
SmartPhone 25245000 297

201903 DesktopPC 25740000 143
GamingPC 64050000 305
LaptopPC 19080000 159
201904 DesktopPC 24300000 135
GamingPC 64890000 309
LaptopPC 21960000 183

201905 DesktopPC 25920000 144
GamingPC 58800000 280
LaptopPC 20520000 171
201906 DesktopPC 28800000 160
GamingPC 63420000 302
LaptopPC 21840000 182

201907 DesktopPC 26100000 145
GamingPC 71610000 341
LaptopPC 19440000 162
21

pivotpivottable関数table関数
print(pandas.pivot_table(
merge_data,
index='item_name',
columns='payment_month',
values=['price','quantity'],
aggfunc='sum'))
price
paymentmonth 201902 201903 20190
itemname
DesktopPC 31140000 25740000 24300
GamingPC 59850000 64050000 64890

LaptopPC 19800000 19080000 21960
MediaPlayer 24150000 26000000 25900
SmartPhone 25245000 25500000 23460
22

可視化可視化
graph_data = pandas.pivot_table(
merge_data,
index='payment_month',
columns='item_name',
values='price',
aggfunc='sum')
print(graph_data)
itemname DesktopPC GamingPC Lapt
paymentmonth
201902 31140000 59850000 1980
201903 25740000 64050000 1908
201904 24300000 64890000 2196

201905 25920000 58800000 2052
201906 28800000 63420000 2184
201907 26100000 71610000 1944
23

折れ線グラフ折れ線グラフ
import matplotlib.pyplot as plt
for item_name in graph_data.keys():
plt.plot(graph_data[item_name].index, graph_data[item_name].values,label=item_name)
plt.legend()
plt.savefig('images/oresen.png')
plt.show()

円グラフ円グラフ
counts = merge_data['gender'].value_counts()
print(counts)
plt.pie(counts, labels=counts.keys(), startangle=-90)
plt.legend()
plt.savefig('images/piechart.png')
plt.show()
Maleale 3596 Female 3548 Name:
gender, dtype: int64

棒グラフ棒グラフ
sales_volumes = merge_data.groupby('item_name').sum()['quantity']
print(sales_volumes)
for item_name in sales_volumes.keys():
plt.bar(item_name, sales_volumes[item_name], label=item_name)
plt.legend()
plt.savefig('images/barchart.png')
plt.show()
itemname
DesktopPC 900
GamingPC 1822
LaptopPC 1022
MediaPlayer 3043

Pandas presentation

Recommended

Recommended

More Related Content

Similar to Pandas presentation

Similar to Pandas presentation (20)

Pandas presentation