Pandas！資料處理與分析的利器！

Pandas！資料處理與分析的利器！
到這裡下載今⽇日簡報 ☞
1

尼斯
智庫驅動實習⽣生
國⽴立台灣科技⼤大學資訊管理系三年級
國⽴立台灣科技⼤大學程式設計研究社助教
國⽴立鳳⼭山⾼高級中學電腦資訊社社⻑⾧長
2

Fewer Loop，More Efﬁcient
3
Pandas
by 尼斯

以Series運算為例Fewer Loop
4

Series（序列）
• 群集資料型態
• 有順序性的
• Index
5
⼩小明
⼩小華
⼩小花
⼩小綠
⼩小橘
56
70
42
58
NaN
index values

integer-based indexing
6
height = pd.Series([150, 175, 143, 174, 158])
weight = pd.Series([56, 70, 42, 58, np.nan])
bmi = weight / (height / 100) ** 2
0 24.888889
1 22.857143
2 20.538902
3 19.157088
4 NaN
以Series運算為例

List vs SeriesMore Efﬁcient
7

List vs Series
8
import numpy as np
import pandas as pd
sample_series = pd.Series(np.random.sample(1000000))
sample_list = list(np.random.sample(1000000))
%timeit sample_series+sample_series
1000 loops, best of 3: 1.04 ms per loop
10 loops, best of 3: 136 ms per loop
%timeit [i+i for i in sample_list]
1.04
136

Import Data from Multiple Source
• csv
• excel
• sql
• json
• html
• …
10

Import Data from a CSV
11
url ='http://opendata.epa.gov.tw/ws/Data/AQX/?
format=csv&ndctype=CSV&ndcnid=6074'
airPollution = pd.read_csv(url, encoding="utf-8-sig")
Site
Name
County PSI
Major
Pollutant
Status SO2 CO O3 PM10 PM2.5 …
0 ⿆麥寮雲林縣 70 懸浮微粒普通 1.1 0.39 36 75 32 …
1 關⼭山臺東縣 34 NaN 良好 1.1 NaN 41 18 11 …
2 ⾺馬公澎湖縣 0 NaN NaN 1.9 0.32 52 57 27 …
3 ⾦金⾨門⾦金⾨門縣 67 懸浮微粒普通 5.0 0.33 56 89 30 …
4 ⾺馬祖連江縣 54 懸浮微粒普通 3.6 0.31 62 64 37 …
Site
Name
County PSI
Major
Pollutant
Status SO2 CO O3 PM10 PM2.5
0
1
2
3
4

Dataframe
12
Site
Name
County PSI
Major
Pollutant
Status SO2 CO O3 PM10 …
0
1
2
3
4
⿆麥寮
關⼭山
⾺馬公
⾦金⾨門
⾺馬祖
雲林縣
臺東縣
澎湖縣
⾦金⾨門縣
連江縣
70
34
0
67
54
懸浮微粒
NaN
NaN
懸浮微粒
懸浮微粒
普通
良好
NaN
普通
普通
1.1
1.1
1.9
5.0
3.6
36
41
52
56
62
75
18
57
89
64
0.39
NaN
0.32
0.33
0.31

Select Series from a Dataframe
13
SiteName County PSI
MajorPoll
utant
Status SO2 CO O3 PM10 ... psi2
0 ⿆麥寮雲林縣 70 懸浮微粒普通 1.1 0.39 36 75 … 122.09
1 關⼭山臺東縣 34 NaN 良好 1.1 NaN 41 18 … NaN
2 ⾺馬公澎湖縣 0 NaN NaN 1.9 0.32 52 57 … 113.82
3 ⾦金⾨門⾦金⾨門縣 67 懸浮微粒普通 5.0 0.33 56 89 … 156.33
4 ⾺馬祖連江縣 54 懸浮微粒普通 3.6 0.31 62 64 … 133.31
airPollution['psi2'] = (
airPollution['PM10'] +
airPollution['SO2'] +
airPollution['CO'] +
airPollution['O3'] +
airPollution['NO2'] )

以補值為例Data Cleaning
14

以補值為例
16
0 1 2 3 4 5 6 …
0
臺中市政府
核准⽴立案⽼老
⼈人福利機構
名冊更新⽇日
期: ......
NaN NaN NaN NaN NaN NaN …
1 ⾏行政區域機構名稱負責⼈人地址電話床型/床數
⺫⽬目前收
容⼈人數
…
2 中區
台中市私
⽴立溫興⻑⾧長
期照顧...
賴興建
臺中市中區
雙⼗十路⼀一段
19巷6號
04-22230
938
養護15床
(含插2管7床)
養護12
⼈人
…
url = 'http://www.society.taichung.gov.tw/section/index-1.asp?
Parser=99,16,257,,,,3807,589,,,,42,,3'
careCenter = pd.read_html(url)[1]

以補值為例
18
⺫⽬目前收容
⼈人數 …
2 中區
台中市私⽴立溫
興⻑⾧長期照顧中
⼼心(養護型)
賴興建
臺中市中區
19巷6號
04-22230
938
養護15床
(含插2管7床)
養護12⼈人 …
3 NaN
台中市私⽴立麗
安⽼老⼈人⻑⾧長期照
顧中⼼心(養護型)
張軒維
臺中市中區
成功路341
號3、4樓
04-22250
345
⻑⾧長照16 養
護32
(含插2管15床)
⻑⾧長照5⼈人
養護32⼈人
…
4 東區
臺中市私⽴立惠
群⽼老⼈人養護中
⼼心
莊俐貞
臺中市東區干
城⾥里⾃自由路三
段276號5樓
04-22130
557
養護47床
(含插2管23床)
養護43⼈人 …
5 NaN
台中市私⽴立德
康⽼老⼈人⻑⾧長期照
林桂連
臺中市東區東
勢⾥里⽟玉皇街63
號3~5樓
04-2215-4
171
養護49床
(含插2管24床)
養護41⼈人
careCenter.columns = careCenter.ix[1]
careCenter = careCenter[2:]

以補值為例
20
⺫⽬目前收容
⼈人數 …
2 中區
⼼心(養護型)
賴興建
臺中市中區
19巷6號
04-22230
938
養護15床
(含插2管7床)
養護12⼈人 …
3 中區
張軒維
臺中市中區
成功路341
號3、4樓
04-22250
345
⻑⾧長照16 養
護32
(含插2管15床)
⻑⾧長照5⼈人
養護32⼈人
…
4 東區
臺中市私⽴立惠
群⽼老⼈人養護中
⼼心
莊俐貞
臺中市東區干
城⾥里⾃自由路三
段276號5樓
04-22130
557
養護47床
(含插2管23床)
養護43⼈人 …
5 東區
台中市私⽴立德
康⽼老⼈人⻑⾧長期照
林桂連
臺中市東區東
勢⾥里⽟玉皇街63
號3~5樓
04-2215-4
171
養護49床
(含插2管24床)
養護41⼈人
careCenter['⾏行政區域'] = (
careCenter[‘⾏行政區域'].fillna(method='ffill')
) # http://pandas.pydata.org/pandas-docs/stable/missing_data.html

使⽤用正規表達式萃取資料Pattern Matching
21

使⽤用正規表達式萃取資料
22
⺫⽬目前收容
⼈人數 …
2 中區
⼼心(養護型)
賴興建
臺中市中區
19巷6號
04-22230
938
養護15床
(含插2管7床)
養護12⼈人 …
3 中區
張軒維
臺中市中區
成功路341
號3、4樓
04-22250
345
⻑⾧長照16 養
護32
(含插2管15床)
⻑⾧長照5⼈人
養護32⼈人
…
def 養護(t):
match = re.findall('養護(d+)', t)
if match:
return int(match[0])
else:
return None
careCenter['⺫⽬目前收容⼈人數（養護）'] = careCenter['⺫⽬目前收容⼈人數'].apply(養護)
careCenter['床型/床數（養護）'] = careCenter['床型/床數'].apply(養護)

使⽤用正規表達式萃取資料
23
1 ⾏行政區域機構名稱負責⼈人 … 床型/床數
⺫⽬目前收容
⼈人數 … 床型/床數⺫⽬目前收
容⼈人數
2 中區
台中市私⽴立溫興
⻑⾧長期照顧中⼼心
(養護型)
賴興建 …
養護15床
(含插2管7床)
養護12⼈人 … 15 12
3 中區
台中市私⽴立麗安
⽼老⼈人⻑⾧長期照顧中
⼼心(養護型)
張軒維 …
⻑⾧長照16 養
護32
(含插2管15床)
⻑⾧長照5⼈人
養護32⼈人
… 32 32
4 東區
臺中市私⽴立惠群
⽼老⼈人養護中⼼心
莊俐貞 … 養護47床
(含插2管23床)
養護43⼈人 … 47 43
5 東區
台中市私⽴立德康
⼼心(養護型)
林桂連 …
養護49床
(含插2管24床)
養護41⼈人 … 49 41
6 東區
臺中市私⽴立敬馨
⼼心(養護型)
吳祐萱 …
養護49床
(含插2管24床)
養護4⼈人 … 49 4
ＸＸＯＯ
15
32
47
49
49
15
32
47
49
49
12
32
43
41
4
12
32
43
41
4

整理完資料之後，分析的⼯工作才要開始
24

Pandas與他的神奇夥伴
Pandas Ecosystem
• bokeh / ggplot2 / matplotlib 資料視覺化
• Statsmodels 統計模型
• Scikit-learn 機器學習
Other
• gensim / jieba ⾃自然語⾔言處理
25

使⽤用Pandas的其他好處
• Python是個容易學習的語⾔言
• Python是⼀一般⽤用途的語⾔言
• Solving the 「Two-Language」 Problem
• 不只能夠做資料分析，資料產品
29

The End
Thanks for listening！

Pandas！資料處理與分析的利器！

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Pandas！資料處理與分析的利器！