12. Application - Python + Pandas

빅데이터 ~ 머신(딥)러닝
실무로 배우는 빅데이터 기술
 Python+Pandas 활용
☆ 확장하기 – 12편 ☆
김강원

판다스(Pandas) 는?
파이썬 기반으로 쉽고 편리하게 데이터 분석을 지원하는 라이브러리
※ Pandas의 기본 개념과 기능들은 구글링을 통해 확인 하세요!

파일럿 프로젝트 확장 (1/2)
Python
& Pandas

책 337 페이지 중에서… (개정판)
데이터 탐색 및 분석
파일럿 프로젝트 확장 (2/2)

확장편 실습
Python + Pandas 홗용
데이터 탐색/분석

 탐색/분석 데이터 확보
파일질라(FTP) 실행 > Server02 접속 > 파일 다운로드
- 다운로드 경로: /home/pilot-pjt/mahout-data/classification/input
- 다운로드 파일: classification_data.txt
- 다운로드 위치: D://data/
Step-1
 파이썬 홖경 구성 및 실행
윈도우 시작 메뉴 > Anaconda Prompt 실행
> Activate py35
> conda install seaborn
> jupyter notebook
Step-2
 Jupyter Notebook 생성
Jupyter Home > New > Python 3 실행
Step-3

 라이브러리 Import
import os.path
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from sklearn import preprocessing
Step-4
 분석 데이터 로드
df= pd.read_csv("D://data/classification_dataset.txt", names = ["sex", "age", "marriage", "region", "job",
"car_capacity", "car_year", "car_model", "tire_fl", "tire_fr", "tire_bl", "tire_br", "light_fl", "light_fr",
"light_bl", "light_br", "engine_s", "break_s", "battery_s", "result"])
df.head()
Step-5
 미분석 항목 제외
df = df.drop(['sex','age','marriage','region','job','car_model'], axis =1)
df.head()
Step-6

 레이블 변수값 변홖
df.loc[df.result == '비정상', 'result'] = 1
df.loc[df.result == '정상', 'result'] = 0
df.head()
Step-7
 레이블 변수 탐색
print("# 비정상")
print(df.result[df.result ==1].describe())
print("====================================")
print("# 정상")
print(df.result[df.result ==0].describe())
Step-8
 비정상 데이터 탐색
f, ax1 = plt.subplots(1, 1, sharex=True, figsize=(12,4))
ax1.hist(df.car_year[df.result == 1], bins = 50)
ax1.set_title('Bad Condition')
plt.ylabel('Number of Transactions')
plt.show()
Step-9

 데이터 변홖 – 엔진/브레이크
df.loc[df.engine_s == 'A', 'engine_s'] = 0
df.loc[df.engine_s == 'B', 'engine_s'] = 50
df.loc[df.engine_s == 'C', 'engine_s'] = 100
df.loc[df.break_s == 'A', 'break_s'] = 0
df.loc[df.break_s == 'B', 'break_s'] = 50
df.loc[df.break_s == 'C', 'break_s'] = 100
Step-10
 데이터 정규화
dValues = df.values
min_max_scaler = preprocessing.MinMaxScaler()
dValues_scaled = min_max_scaler.fit_transform(dValues )
df = pd.DataFrame(dValues_scaled , columns = [ "car_capacity", "car_year", "tire_fl", "tire_fr",
"tire_bl", "tire_br", "light_fl", "light_fr", "light_bl", "light_br", "engine_s", "break_s", "battery_s",
"result"] )
Step-11

 영향도 낮은 변수 제외
df = df.drop(['tire_fl','tire_fr','tire_bl','tire_br','light_fl','light_fr','light_bl','light_br'], axis =1)
df.head()
Step-13
 변수(피처)들의 영향도 파악
v_features = df.ix[:,0:13].columns
plt.figure(figsize=(12,28*4))
gs = gridspec.GridSpec(28, 1)
for i, cn in enumerate(df[v_features]):
ax = plt.subplot(gs[i])
sns.distplot(df[cn][df.result == 1], bins=50, color='red')
sns.distplot(df[cn][df.result == 0], bins=50 )
ax.set_xlabel('')
ax.set_title('histogram of feature: ' + str(cn))
plt.show()
Step-12
 프로그램 저장
File > Rename > “12.Python&Pandas 활용”
Step-14

실무로 배우는 빅데이터 기술
확장하기 12편 – Python+Pandas 홗용
[ 강의자료 ]
 동 영 상: 유튜브 www.youtube.com
 실습문서: 슬라이드쉐어 www.slideshare.net

12. Application - Python + Pandas

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 12. Application - Python + Pandas

Similar to 12. Application - Python + Pandas (20)

More from merry7

More from merry7 (11)

12. Application - Python + Pandas