시종설 1조

시스템 종합 설계 #1조
201601376 김찬조
201601380 박경수
201601381 박재서
201601388 이석진
201802884 노성만

경진대회
• 데이콘에서 주관하는 경진대회
• 신용카드 사용자 데이터를 보고 사용자의 대금 연체 정도를 예측하는 알고리즘 개발
• 사용 가능 프로그램 : R 또는 Python (Google Colab)

#클라우드 기반의 Jupyter 노트북
개발환경
#딥러닝 개발을 위한 라이브러리
가 이미 설치 → 높은 접근성
#고가인 GPU도 무료로 사용 가능
#GitHub 등과의 연동으로 자유롭
게 소스 개발 환경
개발환경

데이터
train.csv test.csv
-DACON 제공 데이터
• train.csv 에는 index, credit 등의 정
보가 포함된 20가지 변수존재
• test.csv 에는 credit 변수가 제외된
19가지 변수가 존재
 train.csv 에서 데이터간 상관관계 및
중요도를 파악하여 credit을 예측
 예측된 credit을 sample_submission
라는 파일에 저장 후 제출

평가 산식 : LogLoss
#Log Loss
모델이 예측한 확률 값을 직접적으로 반영하여 평가
확률 값을 음의 log함수에 넣어서 변환을 시킨 값으로 평가
-> 잘못 예측할 수록, 패널티를 부여하기 위함

#Log Loss
정답 수 정답 확률
A 1개 0.99
B 1개 0.2
정답 수 정답 확률 Log Loss
A 1개 0.99 -log(0.99)
= 0.01005
B 1개 0.2 -log(0.2)
= 1.6094

#Log Loss
0 1 2
1 1 0 0
2 0 0 1
3 0 1 0
4 0 0 1
0 1 2
1 0.99 0.01 0
2 0.3 0.1 0.6
3 0.4 0.5 0.1
4 0.2 0.2 0.8
실제 정답 예측 값

변수 설명
#DAYS_BIRTH
• 음의 값을 띄고 있으며 나이의 분산이 크다.

변수 설명
#DAYS_EMPLOYED
• 양의 값과 음의 값을 갖는데 일을 하고 있으면 음의 값, 일을 하지 않으면 양의 값을 띈다.

변수 설명
#데이터 전처리
• Gender = 성별 ['F','M']
• Car = 자동차 유무 ['N', 'Y’]
• Reality = 부동산 유무 ['N', 'Y’]
• Child_num = 자녀 수

변수 설명
• income_type = 'Commercial associate', 'Working', 'State servant', 'Pensioner',
'Student’
• edu_type = 'Higher education', 'Secondary / secondary special', 'Incomplete
higher', 'Lower secondary', 'Academic degree’],
• family_type = 'Married', 'Civil marriage', 'Separated', 'Single / not married',
'Widow’]
• house_type = Municipal apartment', 'House / apartment', 'With parents', 'Co-op
apartment', 'Rented apartment', 'Office apartment

데이터의 특이점 발견
데이터에서 다른 index값들이 begin_month와 credit값을 제외한 나머지 변수에서 같은 값을 나타내는 것을 확인
→ 이는 한 사람이 여러 개의 카드를 만들었음을 의미
1) 중복되는 행을 제거하지 않고 분석할 경우
• 인물별로 동일한 credit을 가진 사람이 많은 경우: begin_month의 의미가 없어짐
• 인물별로 다른 credit을 가진 경우 : begin_month가 미치는 영향이 크다고 볼 수 있게 된다.
2) 중복되는 데이터를 제거하고 분석
• 14358개 제거됨 → 데이터가 12099개밖에 남지 않는다
• 또한, 데이터를 제거할 시 begin_month에 대한 처리가 애매함

알고리즘 소개: LGBM
#알고리즘
LGBM(LightBGM)란?
LightGBM의 경우에는 최대 손실 값을 가지는 노드
를 중심으로 계속해서 분할하는 '리프 중심 트리 분
할(leaf-wise)' 방식을 사용합니다.
데이터가 10,000개 이상일 때 추천되는 알고리즘이
다.

# 데이터 삭제
train = train.drop("FLAG_MOBIL", axis = 1)
test = test.drop("FLAG_MOBIL", axis = 1)
train = train.drop("child_num",axis = 1)
test = test.drop("child_num",axis = 1)
train = train.drop("email",axis = 1)
test = test.drop("email",axis = 1)
# 범주형 데이터 전처리
train['gender'] = train['gender'].replace(['F','M'], [0, 1])
train['car'] = train['car'].replace(['N', 'Y'], [0, 1])
train['reality'] = train['reality'].replace(['N', 'Y'], [0, 1])
test['gender'] = test['gender'].replace(['F','M'], [0, 1])
test['car'] = test['car'].replace(['N', 'Y'], [0, 1])
test['reality'] = test['reality'].replace(['N', 'Y'], [0, 1])
모델 - LGBM

# 교육수준이 높은 수준에 따라서 높은 값 부여
train['edu_type'] = train['edu_type'].replace(['Lower secondary','Secondary / secondary speci
al','Incomplete higher','Higher education','Academic degree'],[1,2,3,4,5])
test['edu_type'] = test['edu_type'].replace(['Lower secondary','Secondary / secondary special
','Incomplete higher','Higher education','Academic degree'],[1,2,3,4,5])
# 데이터 타입이 문자열인 컬럼을 수집 & 원핫 인코딩
object_col = []
for col in train.columns:
if train[col].dtype == 'object':
object_col.append(col)
enc = OneHotEncoder()
enc.fit(train.loc[:,object_col])
모델 - LGBM

# 각각의 데이터에 대한 원핫 인코딩
train_onehot_df = pd.DataFrame(enc.transform(train.loc[:,object_col]).toarray(),
columns=enc.get_feature_names(object_col))
train.drop(object_col, axis=1, inplace=True)
train = pd.concat([train, train_onehot_df], axis=1)
test_onehot_df = pd.DataFrame(enc.transform(test.loc[:,object_col]).toarray(),
columns=enc.get_feature_names(object_col))
test.drop(object_col, axis=1, inplace=True)
test = pd.concat([test, test_onehot_df], axis=1)
# 높은 값들 정규화
train['DAYS_BIRTH'] = (train['DAYS_BIRTH']-
train['DAYS_BIRTH'].mean())/train['DAYS_BIRTH'].std()
train['DAYS_EMPLOYED'] = (train['DAYS_EMPLOYED']-
train['DAYS_EMPLOYED'].mean())/train['DAYS_EMPLOYED'].std()
train['begin_month'] = (train['begin_month']-
train['begin_month'].mean())/train['begin_month'].std()
……
모델 - LGBM

StratifieldKFold를 사용하여 값을 비슷하게 분포시킨 후 LGBM으로 훈련시킴
30번 이상의 개선이 없을 경우 중단시킴
5개의 fold를 훈련하여 저장하였음
모델 - LGBM

알고리즘 소개: DNN(Deep Neural Network)
#알고리즘
DNN(Deep Neural Network)이란?
입력층(input layer)과 출력층(output
layer) 사이에 여러 개의 은닉층 분류 및
수치 예측 주로 사용.

모델 - DNN
# DAYS_BIRTH, DAYS_EMPLOYED, begin_month, family_size : 정규화 진행
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(family_size_1)
family_size_1 = scaler.transform(family_size_1)
# 데이터 추출 후, 원핫 인코딩 진행
onehot_encoder = preprocessing.OneHotEncoder()
Label = family_type_1
print(Label.shape)
Label = np.array(Label)
Label.reshape(-1,1)
Onehot_Label = onehot_encoder.fit_transform(Label)
Onehot_family_type_1_Label = Onehot_Label.toarray()
# 추출한 데이터들을 그룹화 진행
x_2 = np.hstack([Onehot_gender_1_Label,Onehot_car_1_Label,Onehot_reality_1_Label,Onehot_incom
e_type_1_Label,Onehot_edu_type_1_Label,Onehot_family_type_1_Label,Onehot_house_type_1_Label,f
amily_size_1])
x = np.column_stack([x_2,x_1]).astype(np.float)

#Modeling & Apply
# 모델 구축하기 초기 input data : 21160
model = Sequential()
model.add(Dense(450, input_dim = 31, kernel_initializer = "uniform", activation = 'relu'))
model.add(Dense(200, kernel_initializer = "uniform", activation = 'relu'))
model.add(Dense(3, kernel_initializer = "uniform", activation = 'softmax'))
model.summary()
# 데이터 분할하기
train_x = x[:21160,:].astype(np.float)
val_x = x[21160:, :].astype(np.float)
train_y = train_credit[:21160,:].astype(np.float)
val_y = train_credit[21160:,:].astype(np.float)
test_x = test.astype(np.float)
모델 - DNN

#Modeling & Apply
# 학습
early_stopping = EarlyStopping(monitor='val_loss',patience=10, verbose=1, mode='auto')
hist = model.fit(train_x,train_y, validation_data=(val_x,val_y), epochs = 1000, batch_size = 64,
callbacks = [early_stopping])
# 결과 예측모델
prob = model.predict(test_x)
모델 - DNN

CatBoost는 “Categorical Boost” 약자이며 Yandex에서 개발된 오픈 소스 Machine Learning
CERN, Cloudflare, Careem taxi를 포함한 다른 회사에서 검색, 추천 시스템, 개인 비서, 자율주행 자동차, 날씨
예측 및 기타 많은 작업에 사용된다.
# Catboost란?

# Catboost란?
• Great quality without parameter tuning/ Categorical features
support
Category features를 사용하기 위해서는 One-Hot-Encoding등 데이터를 전
처리할 필요가 있었지만 Catboost에서는 사용자가 다른 작업을 하지 않아도
자동으로 이를 변환하여 사용한다.
• Improved accuracy
새로운 gradient-boosting 방식으로 모델을 구성할 때 과적합을 줄입니다.
• Fast prediction
학습 시간이 다른 GBDT에 보다는 더 오래 걸리는 대신에 예측 시간이 13-16
배 정도 더 빠르다.

모델 - Catboost
#결측치 처리
train.fillna('NaN', inplace=True)
test.fillna('NaN', inplace=True)
train = train.drop(14900,0)
#family_size가 7보다 큰 데이터 삭제
train = train[(train['family_size'] <= 7)]
train = train.reset_index(drop=True)
#양수인 데이터는 무직자로 판단하며 0으로 처리
train['DAYS_EMPLOYED'] = train['DAYS_EMPLOYED'].map(lambda x: 0 if x > 0 else x)
test['DAYS_EMPLOYED'] = test['DAYS_EMPLOYED'].map(lambda x: 0 if x > 0 else x)

#음수값을 양수값으로 변환
feats = ['DAYS_BIRTH', 'begin_month', 'DAYS_EMPLOYED']
for feat in feats:
train[feat]=np.abs(train[feat])
test[feat]=np.abs(test[feat])
#불필요한 값 삭제
Train = train.drop(＂FLAG_MOBIL＂, axis = 1)
Train = train.drop(＂child_num", axis = 1)
test = test.drop("FLAG_MOBIL", axis = 1)
test = test.drop("child_num", axis = 1)
모델 - Catboost

for df in [train,test]:
#DAYS_BIRTH 파생변수- Age(나이)
df['Age'] = df['DAYS_BIRTH'] // 365
#DAYS_EMPLOYED_m 파생변수- EMPLOYED(근속연수),
DAYS_EMPLOYED_m(고용된 달)
DAYS_EMPLOYED_w(고용된 주(고용연도의 n주차))
df['EMPLOYED'] = df['DAYS_EMPLOYED'] // 365
df['DAYS_EMPLOYED_m'] = np.floor(df['DAYS_EMPLOYED'] / 30)-((np.floor(df['DAYS_EMPLOYED'] / 30) / 12).astype(int) * 12)
df['DAYS_EMPLOYED_w'] = np.floor(df['DAYS_EMPLOYED'] / 7) - ((np.floor(df['DAYS_EMPLOYED'] / 7) / 4).astype(int) * 4)
모델 - Catboost

#ability: 소득/(살아온 일수+ 근무일수)
df['ability'] = df['income_total'] / (df['DAYS_BIRTH'] + df['DAYS_EMPLOYED'])
#income_mean: 가족 구성원 한명에게 들어가게 되는 income_total
df['income_mean'] = df['income_total'] / df['family_size’]
#ID 생성: 각 컬럼의 값들을 더해서 고유한 사람을 파악(*한 사람이 여러 개 카드를 만들 가능성을
고려해 begin_month는 제외함)
df['ID'] =
df['income_total'].astype(str) + '_' +
df['DAYS_BIRTH'].astype(str) + '_' + df['DAYS_EMPLOYED'].astype(str) + '_' +
df['work_phone'].astype(str) + '_' + df['phone'].astype(str) + '_' +
df['email'].astype(str) + '_' + df['family_size'].astype(str) + '_' +
df['gender'].astype(str) + '_' + df['car'].astype(str) + '_' +
df['reality'].astype(str) + '_' + df['income_type'].astype(str) + '_' +
df['edu_type'].astype(str) + '_' + df['family_type'].astype(str) + '_' +
df['house_type'].astype(str) + '_' + df['occyp_type'].astype(str)
모델 - Catboost

for df in [train,test]:
df['income_total'] = df['income_total'] / 10000
# 카테고리 변수는 ordinal_encoder 변환 ,ID는 변환 후 정수 처리
encoder = OrdinalEncoder(categorical_feature)
train[categorical_feature] = encoder.fit_transform(train[categorical_feature],
train['credit'])
test[categorical_feature] = encoder.transform(test[categorical_feature])
train['ID'] = train['ID'].astype('int64')
test['ID'] = test['ID'].astype('int64')
모델 - Catboost

모델 - Catboost
# 모델 설정
n_est = 2000
seed = 42
n_fold = 15
n_class = 3
target = 'credit'
X = train.drop(target, axis=1)
y = train[target]
X_test = test

모델 - Catboost
#데이터 간 중요도
• 새로 파생된 변수인 ‘ID’의 중요도가 상당히 크게 나타남

#결론
• Catboost는 ‘ID’라는 변수를 생성한 후, cat_feature를 튜닝하였기 때문에 LGBM보다 성능이 좋게 나옴
• 정형 데이터이기 때문에 DNN은 적합하지 않음
• 기존 데이터로부터 새로운 변수 생성 및 모델에 대한 깊은 이해가 중요함
최종 결론
<Catboost>
#최종 결론

시종설 1조

Recommended

Recommended

More Related Content

Similar to 시종설 1조

Similar to 시종설 1조 (20)

시종설 1조