No55 tokyo r_presentation

第55回勉強会@東京 (#TokyoR) 30/07/2016
Cutoff値は自分で決めたい
～生存時間分析とROC curve
@fuuuumin314

自己紹介
• ID：@fuuuumin314
• 大学5年生
• Rとの関わり：主に研究室のデータ分析ツールとして利用
• Rユーザー歴：1年（初心者）
• TokyoR参加：2回目

注意
• もともと統計、機械学習、Rは独学なので間違い、改良点がございましたら
ご指摘いただけると幸いです

今回使うデータ
• Melanoma data {MASS}
• 205件のデンマーク人の悪性黒色腫生存データ
• 生存期間、年齢、性別、隆起の厚さ(mm)、潰瘍の有無

今回使うデータ
• Melanoma data {MASS}
• 205件のデンマーク人の悪性黒色腫生存データ
• 生存期間、年齢、性別、隆起の厚さ(mm)、潰瘍の有無
www.skincancer.org

パッケージ紹介
• library(ggplot2) ← 作図おなじみ
• library(dplyr) ← 前処理おなじみ
• library(MASS) ← データ、step wise AIC
• library(survival) ← 生存時間分析
• library(survminer) ← 生存時間プロット
• library(survivalROC) ← 生存時間分析ROC

前処理
## reading and preparing data
df <- Melanoma
df$status <- as.factor(df$status)
df$ulcer <- as.factor(df$ulcer)
df1 <- df %>%
## excluding cases
dplyr::filter(!(status == 3)) %>%
## alive = 0, dead = 1
dplyr::mutate(cens = ifelse(status == 1, 1, 0))
df1$cens <- as.integer(df1$cens)

>head(df1)
> head(df1)
time status sex age year thickness ulcer cens
1 35 2 1 41 1977 1.34 0 0
2 185 1 1 52 1965 12.08 1 1
3 204 1 1 28 1971 4.84 1 1
4 210 1 1 77 1972 5.16 1 1
5 232 1 1 49 1968 12.88 1 1
6 279 1 0 68 1971 7.41 1 1

単変量解析: code
ulcer_fit <- survfit(Surv(time, cens) ~ ulcer, data = df1)
ulcer_p <- ggsurvplot(fit = ulcer_fit,
risk.table = TRUE,
pval = TRUE,
conf.int = TRUE)
ulcer_p

多変量解析: Cox hazard model
ℎ 𝑡 𝑥 = ℎ0(𝑡)exp(𝛽1 𝑥1 + 𝛽2 𝑥2 + ・・・ + 𝛽 𝑘 𝑥 𝑘)
• 各時刻において、ある共変量𝑥 𝑘が1増加したときにイベント発生リス
ク（ハザード比）が何倍になるかを推定
• 時間に依らずハザード比が一定である仮定
→ 残差分析（省略）

多変量解析: code
## multivariate
cox_fit <- coxph(Surv(time, cens) ~ sex + age + thickness + ulcer,
data = df1)
## variable selection
cox_fit2 <- stepAIC(cox_fit)
summary(cox_fit2)

多変量解析: result
> summary(cox_fit2, digits = 3)
Call:
coxph(formula = Surv(time, cens) ~ sex + age + thickness + ulcer,
data = df1)
n= 191, number of events= 57
coef exp(coef) se(coef) z Pr(>|z|)
sex 0.45794 1.58082 0.26832 1.707 0.087873 .
age 0.01429 1.01439 0.00834 1.713 0.086634 .
thickness 0.11137 1.11781 0.03725 2.990 0.002790 **
ulcer1 1.13193 3.10162 0.30961 3.656 0.000256 ***

結局、thickness の予測能は？

結局、thickness の予測能は？
> range(df1$thickness)
[1] 0.10 17.42
p <- ggplot(data = df1, aes(x = thickness)) + geom_histogram(binwidth
= 1.0)
p <- p + ggtitle("histgram of thickness by 1.0 mm")
p <- p + theme_bw()
p

ROC curveについて1
値正しい分類
16 T
15 T
14 F
13 T
12 T
11 T
10 F
9 T
8 T
8 T
8 T
8 F
7 F
6 T
5 F
例えば、11以上を陽性とした場合、
真にT 真にF
陽性 5 1
陰性 5 4
奥村晴彦先生のhpより
感度 = 真にTのうち、陽性だった割合
= 5 / 5+5 = 50%
特異度 = 真にFのうち、陰性だった割合
= 4 / 1+4 = 80%
1-特異度 = 偽陽性(FP)
このCut-off値を変化させて感度、特異度（偽陽性）をプロットしたグラフが
ROC曲線

ROC curveについて2
• 一般的なROC curveを時間依存の関数にしたものを今回使う
• 詳細は、
Heagerty, P.J., Lumley, T., Pepe, M. S. (2000) Time-dependent ROC Curves for
Censored Survival Data and a Diagnostic Marker Biometrics, 56, 337 – 344

ROC curve: code
## cutoff of 5 years OS
cutoff <- 365*5
AUC_OS_5y <- survivalROC(Stime = df1$time,
status = df1$cens,
marker = df1$thickness,
predict.time = cutoff,
method = "KM")
plot(AUC_OS_5y$FP, AUC_OS_5y$TP, type="l", xlim=c(0,1), ylim=c(0,1),
xlab=paste( "FP", "n", "AUC = ",round(AUC_OS_5y$AUC,3)),
ylab="TP",main="MM thickness OS, Method = KM n cutoff = 5 years")
abline(0,1)

Cut-off値の決定
感度=1, 特異度=1
最短距離

Cut-off値の決定: code
OS_5y_df <- cbind.data.frame(AUC_OS_5y$cut.values,
AUC_OS_5y$TP,
AUC_OS_5y$FP)
colnames(OS_5y_df) <- c("cut_values", "TP", "FP")
OS_5y_df <- OS_5y_df %>%
dplyr::mutate(distance = ((1-TP)^2 + (FP)^2))
distance <- OS_5y_df %>%
dplyr::arrange(distance)
round(head(distance), 3)

Cut-off値の決定: result
> round(head(distance), 3)
cut_values TP FP distance Youden_index
1 2.26 0.765 0.310 0.151 0.455
2 2.34 0.743 0.310 0.162 0.433
3 2.10 0.786 0.344 0.164 0.442
4 1.94 0.808 0.358 0.165 0.451
5 3.22 0.634 0.179 0.166 0.455
6 3.06 0.655 0.234 0.173 0.421

Cross validation (4-folds)
• ランダムに4群に割り付けて
set.seed(55)
df_sample <- df1 %>%
dplyr::mutate(group = sample(x = c(1:4), size = dim(df1)[1], replace = TRUE))
> df_sample$group <- as.factor(df_sample$group)
> summary(df_sample$group)
1 2 3 4
44 51 42 54

Cross validation (4-folds): result
Cutoff = 1.94, AUC = 0.736 Cutoff = 2.26, AUC = 0.796
Cutoff = 2.26, AUC = 0.762 Cutoff = 2.26, AUC = 0.797

まとめ
• Cut-off値決めにROC曲線が有効
• ただしcross-validation で閾値が変動する可能性あり
• 生存曲線でもROC曲線が使える。パッケージもある

反省
• 生存時間の変化によってAUCの変化を示すことができなかった
• あくまでROC曲線は単変量解析のため、多変量のときの予測性能を保障できな
い
• 次回は自分でデータスクレイピングしたい

反省
• 生存時間の変化によってAUCの変化を示すことができなかった
→ iAnalysis ～おとうさんの解析日記～ [R program]時間依存性ROC曲線法
• あくまでROC曲線は単変量解析のため、多変量のときの予測性能を保障できない
→Nomogram (Tokyo.R #46 Cox比例ハザードモデルとその周辺)
→Nomograms for High-Dimensional Data などで正則化
• 次回は自分でデータスクレイピングしたい

参考資料
• Tokyo R の過去の資料
• Rによるデータサイエンス金明哲
• Qiita dplyrを使いこなす！
• 大阪大学大学院医学系研究科老年・腎臓内科学腎臓内科 hp
• R bloggers “Survival plots have never been so informative”

ROC curve: code
df_1 <- df_sample %>%
dplyr::filter(!(group == 1))
## cutoff of 5 years OS
cutoff <- 365*5
AUC_OS_5y <- survivalROC(Stime = df_1$time, status = df_1$cens,
marker = df_1$thickness, predict.time = cutoff, method = "KM")
OS_5y_df <- cbind.data.frame(AUC_OS_5y$cut.values, AUC_OS_5y$TP,
AUC_OS_5y$FP)
colnames(OS_5y_df) <- c("cut_values", "TP", "FP")
OS_5y_df <- OS_5y_df %>%
dplyr::mutate(distance = ((1-TP)^2 + (FP)^2)) %>%
dplyr::mutate(Youden_index = (TP - FP))
distance <- OS_5y_df %>%
dplyr::arrange(distance)
cut_value <- distance[1, 1]
cut_value
df_cut <- df_sample %>%
dplyr::filter(group == 1) %>%
dplyr::mutate(thick_dicot = ifelse(thickness >= cut_value, 1, 0))
## univariate by thickness
## low:0, high:1
thick_fit <- survfit(Surv(time, cens) ~ thick_dicot, data = df_cut)
thick_fit
thick_p <- ggsurvplot(fit = thick_fit,
risk.table = TRUE,
pval = TRUE,
conf.int = TRUE)
thick_p

No55 tokyo r_presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to No55 tokyo r_presentation

Similar to No55 tokyo r_presentation (20)

No55 tokyo r_presentation