Successfully reported this slideshow.
Your SlideShare is downloading. ×

AI 바이오 (2_3일차).pdf

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
Loading in …3
×

Check these out next

1 of 148 Ad

AI 바이오 (2_3일차).pdf

Download to read offline

bioinformatics using statistical learning, machine learning and deep learning.
Day 2 and 3 materials from 12 days course, focusing on statistical analysis.
Meta analysis for medical data handling is include.

bioinformatics using statistical learning, machine learning and deep learning.
Day 2 and 3 materials from 12 days course, focusing on statistical analysis.
Meta analysis for medical data handling is include.

Advertisement
Advertisement

More Related Content

Advertisement

AI 바이오 (2_3일차).pdf

  1. 1. AI-Bio 융합 전문 과정 2022-8~10 윤형기 (hky@openwith.net) 2일차
  2. 2. 주제 세부사항 1일차 인사 및 과정 소개 인사 수강생 현황 및 수강목적 등 파악 의료/바이오 개관 (기술/산업) 의료/바이오 기술 및 산업동향 기반기술 (1-1) Python과 분석 패키지 분석도구 (1) (Python, Scipy, numpy/pandas) 2일차 기반기술 (1-2) R과 통계분석 분석도구 (2) (R과 통계학) 생명통계 활용 (1) 생명정보와 ANOVA, 다변량분석 등 유전체 분석 3일차 생명통계 활용 (2) 메타분석 유전체 분석 (Omics) (1) 유전체(genome) 분석 전사체(transcriptome) 분석 4일차 유전체 분석 (Omics) (2) 후성유전체(epigenome) 분석 단백체(proteome) 분석 차세대 Sequencing GenBank와 NCBI데이터 VCF 데이터 분석, NGS 데이터 처리 등 5일차 기반기술 (3) 기계학습 (1) 모델링 방법론 (모델 개념 및 Cross-Validation) 지도학습 알고리즘 (선형모델, 분류) 기반기술 (3) 기계학습 (2) 비지도학습 알고리즘 (군집, 연관분석 등) 6일차 지도학습과 생명정보 응용 의료데이터에서의 예측모델 선형모델과 헬스케어 데이터의 분류 비지도학습과 생명정보 응용 임상데이터의 연관성분석 동반질병 (comorbidity) 분석 의료/바이오 도메인 이해 헬스케어 데이터셋과 생명통계 바이오 데이터와 기계학습 일정
  3. 3. 주제 세부사항 7일차 기반기술 (4) 딥러닝 (1) 신경망 학습과 딥러닝 모델 기반기술 (3) 딥러닝 (2) TensorFlow PyTorch 8일차 딥러닝과 생명정보 응용 Bi-LSTM을 이용한 헬스케어 시뮬레이션 딥러닝을 이용한 피부병 식별 온톨로지와 생명정보 응용 세만틱웹과 ontologies Ontology의 생명정보 응용 9일차 기반 기술 (3) 이미지 처리 이미지 처리와 컴퓨터 비전 개요 의료영상분석 (1) Segmentation 영상등록 (image registration) 10일차 의료영상분석 (2) 심전도 (ECG) Rendering과 Surface Models MRI 11일차 기반기술 (4) 생명정보와 계산화학 계산화학 (computational chemistry) 개요 신약개발 (drug discovery) (1) 표적규명 (target identification) 시약과 검정법 개발 ADME (흡수, 분포, 대사, 배설) 독성학과 기계학습 응용 12일차 기반 기술 (5) GAN GAN (Generative Adversarial Networks)과 VAE 신약개발과 GAN 생성모델을 이용한 신약후보물질 추천 총정리 Wrap-up 총정리 의료영상 분석 약물분석과 신약설계 바이오 데이터와 딥러닝
  4. 4. 기반기술 (1-2) – R과 통계분석
  5. 5. 기초통계
  6. 6. • Unit I: 개요 – 1. 개요와 기술(記述)통계 – 2. 확률이론과 Bayesian • Unit II: 변량별 데이터 분석 – 3. 단변량/이변량/다변량 • Unit III: 분포와 표본추출 – 4. 이산 분포와 연속 분포 – 5. 표본추출과 표본분포 • Unit IV: 모수 추정 – 6. 추정(단일/2개 모집단) – 7. 가설검정
  7. 7. UNIT I: 개요 1. 기본개념과 기술(記述)통계 2. 확률이론과 Bayesian
  8. 8. 1. 기본개념과 기술통계 • 1.1 통계 개념 – Everything Varies - Heterogeneity is universal - “변화의 정도가 통계적 의미 (유의성)이 있는지 여부" – Statistical vs. Practical Significance • Statistical Significance: differences in group means are not likely due to sampling error. • Practical (or clinical) Significance
  9. 9. • 통계 개념 (2) – Everything Varies • Heterogeneity is universal - “변화의 정도가 통계적 의미 (유의성)이 있는지 여부" – Statistical vs. Practical Significance • Statistical Significance – differences in group means are not likely due to sampling error. • Practical (or clinical) Significance – Practical significance asks the larger question about differences – “Are the differences between samples big enough to have real meaning.” – Generally assessed with measure of effect size – 2 categories: » Difference measures » Variance accounted for measures
  10. 10. • 1.2 기술통계 (Descriptive Statistics) – (1) 중심경향성: Ungrouped Data • Mode, Mean, Median • Percentile, Quantile/Quartile – (2) 변동성: Ungrouped Data • Range & IQR (Interquartile Range) • MAD (Mean Absolute Deviation) • Variance, Standard Deviation • 모분산 vs. 표본분산 및 표준편차 • Unbiased estimator • Z-score • Coefficient of Variation (CV)
  11. 11. – (3) Measures of Shape • Skewness – Coefficient of Skewness • Kurtosis • Box-and-Whisker Plots
  12. 12. – (4) 연관성 (Association) 측도 • Correlation – Pearson product-moment correlation coefficient – Spearman Correlation Coefficient – Kendall Tau-b Correlation Coefficient
  13. 13. 2. 확률이론과 Bayesian • 2.1 기본개념 – Experiment, (Elementary) – Event와 Sample Space • Independent Events, Unions, Intersections, • MECE (Mutually Exclusive Collectively Exhaustive) – Marginal, Union, Joint P(X⋂Y) = 0
  14. 14. – Counting Possibilities • mn Counting Rule: m x n • Sampling from a Population with Replacement: (N)n possibilities • Combinations: Sampling from Population Without Replacement: NCn = 𝑁!/𝑛!(𝑁−𝑛)! • 기대값과 분산 – (a) 기대값 – (b) 분산 • Geometric Mean
  15. 15. • 2.2 조건부 확률과 Bayes’ rule – 조건부 확률 법칙 • P(X | Y) = (P(X ∩ Y))/(P(Y)) = (P(X)•((Y|X))/(P(Y)) – 독립성 여부의 검정: » P(X | Y) = P(X) and P(Y| X) = P(Y) – Bayes’ Rule • P(Xi | Y) = 𝑃 𝑋𝑖 •𝑃(𝑌|𝑋𝑖) 𝑃 𝑋1 •𝑃 𝑌 𝑋1 + 𝑃 𝑋2 •𝑃 𝑌 𝑋2 +⋯+𝑃 𝑋𝑛 •𝑃(𝑌|𝑋𝑛) – Odds • 주사위에서 2 또는 3이 나올 확률 = 1/3, odds = 2/4 = 1/2
  16. 16. UNIT II: 변량별 데이터 시각화 3. 단변량/ 이변량/ 다변량
  17. 17. • 4.1 개요 • 확률변수 (Random variable) • = a variable that contains the outcomes of a chance experiment • 4.2 이산분포의 모양 – Mean or Expected Value • = long-run average of occurrences – Variance와 Standard Deviation • 4.2 이항분포 – Binomial formula – 이항분포의 평균과 표준편차 • 4.3 Poisson 분포 – Law of improbable events λ = long-run average
  18. 18. • 4.5 초기하 (Hypergeometric) 분포 – 개요 • = 유한 모집단으로부터 비복원추출 시 나타나는 확률분포 – 다음 경우에 이항분포 대신 사용: • (i) Sampling is done without replacement. • (ii) n ≥ 5% N
  19. 19. (연속 분포 ) • 4.6 일양분포 (一樣分布 Uniform Distribution) • 4.7 정규분포 – 개요 • Gaussian 분포 • 정규분포의 확률밀도함수 – Standardized Normal Distribution • z score = # of s.d. that a value x is above or below the mean • z distribution • 4.8 이항분포 대신 정규분포의 적용 (Approximate) – 경험법칙; • 대략 normal curve value의 99.7%가 3 s.d. 이내 • n • p > 5 and n • q > 5 – Correcting for Continuity • ; Converting discrete distribution into a continuous distribution.
  20. 20. • 4.7 지수분포 – Inter-arrival times of random arrivals • = Random occurrences 사이의 시간의 분포 • cf. Poisson distribution = random occurrences over some interval – 지수분포의 확률
  21. 21. • 4.8 𝜒2분포 • 4.9 Lognormal 분포 – 그 로그가 정규분포를 따르는 분포
  22. 22. 여러 분포 간의 관계
  23. 23. 5. 표본추출과 표본분포 • 5.1 표본추출 방법 • 5.2 ҧ 𝑥 의 표본분포 – 중심극한정리 • 𝜇 ҧ 𝑥= μ • 𝜎 ҧ 𝑥 = 𝜎 𝑛 – z Formula for Sample Means – Sampling from a Finite Population – 중심극한정리 • 5.3 Ƹ 𝑝의 표본분포
  24. 24. UNIT IV: 모수 추정 6. 추정 7. 가설검정 8. 분산분석과 실험계획
  25. 25. 6. 추정 • 신뢰구간 추정 (단일 모집단) – z 통계량 이용한 신뢰구간 추정 (단일 모집단) (σ Known) • 점추정 (point estimation) • 100(1-α)% Confidence Interval to Estimate μ: σ known] • 유한조정계수 • Sample Size가 작은 경우 – 여태까지 주로 n ≥ 30 – n < 30 이어도 중심극한정리에 의해 z formula 적용 : – sample size가 클 때 또는 작아도 모집단이 정규분포 (σ known)
  26. 26. – t 통계량 이용한 신뢰구간추정 (단일모집단) (σ Unknown) • 모집단이 정규분포인데 모집단 s.d 를 모르는 경우 t 분포 적용. – 표본크기에 따라 분포가 다르다. – t 통계량의 가정: 모집단이 정규분포 » 모집단이 정규분포가 아니면 비모수통계 기법 – t 분포의 특징: Robust • t 통계량을 이용한 모집단 평균 추정에서의 신뢰구간 – 모비율 추정
  27. 27. 7. 가설검정 (단일 모집단) • 7.1 개요 – Good and Bad Hypotheses • “a good hypothesis is a falsifiable hypothesis.” - Karl Popper • Absence of evidence is not evidence of absence. – 귀무가설 (Null Hypotheses) • ‘nothing is happening’. == slope of the relationship is zero. – 대립가설 (Alternative Hypotheses)
  28. 28. Statistical Hypothesis Testing • Step 1: State the Null Hypothesis (H0) • Step 2: State the Alternative Hypothesis • Step 3: Set 𝛼 • Step 4: Collect Data Decision In Reality H0 is TRUE H0 is FALSE Accept H0 correct Type II Error β = probability of Type II Error Reject H0 Type I Error α = probability of Type I Error correct
  29. 29. • Step 5: Calculate a test statistic • Fcalculated • Step 6: Construct Acceptance / Rejection regions • Step 7: Based on steps 5 & 6, draw a conclusion about H0 • If Fcalculated from data > Fα, then you are in the Rejection region and you can reject H0 with (1-α) level of confidence.
  30. 30. – Rejection and Nonrejection Regions – Type I 및 Type II Errors
  31. 31. • 7.2 z 통계량 이용한 모평균의 가설검정 (σ Known) – 단일평균에 대한 z 검정 – 유한모집단의 평균에 대한 검정 – p-Value를 이용한 가설검정 • p-value = 관측된 유의수준 (level of significance) – defines the smallest value of 𝛼 for which the H0 can be rejected. • “α 가 p보다 커야만 H0를 reject 가능” – Critical Value Method를 이용한 가설검정 • Rejecting H0 using p-values
  32. 32. • p values vs. Effect sizes – p values • are calculated on the assumption that the H0 is true. • p values are about the size of the test statistic. • = an estimate of the probability that a value of the test statistic, or a value more extreme than this, could have occurred by chance when the null hypothesis is true. – Effect sizes • = measure of strength of a phenomenon – r2, regression coefficients, … → magic criteria
  33. 33. • 7.3 t 통계량 이용한 모평균 가설검정 (σ Unknown) – (…) • z Test of a Population Proportion – Critical Value Method를 이용한 가설검정 • p-values를 이용한 H0 기각 • 7.4 비율에 관한 가설검정 – […] • Using p-value • Using the critical value method
  34. 34. • 7.5 분산에 관한 가설검정 • Table χ2 vs. Observed χ2 • H0 can also be tested by the critical value method. • Observed χ2 값대신 critical χ2 value for alpha를 적용하여 s2 계산 → critical sample variance (sc 2) • 7.6 Type II Errors
  35. 35. 회귀분석/선형모델
  36. 36. 회귀분석 • 개요 – single numeric D.V. (value to be predicted)과 one or more numeric I.V. (predictors)간의 관계식. – "regression" = process of fitting lines to data (Galton) – 용도: • 수치예측 • 그 밖에 가설검정, 각종 전제조건의 적합성 결정 등 • 다양한 모델에 적용 – SLR – MLR – GLM • Link functions • Logistic regression, Poisson regression, …
  37. 37. • 단순회귀분석 – Correlation과 단순회귀분석 – OLS (ordinary least squares) – 회귀선 방정식의 결정 • deterministic model: y = β0 + β1x • probabilistic model: y = β0 + β1x + ε
  38. 38. • 잔차분석
  39. 39. – 추정값의 표준오차 • Error분석 을 위해 잔차 (= 개별 데이터에 대한 추정 에러) 계산 대 신 standard error of the estimate 이용. – SSE (Error Sum of Squares) – 더 좋은 지표: 추정치의 표준오차 (se) = 회귀모델에서 잔차의 표준편차 – (정규분포 empirical rule: “68% 가 μ+ 1σ 범위, 95%가 μ+ 2σ 범위. » 회귀분석의 가정도 주어진 x에 대해 error terms ~ ND() ) » 이제 error terms ~ ND(), se 는 error의 s.d., AVG error =0 이므로 • 68% of the error values (residuals) should be within 0 ±1se • 95% of the error values (residuals) should be within 0 ±2se. – se provides a single measure of magnitude of errors in model. – 또한 outlier 식별에 이용. (예: outside ±2se or ±3se)
  40. 40. – 결정계수 • R2 = I.V. (x)가 D.V. (y)의 변동성을 얼마나 설명하는가 » r2=0 … r2= 1 – D.V. (y) – SS로 측정된 변동성: y (SSyy): » SSyy=SSR +SSE 에서 각 항을 SSyy 로 나누면 – r2 is proportion of y variability explained by regression model: • r 과 R2 의 관계 – r2 = (R)2 – 회귀모델 기울기의 가설검정 & 모델 전반의 Testing • 기울기 – r = (r)2
  41. 41. 다중회귀분석 • 독립변수를 가진 다중회귀모델 (First Order) – 𝑦 = β0 + β1 𝑥1 + β2 𝑥2 + ε – Constant & coefficients는 표본으로부터 추출: ො y =b0 +b1x1 +b2x2 → response surface / response plane • 회귀모델과 계수에 대한 유의성 검정 – <회귀모델의 adequacy 분석> – 모델 전반의 검정 • 단순회귀; t test of slope of the regression line to see if ≠ 0. (즉, whether I.V. contribute significantly in predicting D.V. ) • 다중회귀; an analogous test makes use of F statistic.
  42. 42. – 회귀계수에 대한 유의성 검정 • 각각의 회귀계수에 대한 t-검정 – H0: β1 =0 H0: β2 =0 … H0: βk =0 – Ha: β1 ≠ 0 Ha: β2 ≠ 0 Ha: βk ≠ 0 – 회귀계수에 대한 개별 검정에서의 자유도 = n - k - 1. – 추정치의 잔차와 표준오차 및 R2 • 잔차 (= error of the regression model) – 활용: outlier 탐지, regression분석 시 assumptions 검정 • SSE 와 추정 값의 표준오차 – = 추정표준오차(표준추정오차)= 차이의 표준오차 – = 최적선에 대한 산포도에서 점들의 분산도 – = ො 𝑦를 중심으로 실제 y 점수분포가 (회귀선에 의한) 어느 정도인가 표시 – SSE =Σ(y - ො 𝑦)2 • 회귀분석의 가정 (error terms ~ ND(0) + 경험칙 (대략 잔차의 68%가 ±1se 범위, 95% 가 ±2se 범위) → 데이터 fitting정도 측정에 standard error of estimate가 유용.
  43. 43. 주요 이슈 • (1) Response-Predictors간의 관계성 여부? – 가설검정 • H0: β1 = β2 = ···= βp =0 • Ha: at least one βj is non-zero. –  F-statistic 계산: • 단, TSS = σ(yi − ത y)2 and RSS = σ(yi − ෝ yi)2. – IF H0 is true (=response-predictors간 no relationship) THEN F 값은 1에 근접 – IF Ha is true, – THEN E{(TSS - RSS)/p} >σ2, so we expect F > 1 .
  44. 44. • (2) 변수 별 중요도 결정 – Variable Selection • Mallow’s Cp, • Akaike information criterion (AIC), • Bayesian information criterion (BIC), • adjusted R2 – 그런데 2p 모델 • Forward selection • Backward selection • Mixed selection
  45. 45. • (3) Model Fit – In SLR, R2 = 설명변수와 상관계수간의 상관계수의 제곱 – In MLR, it equals Cor(Y, ෡ Y.)2 – fitted linear model의 특징: maximizes this correlation among all possible linear models. – p-value를 통해 R2 의 개선 정도를 계수화 – RSE의 정의: • 따라서 변수가 많은 모델일수록 higher RSE if the decrease in RSS is small relative to the increase in p.
  46. 46. • (4) Predictions • β0, β1,..., βp의 true value를 안다 해도 random error로 인해 완벽한 예측은 불가능. (즉, irreducible error) – confidence interval – prediction interval
  47. 47. R
  48. 48. R • R 언어의 여러 측면 – 수리/통계 분석도구로서의 R – 프로그래밍 언어로서의 R – 시각화 도구로서의 R • R과 AI/딥러닝 – 기계학습과 예측적 분석 (Predictive Analysis) – Keras with R • Cheatsheet – https://iqss.github.io/dss-workshops/R/Rintro/base-r-cheat-sheet.pdf – https://www.rstudio.com/resources/cheatsheets/
  49. 49. • R 패키지 • Bio 특화 패키지
  50. 50. 분산분석
  51. 51. 분산분석의 기본개념 • 기본 개념 • 분산분석: 평균에 차이가 있는지를 분석 • 실험계획법 (experimental design): 자료를 어떻게 수집할 것인가 • 인자 (factor) – 실험에 직접 취급되는 대상 – 인자의 수에 따라 일원배치법, 이원배치법, 다원배치법 • 수준 (level) = 실험을 실시하는 인자의 조건, • 처리 (treatment) = 인자의 수준 • 특성값 = 실험실시 후 자료의 형태로 얻어지는 반응값 – [분산분석을 위해 필요한 가정] • 독립성: 각 수준에서의 표본의 관측값들이 서로 독립 • 정규성: 관측값의 분포~ ND() • 분산의 동일성: 모집단의 분산이 동일 – Diagnostic tests • Residual Analysis
  52. 52. • 예: LED 형광등 생산하는 세 전구회사에서 생산된 형광등 수명자료 – 인자, 수준, 특성의 예 인자의 수에 따라서 ▪ 인자가 하나인 경우 일원배치법 – 완전확률화계획법 (CRD) – 수준수가 k개, 각 수준에 대해 r회의 반복을 시행할 때 전체 실 험을 k x r개로 분할하고 난수표 나 제비뽑기 등으로 확률적으로 배치. ▪ 인자가 둘인 경우 이원배치법 ▪ 인자 > 3개 경우 다원배치법
  53. 53. 분산분석의 원리 • 기본 원리 – 𝐻0: 𝜇1 = 𝜇2 = ⋯ = 𝜇3 각 집단의 평균은 동일하다 – 𝐻𝑎: 𝜇𝑖 ≠ 𝜇𝑗; 평균이 적어도 두 개는 다르다. – 예: 수준 수가 k개, 반복실험횟수가 각각 nk개의 일원배치법 자료 – 다음 페이지의 표: • 𝑌𝑖𝑗 는 인자 A의 i번째 수준에서 j번째 관측값을 나타낸 것이다. • 일원배치법 데이터 𝑌𝑖𝑗에 대한 구조식은 다음과 같다. • 𝑌𝑖𝑗 = 𝜇𝑖 + 𝑒𝑖𝑗 𝑖 = 1,2, … , 𝑘 𝑗 = 1,2, … , 𝑛𝑖 • 𝜇𝑖 = 인자 A의 i번째 수준의 평균 • 𝑒𝑖𝑗 는 서로 독립이고 평균이 0이고 분산이 𝜎2 인 정규분포를 따른다.
  54. 54. 인자의 수준 실험의 반복 (확률표본) 𝐴1 𝐴2 … 𝐴𝑖 … 𝐴𝑘 𝑌11 𝑌21 𝑌𝑖2 𝑌𝑘1 𝑌12 𝑌22 𝑌𝑖2 𝑌𝑘2 … … … … 𝑌1𝑗 𝑌2𝑗 𝑌𝑖𝑗 𝑌𝑘𝑗 … … … … 𝑌1𝑛1 𝑌2𝑛2 𝑌𝑖𝑛𝑖 𝑌𝑘𝑛𝑘 평균 ഥ 𝑌1 ഥ 𝑌2 ഥ 𝑌𝑖 𝑌𝑘 i번째 수준의 평균 ഥ 𝑌𝑖. = 1 𝑛 ෍ 𝑗=1 𝑛𝑖 𝑌𝑖𝑗 전체평균 ഥ 𝑌.. = 1 𝑁 ෍ 𝑖=1 𝑘 ෍ 𝑗=1 𝑛𝑖 𝑌𝑖𝑗 𝑁 = 𝑛1 + 𝑛2 + ⋯ + 𝑛𝑘
  55. 55. • 각 관측값(𝑌𝑖𝑗)와 전체평균 ഥ 𝑌..의 차이인 편차 𝑌𝑖𝑗 − ഥ 𝑌.. 는 두 부분으로 나눌 수 있다. • 𝑌𝑖𝑗 − ഥ 𝑌.. = 𝑌𝑖𝑗 − ഥ 𝑌𝑖. + ( ഥ 𝑌𝑖. − ഥ 𝑌..) • 양변을 동시에 제곱하여 σ𝑖=1 𝑘 σ𝑗=1 𝑛𝑖 𝑌𝑖𝑗 를 취하면 교차합은 0이 되므로: – σ𝑖=1 𝑘 σ𝑗=1 𝑛𝑖 (𝑌𝑖𝑗 − ത 𝑌..)2 = σ𝑖=1 𝑘 σ𝑗=1 𝑛𝑖 (𝑌𝑖𝑗 − ത 𝑌𝑖.)2 + σ𝑖=1 𝑘 σ𝑗=1 𝑛𝑖 (ത 𝑌𝑖 − ത 𝑌..)2 • σ𝑖=1 𝑘 σ𝑗=1 𝑛𝑖 (𝑌𝑖𝑗 − ത 𝑌..)2 는 각 관측값이 전체평균으로부터 얼마나 퍼져있는가를 측정 하는 것으로 총제곱합 (total sum of squres)라 하고 SST로 나타낸다. • σ𝑖=1 𝑘 σ𝑗=1 𝑛𝑖 (𝑌𝑖𝑗 − ത 𝑌𝑖.)2 는 집단(수준)내 분산으로 관측값이 i번째 수준의 평균을 중심 으로 얼마나 퍼져 있는가를 측정하는 것으로 오차제곱합 (error sum of squres)라 하고 SSE로 나타낸다. • σ𝑖=1 𝑘 σ𝑗=1 𝑛𝑖 (ത 𝑌𝑖 − ത 𝑌..)2 는 집단(수준)간 분산으로 i번째 수준의 평균이 전체평균으로 부터 얼마나 떨어져 있는가를 측정하는 것으로 처리제곱합 (treatment sum of squares)라 하고 SSA로 나타낸다.
  56. 56. – 편차의 모든 관측값에 대한 제곱합은 다음과 같이 표현할 수 있다. – MS (Mean Square: 평균제곱) = 각 제곱합을 자유도로 나눈 것. • MSA = 𝑆𝑆𝐴 𝑘−1 MSE = 𝑆𝑆𝐸 (𝑁−𝑘) – SST 중 인자처리효과인 SSA의 비율이 높을수록 모평균들 간 차이가 크게 된다. – SSA의 값이 커지면 𝑆𝑆𝐴 𝑆𝑆𝐸 의 값이 커지게 된다. 검정은 F검정통계량을 사용하는데 귀무가설이 참일 때 분자 자유도 k-1, 분모 자유도 N-k인 F분포를 따른다. – F = ൗ 𝑆𝑆𝐴 𝑘−1 ൗ 𝑆𝑆𝐸 (𝑁−𝑘) = 𝑀𝑆𝐴 𝑀𝑆𝐸 SST = SSE + SSA 총제곱합 = 오차제곱합 + 처리제곱합 전체분산 집단(수준)내 분산 집단(수준)간 분산 자유도 N-1 𝑛1 − 1 + 𝑛2 − 1 + ⋯ + 𝑛𝑘 − 1 = ෍ 𝑖=1 𝑘 𝑛𝑖 − 1 = 𝑛 − 𝑘 SSA: k-1 MS (Mean Square: 평균제곱 ) = 각 제곱합을 자유도로 나 눈 것. MSE = 𝑆𝑆𝐸 (𝑁−𝑘) MSA = 𝑆𝑆𝐴 𝑘−1
  57. 57. – 분산분석표 (ANOVA table) – MS (Mean Square: 평균제곱) = 각 제곱합을 자유도로 나눈 것. MSA = 𝑆𝑆𝐴 𝑘−1 MSE = 𝑆𝑆𝐸 (𝑁−𝑘) – SST 중 인자 처리효과를 나타내는 SSA 비율이 높을수록 모평균 간 차이가 있을 확률이 높다. – 검정통계는 F검정통계량을 사용 - H0가 참일 때 분자 자유도 k-1, 분모 자유도 N-k인 F분포를 따른다. F = ൗ 𝑆𝑆𝐴 𝑘 − 1 ൗ 𝑆𝑆𝐸 (𝑁 − 𝑘) = 𝑀𝑆𝐴 𝑀𝑆𝐸 요인 제곱합 자유도 평균제곱 F 집단 간 SSA k-1 MSA = 𝑆𝑆𝐴 𝑘−1 F = ൗ 𝑆𝑆𝐴 𝑘 − 1 ൗ 𝑆𝑆𝐸 (𝑁 − 𝑘) = 𝑀𝑆𝐴 𝑀𝑆𝐸 집단 내 SSE N-k MSE = 𝑆𝑆𝐸 (𝑁−𝑘) 합계 SST N-1
  58. 58. – 분산분석표를 계산할 때 다음의 간편계산식을 이용하면 편리 ◼ 𝑁 = 𝑛1 + 𝑛2 + ⋯ + 𝑛𝑘 ◼ 전체 관측값의 합: T = σ𝑖=1 𝑘 σ𝑗=1 𝑛𝑖 𝑌𝑖𝑗 ◼ 수준 i에서의 모든 관측값의 합 𝑇𝑖 = σ𝑖=1 𝑛𝑖 𝑌𝑖𝑗 ◼ CT (correction term) = 𝑇2 𝑁 – 이라 할 때 SST, SSA, SSE를 간편하게 구할 수 있다. – SST = σ𝑖=1 𝑘 σ𝑗=1 𝑛𝑖 𝑌𝑖𝑗 2 − 𝐶𝑇 – 𝑆𝑆𝐴 = σ𝑖=1 𝑘 𝑇𝑖 2 𝑁 − 𝐶𝑇 – 𝑆𝑆𝐸 = σ𝑖 𝑘 σ𝑗=1 𝑛𝑖 𝑌𝑖𝑗 − σ𝑖=1 𝑘 𝑇𝑖 2 𝑁 = 𝑆𝑆𝑇 − 𝐶𝑇
  59. 59. – (1) – (2) – (3)
  60. 60. • ANOVA – F에서의 임계치 (critical value)
  61. 61. 다중비교 • 개요 – 분산분석에서 F검정을 통해 귀무가설을 기각한 경우 각 처리평 균들 사이에 통계적으로 의미있는 차이가 있는지 여부를 검정하 기위해 모든 인자수준 평균들로 짝을 지어 두 인자수준 평균을 차례로 비교한다. • 방법 – Bonferroni 방법 – Duncan 방법 – Tukey 방법 – 최소유의차 (LSD: least significant difference) 방법
  62. 62. • Tukey Test for Pairwise Mean Comparisons – Step 1: Compute Tukey’s w value – Step 2: Rank the means, calculate differences
  63. 63. • 기타의 Pairwise 평균 비교방법 – to compare all possible means, two-at-a-time, as t-tests. • Unlike an ordinary two sample t-test, however, the method does rely on the experiment–wide error (the MSE). • standard error for the difference between two treatment means (𝑠 ത 𝑑 or SE) • Fisher’s Protected Least Significant Difference (LSD).
  64. 64. • Tukey
  65. 65. • Bonferroni
  66. 66. • Scheffe
  67. 67. • Dunnett Comparisons significant at the 0.05 level are indicated by ***. Fertilizer Comparison Difference Between Means Simultaneous 95% Confidence Limits *** F3 - Control 8.200 5.638 10.762 *** F1 - Control 7.600 5.038 10.162 *** F2 - Control 4.867 2.305 7.429 ***
  68. 68. Contrast Analysis • 의의: 더 넓은 범위로 분석 – 예: treatment level groups or testing of trends prompting regression modeling to express the response vs. treatment relationship with treatment as a numerical predictor • 1-factor ANOVA: linear contrast as a linear combination of the treatment means such that numerical coefficients add to 0
  69. 69. 확장 • Multi-Factor ANOVA • Random Effects와 Mixed Models • Experimental Design • ANCOVA
  70. 70. MVA (다변량 분석)
  71. 71. (다변량) 확률변수
  72. 72. 단변량 확률변수 • Discrete Random Variable – 확률변수 (Random Variable: X(ξ) • = a single-valued real function that assigns a real number (value of X(ξ)) to each sample point ξ of S.
  73. 73. • Continuous RV과 PDF • Expected Value – = mean
  74. 74. • Normal Random Variable – Normality is Preserved by Linear Transformations – Standard Normal random variable
  75. 75. • Distribution Functions – (누적) 분포함수 (cdf) – Properties of FX(X) – 분포함수로부터의 확률의 결정
  76. 76. • Discrete Random Variable과 PMF – = discrete random variable X가 특정 x값에 대해 즉,P(X = x) 일 때 가지는 확률. = probability function, frequency function, • Properties – 1 𝐹 𝑋 = 𝑥 = 𝑓 𝑥 > 0 𝑖𝑓 𝑘 ∈ 𝑡ℎ𝑒 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑆 – 2 σ𝑥∈𝑆 𝑓 𝑥 = 1 – 3 𝑃 𝑋 ∈ 𝐴 = σ𝑥∈𝐴 𝑓(𝑥) • 종류 – Finite, Countably infinite (가산무한) • 다변량에서의 PMF – Joint Probability Distribution
  77. 77. • Continuous Random Variable과 PDF – 확률변수 X가 연속확률변수일 때 X가 이루는 연속확률분포를 함 수 f(x)로 나타낸 것 – fX(x)는 연속확률변수 X의 PDF
  78. 78. • Joint PDF of Multiple RV – Marginal PDF – Joint PDF • Conditioning • Bayes' Rule
  79. 79. • Univariate Moment – = a specific quantitative measure of the shape of a function – 평균, 분산 Moment ordinal Moment Cumulant Raw Central Normalised Raw Standardised 1 Mean 0 0 Mean N/A 2 - Variance 1 Variance 1 3 - - Skewness - Skewness 4 - - (Non-excess or historical) kurtosis - Excess kurtosis 5 - - Hyperskewness - - 6 - Hypertailedness - - 7 - - - - -
  80. 80. • Bivariate 확률변수의 공분산과 상관계수 – Population covariance – Sample covariance
  81. 81. • Correlation – Population correlation of two r.v. x and y – Sample correlation • rxy is related to the cosine of the angle between two vectors.
  82. 82. 다변량 확률변수 • Bivariate Random Variables – Random experiment의 표본공간 S에서 X, Y의 두 r.v.를 가짐. – → (X, Y) = bivariate r.v. (or 2-D random vector). – → Range space of bivariate r.v. (X, Y) is denoted by RXY & defined by – If r.v.'s X & Y are discrete r.v.'s, then (X, Y) is a discrete bivariate r.v. – if X & Y are continuous r.v.'s, then (X, Y) is a continuous bivariate r.v. – If one of X and Y is discrete while the other is continuous, then (X, Y) is called a mixed bivariate r.v.
  83. 83. MV RV에서의 Mean Vectors • 표본 – Let y represent a random vector of p variables measured on a sampling unit (subject or object). – If there are n individuals in the sample, the n observation vectors are denoted by y1, y2, . . . , yn, where • 모집단
  84. 84. MV RV에서의 Covariance Matrices • Sample covariance matrix S • Population covariance matrix
  85. 85. Covariance & Correlation Coefficient – (k, n)th moment of a bivariate r.v. (X, Y) is defined by – If n = 0, we obtain kth moment of X, and if k = 0, we obtain the nth moment of Y. – If (X, Y) is a discrete bivariate r.v., then
  86. 86. – 마찬가지로 – If (X, Y) is a continuous bivariate r.v., then
  87. 87. – (1, 1)th joint moment of (X, Y) is the correlation of X and Y. • If E(XY) = 0, then we say that X and Y are orthogonal. • The covariance of X and Y, Cov(X, Y) or σXY, is defined by – If Cov(X, Y) = 0, then we say that X and Y are uncorrelated.
  88. 88. – If X and Y are independent, then they are uncorrelated, but the converse is not true in general; – the fact that X and Y are uncorrelated does not, in general, imply that they are independent. – The correlation coefficient, denoted by ρ (X, Y) or pXY, is defined by – Correlation coefficient of X and Y is a measure of linear dependence between X and Y.
  89. 89. • 상관계수행렬 (Correlation Matrices) – Correlation matrix  → covariance matrix, and vice versa.
  90. 90. Matrices for Subsets of Variables • Subset 행렬에서의 Mean Vector & Covariance – 2 Subsets – 3+ subsets
  91. 91. N-Variate Random Variables • 개념 – n-tuple of r.v.'s (X1, X2, . . . , Xn) is an n-variate r.v. (n-D r.v.) if each Xi, i = 1, 2, ... , n, associates a real number with every sample point ξ ∈ S. Thus, an n-variate r.v. is simply a rule associating an n-tuple of real numbers with every ξ ∈ S. – Let (X1, . . . , Xn) be an n-variate r.v. on S. Then its joint cdf is
  92. 92. Special Distributions • Multinomial Distribution • Bivariate Normal Distribution
  93. 93. ANOVA의 다변량 확장
  94. 94. • Conceptual Models 독립변수 수치형 (metric) 비수치형 (non-metric) 종속변수 수치형 (metric) 회귀분석 ANOVA 비수치형 (non-metric) Discriminant Analysis χ2 독립변수 1 2+ 종속변수 1 One-way ANOVA Factorial ANOVA 2+ (one-way) MANOVA Factorial ANOVA (two-way/multi-way) MANOVA
  95. 95. ANOVA model – Yij = subject j in group i 관측치 – ni = Number of subjects in group i – N = n1 + n2 + ... + ng • Assumptions – E(Yij) = μi – var(Yij) = σ2 – Independence – Normality • Under H0: F ~ Fg-1, N-g
  96. 96. Multi-Factor ANOVA • Factorial or Crossed Treatment Design – In Multi-factor experiments combinations of treatments are applied to experimental units. – In a factorial design, each level of every treatment is combined with each level of all other treatments. • With the addition of crossed factors the number of experimental units increases very quickly and so tough decisions have to be made regarding the number of treatments and the number of levels of each treatment.
  97. 97. no effect of Factor A, a small effect of Factor B (and if there were no effect of Factor B the two lines would coincide), and no interaction between Factor A and Factor B. large effect of Factor A small effect of Factor B, and no interaction. No effect of Factor A, larger effect of Factor B, and no interaction. large effect of Factor A, a large effect of Factor B and no interaction.
  98. 98. no effect of Factor A, no effect of Factor B but an interaction between A and B. Large effect of Factor A, no effect of Factor B with a slight interaction. No effect of Factor A, a large effect of Factor B, with a very large interaction. An effect of Factor A, a large effect of Factor B with a large interaction.
  99. 99. Additive Model (No Interaction) • In factorial design we first look at the interactions for significance. – If interaction is not significant, we can drop the interaction term from our model, and we end up with an additive model. • For a two-factor factorial, the model we initially consider is: • Note that interaction term (αβ)ij is a multiplicative term. • If interaction is found to be non-significant, then model reduces to: – Here we can see that response variable is simply a function of adding the effects of the factors.
  100. 100. Crossed and Nested Factors • Single-factor studies • Multifactor studies Crossed Factors and Nested Factors - Chemical Yield
  101. 101. • Crossed - Nested Designs – Multi-factor studies can involve treatment combinations • → some are crossed with other factors, and some are nested within other factors. – Statistical model • contains both crossed and nested effects: – ANOVA table Source df Factor A a - 1 Factor B(A) a(b - 1) Factor C c-1 AC (a-1)(c-1) BC a(b-1)(c-1) Error abc(n-1) Total (nabc)-1
  102. 102. ANCOVA • 개념 – → evaluates whether the means of a dependent variable are equal across different groups while statistically controlling effects of other variables that are not of primary interest (covariates) • Used to control variable (covariate): – used when we suspect that the variance of the dependent variable is not solely explained by the group variable – (1) Systematic bias → When the members in each group were not selected randomly, which leads to bias of test results • (2) Within-Group error SS → Variance due to individual differences among subjects in a group
  103. 103. • Steps in ANCOVA
  104. 104. • How Covariance Analysis Reduces Error (a) Error Variability with Single-factor Analysis of Variance Model (b) Error Variability with Covariance Analysis Model
  105. 105. • Covariance Model
  106. 106. • Treatment Effects의 비교 Treatment Regression Lines with Covariance Model
  107. 107. MANOVA – Yijk = Observation for variable k from subject j in group i. – Assumptions • Data from group i has common mean vector μi= • Data from all groups have common covariance matrix Σ. • Independence: The subjects are independently sampled. • Normality: The data are multivariate normally distributed. Ha: for at least one i≠j
  108. 108. • mean vector for treatment i: • mean vector for block j: • grand mean vector: • Total Sum of Squares and Cross Products Matrix. H = Treatment SSCP matrix; B = Block SSCP matrix; E = Error SSCP matrix.
  109. 109. • (k,l)th element of Treatment SSCP matrix H – If k = l, is treatment SS for k, and measures variation among treatments. – If k ≠ l, this measures how k and l vary together across treatments. • (k,l)th element of Block SSCP matrix B is – For k = l, is block SS for k, and measures variation among blocks. – For k ≠ l, this measures how variables k and l vary together. • (k,l)th element of the Error SSCP matrix E is – For k = l, is the error SS for k, and measures variability within treatment and block combinations of variable k. – For k ≠ l, this measures association or dependence between k and l
  110. 110. • Notations – Sample Mean Vector – Grand Mean Vector • is comprised of grand means for each p variables – Total SSCP
  111. 111. • Two Types of MANOVAs – (1) One-Way MANOVA (One group variable) – (2) Factorial MANOVA (more than one group variable)
  112. 112. • One-Way MANOVA (One group variable) – (EX) Student grades of 4 countries: H0: μCan = μUS = μMex = μPan – Calculate F approximations of 4 statistics and look up F-table
  113. 113. • Factorial MANOVA (more than one group variable) – 3 Types of Sum of Squares • (1) Type I SS → for Balanced data • (2) Type II SS → Most powerful when no significant interaction terms • (3) Type III SS → when there is a significant interaction term – H0: • (1) Group variable A/B do not significantly influence the means of the outcome variables • (2) The interaction of group variables A and B do not significantly influence mean of outcome variables
  114. 114. • MANOVA table MANOVA Source d.f. SSP Blocks b - 1 B Treatments a - 1 H Error (a - 1)(b - 1) E Total ab - 1 T
  115. 115. • MANOVA Assumptions – (1) Independent Observations – (2) Normality • Test using Shapiro-Wilks test – (3) Equal Variance-Covariance Matrices Between Groups • Test using Box M test
  116. 116. • Test statistic – Wilks Lambda: To test H0:treatment mean vectors are equal, • reject H0 if Wilks lambda is small (close to zero). – Hotelling-Lawley Trace • reject H0 if this test statistic is large. – Pillai Trace • reject H0 if this test statistic is large. – Roy's Maximum Root: Largest eigenvalue of HE-1 • reject the null hypothesis if this test statistic is large.
  117. 117. Effect Size • Partial η2 values – % of variance explained by the group variable (i.v.) • Partial η2 = 1 – Λ1/S • S = min(P, dfeffect) – P = number of dependent variables – dfeffect = d.f. for the effect tested (independent variable) • One-way MANOVA – 예: Baumann Education Data - Group variable: Education • Λ = 0.63202 • S = min(P = 3, dfeffect=2) = 2 – 20.5% variance of the grades of the 3 tests taken by the students are due to the difference of education style
  118. 118. RBD: 2-way MANOVA • Within randomized block designs, we have two factors: – Blocks, and – Treatments • RBD with a treatments + b blocks is constructed in 2 steps: – The experimental units (the units to which our treatments are going to be applied) are partitioned into b blocks, each comprised of a units. – Treatments are randomly assigned to the experimental units in such a way that each treatment appears once in each block. • 일반적으로 block을 분할 (partition) → 다음의 효과 – Units within blocks are as uniform as possible. – Differences between blocks are as large as possible.
  119. 119. 2-way MANOVA Additive Model • Assumptions – Error vectors εij have zero population mean; – Error vectors εij have common variance-covariance matrix Σ — (the usual assumption of a homogeneous variance-covariance matrix) – Error vectors εij are independently sampled; – Error vectors εij are sampled from a multivariate normal distribution; – No block by treatment interaction. This means that the effect of the treatment is not affected by, or does not depend on the block. • Treatment mean vector for treatment i: H = treatment SSCP B = Block SSCP E = Error SSCP
  120. 120. MANCOVA • 개념 – ത 𝑌 𝑗(𝑎𝑑𝑗) = ത 𝑌 𝑗 − 𝑏𝑤 ത 𝑋𝑗 − ത 𝑋 – ത 𝑌 𝑗(𝑎𝑑𝑗)= adjusted d.v. mean in group j (j=1,2,... ; total no. of groups) – ത 𝑌 𝑗 = d.v. mean in group j before adjustment – 𝑏𝑤 =common regression coef. in entire sample – ത 𝑋𝑗 = mean of covariate variable for group j – ത 𝑋 = covariate mean for entire sample – 𝐻0: = ത 𝑌1 𝑎𝑑𝑗 = ത 𝑌2 𝑎𝑑𝑗 = ത 𝑌 𝑗 𝑎𝑑𝑗
  121. 121. Assumptions • For ANOVA – (1) Observations independent from each other – (2) Population variances of groups are equal – (3) Dependent variable normal • ANCOVA assumptions include assumptions of ANOVA plus: – (4) Continuous dependent variables and membership exclusive (fixed) independent group variable – (5) Linear relationship between dependent variables – (6) Covariate is related to dependent variable, not group variable – (7) Regression line for the groups are parallel (check by introducing interaction term of group variable and covariate) – (8) Homoscedasticity of regression slops (check by introducing MSE from separate group regressions)
  122. 122. 실습
  123. 123. AI-Bio 융합 전문 과정 2022-8~10 윤형기 (hky@openwith.net) 3일차
  124. 124. 주제 세부사항 1일차 인사 및 과정 소개 인사 수강생 현황 및 수강목적 등 파악 의료/바이오 개관 (기술/산업) 의료/바이오 기술 및 산업동향 기반기술 (1-1) Python과 분석 패키지 분석도구 (1) (Python, Scipy, numpy/pandas) 2일차 기반기술 (1-2) R과 통계분석 분석도구 (2) (R과 통계학) 생명통계 활용 (1) 생명정보와 ANOVA, 다변량분석 등 유전체 분석 3일차 생명통계 활용 (2) 메타분석 유전체 분석 (Omics) (1) 유전체(genome) 분석 전사체(transcriptome) 분석 4일차 유전체 분석 (Omics) (2) 후성유전체(epigenome) 분석 단백체(proteome) 분석 차세대 Sequencing GenBank와 NCBI데이터 VCF 데이터 분석, NGS 데이터 처리 등 5일차 기반기술 (3) 기계학습 (1) 모델링 방법론 (모델 개념 및 Cross-Validation) 지도학습 알고리즘 (선형모델, 분류) 기반기술 (3) 기계학습 (2) 비지도학습 알고리즘 (군집, 연관분석 등) 6일차 지도학습과 생명정보 응용 의료데이터에서의 예측모델 선형모델과 헬스케어 데이터의 분류 비지도학습과 생명정보 응용 임상데이터의 연관성분석 동반질병 (comorbidity) 분석 의료/바이오 도메인 이해 헬스케어 데이터셋과 생명통계 바이오 데이터와 기계학습 일정
  125. 125. 주제 세부사항 7일차 기반기술 (4) 딥러닝 (1) 신경망 학습과 딥러닝 모델 기반기술 (3) 딥러닝 (2) TensorFlow PyTorch 8일차 딥러닝과 생명정보 응용 Bi-LSTM을 이용한 헬스케어 시뮬레이션 딥러닝을 이용한 피부병 식별 온톨로지와 생명정보 응용 세만틱웹과 ontologies Ontology의 생명정보 응용 9일차 기반 기술 (3) 이미지 처리 이미지 처리와 컴퓨터 비전 개요 의료영상분석 (1) Segmentation 영상등록 (image registration) 10일차 의료영상분석 (2) 심전도 (ECG) Rendering과 Surface Models MRI 11일차 기반기술 (4) 생명정보와 계산화학 계산화학 (computational chemistry) 개요 신약개발 (drug discovery) (1) 표적규명 (target identification) 시약과 검정법 개발 ADME (흡수, 분포, 대사, 배설) 독성학과 기계학습 응용 12일차 기반 기술 (5) GAN GAN (Generative Adversarial Networks)과 VAE 신약개발과 GAN 생성모델을 이용한 신약후보물질 추천 총정리 Wrap-up 총정리 의료영상 분석 약물분석과 신약설계 바이오 데이터와 딥러닝
  126. 126. 메타분석 (META-ANALYSIS)
  127. 127. 개요 • 메타분석이란? – an “analysis of analyses” (Glass 1976) - to combine, summarize and interpret all available evidence pertaining to a clearly defined research field or research question (Lipsey and Wilson 2001). – 목적 = the statistical synthesis of the data • 배경 – (1) Traditional/Narrative Reviews. • narrative reviews by experts → biases – (2) Systematic Reviews • try to summarize evidence using clearly defined and transparent rules, assessing the validity of evidence using predefined standards and present a synthesis of outcomes in a systematic way. – (3) Meta-Analyses. • aim to combine results from previous studies in a quantitative way. • quantify the effect of a medication, the prevalence of a disease, or the correlation between two properties, across all studies
  128. 128. • 주요 Bibliographic Databases – 대표적인 데이터베이스 PubMed Openly accessible database of the US National Library of Medicine. Primarily contains biomedical research. PsycInfo Database of American Psychological Association. Primarily covers research in the social and behavioral sciences. Cochrane Central Register of Controlled Trials (CENTRAL) Openly accessible database of the Cochrane Collaboration. Primarily covers health-related topics. Embase Database of biomedical research maintained by the large scientific publisher Elsevier. Requires a license. ProQuest International Bibliography of the Social Sciences Database of social science research. Requires a license. Education Resources Information Center (ERIC) Openly accessible database on education research.
  129. 129. – Citation Database – Dissertations – Study Registries Web of Science Interdisciplinary citation database maintained by Clarivate Analytics. Requires a license. Scopus Interdisciplinary citation database maintained by Elsevier. Requires a license. Google Scholar Openly accessible citation database maintained by Google. Has only limited search and reference retrieval functionality. Dissertations ProQuest Dissertations Database of dissertations. Requires a license WHO International Clinical Trials Registry Platform (ICTRP) Openly accessible database of clinical trial registrations worldwide. Can be used to identify studies that have not (yet) been published. OSF Registries Openly accessible interdisciplinary database of study registrations. Can be used to identify studies that have not (yet) been published.
  130. 130. • 사용 – Medicine, psychology, criminology, business, … • 주된 방법 – meta-analyses of effect sizes • 조심할 점 – “Apples and Oranges” Problem – “Garbage In, Garbage Out” Problem – “File Drawer” Problem – “Researcher Agenda” Problem
  131. 131. 기본 이용법 High-dose versus standard-dose of statins (adapted from Cannon et al., 2006).
  132. 132. • Effect size • Compute effect size for each study, and assess the consistency of the effect across studies and to compute a summary effect. • The effect size represent any relationship between two variables - impact of an intervention, such as medical treatment, ... • 여기서는 risk ratio < 1.0 = risk was lower in the high-dose group • Precision • the effect size for each study is bounded by a confidence interval, reflecting the precision with which the effect size has been estimated in that study. • Study weights • the size of each square reflecting the weight that is assigned to the corresponding study when we compute the summary effect. • relationship between a study’s precision and that study’s weight - Since precision is driven primarily by sample size, we can think of the studies as being weighted by sample size • p - values
  133. 133. • Fixed effects vs. Random effects • Under fixed-effect model, – assume that all studies in the analysis share the same true effect size, and the summary effect is our estimate of this common effect size. • Under random-effects model, – assume that the true effect size varies from study to study, and the summary effect is our estimate of the mean of the distribution of effect sizes. • Precision • The location of the diamond represents the effect size while its width reflects the precision of the estimate. • The precision addresses the accuracy of the summary effect as an estimate of the true effect. • p - values
  134. 134. • Heterogeneity of effect sizes • treatment effect is usually NOT consistent across all studies – 과제: assess the dispersion of effect sizes from study to study • If the effect size is consistent, – focus on the summary effect, and note that this effect is robust across the domain of studies included in the analysis. • If the effect size varies modestly, – report the summary effect but note that the true effect in any given study could be somewhat lower or higher than this value. • If the effect varies substantially from one study to the next, – shift our attention from the summary effect to the dispersion itself.
  135. 135. – Raw (unstandardized) mean difference D • Computing D from studies that use independent groups
  136. 136. – Standardized mean difference, d and g • If studies use different instruments to assess the outcome, then the scale of measurement will differ from study to study and it would not be meaningful to combine raw mean differences. In such cases, use standardized mean difference • Computing d and g from studies that use independent groups
  137. 137. • Response ratios의 경우 Response ratios are analyzed in log units.
  138. 138. • Effect size의 예 Effect sizes based on means Raw (unstandardized) mean difference (D) Based on studies with independent groups Based on studies with matched groups or pre-post designs Standardized mean difference (d or g) Based on studies with independent groups Based on studies with matched groups or pre-post designs Response ratios (R) Based on studies with independent groups Effect sizes based on binary data Risk ratio (RR) Based on studies with independent groups Odds ratio (OR) Based on studies with independent groups Risk difference (RD) Based on studies with independent groups Effect sizes based on correlational data Correlation (r) Based on studies with one group 출처: MichaelBorenstein et. al., Introduction to Meta-Analysis, 2009
  139. 139. OMICS
  140. 140. 생명과학과 Omics • 세포細胞, cell • 중심원리 (Central Dogma) • Sequencing? • operation of determining the precise order of nucleotides of a given DNA molecule, to determine the sequence of individual genes, full chromosomes or entire genomes of an organism.
  141. 141. • Omics – By https://en.wikipedia.org/wiki/Omics • aims at the collective characterization and quantification of pools of biological molecules that translate into the structure, function, and dynamics of an organism or organisms. • Computational Biology와 Bioinformatics의 주된 연구 – 1. Genomics (& Genetics) – the study of the structure, functions and mapping of genomes – 2 Transcriptomics (전사체학) • Transcriptome (전사체)에 대한 연구 – transcriptome = the sum of an organism’s RNA transcripts.
  142. 142. – 3. Proteomics (단백질체학) • study of proteins – The process of transcription produces messenger RNA (mRNA) which serves as a template for the synthesis of protein through translation. Hence proteins produced depend on the genes that are transcribed from the mRNA. – 1. Applications of proteomics in drug discovery – 2. Protein folding – 3. Protein structure prediction – 4. Protein-protein interaction networks – 4. Metabolomics (대사체학) • metabolites (대사물질, 대사산물)에 대한 연구 – = molecules produced by metabolism within tissues and cells – Researchers try to identify and quantify metabolites using different analytical methods and interpret data. There are difference subfields of metabolomics such as metabonomics and exometabolomics. – 1. Metabolic reprogramming – 2. Mass spectrometry strategies – 3. Identification of biomarkers
  143. 143. – 4. Metabolomics (대사체학) • metabolites (대사물질, 대사산물)에 대한 연구 – = molecules produced by metabolism within tissues and cells – to identify and quantify metabolites using different analytical methods and interpret data. – 1. Metabolic reprogramming – 2. Mass spectrometry strategies – 3. Identification of biomarkers – 5. Phylogenetics (계통분류학) • study of how species evolved and what relationships exist within groups of organisms. Relationships are determined using phylogenetic inference methods with DNA sequencing data or morphology. – 1. Inferring phylogenetic trees – 2. Phylogenetic networks – 3. Bayesian phylogenetics – 4. Phylogenetic model selection – 5. Evolutionary models
  144. 144. – 5. Phylogenetics (계통분류학) • Relationships are determined using phylogenetic inference methods with DNA sequencing data or morphology. → phylogenetic tree – 1. Inferring phylogenetic trees – 2. Phylogenetic networks – 3. Bayesian phylogenetics – 4. Phylogenetic model selection – 5. Evolutionary models – 6. Systems biology • 수학적 모델과 시뮬레이션을 이용 – 1. Gene regulatory networks – 2. Modelling metabolic interactions – 3. Model protective mechanisms induced by antibiotics – 4. Studying cell signalling pathways
  145. 145. 실습

×