Workshop 2011

Population Structure Analysis
using STRUCTURE software

Chang Bum Hong

kt Bioinformatics TF, hongiiv@gmail.com, twitter @hongiiv, hongiiv.tistory.com

Permissions: you are free to blog or live-blog about this presentation as long as you attribute the work to its authors
Friday, August 12, 11

Genetic test

일반적으로 알콜을 섭취하게 되면 알콜은 아세트알데히드(얼굴을 붉게 만들고, 가슴도 콩닥
거리고, 구토를 일으키는 독성 물질)로 변하게 되고 이것이 다시 ALDH 에 의해 인체에 무해
한 젖산으로 분해되는 과정을 거치게 됩니다. 이때 ALDH2라는 유전자가 바로 아세트알데히
드가 조금이라도 생성되면 분해하는데 관여하게 이때 유전자형에 따라서 3가지 유형으로 나
타나게 됩니다.


23andMe


북서유럽

남동유럽


HGDP(Human Genome Diversity Project)

Text


PASNP(Pan-Asian SNP Consortium)

Text


East Asia - Public genotype data
SNP Individual Population
PASNP 54,794 1,928 75
HGDP a 2,834~ 1,056 52
HapMap 1,481,135 1,397 11
b
SGVP 268,667 292 3
Korean 58,625 159 10
China(Yanbian) 58,625 16 1
Japan(Kobe) 58,625 5 1
Korea-Japan 58,625 6 1
Vietnam 58,625 16 1
Korean-Vietnam 58,625 8 1
Cambodia 58,625 16 1
Mongol 58,625 16 1
a. Pan-Asian SNP Consortium(http://www4a.biotec.or.th/PASNP)
b. Singapore Genome Variation Project(http://www.nus-cme.org.sg/SGVP)


Korean Data
16
YeonCheon

16
Pyeong
Chang

MW
JeCheon
16 16
Cheonan
average >70 year old
long settlement
Affymetrix 50K Xba
GyeongJu
16 16
GimJe 15 China(Yanbian)
Goryeong UlSan
Japan(Kobe)
16 Korea-Japan
Vietnam
Korean-Vietnam

SW 16
NaJu

SE
Cambodia
Mongol

16 58,960 SNPs
Jeju


Missing genotype individuals
GimJe

GoRyeong
Gyeong
Text Ju

Before QC 58,960 SNPs Before QC 58,960 SNPs
All Asian Korean


Relatedness between the 153
Korean(10 region) Individuals
YeonCheon
PyeongChang

JeCheon

CheonAn GyeongJu

UlSan
GimJe GoRyeong

NaJu

JeJu

PCA analysis using autosomal 46,559 SNP markers (n=153, Korean)

PCA analysis of East Asian descent
Mongol

Yanbian

Kobe JPT-
Jeju HapMap

CHB-
HapMap

Vietnam

Cambodia
illustration of geographic correspondence of ethnic group
Korea-Vietnam Korea-Japan
locations

Relationship between Eigenvector
values and Latitude
47.81
39.98
37.53

2
R = 0.8621
y = 36.65 + 166.33x
14.72


STRUCTURE software
• A model-based clustering method (Pritchard et al. 2000)

• Free software
(http://pritch.bsd.uchicago.edu/structure.html)

• Bayesian approach (MCMC: Markov Chain Mote Carlo)

• Detects the underlying genetic population among a set of individuals genotyped at multiple
markers

• Computes the proportion of the genome of an individual originating from each inferred
population (quantitative clustering method)


Input data
• A matrix where the data for individuals are in rows, the loci
are in column
• n consecutive rows have the data for each individual of n-
ploid species
• Integer should be used for coding genotype
• Missingoccur should be indicated by(e.g. -1) which
doesn’t
data
elsewhere in the data
a number

• The dataSTRUCTUREbe a text file (.txt) not an excel (.xls) for
running
file should


Input format
1 consecutive rows for alleles
MarkerName...
Label PopID Flag Location Genotype...

genotype (1,2,5)
AA = 11
AB = 12
BB = 22
missing = 55

Information of user-defined populations
Lable : 각 개인의 고유한 ID로 숫자 또는 문자 어떤것이든 상관없다.(예, CEPH1334.10)
PopID: 개인이 속한 민족의 고유한 번호 (예, 중국인(CHB)인 경우 5, 유럽인(CEU)인 경우 1과 같이 자신이 직접 부여)
Flag: 해당 PopID 정보를 STRUCTURE 프로그램 실행시 사용할 것인가?(1= 사용한다, 2= 사용하지 않는다.)
Location: 해당 개인의 위치정보(예, 동아시아(EAS)인경우 1번, 유럽(EURA)인 경우 2번과 같이 자신이 직접 부여)

Input format (cont.)


Running STRUCTURE from a graphical
interface, Front End


Importing input data into a project


Importing input data into a project (cont.)


Configuring a parameter set


Configuring a parameter set (cont.)

Length of Burnin Period : how long to run the simulation before collecting data to minimize the
effect of the starting configuration, 목표함수로 수렴할 때까지의 반복 숫자
Number of MCMC Reps after Burnin : how long to run the simulation after burnin to get
accurate parameter estimates

Configuring a parameter set (cont.)


Running STRUCTURE: a single run


Running STRUCTURE: a single run (cont.)


Running STRUCTURE: a batch run


Running STRUCTURE: a batch run (cont.)


Ln P(D): Estimated probability of Ks


Analysis of genome-wide SNP data
• For very may become impractically slow
settings
large data sets, the runtime of structure using default

• reduced data sets (ex, pruned)
• get accurate resultsNUMREPS) shorter runs than default
(ex, small values of
using much

• download themachine) and compile it on your machine
(using 64-bit
source code

• use the command-line version of structure


An example of MCMC convergence


Inference of true K
(number of population)

• The log likelihood for each K, Ln P(D) = L(K)
• Two approaches to determine the best K
• Use of L(K) : When K is approaching a true value, L(K) plateaus
and has high variance between runs
• Use of an ad hod quantity (∆K) the likelihood (∆K).on the
second order rate of change of
: calculated based
The ∆K
shows a clear peak at the true value of K


Simulation Result

Q-metrix
an individuals belongs to a subpopulation


Simulation Result (cont.)


Enjoy running STRUCTURE


We may not always be able to know the TRUE value
K, but we should aim for the smallest value of K
that captures the major structure in the data

Pritchard et al. (2000)


Workshop 2011

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

More from Hong ChangBum

More from Hong ChangBum (20)

Recently uploaded

Recently uploaded (20)

Workshop 2011