population structure software: detailed analysis method

Structure Software
Pritchard Lab, Stanford University
(http://pritch.bsd.uchicago.edu/structure.html

An overview of STRUCTURE
 Pritchard, Stephens & Donnelly (2000a),
 Extended by Falush, Stephens and Pritchard (2003a, 2007).
 Updated version 2.3.4 (2012)
 It is a model-based clustering method for assuming
population structure
Applications:
 Demonstrating the presence of population structure
 Identifying distinct genetic populations,
 Assigning individuals to populations.
 Identifying migrants and admixed individuals
 Population genetics, forensic analysis & human genetics

An overview of STRUCTURE (cont.)
 One of the most widely used population analysis tools
 Detects the underlying genetic population among a set of
individuals genotyped at multiple markers
 Computes the proportion of the genome of an individual
originating from each inferred population (quantitative
clustering method)
 Web based, freely available software. Not require special
packages to install and to run the operations.
 A simple binary data is enough for basic run

Software impact during 2019-20

Terminology to understand
 K= no. of assumed populations,
 ΔK,
 Burn in period – to minimize effect of starting configuration.
 MCMC-Markov Chain Monte Carlo (Bayesian approach)
 Expected heterozygosity:
Fundamental measure of genetic variation in a population, based
on the probability of heterozygous genotypes in different sub groups
 Population differentiation(fst)
explains the summary of genetic differentiation among groups/ population
 Membership probability
 Q matrix

Input file format
Option1: each individual has ,for each locus, one allele in different rows
For SNP
Option2: each individual has ,for each locus, one allele in different columns
• A binary matrix where the data for individuals are in rows, the loci are in
column
• n consecutive rows have the data for each individual of n-ploid species
• Integer should be used for coding genotype
• Missing data should be indicated by a number which doesn’t occur
elsewhere in the data (e.g. -1 or -9 )
• The data file should be a text file (.txt) for running STRUCTURE

Two
consecutive
rows for alleles
Input file format
This has to be convert into txt. format

Running STRUCTURE from a graphical interface, Front End
Project
functioning
details
Shows the out put and
fictional set parameters
here
Program running

Importing input data into a project

Importing input data into a project (cont.)

Configuring a parameter set
Length of Burn in Period: how long to run the simulation before
collecting data to minimize the effect of the starting configuration
Number of MCMC runs after burn in
how long to run the simulation after burn in to get accurate parameter
estimates

Configuring a parameter set (cont.)
No admixture:
assumes that each individual has its origin in only one of the K populations
Admixture model:
This model considers that individuals can have an admixed ancestry such
that each individual can inherit a fraction of their genome from ancestors
in population k.
Linkage model:
The linkage model has the same characteristics as the admixture model but
it considers that all the alleles in the same region (linkage group) have an
origin in the same ancestral population

Running STRUCTURE
Details of the parameters set
for run

Running STRUCTURE (cont.)
No of iterations:

Representing the results graphically

different types in representation
C)
C=Tree plot representation
at K=4

Fst:
summary of genetic
differentiation among groups

Inference of true K
(K=number of populations)
The log likelihood for each K, Ln P(D) = L(K)
Two approaches to determine the best K
1. Use of L(K): When K is approaching a true value,
L(K) plateaus (or continues increasing slightly)
and has high variance between runs (Rosenberg et
al. 2001).
Nonparametric test (Wilcoxon test)
2. Use of an ad hoc quantity (ΔK): Calculated based
on the second order rate of change of the
likelihood (ΔK) (Evanno et al. 2005). The ΔK
shows a clear peak at the true value of K.
ΔK = m([L’’K])/s[L(K)]

Inference of best K using the delta K method

Structure Harvester
 This is a web based python program. Developed by Earl&Von holdt 2012.
 Based on Evano Method 2005.
 Out put file of STRUCTURE can be taken as input (in Zip file format only)
 Gives the best K value based on ΔK method
Conditions for better performance
 Needs to run at least 3 K values with minimum 3 replications
 Sample Standard deviation of the log likelihood values across all K values is
non zero
 Small number of replications may violate result

Conclusion & Remarks
 Simple and effective software for assuming the populations in
a unknown set of samples using molecular data (allelic
variation)
 The burn in period is not clearly explained on which basis it
will create the starting point for simulation
 Both burn in period and MCMC length has to be standardize
based on different analysis
 Overall with less input data we can create high impact of
representation which is widely using

There is no one standard analysis parameter in
STRUCTURE—the data and the study objectives will
influence the choice of the most appropriate
parameter— and precaution should be used to avoid
over estimating the actual population structure
present in complex data.
Thank you

population structure software: detailed analysis method

More Related Content

Similar to population structure software: detailed analysis method

Recently uploaded

population structure software: detailed analysis method