Structure Software
Pritchard Lab, Stanford University
(http://pritch.bsd.uchicago.edu/structure.html
An overview of STRUCTURE
 Pritchard, Stephens & Donnelly (2000a),
 Extended by Falush, Stephens and Pritchard (2003a, 2007).
 Updated version 2.3.4 (2012)
 It is a model-based clustering method for assuming
population structure
Applications:
 Demonstrating the presence of population structure
 Identifying distinct genetic populations,
 Assigning individuals to populations.
 Identifying migrants and admixed individuals
 Population genetics, forensic analysis & human genetics
An overview of STRUCTURE (cont.)
 One of the most widely used population analysis tools
 Detects the underlying genetic population among a set of
individuals genotyped at multiple markers
 Computes the proportion of the genome of an individual
originating from each inferred population (quantitative
clustering method)
 Web based, freely available software. Not require special
packages to install and to run the operations.
 A simple binary data is enough for basic run
Software impact during 2019-20
Terminology to understand
 K= no. of assumed populations,
 ΔK,
 Burn in period – to minimize effect of starting configuration.
 MCMC-Markov Chain Monte Carlo (Bayesian approach)
 Expected heterozygosity:
Fundamental measure of genetic variation in a population, based
on the probability of heterozygous genotypes in different sub groups
 Population differentiation(fst)
explains the summary of genetic differentiation among groups/ population
 Membership probability
 Q matrix
Input file format
Option1: each individual has ,for each locus, one allele in different rows
For SNP
Option2: each individual has ,for each locus, one allele in different columns
• A binary matrix where the data for individuals are in rows, the loci are in
column
• n consecutive rows have the data for each individual of n-ploid species
• Integer should be used for coding genotype
• Missing data should be indicated by a number which doesn’t occur
elsewhere in the data (e.g. -1 or -9 )
• The data file should be a text file (.txt) for running STRUCTURE
Two
consecutive
rows for alleles
Input file format
This has to be convert into txt. format
Running STRUCTURE from a graphical interface, Front End
Project
functioning
details
Shows the out put and
fictional set parameters
here
Program running
Importing input data into a project
Importing input data into a project (cont.)
Configuring a parameter set
Length of Burn in Period: how long to run the simulation before
collecting data to minimize the effect of the starting configuration
Number of MCMC runs after burn in
how long to run the simulation after burn in to get accurate parameter
estimates
Configuring a parameter set (cont.)
No admixture:
assumes that each individual has its origin in only one of the K populations
Admixture model:
This model considers that individuals can have an admixed ancestry such
that each individual can inherit a fraction of their genome from ancestors
in population k.
Linkage model:
The linkage model has the same characteristics as the admixture model but
it considers that all the alleles in the same region (linkage group) have an
origin in the same ancestral population
Running STRUCTURE
Details of the parameters set
for run
Running STRUCTURE (cont.)
No of iterations:
Representing the results graphically
different types in representation
C)
C=Tree plot representation
at K=4
Fst:
summary of genetic
differentiation among groups
Summery of Structure
Inference of true K
(K=number of populations)
The log likelihood for each K, Ln P(D) = L(K)
Two approaches to determine the best K
1. Use of L(K): When K is approaching a true value,
L(K) plateaus (or continues increasing slightly)
and has high variance between runs (Rosenberg et
al. 2001).
Nonparametric test (Wilcoxon test)
2. Use of an ad hoc quantity (ΔK): Calculated based
on the second order rate of change of the
likelihood (ΔK) (Evanno et al. 2005). The ΔK
shows a clear peak at the true value of K.
ΔK = m([L’’K])/s[L(K)]
Inference of best K using the delta K method
Structure Harvester
 This is a web based python program. Developed by Earl&Von holdt 2012.
 Based on Evano Method 2005.
 Out put file of STRUCTURE can be taken as input (in Zip file format only)
 Gives the best K value based on ΔK method
Conditions for better performance
 Needs to run at least 3 K values with minimum 3 replications
 Sample Standard deviation of the log likelihood values across all K values is
non zero
 Small number of replications may violate result
Conclusion & Remarks
 Simple and effective software for assuming the populations in
a unknown set of samples using molecular data (allelic
variation)
 The burn in period is not clearly explained on which basis it
will create the starting point for simulation
 Both burn in period and MCMC length has to be standardize
based on different analysis
 Overall with less input data we can create high impact of
representation which is widely using
There is no one standard analysis parameter in
STRUCTURE—the data and the study objectives will
influence the choice of the most appropriate
parameter— and precaution should be used to avoid
over estimating the actual population structure
present in complex data.
Thank you

population structure software: detailed analysis method

  • 1.
    Structure Software Pritchard Lab,Stanford University (http://pritch.bsd.uchicago.edu/structure.html
  • 2.
    An overview ofSTRUCTURE  Pritchard, Stephens & Donnelly (2000a),  Extended by Falush, Stephens and Pritchard (2003a, 2007).  Updated version 2.3.4 (2012)  It is a model-based clustering method for assuming population structure Applications:  Demonstrating the presence of population structure  Identifying distinct genetic populations,  Assigning individuals to populations.  Identifying migrants and admixed individuals  Population genetics, forensic analysis & human genetics
  • 3.
    An overview ofSTRUCTURE (cont.)  One of the most widely used population analysis tools  Detects the underlying genetic population among a set of individuals genotyped at multiple markers  Computes the proportion of the genome of an individual originating from each inferred population (quantitative clustering method)  Web based, freely available software. Not require special packages to install and to run the operations.  A simple binary data is enough for basic run
  • 4.
  • 5.
    Terminology to understand K= no. of assumed populations,  ΔK,  Burn in period – to minimize effect of starting configuration.  MCMC-Markov Chain Monte Carlo (Bayesian approach)  Expected heterozygosity: Fundamental measure of genetic variation in a population, based on the probability of heterozygous genotypes in different sub groups  Population differentiation(fst) explains the summary of genetic differentiation among groups/ population  Membership probability  Q matrix
  • 6.
    Input file format Option1:each individual has ,for each locus, one allele in different rows For SNP Option2: each individual has ,for each locus, one allele in different columns • A binary matrix where the data for individuals are in rows, the loci are in column • n consecutive rows have the data for each individual of n-ploid species • Integer should be used for coding genotype • Missing data should be indicated by a number which doesn’t occur elsewhere in the data (e.g. -1 or -9 ) • The data file should be a text file (.txt) for running STRUCTURE
  • 7.
    Two consecutive rows for alleles Inputfile format This has to be convert into txt. format
  • 8.
    Running STRUCTURE froma graphical interface, Front End Project functioning details Shows the out put and fictional set parameters here Program running
  • 9.
    Importing input datainto a project
  • 10.
    Importing input datainto a project (cont.)
  • 11.
    Configuring a parameterset Length of Burn in Period: how long to run the simulation before collecting data to minimize the effect of the starting configuration Number of MCMC runs after burn in how long to run the simulation after burn in to get accurate parameter estimates
  • 12.
    Configuring a parameterset (cont.) No admixture: assumes that each individual has its origin in only one of the K populations Admixture model: This model considers that individuals can have an admixed ancestry such that each individual can inherit a fraction of their genome from ancestors in population k. Linkage model: The linkage model has the same characteristics as the admixture model but it considers that all the alleles in the same region (linkage group) have an origin in the same ancestral population
  • 13.
    Running STRUCTURE Details ofthe parameters set for run
  • 14.
  • 15.
  • 16.
    different types inrepresentation C) C=Tree plot representation at K=4
  • 17.
  • 18.
  • 19.
    Inference of trueK (K=number of populations) The log likelihood for each K, Ln P(D) = L(K) Two approaches to determine the best K 1. Use of L(K): When K is approaching a true value, L(K) plateaus (or continues increasing slightly) and has high variance between runs (Rosenberg et al. 2001). Nonparametric test (Wilcoxon test) 2. Use of an ad hoc quantity (ΔK): Calculated based on the second order rate of change of the likelihood (ΔK) (Evanno et al. 2005). The ΔK shows a clear peak at the true value of K. ΔK = m([L’’K])/s[L(K)]
  • 20.
    Inference of bestK using the delta K method
  • 21.
    Structure Harvester  Thisis a web based python program. Developed by Earl&Von holdt 2012.  Based on Evano Method 2005.  Out put file of STRUCTURE can be taken as input (in Zip file format only)  Gives the best K value based on ΔK method Conditions for better performance  Needs to run at least 3 K values with minimum 3 replications  Sample Standard deviation of the log likelihood values across all K values is non zero  Small number of replications may violate result
  • 22.
    Conclusion & Remarks Simple and effective software for assuming the populations in a unknown set of samples using molecular data (allelic variation)  The burn in period is not clearly explained on which basis it will create the starting point for simulation  Both burn in period and MCMC length has to be standardize based on different analysis  Overall with less input data we can create high impact of representation which is widely using
  • 23.
    There is noone standard analysis parameter in STRUCTURE—the data and the study objectives will influence the choice of the most appropriate parameter— and precaution should be used to avoid over estimating the actual population structure present in complex data. Thank you