Teaching Population Genetics with R

A Simulation-Based Approach to
Teaching Population Genetics:
R as a Teaching Platform
Bruce J. Cochrane
Department of Zoology/Biology
Miami University
Oxford OH

Two Time Points
• 1974
o Lots of Theory
o Not much Data
o Allozymes Rule
• 2013
o Even More Theory
o Lots of Data
o Sequences, -omics, ???

The Problem
• The basic approach hasn’t changed, e. g.
o Hardy Weinberg
o Mutation
o Selection
o Drift
o Etc.
• Much of it is deterministic

And
• There is little initial connection with real data
o The world seems to revolve around A and a
• At least in my hands, it doesn’t work

The Alternative
• Take a numerical (as opposed to analytical) approach
• Focus on understanding random variables and distributions
• Incorporate “big data”
• Introduce current approaches – coalescence, Bayesian
Analysis, etc. – in this context

Why R?
• Open Source
• Platform-independent (Windows, Mac, Linux)
• Object oriented
• Facile Graphics
• Web-oriented
• Packages available for specialized functions

Where We are Going
• The Basics – Distributions, chi-square and the Hardy Weinberg
Equilibrium
• Simulating the Ewens-Watterson Distribution
• Coalescence and summary statistics
• What works and what doesn’t

The Normal Distribution
dat.norm <-rnorm(1000)
hist(dat.norm,freq=FALSE,ylim=c(0,.5))
curve(dnorm(x,0,1),add=TRUE,col="red")
mean(dat.norm)
var(dat.norm)
> mean(dat.norm)
[1] 0.003546691
> var(dat.norm)
[1] 1.020076

Sample Size and Cutoff Values
n <-c(10,30,100,1000)
res <-sapply(n,ndist)
colnames(res)=n
res
> res
10 30 100 1000
2.5% -1.110054 -1.599227 -1.713401 -1.981675
97.5% 2.043314 1.679208 1.729095 1.928852

What is chi-square All About?
xsq <-rchisq(10000,1)
hist(xsq, main="Chi Square Distribution, N=1000, 1 d. f",xlab="Value")
p05 <-quantile(xsq,.95)
abline(v=p05, col="red")
p05
95%
3.867886

Simple Generation of Critical Values
d <-1:10
chicrit <-qchisq(.95,d)
chitab <-cbind(d,chicrit)
chitab
d chicrit
[1,] 1 3.841459
[2,] 2 5.991465
[3,] 3 7.814728
[4,] 4 9.487729
[5,] 5 11.070498
[6,] 6 12.591587
[7,] 7 14.067140
[8,] 8 15.507313
[9,] 9 16.918978
[10,] 10 18.307038

Calculating chi-squared
The function
function(obs,exp,df=1){
chi <-sum((obs-exp)^2/exp)
pr <-1-pchisq(chi,df)
c(chi,pr)
A sample function call
obs <-c(315,108,101,32)
z <-sum(obs)/16
exp <-c(9*z,3*z,3*z,z)
chixw(obs,exp,3)
The output
chi-square = 0.47
probability(<.05) = 0.93
deg. freedom = 3

Basic Hardy Weinberg Calculations
The Biallelic Case
Sample input
obs <-c(13,35,70)
hw(obs)
Output
[1] "p= 0.2585 q= 0.7415"
obs exp
[1,] 13 8
[2,] 35 45
[3,] 70 65
[1] "chi squared = 5.732 p = 0.017 with 1 d. f."

Illustrating With Ternary Plots
library(HardyWeinberg)
dat <-(HWData(100,100))
gdist <-dat$Xt #create a variable with the working data
HWTernaryPlot(gdist, hwcurve=TRUE,addmarkers=FALSE,region=0,vbounds=FALSE,axis=2,
vertexlab=c("0","","1"),main="Theoretical Relationship",cex.main=1.5)

Access to Data
• Direct access of data
o HapMap
o Dryad
o Others
• Manipulation and visualization within R
• Preparation for export (e. g. Genalex)

Direct Access of HapMap Data
library (chopsticks)
chr21 <-read.HapMap.data("http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/latest_phaseII_ncbi_b36/fwd_strand/
non-redundant/genotypes_chr21_YRI_r24_nr.b36_fwd.txt.gz")
chr21.sum <-summary(chr21$snp.data)
head(chr21.sum)
Calls Call.rate MAF P.AA P.AB P.BB z.HWE
rs885550 90 1.0000000 0.09444444 0.8111111 0.1888889 0.00000000 0.9894243
rs1468022 90 1.0000000 0.00000000 0.0000000 0.0000000 1.00000000 NA
rs169758 90 1.0000000 0.31666667 0.4000000 0.5666667 0.03333333 2.9349509
rs150482 89 0.9888889 0.00000000 0.0000000 0.0000000 1.00000000 NA
rs12627229 89 0.9888889 0.00000000 0.0000000 0.0000000 1.00000000 NA
rs9982283 90 1.0000000 0.05555556 0.0000000 0.1111111 0.88888889 0.5580490

Distribution of Hardy Weinberg Deviation on
Chromosome 22 Markers

And Determining the Number of Outliers
nsnps <- length(hwdist)
quant <-quantile(hwdist,c(.025,.975))
low <-length(hwdist[hwdist<quant[1]])
high <-length(hwdist[hwdist>quant[2]])
accept <-nsnps-low-high
low; accept; high
[1] 982
[1] 37330
[1] 976

Sampling and Plotting Deviation from Hardy Weinberg
chr21.poly <-na.omit(chr21.sum) #remove all NA's (fixed SNPs)
chr21.samp <-sample(nrow(chr21.poly),1000, replace=FALSE)
plot(chr21.poly$z.HWE[chr21.samp])

Plotting F for Randomly Sampled Markers
chr21.sub <-chr21.poly[chr21.samp,]
Hexp <- 2*chr21.sub$MAF*(1-chr21.sub$MAF)
Fi <- 1-(chr21.sub$P.AB/Hexp)
plot(Fi,xlab="Locus",ylab="F")

Additional Information
head(chr21$snp.support)
dbSNPalleles Assignment Chromosome Position Strand
rs885550 C/T C/T chr21 9887804 +
rs1468022 C/T C/T chr21 9887958 +
rs169758 C/T C/T chr21 9928786 +
rs150482 A/G A/G chr21 9932218 +
rs12627229 C/T C/T chr21 9935312 +
rs9982283 C/T C/T chr21 9935844 +

The Ewens- Watterson Test
• Based on Ewens (1977) derivation of the theoretical
equilibrium distribution of allele frequencies under the
infinite allele model.
• Uses expected homozygosity (Σp2) as test statistic
• Compares observed homozygosity in sample to expected
distribution in n random simulations
• Observed data are
o N=number of samples
o k= number of alleles
o Allele Frequency Distribution

Classic Data (Keith et al., 1985)
• Xdh in D. pseudoobscura, analyzed by sequential
electrophoresis
• 89 samples, 15 distinct alleles

Testing the Data
1. Input the Data
Xdh <- c(52,9,8,4,4,2,2,1,1,1,1,1,1,1,1) # vector of allele numbers
length(Xdh) # number of alleles = k
sum(Xdh) #number of samples = n
2. Calculate Expected Homozygosity
Fx <-fhat(Xdh)
3. Run the Analysis
Ewens(n,k,Fx)

With Newer (and more complete) Data
Lactase Haplotypes in European and African Populations
1. Download data for Lactase gene from HapMap (CEU, YRI)
o 25 SNPS
o 48,000 KB
2. Determine numbers of haplotypes and frequencies for each
3. Apply Ewens-Waterson test to each.

The Results
par(mfrow=c(2,1))
pops <-c("ceu","yri")
sapply(pops,hapE)
CEU
YRI

Some Basic Statistics from Sequence Data
library(seqinR)
library(pegas)
dat <-read.fasta(file="./Data/FGB.fas")
#additional code needed to rearrange data
sites <-seg.sites(dat.dna)
nd <-nuc.div(dat.dna)
taj <-tajima.test(dat.dna)
length(sites); nd;taj$D
[1] 23
[1] 0.007561061
[1] -0.7759744
Intron sequences, 433 nucleotides each
from Peters JL, Roberts TE, Winker K, McCracken KG (2012)
PLoS ONE 7(2): e31972. doi:10.1371/journal.pone.0031972

Coalescence I – A Bunch of Trees
trees <-read.tree("http://dl.dropbox.com/u/9752688/ZOO%20422P/R/msfiles/tree.1.txt")
plot(trees[1:9],layout=9)

Coalescence II - MRCA
msout.1.txt <-system("./ms 10 1000 -t .1 -L", intern=TRUE)
ms.1 <- read.ms.output(msout.1.txt)
hist(ms.1$times[,1],main="MRCA, Theta=0.1",xlab="4N")

Coalescence III – Summary Statistics
system("./ms 50 1000 -s 10 -L | ./sample_stats >samp.ss")
# 1000 simulations of 50 samples, with number of sites set to 10
ss.out <-read_ss("samp.ss")
head(ss.out)
pi S D thetaH H
1. 1.825306 10 -0.521575 2.419592 -0.594286
2. 2.746939 10 0.658832 2.518367 0.228571
3. 3.837551 10 2.055665 3.631837 0.205714
4. 2.985306 10 0.964128 2.280000 0.705306
5. 1.577959 10 -0.838371 5.728163 -4.150204
6. 2.991020 10 0.971447 3.539592 -0.548571

Coalescence IV – Distribution of Summary Statistics
hist(ss.out$D,main="Distribution of Tajima's D (N=1000)",xlab="D")
abline(v=mean(ss.out$D),col="blue")
abline(v=quantile(ss.out$D,c(.025,.975)),col="red")

Other Uses
• Data Manipulation
o Conversion of HapMap Data for use elsewhere (e. g. Genalex)
o Other data sources via API’s (e. g. package rdryad)
• Other Analyses
o Hierarchical F statistics (hierfstat)
o Haplotype networking (pegas)
o Phylogenetics (ape, phyclust, others)
o Approximate Bayesian Computation (abc)
• Access for students
o Scripts available via LMS
o Course specific functions can be accessed (source("http://db.tt/A6tReYEC")
o Notes with embedded code in HTML (Rstudio, knitr)

Challenges
• Some coding required
• Data Structures are a challenge
• Packages are heterogeneous
• Students resist coding

Nevertheless
• Fundamental concepts can be easily visualized graphically
• Real data can be incorporated from the outset
• It takes students from fundamental concepts to real-world
applications and analyses
For Further information:
cochrabj@miamioh.edu
Functions
http://db.tt/A6tReYEC

Teaching Population Genetics with R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Teaching Population Genetics with R

Similar to Teaching Population Genetics with R (20)

Recently uploaded

Recently uploaded (20)

Teaching Population Genetics with R