A Simulation-Based Approach toTeaching Population Genetics:R as a Teaching PlatformBruce J. CochraneDepartment of Zoology/...
Two Time Points• 1974o Lots of Theoryo Not much Datao Allozymes Rule• 2013o Even More Theoryo Lots of Datao Sequences, -om...
The Problem• The basic approach hasn’t changed, e. g.o Hardy Weinbergo Mutationo Selectiono Drifto Etc.• Much of it is det...
And• There is little initial connection with real datao The world seems to revolve around A and a• At least in my hands, i...
The Alternative• Take a numerical (as opposed to analytical) approach• Focus on understanding random variables and distrib...
Why R?• Open Source• Platform-independent (Windows, Mac, Linux)• Object oriented• Facile Graphics• Web-oriented• Packages ...
Where We are Going• The Basics – Distributions, chi-square and the Hardy WeinbergEquilibrium• Simulating the Ewens-Watters...
The RStudio Interface
The Normal Distributiondat.norm <-rnorm(1000)hist(dat.norm,freq=FALSE,ylim=c(0,.5))curve(dnorm(x,0,1),add=TRUE,col="red")m...
Sample Size and Cutoff Valuesn <-c(10,30,100,1000)res <-sapply(n,ndist)colnames(res)=nres> res10 30 100 10002.5% -1.110054...
What is chi-square All About?xsq <-rchisq(10000,1)hist(xsq, main="Chi Square Distribution, N=1000, 1 d. f",xlab="Value")p0...
Simple Generation of Critical Valuesd <-1:10chicrit <-qchisq(.95,d)chitab <-cbind(d,chicrit)chitabd chicrit[1,] 1 3.841459...
Calculating chi-squaredThe functionfunction(obs,exp,df=1){chi <-sum((obs-exp)^2/exp)pr <-1-pchisq(chi,df)c(chi,pr)A sample...
Basic Hardy Weinberg CalculationsThe Biallelic CaseSample inputobs <-c(13,35,70)hw(obs)Output[1] "p= 0.2585 q= 0.7415"obs ...
Illustrating With Ternary Plotslibrary(HardyWeinberg)dat <-(HWData(100,100))gdist <-dat$Xt #create a variable with the wor...
Access to Data• Direct access of datao HapMapo Dryado Others• Manipulation and visualization within R• Preparation for exp...
Direct Access of HapMap Datalibrary (chopsticks)chr21 <-read.HapMap.data("http://hapmap.ncbi.nlm.nih.gov/downloads/genotyp...
Distribution of Hardy Weinberg Deviation onChromosome 22 Markers
And Determining the Number of Outliersnsnps <- length(hwdist)quant <-quantile(hwdist,c(.025,.975))low <-length(hwdist[hwdi...
Sampling and Plotting Deviation from Hardy Weinbergchr21.poly <-na.omit(chr21.sum) #remove all NAs (fixed SNPs)chr21.samp ...
Plotting F for Randomly Sampled Markerschr21.sub <-chr21.poly[chr21.samp,]Hexp <- 2*chr21.sub$MAF*(1-chr21.sub$MAF)Fi <- 1...
Additional Informationhead(chr21$snp.support)dbSNPalleles Assignment Chromosome Position Strandrs885550 C/T C/T chr21 9887...
The Ewens- Watterson Test• Based on Ewens (1977) derivation of the theoreticalequilibrium distribution of allele frequenci...
Classic Data (Keith et al., 1985)• Xdh in D. pseudoobscura, analyzed by sequentialelectrophoresis• 89 samples, 15 distinct...
Testing the Data1. Input the DataXdh <- c(52,9,8,4,4,2,2,1,1,1,1,1,1,1,1) # vector of allele numberslength(Xdh) # number o...
The Result
With Newer (and more complete) DataLactase Haplotypes in European and African Populations1. Download data for Lactase gene...
The Resultspar(mfrow=c(2,1))pops <-c("ceu","yri")sapply(pops,hapE)CEUYRI
Some Basic Statistics from Sequence Datalibrary(seqinR)library(pegas)dat <-read.fasta(file="./Data/FGB.fas")#additional co...
Coalescence I – A Bunch of Treestrees <-read.tree("http://dl.dropbox.com/u/9752688/ZOO%20422P/R/msfiles/tree.1.txt")plot(t...
Coalescence II - MRCAmsout.1.txt <-system("./ms 10 1000 -t .1 -L", intern=TRUE)ms.1 <- read.ms.output(msout.1.txt)hist(ms....
Coalescence III – Summary Statisticssystem("./ms 50 1000 -s 10 -L | ./sample_stats >samp.ss")# 1000 simulations of 50 samp...
Coalescence IV – Distribution of Summary Statisticshist(ss.out$D,main="Distribution of Tajimas D (N=1000)",xlab="D")abline...
Other Uses• Data Manipulationo Conversion of HapMap Data for use elsewhere (e. g. Genalex)o Other data sources via API’s (...
Sample HTML Rendering
Challenges• Some coding required• Data Structures are a challenge• Packages are heterogeneous• Students resist coding
Nevertheless• Fundamental concepts can be easily visualized graphically• Real data can be incorporated from the outset• It...
Upcoming SlideShare
Loading in...5
×

Teaching Population Genetics with R

1,210

Published on

Presented at Evolution 2013, June 24; describes an approach to teaching populations genetics at the upper undergraduate/beginning graduate level, using simulations based in R and incorporating available large genomic data sets.

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,210
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Teaching Population Genetics with R

  1. 1. A Simulation-Based Approach toTeaching Population Genetics:R as a Teaching PlatformBruce J. CochraneDepartment of Zoology/BiologyMiami UniversityOxford OH
  2. 2. Two Time Points• 1974o Lots of Theoryo Not much Datao Allozymes Rule• 2013o Even More Theoryo Lots of Datao Sequences, -omics, ???
  3. 3. The Problem• The basic approach hasn’t changed, e. g.o Hardy Weinbergo Mutationo Selectiono Drifto Etc.• Much of it is deterministic
  4. 4. And• There is little initial connection with real datao The world seems to revolve around A and a• At least in my hands, it doesn’t work
  5. 5. The Alternative• Take a numerical (as opposed to analytical) approach• Focus on understanding random variables and distributions• Incorporate “big data”• Introduce current approaches – coalescence, BayesianAnalysis, etc. – in this context
  6. 6. Why R?• Open Source• Platform-independent (Windows, Mac, Linux)• Object oriented• Facile Graphics• Web-oriented• Packages available for specialized functions
  7. 7. Where We are Going• The Basics – Distributions, chi-square and the Hardy WeinbergEquilibrium• Simulating the Ewens-Watterson Distribution• Coalescence and summary statistics• What works and what doesn’t
  8. 8. The RStudio Interface
  9. 9. The Normal Distributiondat.norm <-rnorm(1000)hist(dat.norm,freq=FALSE,ylim=c(0,.5))curve(dnorm(x,0,1),add=TRUE,col="red")mean(dat.norm)var(dat.norm)> mean(dat.norm)[1] 0.003546691> var(dat.norm)[1] 1.020076
  10. 10. Sample Size and Cutoff Valuesn <-c(10,30,100,1000)res <-sapply(n,ndist)colnames(res)=nres> res10 30 100 10002.5% -1.110054 -1.599227 -1.713401 -1.98167597.5% 2.043314 1.679208 1.729095 1.928852
  11. 11. What is chi-square All About?xsq <-rchisq(10000,1)hist(xsq, main="Chi Square Distribution, N=1000, 1 d. f",xlab="Value")p05 <-quantile(xsq,.95)abline(v=p05, col="red")p0595%3.867886
  12. 12. Simple Generation of Critical Valuesd <-1:10chicrit <-qchisq(.95,d)chitab <-cbind(d,chicrit)chitabd chicrit[1,] 1 3.841459[2,] 2 5.991465[3,] 3 7.814728[4,] 4 9.487729[5,] 5 11.070498[6,] 6 12.591587[7,] 7 14.067140[8,] 8 15.507313[9,] 9 16.918978[10,] 10 18.307038
  13. 13. Calculating chi-squaredThe functionfunction(obs,exp,df=1){chi <-sum((obs-exp)^2/exp)pr <-1-pchisq(chi,df)c(chi,pr)A sample function callobs <-c(315,108,101,32)z <-sum(obs)/16exp <-c(9*z,3*z,3*z,z)chixw(obs,exp,3)The outputchi-square = 0.47probability(<.05) = 0.93deg. freedom = 3
  14. 14. Basic Hardy Weinberg CalculationsThe Biallelic CaseSample inputobs <-c(13,35,70)hw(obs)Output[1] "p= 0.2585 q= 0.7415"obs exp[1,] 13 8[2,] 35 45[3,] 70 65[1] "chi squared = 5.732 p = 0.017 with 1 d. f."
  15. 15. Illustrating With Ternary Plotslibrary(HardyWeinberg)dat <-(HWData(100,100))gdist <-dat$Xt #create a variable with the working dataHWTernaryPlot(gdist, hwcurve=TRUE,addmarkers=FALSE,region=0,vbounds=FALSE,axis=2,vertexlab=c("0","","1"),main="Theoretical Relationship",cex.main=1.5)
  16. 16. Access to Data• Direct access of datao HapMapo Dryado Others• Manipulation and visualization within R• Preparation for export (e. g. Genalex)
  17. 17. Direct Access of HapMap Datalibrary (chopsticks)chr21 <-read.HapMap.data("http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/latest_phaseII_ncbi_b36/fwd_strand/non-redundant/genotypes_chr21_YRI_r24_nr.b36_fwd.txt.gz")chr21.sum <-summary(chr21$snp.data)head(chr21.sum)Calls Call.rate MAF P.AA P.AB P.BB z.HWErs885550 90 1.0000000 0.09444444 0.8111111 0.1888889 0.00000000 0.9894243rs1468022 90 1.0000000 0.00000000 0.0000000 0.0000000 1.00000000 NArs169758 90 1.0000000 0.31666667 0.4000000 0.5666667 0.03333333 2.9349509rs150482 89 0.9888889 0.00000000 0.0000000 0.0000000 1.00000000 NArs12627229 89 0.9888889 0.00000000 0.0000000 0.0000000 1.00000000 NArs9982283 90 1.0000000 0.05555556 0.0000000 0.1111111 0.88888889 0.5580490
  18. 18. Distribution of Hardy Weinberg Deviation onChromosome 22 Markers
  19. 19. And Determining the Number of Outliersnsnps <- length(hwdist)quant <-quantile(hwdist,c(.025,.975))low <-length(hwdist[hwdist<quant[1]])high <-length(hwdist[hwdist>quant[2]])accept <-nsnps-low-highlow; accept; high[1] 982[1] 37330[1] 976
  20. 20. Sampling and Plotting Deviation from Hardy Weinbergchr21.poly <-na.omit(chr21.sum) #remove all NAs (fixed SNPs)chr21.samp <-sample(nrow(chr21.poly),1000, replace=FALSE)plot(chr21.poly$z.HWE[chr21.samp])
  21. 21. Plotting F for Randomly Sampled Markerschr21.sub <-chr21.poly[chr21.samp,]Hexp <- 2*chr21.sub$MAF*(1-chr21.sub$MAF)Fi <- 1-(chr21.sub$P.AB/Hexp)plot(Fi,xlab="Locus",ylab="F")
  22. 22. Additional Informationhead(chr21$snp.support)dbSNPalleles Assignment Chromosome Position Strandrs885550 C/T C/T chr21 9887804 +rs1468022 C/T C/T chr21 9887958 +rs169758 C/T C/T chr21 9928786 +rs150482 A/G A/G chr21 9932218 +rs12627229 C/T C/T chr21 9935312 +rs9982283 C/T C/T chr21 9935844 +
  23. 23. The Ewens- Watterson Test• Based on Ewens (1977) derivation of the theoreticalequilibrium distribution of allele frequencies under theinfinite allele model.• Uses expected homozygosity (Σp2) as test statistic• Compares observed homozygosity in sample to expecteddistribution in n random simulations• Observed data areo N=number of sampleso k= number of alleleso Allele Frequency Distribution
  24. 24. Classic Data (Keith et al., 1985)• Xdh in D. pseudoobscura, analyzed by sequentialelectrophoresis• 89 samples, 15 distinct alleles
  25. 25. Testing the Data1. Input the DataXdh <- c(52,9,8,4,4,2,2,1,1,1,1,1,1,1,1) # vector of allele numberslength(Xdh) # number of alleles = ksum(Xdh) #number of samples = n2. Calculate Expected HomozygosityFx <-fhat(Xdh)3. Run the AnalysisEwens(n,k,Fx)
  26. 26. The Result
  27. 27. With Newer (and more complete) DataLactase Haplotypes in European and African Populations1. Download data for Lactase gene from HapMap (CEU, YRI)o 25 SNPSo 48,000 KB2. Determine numbers of haplotypes and frequencies for each3. Apply Ewens-Waterson test to each.
  28. 28. The Resultspar(mfrow=c(2,1))pops <-c("ceu","yri")sapply(pops,hapE)CEUYRI
  29. 29. Some Basic Statistics from Sequence Datalibrary(seqinR)library(pegas)dat <-read.fasta(file="./Data/FGB.fas")#additional code needed to rearrange datasites <-seg.sites(dat.dna)nd <-nuc.div(dat.dna)taj <-tajima.test(dat.dna)length(sites); nd;taj$D[1] 23[1] 0.007561061[1] -0.7759744Intron sequences, 433 nucleotides eachfrom Peters JL, Roberts TE, Winker K, McCracken KG (2012)PLoS ONE 7(2): e31972. doi:10.1371/journal.pone.0031972
  30. 30. Coalescence I – A Bunch of Treestrees <-read.tree("http://dl.dropbox.com/u/9752688/ZOO%20422P/R/msfiles/tree.1.txt")plot(trees[1:9],layout=9)
  31. 31. Coalescence II - MRCAmsout.1.txt <-system("./ms 10 1000 -t .1 -L", intern=TRUE)ms.1 <- read.ms.output(msout.1.txt)hist(ms.1$times[,1],main="MRCA, Theta=0.1",xlab="4N")
  32. 32. Coalescence III – Summary Statisticssystem("./ms 50 1000 -s 10 -L | ./sample_stats >samp.ss")# 1000 simulations of 50 samples, with number of sites set to 10ss.out <-read_ss("samp.ss")head(ss.out)pi S D thetaH H1. 1.825306 10 -0.521575 2.419592 -0.5942862. 2.746939 10 0.658832 2.518367 0.2285713. 3.837551 10 2.055665 3.631837 0.2057144. 2.985306 10 0.964128 2.280000 0.7053065. 1.577959 10 -0.838371 5.728163 -4.1502046. 2.991020 10 0.971447 3.539592 -0.548571
  33. 33. Coalescence IV – Distribution of Summary Statisticshist(ss.out$D,main="Distribution of Tajimas D (N=1000)",xlab="D")abline(v=mean(ss.out$D),col="blue")abline(v=quantile(ss.out$D,c(.025,.975)),col="red")
  34. 34. Other Uses• Data Manipulationo Conversion of HapMap Data for use elsewhere (e. g. Genalex)o Other data sources via API’s (e. g. package rdryad)• Other Analyseso Hierarchical F statistics (hierfstat)o Haplotype networking (pegas)o Phylogenetics (ape, phyclust, others)o Approximate Bayesian Computation (abc)• Access for studentso Scripts available via LMSo Course specific functions can be accessed (source("http://db.tt/A6tReYEC")o Notes with embedded code in HTML (Rstudio, knitr)
  35. 35. Sample HTML Rendering
  36. 36. Challenges• Some coding required• Data Structures are a challenge• Packages are heterogeneous• Students resist coding
  37. 37. Nevertheless• Fundamental concepts can be easily visualized graphically• Real data can be incorporated from the outset• It takes students from fundamental concepts to real-worldapplications and analysesFor Further information:cochrabj@miamioh.eduFunctionshttp://db.tt/A6tReYEC
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×