Workshop 2011
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
3,494
On Slideshare
459
From Embeds
3,035
Number of Embeds
4

Actions

Shares
Downloads
7
Comments
0
Likes
0

Embeds 3,035

http://hongiiv.tistory.com 3,024
http://www.hanrss.com 7
http://webcache.googleusercontent.com 3
http://translate.google.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Population Structure Analysis using STRUCTURE software Chang Bum Hong kt Bioinformatics TF, hongiiv@gmail.com, twitter @hongiiv, hongiiv.tistory.com Permissions: you are free to blog or live-blog about this presentation as long as you attribute the work to its authorsFriday, August 12, 11
  • 2. Genetic test 일반적으로 알콜을 섭취하게 되면 알콜은 아세트알데히드(얼굴을 붉게 만들고, 가슴도 콩닥 거리고, 구토를 일으키는 독성 물질)로 변하게 되고 이것이 다시 ALDH 에 의해 인체에 무해 한 젖산으로 분해되는 과정을 거치게 됩니다. 이때 ALDH2라는 유전자가 바로 아세트알데히 드가 조금이라도 생성되면 분해하는데 관여하게 이때 유전자형에 따라서 3가지 유형으로 나 타나게 됩니다.Friday, August 12, 11
  • 3. 23andMeFriday, August 12, 11
  • 4. 북서유럽 남동유럽Friday, August 12, 11
  • 5. HGDP(Human Genome Diversity Project) TextFriday, August 12, 11
  • 6. PASNP(Pan-Asian SNP Consortium) TextFriday, August 12, 11
  • 7. East Asia - Public genotype data SNP Individual Population PASNP 54,794 1,928 75 HGDP a 2,834~ 1,056 52 HapMap 1,481,135 1,397 11 b SGVP 268,667 292 3 Korean 58,625 159 10 China(Yanbian) 58,625 16 1 Japan(Kobe) 58,625 5 1 Korea-Japan 58,625 6 1 Vietnam 58,625 16 1 Korean-Vietnam 58,625 8 1 Cambodia 58,625 16 1 Mongol 58,625 16 1 a. Pan-Asian SNP Consortium(http://www4a.biotec.or.th/PASNP) b. Singapore Genome Variation Project(http://www.nus-cme.org.sg/SGVP)Friday, August 12, 11
  • 8. Korean Data 16 YeonCheon 16 Pyeong Chang MW JeCheon 16 16 Cheonan average >70 year old long settlement Affymetrix 50K Xba GyeongJu 16 16 GimJe 15 China(Yanbian) Goryeong UlSan Japan(Kobe) 16 Korea-Japan Vietnam Korean-Vietnam SW 16 NaJu SE Cambodia Mongol 16 58,960 SNPs JejuFriday, August 12, 11
  • 9. Missing genotype individuals GimJe GoRyeong Gyeong Text Ju Before QC 58,960 SNPs Before QC 58,960 SNPs All Asian KoreanFriday, August 12, 11
  • 10. Relatedness between the 153 Korean(10 region) Individuals YeonCheon PyeongChang JeCheon CheonAn GyeongJu UlSan GimJe GoRyeong NaJu JeJu PCA analysis using autosomal 46,559 SNP markers (n=153, Korean)Friday, August 12, 11
  • 11. PCA analysis of East Asian descent Mongol Yanbian Kobe JPT- Jeju HapMap CHB- HapMap Vietnam Cambodia illustration of geographic correspondence of ethnic group Korea-Vietnam Korea-Japan locationsFriday, August 12, 11
  • 12. Relationship between Eigenvector values and Latitude 47.81 39.98 37.53 2 R = 0.8621 y = 36.65 + 166.33x 14.72Friday, August 12, 11
  • 13. STRUCTURE software • A model-based clustering method (Pritchard et al. 2000) • Free software (http://pritch.bsd.uchicago.edu/structure.html) • Bayesian approach (MCMC: Markov Chain Mote Carlo) • Detects the underlying genetic population among a set of individuals genotyped at multiple markers • Computes the proportion of the genome of an individual originating from each inferred population (quantitative clustering method)Friday, August 12, 11
  • 14. Input data • A matrix where the data for individuals are in rows, the loci are in column • n consecutive rows have the data for each individual of n- ploid species • Integer should be used for coding genotype • Missingoccur should be indicated by(e.g. -1) which doesn’t data elsewhere in the data a number • The dataSTRUCTUREbe a text file (.txt) not an excel (.xls) for running file shouldFriday, August 12, 11
  • 15. Input format 1 consecutive rows for alleles MarkerName... Label PopID Flag Location Genotype... genotype (1,2,5) AA = 11 AB = 12 BB = 22 missing = 55 Information of user-defined populations Lable : 각 개인의 고유한 ID로 숫자 또는 문자 어떤것이든 상관없다.(예, CEPH1334.10) PopID: 개인이 속한 민족의 고유한 번호 (예, 중국인(CHB)인 경우 5, 유럽인(CEU)인 경우 1과 같이 자신이 직접 부여) Flag: 해당 PopID 정보를 STRUCTURE 프로그램 실행시 사용할 것인가?(1= 사용한다, 2= 사용하지 않는다.) Location: 해당 개인의 위치정보(예, 동아시아(EAS)인경우 1번, 유럽(EURA)인 경우 2번과 같이 자신이 직접 부여)Friday, August 12, 11
  • 16. Input format (cont.)Friday, August 12, 11
  • 17. Running STRUCTURE from a graphical interface, Front EndFriday, August 12, 11
  • 18. Importing input data into a projectFriday, August 12, 11
  • 19. Importing input data into a project (cont.)Friday, August 12, 11
  • 20. Importing input data into a project (cont.)Friday, August 12, 11
  • 21. Importing input data into a project (cont.)Friday, August 12, 11
  • 22. Importing input data into a project (cont.)Friday, August 12, 11
  • 23. Importing input data into a project (cont.)Friday, August 12, 11
  • 24. Importing input data into a project (cont.)Friday, August 12, 11
  • 25. Importing input data into a project (cont.)Friday, August 12, 11
  • 26. Configuring a parameter setFriday, August 12, 11
  • 27. Configuring a parameter set (cont.) Length of Burnin Period : how long to run the simulation before collecting data to minimize the effect of the starting configuration, 목표함수로 수렴할 때까지의 반복 숫자 Number of MCMC Reps after Burnin : how long to run the simulation after burnin to get accurate parameter estimatesFriday, August 12, 11
  • 28. Configuring a parameter set (cont.)Friday, August 12, 11
  • 29. Configuring a parameter set (cont.)Friday, August 12, 11
  • 30. Configuring a parameter set (cont.)Friday, August 12, 11
  • 31. Running STRUCTURE: a single runFriday, August 12, 11
  • 32. Running STRUCTURE: a single run (cont.)Friday, August 12, 11
  • 33. Running STRUCTURE: a batch runFriday, August 12, 11
  • 34. Running STRUCTURE: a batch run (cont.)Friday, August 12, 11
  • 35. Ln P(D): Estimated probability of KsFriday, August 12, 11
  • 36. Friday, August 12, 11
  • 37. Analysis of genome-wide SNP data • For very may become impractically slow settings large data sets, the runtime of structure using default • reduced data sets (ex, pruned) • get accurate resultsNUMREPS) shorter runs than default (ex, small values of using much • download themachine) and compile it on your machine (using 64-bit source code • use the command-line version of structureFriday, August 12, 11
  • 38. An example of MCMC convergenceFriday, August 12, 11
  • 39. Inference of true K (number of population) • The log likelihood for each K, Ln P(D) = L(K) • Two approaches to determine the best K • Use of L(K) : When K is approaching a true value, L(K) plateaus and has high variance between runs • Use of an ad hod quantity (∆K) the likelihood (∆K).on the second order rate of change of : calculated based The ∆K shows a clear peak at the true value of KFriday, August 12, 11
  • 40. Friday, August 12, 11
  • 41. Simulation Result Q-metrix an individuals belongs to a subpopulationFriday, August 12, 11
  • 42. Simulation Result (cont.)Friday, August 12, 11
  • 43. Enjoy running STRUCTUREFriday, August 12, 11
  • 44. We may not always be able to know the TRUE value K, but we should aim for the smallest value of K that captures the major structure in the data Pritchard et al. (2000)Friday, August 12, 11