Analyzing Exome Data with KNIME

2,320 views
2,192 views

Published on

A presentation I gave for analyzing some exome data with the knime workbench ( http://www.knime.org )

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,320
On SlideShare
0
From Embeds
0
Number of Embeds
478
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Analyzing Exome Data with KNIME

  1. 1. Pierre Lindenbaum PhD UMR915 – Institut du thorax Nantes, France @yokofakun http://plindenbaum.blogspot.com [email_address] Analysing Exome Data with KNIME
  2. 2. 2 exomes sequenced
  3. 3. [m/m] 1 st case: for a given mutation we expect... not( [m/m] )
  4. 4. Files
  5. 5. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A 42 columns
  6. 6. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_009 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Genomic Position
  7. 7. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Sample Name
  8. 8. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A RS## number
  9. 9. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Ref. & Alt. alleles
  10. 10. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Gene
  11. 11. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Prediction
  12. 12. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Homo/Hetero zygote
  13. 13. http://www.knime.org
  14. 15. Our workflow:
  15. 17. Read the data
  16. 18. Rename both “ Sample” Columns
  17. 19. Remove the sequences (save memory/speed)
  18. 20. Expect “not (snp_diff.*)” for
  19. 21. Expect “snp_diff.*” for
  20. 22. Merge data. Two columns “ SAMPLE_WILD” & “ SAMPLE_MUTATED”
  21. 23. Highlight low quality
  22. 24. Remove low quality
  23. 25. Must be in located in a Gene
  24. 26. Remove if known rs#
  25. 27. Remove if synonymous mutation
  26. 28. Remove wild allele from Alt. (cleanup)
  27. 29. Group by Gene
  28. 31. Keep mutations carried by both samples
  29. 32. Group by Gene Name & Visualize
  30. 34. Retrieve the SNPs for each Gene.
  31. 36. bash version... #remove rs #in gene #remove the low qualities #keep SNP_diff #only the non-synonymous or stop #remove DNA & prot sequences #order by GENE gunzip -c AllChrom.exome.snp.pool.new.annotation.gz | awk -F ' ' '{if(substr($4,1,2)!="rs") print;}' | awk -F ' ' '{if($20!="") print;}' | awk -F ' ' '{if(index($19,"douteux")==0) print;}' | awk -F ' ' '{if(index($19,"_diff")!=0) print;}' | awk -F ' ' '{if(index($26,"nonsense")!=0 || index($26,"missense")!=0) print;}' | cut -d ' ' -f 1-27 | sort -t ' ' -k20,20 > _jeter1.txt #extract wild exome #remove rs #remove SNP_diff #in gene #order by gene gunzip -c AllChrom.exome.snp.u2437.new.annotation.gz | awk -F ' ' '{if(substr($4,1,2)!="rs") print;}' | awk -F ' ' '{if(index($19,"douteux")==0) print;}' | awk -F ' ' '{if(index($19,"_diff")==0) print;}' | awk -F ' ' '{if($20!="") print;}' | cut -d ' ' -f 1-27 | sort -t ' ' -k20,20 > _jeter3.txt #join wild & mutated data by gene #check wild sample has no mutation in the pair of mutated snps #remove wild data join -t ' ' -1 20 -2 20 _jeter1.txt _jeter3.txt | awk -F ' ' '{if($3==$29 && int($2) == int($28) ) print;}' | cut -d ' ' -f 1 | sort | uniq rm _jeter*.txt
  32. 37. In one gene: SNP1: [m/+] SNP2: [m/+] 2 nd case: Composite heterozygous
  33. 38. The workflow:
  34. 39. Read [m] & [+] files Mutated sample Wild sample
  35. 40. Remove cDNA & protein sequences
  36. 41. Remove the SNPs having a rs#
  37. 42. Keep the heterozygous mutations
  38. 43. Remove poor quality
  39. 44. Keep the non-synonymous mutations
  40. 45. Create a new column: = chrom+”_”+position;
  41. 46. Rename the columns 'sample-id' (will generate two distinct columns after joining)
  42. 47. Left join on the column 'chrom_col'
  43. 48. Keep the mutations that were NOT part of the wild sample.
  44. 49. Cleanup, remove some columns.
  45. 50. Duplicate the table to Create two lists of SNPs (5' & 3').
  46. 51. Join both tables on gene name.
  47. 52. Keep the SNPs having: pos(snp1) < pos(snp2)
  48. 53. Display the results
  49. 54. #remove rs #only keep the 'SNP_het' #remove the low qualities #remove SNP_het* #only the non-synonymous or stop #remove DNA & prot sequences #add chrom_position flag #sort gunzip -c AllChrom.exome.snp.pool.new.annotation.gz | awk -F ' ' '{if(substr($4,1,2)!=&quot;rs&quot;) print;}' | awk -F ' ' '{if(index($19,&quot;douteux&quot;)==0) print;}' | awk -F ' ' '{if(index($19,&quot;_het&quot;)!=0) print;}' | awk -F ' ' '{if(index($26,&quot;nonsense&quot;)!=0 || index($26,&quot;missense&quot;)!=0) print;}' | cut -d ' ' -f 1-27 | awk -F ' ' '{printf(&quot;%s_%st%sn&quot;,$2,$1,$0);}' | sort -t ' ' -k1,1 > _jeter1.txt #get all distinct chrom_pos in file cut -d ' ' -f 1 _jeter1.txt | sort -t ' ' -k1,1 | uniq > _jeter2.txt #extract wild exome #keep chrom,position #add chrom_position flag #sort gunzip -c AllChrom.exome.snp.u2437.new.annotation.gz | cut -d ' ' -f 1,2 | awk -F ' ' '{printf(&quot;%s_%sn&quot;,$2,$1);}' | sort -t ' ' -k 1,1 | uniq > _jeter3.txt #get [m] chrom_pos not in [+] chrom_pos set comm -2 -3 _jeter2.txt _jeter3.txt > _jeter4.txt #join uniq [m] chrom_pos & mutated data #remove chrom_pos #order by gene join -t ' ' --check-order -1 1 -2 1 _jeter1.txt _jeter4.txt| cut -d ' ' -f 2- | sort -t ' ' -k 20 > _jeter5.txt #join to self using key= &quot;gene name&quot; #only keep if first mutation in same gene/chromosome and pos1< pos2 #keep some columns join -t ' ' -j 20 _jeter5.txt _jeter5.txt | awk -F ' ' '{if($3==$29 && int($2) < int($28) ) print;}' | cut -d ' ' -f 1,2,3,20,26,28,46,52 > _jeter6.txt #extract gene names cut -d ' ' -f 1 _jeter6.txt | sort | uniq rm _jeter[12345].txt bash version...
  50. 55. Last step... http://en.wikipedia.org/wiki/File:Nobel_Prize.png
  51. 56. Thanks. Remember: you should learn how to use the Unix command line...

×