"Mon make à moi", (tout sauf Galaxy)

1,294 views
1,186 views

Published on

Video available at: http://videos.rennes.inria.fr/ReNaBI-GO2013/indexPierreLindenbaum.html

11ème édition des rencontres de la plate-forme ReNaBI-Grand Ouest, la journée sera consacrée aux workflows en bio-informatique.*


Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,294
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

"Mon make à moi", (tout sauf Galaxy)

  1. 1. Mon Make à moi. Pierre Lindenbaum PhD UMR1087 – Institut du thorax @yokofakun pierre.lindenbaum@univ-nantes.fr http://plindenbaum.blogspot.com
  2. 2. http://commons.wikimedia.org/wiki/File:Calendar-leapyeardate.jpg
  3. 3. “Institut du Thorax” (INSERM/UMR1087)
  4. 4. - Align Reads - Sort BAM - Call variations - Annotation
  5. 5. Know Your Audience http://www.flickr.com/photos/9479603@N02/4143361191
  6. 6. http://www.biostars.org/p/50034/
  7. 7. - Align Reads - Sort BAM - Call variations - Annotation
  8. 8. http://www.flickr.com/photos/eole/380316678/
  9. 9. http://www.flickr.com/photos/dannywartnaby/78944310/
  10. 10. http://upload.wikimedia.org/wikipedia/commons/thumb/5/50/KL_AMD_Athlon_64_X2_Bri sbane.jpg/602px-KL_AMD_Athlon_64_X2_Brisbane.jpg
  11. 11. void fun( void *ptr ); pthread_create ( &thread1, NULL, (void *) &fun, (void *) &data1 );
  12. 12. Usage:   bwa aln [options] <prefix> <in.fq> Options: ­o INT    maximum number or fraction of gap opens [1]          ­i INT    do not put an indel within INT bp towards the ends [5]          ­d INT    maximum occurrences for extending a long deletion [10]          ­l INT    seed length [32]          ­k INT    maximum differences in the seed [2]          ­m INT    maximum entries in the queue [2000000]          ­t INT    number of threads [1]          ­M INT    mismatch penalty [3]          ­O INT    gap open penalty [11]          ­E INT    gap extension penalty [4]          ­R INT    stop searching when there are >INT equally best hits [30]          ­q INT    quality threshold for read trimming down to 35bp [0]          ­f FILE   file to write output to instead of stdout          ­B INT    length of barcode          ­L        log­scaled gap penalty for long deletions
  13. 13. http://www.flickr.com/photos/eole/380316678/
  14. 14. GNU PARALLEL http://www.gnu.org/software/parallel/
  15. 15. GNU parallel is a shell tool for executing jobs in parallel using one or more computers.
  16. 16. GNU parallel can often be used as a substitute for  xargs or  cat | bash.
  17. 17. 01F.fastq.gz 01R.fastq.gz 02F.fastq.gz 02R.fastq.gz toy.fa ex1.fa
  18. 18. $ parallel    bwa aln      −f{1//}/{2/.}{1/.}.sai{2}{1}      :::01F.fastq.gz 01R.fastq.gz          02F.fastq.gz 02R.fastq.gz       :::examples/toy.fa          examples/ex1.fa
  19. 19. bwa aln -f examples/toy_01_F.fastq.sai examples/toy.fa examples/01_F.fastq.gz bwa aln -f examples/ex1_01_F.fastq.sai examples/ex1.fa examples/01_F.fastq.gz bwa aln -f examples/toy_01_R.fastq.sai examples/toy.fa examples/01_R.fastq.gz bwa aln -f examples/ex1_01_R.fastq.sai examples/ex1.fa examples/01_R.fastq.gz bwa aln -f examples/toy_02_F.fastq.sai examples/toy.fa examples/02_F.fastq.gz bwa aln -f examples/ex1_02_F.fastq.sai examples/ex1.fa examples/02_F.fastq.gz bwa aln -f examples/toy_02_R.fastq.sai examples/toy.fa examples/02_R.fastq.gz bwa aln -f examples/ex1_02_R.fastq.sai examples/ex1.fa examples/02_R.fastq.gz
  20. 20. 1.fastq.gz 2.fastq.gz 3.fastq.gz (...) 9999.fastq.gz toy.fa
  21. 21. $ find ./ -name "*.fastq.gz" | parallel --verbose bwa aln -f {/.}.sai toy.fa {}
  22. 22. 1F.fastq.gz 1R.fastq.gz 2F.fastq.gz 2R.fastq.gz (...) 9999F.fastq.gz 9999R.fastq.gz toy.fa
  23. 23. $ find examples/ -name "*.fastq.gz" | sort | paste -- - - | parallel --colsep 't' bwa mem examples/ex1.fa {1} {2} ">" {1//}/{1/.}_{2/.}.sam bwa mem examples/ex1.fa examples/01_F.fastq.gz examples/01_R.fastq.gz > examples/01_F.fastq_01_R.fastq.sam bwa mem examples/ex1.fa examples/02_F.fastq.gz examples/02_R.fastq.gz > examples/02_F.fastq_02_R.fastq.sam bwa mem examples/ex1.fa examples/03_F.fastq.gz examples/03_R.fastq.gz > examples/03_F.fastq_03_R.fastq.sam bwa mem examples/ex1.fa examples/04_F.fastq.gz examples/04_R.fastq.gz > examples/04_F.fastq_04_R.fastq.sam bwa mem examples/ex1.fa examples/05_F.fastq.gz examples/05_R.fastq.gz > examples/05_F.fastq_05_R.fastq.sam
  24. 24. * copy the BAMs on the remote server, * print the working directory, * sort the BAMs with samtools, * fetch the sorted bam, * cleanup on server side. * Use the '~/tmp/' directory $ parallel --workdir /home/user/tmp -S user@host --trc {/.}_s.bam pwd "&&" samtools sort {} {/.}_s ::: file1.bam file2.bam file3.bam
  25. 25. http://www.flickr.com/photos/eole/380316678/
  26. 26. BUILD AUTOMATION
  27. 27. “logiciel qui construit automatiquement des fichiers, souvent exécutables, ou des bibliothèques à partir d'éléments de base tels que du code source”
  28. 28. “logiciel qui construit automatiquement des fichiers, souvent exécutables, ou des bibliothèques à partir d'éléments de base tels que du code source”
  29. 29. MOPPE Mini-commode, contreplaqué de bouleau
  30. 30. http://www.ruffus.org.uk/
  31. 31. https://bitbucket.org/johanneskoester/snakemake/wiki/Home
  32. 32. rule complex_conversion: input: "{dataset}/inputfile" output: "{dataset}/file.{group}.txt" shell: "somecommand –group {wildcards.group} < {input} > {output}"
  33. 33. 1977
  34. 34. TARGET : DEPENDENCY STATEMENTS
  35. 35.  :  bobthebuilder >
  36. 36. seq1.rna : seq1.dna tr “T” “U” < seq1.dna  > seq1.rna
  37. 37. all.rna : seq1.rna seq2.rna cat seq1.rna seq1.rna > all.rna seq1.rna : seq1.dna tr “T” “U” < seq1.dna  > seq1.rna seq2.rna : seq2.dna tr “T” “U” < seq2.dna  > seq2.rna
  38. 38. $ make tr "T" "U" < seq1.dna > seq1.rna tr "T" "U" < seq2.dna > seq2.rna cat seq1.rna seq2.rna > all.rna
  39. 39. v
  40. 40. $ rm seq1.rna $ make tr "T" "U" < seq1.dna > seq1.rna cat seq1.rna seq2.rna > all.rna
  41. 41. $ touch seq1.rna $ make tr "T" "U" < seq1.dna > seq1.rna cat seq1.rna seq2.rna > all.rna
  42. 42. http://www.flickr.com/photos/eole/380316678/
  43. 43. https://github.com/lindenb/jvarkit/wiki/SplitBam
  44. 44. Option '-j' Specify number of jobs
  45. 45. http://www.flickr.com/photos/eole/380316678/
  46. 46. NAME qmake - distributed parallel make, scheduling by Sun Grid Engine. SYNTAX qmake [ options ] -- [ gmake options ] DESCRIPTION Qmake is a parallel, distributed make utility. Scheduling of the parallel make tasks is done by Sun Grid Engine. It is based on gmake (GNU make), version 3.78.1. Both Sun Grid Engine and gmake command line options can be specified. They are separated by "--".
  47. 47. queuename                      qtype resv/used/tot. load_avg arch          states ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ all.q@node01                   BIP   0/2/64         14.82    lx24­amd64      951872 0.55500 bash       lindenb      r     11/25/2013 09:48:28     1          953823 0.55500 bash       lindenb      r     11/25/2013 12:16:17     1         ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ all.q@node02                   BIP   0/10/64        13.36    lx24­amd64      951870 0.55500 qmake      lindenb      r     11/25/2013 09:48:13     1          951876 0.55500 qmake      lindenb      r     11/25/2013 09:48:35     1          953821 0.55500 bash       lindenb      r     11/25/2013 12:15:32     1          953825 0.55500 bash       lindenb      r     11/25/2013 12:18:02     1          953829 0.55500 bash       lindenb      r     11/25/2013 12:32:02     1          953899 0.55500 bash       lindenb      r     11/25/2013 13:21:47     1          953902 0.55500 bash       lindenb      r     11/25/2013 13:22:02     1          953904 0.55500 bash       lindenb      r     11/25/2013 13:22:02     1          953915 0.55500 bash       lindenb      r     11/25/2013 13:26:47     1          953933 0.55500 bash       lindenb      r     11/25/2013 13:29:02     1         ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ all.q@node03                   BIP   0/7/64         15.01    lx24­amd64      951871 0.55500 bash       lindenb      r     11/25/2013 09:48:28     1          951875 0.55500 bash       lindenb      r     11/25/2013 09:48:28     1          953833 0.55500 bash       lindenb      r     11/25/2013 12:37:02     1          953870 0.55500 bash       lindenb      r     11/25/2013 13:12:17     1          953872 0.55500 bash       lindenb      r     11/25/2013 13:12:17     1          953873 0.55500 bash       lindenb      r     11/25/2013 13:12:32     1          953875 0.55500 bash       lindenb      r     11/25/2013 13:12:32     1        
  48. 48. $ wc -l Makefile 951499
  49. 49. https://en.wikipedia.org/wiki/File:The_Scream.jpg
  50. 50. PERL ? AWK ?
  51. 51. (...) <sample name="Sample1"> <sequences> <pair lane="7" sample-index="ACGTATCA"> <fastq index="1" path="dir/Sample1_ACGTATCA_L007_R1_001.fastq.gz"/> <fastq index="2" path="dir/Sample1_ACGTATCA_L007_R2_001.fastq.gz"/> </pair> </sequences> </sample> <sample name="Sample2"> <sequences> <pair lane="7" sample-index="CGCATACA"> <fastq index="1" path="dir/Sample2_CGCATACA_L007_R1_001.fastq.gz"/> <fastq index="2" path="dir/Sample2_CGCATACA_L007_R2_001.fastq.gz"/> </pair> <pair lane="8" sample-index="CGCATACA"> <fastq index="1" path="dir/Sample2_CGCATACA_L008_R1_001.fastq.gz"/> <fastq index="2" path="dir/Sample2_CGCATACA_L008_R2_001.fastq.gz"/> </pair> </sequences> </sample> (...)
  52. 52. https://en.wikipedia.org/wiki/File:The_Scream.jpg
  53. 53. $ find . -name "*.fastq.gz" | java -jar dist/illuminadir.jar | xmllint --format <?xml version="1.0" encoding="UTF-8"?> <illumina> <!---L OFF--> <directory> <samples> <sample name="30VGKM1"> <pair id="p156" md5="d979e28873b8528d423476bb1d1bcf6c" lane="7" index="CTTGTA" split="2"> <fastq side="1" path="Sample_30VGKM1/30VGKM1_CTTGTA_L007_R1_002.fastq.gz" file-size="341099974"/> <fastq side="2" path="Sample_30VGKM1/30VGKM1_CTTGTA_L007_R2_002.fastq.gz" file-size="343445605"/> </pair> <pair id="p157" md5="bdad27279ddc84b83def03f3cbf64b2e" lane="6" index="CTTGTA" split="2"> <fastq side="1" path="Sample_30VGKM1/30VGKM1_CTTGTA_L006_R1_002.fastq.gz" file-size="339183535"/> <fastq side="2" path="Sample_30VGKM1/30VGKM1_CTTGTA_L006_R2_002.fastq.gz" file-size="342322377"/> </pair> <pair id="p158" md5="8c87db2789f900acce8c60a45e7f58d8" lane="8" index="CTTGTA" split="2"> <fastq side="1" path="Sample_30VGKM1/30VGKM1_CTTGTA_L008_R1_002.fastq.gz" file-size="339680334"/> <fastq side="2" path="Sample_30VGKM1/30VGKM1_CTTGTA_L008_R2_002.fastq.gz" file-size="342787126"/> </pair> <pair id="p159" md5="8e3ba7606a9689b616c20c905d5cfcd4" lane="7" index="CTTGTA" split="3"> <fastq side="1" path="Sample_30VGKM1/30VGKM1_CTTGTA_L007_R1_003.fastq.gz" file-size="339814054"/> (….)
  54. 54. https://github.com/lindenb/jvarkit/wiki/Illuminadir [{"directory":"RUN62_XFC2DM8ACXX/data","samples":[{"sample":"SAMPLE1","files":[{ "md5pair":"cd4b436ce7aff4cf669d282c6d9a7899","lane":8,"index":"ATCACG","split":2 ,"forward":{"md5filename":"3369c3457d6603f06379b654cb78e696","path":"20131001_SN L149_0062_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_002.fastq.g z","side":1,"file-size":359046311},"reverse":{"md5filename":"832039fa00b5f401088 48e48eb437e0b","path":"20131001_SNL149_0062_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/ SAMPLE1_ATCACG_L008_R2_002.fastq.gz","side":2,"file-size":359659451}},{"md5pair" :"b3050fa3307e63ab9790b0e263c5d240","lane":8,"index":"ATCACG","split":3,"forward ":{"md5filename":"091727bb6b300e463c3d708e157436ab","path":"20131001_SNL149_0062 _XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_003.fastq.gz","side" :1,"file-size":206660736},"reverse":{"md5filename":"20235ef4ec8845515beb4e13da34 b5d3","path":"20131001_SNL149_0062_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_A TCACG_L008_R2_003.fastq.gz","side":2,"file-size":206715143}},{"md5pair":"9f7ee49 e87d01610372c43ab928939f6","lane":8,"index":"ATCACG","split":1,"forward":{"md5fi lename":"54cb2fd33edd5c2e787287ccf1595952","path":"20131001_SNL149_0062_XFC2DM8A CXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_001.fastq.gz","side":1,"filesize":354530831},"reverse":{"md5filename":"e937cbdf32020074e50d3332c67cf6b3","pa th":"20131001_SNL149_0062_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L00 8_R2_001.fastq.gz","side":2,"file-size":356908963}},{"md5pair":"0697846a504158ee f523c0f4ede85288","lane":7,"index":"ATCACG","split":2,"forward":{"md5filename":"
  55. 55. <xsl:apply-templates select="." mode="sai"/>:<xsl:apply-templates selec mkdir -p $(dir $@) @$(call timebegindb,$@,sai) @$(call sizedb,$&lt;) $(BWA) aln <xsl:choose> <xsl:when test="/project/properties/property[@key='bwa.aln.option <xsl:value-of select="/project/properties/property[@key='bwa. </xsl:when> <xsl:otherwise> <!-- no bwa.aln option $(BWA.aln.options) --> </xsl:otherwise> </xsl:choose> -f $@ ${REF} $&lt; @$(call timeenddb,$@,sai) @$(call sizedb,$@) @$(call notempty,$@)
  56. 56. sampleList:[“Riri”,”Fifi”,”Loulou”, 1234] <ul> #foreach( $sample in $sampleList ) <li>$sample</li> #end </ul>
  57. 57. #set ($mapper = “bwa”) #if( $mapper == “bowtie” ) bowtie align fastq #elseif( $mapper == “bowtie2” ) bowtie2 align fastq #else bwa aln fastq #end
  58. 58. #include( "one.txt","two.txt","three.txt" ) #parse( "one.vm" )
  59. 59. #macro( fasta $name $seq ) >$name $seq #end #fasta(“EcoR1”,”GAATTC” ) #fasta($rec.name,$rec.seq)
  60. 60. #set( $genotypers = ["samtools", "gatk","freebayes"] ) #foreach($genotyper in ${genotypers}) $(OUTDIR)/VCF/variations.${genotyper}.vcf.gz : #foreach($sample in ${project.sample}) #sample_final_bam( ${sample} #end #if( $genotyper == "samtools") #call_with_samtools_mpileup() #elseif( $genotyper == "gatk") #call_with_gatk() #elseif( $genotyper == "freebayes") #call_with_freebayes() #end #end #end
  61. 61. I want a GUI
  62. 62. Did we try fix the maximum number of SNPs per gene ? Did you include the results from Polyphen ? Can you test with a minimal DEPTH=20 ? Did we ever find this gene ? And with minimal number of affected samples per SNP = 2 ? And with minimal number of affected samples per gene = 4 ? Can you remove all the known SNPs ?
  63. 63. http://www.flickr.com/photos/ohm17/162622755
  64. 64. Command line Command Line
  65. 65. “command line” A PhD candidate
  66. 66. http://en.wikipedia.org/wiki/File:Knime.jpg
  67. 67. Conclusion Make, c'est bien.
  68. 68. make: http://www.gnu.org/software/make/ qmake: http://gridscheduler.sourceforge.net/htmlman/htmlman1/qmake.html gnu-parallel: http://www.gnu.org/software/parallel/ apache-velocity: http://velocity.apache.org/ jvarkit: https://github.com/lindenb/jvarkit jsvelocity: https://github.com/lindenb/jsvelocity knime4bio: http://code.google.com/p/knime4bio/
  69. 69. Merci Laetitia Duboscq-Bidot Mathieu Le-Neue Audrey Bihouée Eric Charpentier Edouard Hirchaud Flogiane Simoget Solena le Scouarnec Raluca Teusan Richard Redon The Biostars.org community

×