Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Coding & Best Practice in Programming in the NGS era

3,535 views

Published on

A talk I gave at the SeqAhead Scientific Meeting 2014 "NGS Data after the Gold Rush", May 7th 2014

Published in: Science, Technology, Business
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Coding & Best Practice in Programming in the NGS era

  1. 1. Coding & Best Practice in Programming Why it matters so much in the NGS era Lex Nederbragt Norwegian Sequencing Centre and Centre for Evolutionary and Ecological Synthesis lex.nederbragt@ibv.uio.no @lexnederbragt OK
  2. 2. Who am I @lexnederbragt flxlexblog.wordpress.com
  3. 3. How I became a bioinformatician
  4. 4. 2007: a grant GS FLX from Roche/454 Genome Analyzer from Solexa/Illumina ? Let’s try them out!
  5. 5. Specimen • Planktothrix rubescens NIVA CYA 98 • Cyanobacteria • (blue-green algae)
  6. 6. Planktothrix Half a million reads Average length 260 nt 10 million reads 33 nucleotides each Perl
  7. 7. Planktothrix Newbler SHARCGS Assembly Half a million reads Average length 260 nt 10 million reads 33 nucleotides each
  8. 8. Atlantic cod genome project 850 million bases (Mbp )‘Wild-caught’ GS FLX from Roche/454
  9. 9. Atlantic cod genome project phase 1
  10. 10. Cod genome project phase 2 From Wikimedia commons, user Sagar Joshi
  11. 11. In summary From flickr, user lesterpubliclibrary
  12. 12. Challenges in the next-generation sequencing era
  13. 13. High-throughput sequencing Phase 1: more is better Phase 2: smaller is better Phase 3: single-molecule Phase 4: nanopores
  14. 14. Democratization of sequencing MinION 512 nanopores 150mb/hour Up to 6 hours $900
  15. 15. Sequencing cost Thanks to Matt Clark (TGAC), modified from http://bit.ly/1iiajcS 454 & polony Solexa & SOLiD HiSeq HiSeq X Ten GAII End of the gold rush?
  16. 16. More more more Data Software Mathias Bigge, Ricordisamoa, others (wikimedia commons) TCTCCTAACAACCCCCcACACACACACACTGGTA CTGATGCCATTCTGCTTTACACCTATACACATCA TATACATtATACACACACACACACACACACAACA CTCTCCTAACCCACACACACTGGTACAGATGCCA GTCTGCTTAACACCTACGCACGTATTATACACAC ACACACACAACGCTCTCCTAACCCACACACACAC CAGTCTGCTTTAAACCTACACACATATTATACAA ACGAGTTGGTGACGTAAGGTTGATAAGGGATATT GGTAAGGGTTAAGGGTAGGGTTGGTGTTAGGGGC AAGGGTTAGGGTTAGTGTAAGGGGTAAGGGTTAG TGTAaGGAGTAAGGGTTAGTGTAAGGGGTTAGTG TTATTGTAAGGGGCTAGTGTTAGTGTTAGTGTTC AGGGTTAGTGTTAGGGGTAGGGTTAATgTTTAGG GTAATGTTTAGGGTTAGGGGTATGGGTTAGTGCT AGGGGTCAGGGTTAGTGTTAGGGTTAGACAACCC ACCTGAGAGAACCAGTGCGATGCCGCCGCAGGCG TTGGGCGAGGACATGGAGGTGCCGTTCATCAGCT GGGTCCCCCGGAGGGTCCAGTTGGGGACGGAGGC GATGGCTCCCCCCGGAGCGCTGATGCTGACCCCC AGGGCGCCGTCGATGCTGGGTCCCCGAGACGACC AGGTGTACTGGTTGGCCGGGAGCTTCTCCCTCAG GGAGTACTCCGCCACCATCATGTCGGGGGTCACG TAGGCCCCAACCCCTGGGGACAGACGGAGCGCGT TACACACCTCAACCCCTTACCCTCGGAGCCTACA
  17. 17. Software Constant stream of new software http://wwwdev.ebi.ac.uk/fg/hts_mappers 88 short-read mappers
  18. 18. Software Constant stream of new software http://neidetcher.com/ubuntu_package_dependency.html InstallationJudging quality Wikimedia commons, user Thebestofall007
  19. 19. Do we need to be worried?
  20. 20. Do we need to be worried? Self-taught bioinformaticians ACCCCCcACACACACACACTGGTACTGATGCC ACACCTATACACATCATATACATtATACACAC ACACAACACTCTCCTAACCCACACACACTGGT GTCTGCTTAACACCTACGCACGTATTATACAC AACGCTCTCCTAACCCACACACACACCAGTCT TACACACATATTATACAAACGAGTTGGTGACG AAGGGATATTGGTAAGGGTTAAGGGTAGGGTT GCAAGGGTTAGGGTTAGTGTAAGGGGTAAGGG GAGTAAGGGTTAGTGTAAGGGGTTAGTGTTAT TAGTGTTAGTGTTAGTGTTCAGGGTTAGTGTT TTAATgTTTAGGGTAATGTTTAGGGTTAGGGG TGCTAGGGGTCAGGGTTAGTGTTAGGGTTAGA GAGAGAACCAGTGCGATGCCGCCGCAGGCGTT ATGGAGGTGCCGTTCATCAGCTGGGTCCCCCG TTGGGGACGGAGGCGATGGCTCCCCCCGGAGC ACCCCCAGGGCGCCGTCGATGCTGGGTCCCCG GTGTACTGGTTGGCCGGGAGCTTCTCCCTCAG GCCACCATCATGTCGGGGGTCACGTAGGCCCC GACAGACGGAGCGCGTTACACACCTCAACCCC AGCCTACATAACCCAACCCTCTGGAGACGGCA AGTCAGAAATAGaGCTGACCGATTCATCAAAT lot’s of data lot’s of software recipe for disaster?
  21. 21. Correctness of results http://www.it.bton.ac.uk/staff/je/java/jewl/tutorial/tutorial.html
  22. 22. Reproducibility doi:10.1038/sj.embor.7401143 A reproducibility crisis?
  23. 23. Reproducibility and reusability http://upload.wikimedia.org/wikipedia/commons/4/48/Recycle.jpg
  24. 24. What it boils down to
  25. 25. My (given) title Coding & Best Practice in Programming Why it matters so much in the NGS era Why it matters so much in science Next-generation sequencing specific?
  26. 26. Diagnostic sequencing Wikimedia commons, user Bill Branson
  27. 27. Diagnostic sequencing
  28. 28. Diagnostic sequencing
  29. 29. Solutions
  30. 30. Solutions Flickr: http://farm4.staticflickr.com/3319/3265787219_bfbc654b5e_o.jpg Wikimedia commons
  31. 31. Best practices 10.1371/journal.pbio.1001745
  32. 32. Best practices Automate repetitive tasks Wikimedia commons, user Pzucchel
  33. 33. Best practices Coding styles, variable naming etc def test_seq: def sequence_is_DNA:
  34. 34. Best practices Use version control https://www.atlassian.com/git/workflows
  35. 35. Best practices From my own work: $ cd scripts $ ls blat_parse4.pl old_versions snps_flanks_2_fastq.pl $ ls old_versions/ blat_parse2.pl blat_parse_attemp1.pl blat_parse.pl.bak blat_parse.pl blat_parse3_backup.pl blat_parse3.pl
  36. 36. Best practices test, test, test def test_zero: assert run_the_function(0) == 0 Assert x > 0, ”cannot handle negative numbers"
  37. 37. Best practices Document well
  38. 38. Best practices Collaborate http://howdoitradestocks.com/wp-content/uploads/2011/12/share-ideas1.jpg
  39. 39. khmer, a 'case study'
  40. 40. khmer Crusoe et al. doi: 10.6084/m9.figshare.979190Michael Crusoe Titus Brown
  41. 41. khmer https://github.com/ged-lab/2013-paper-ssspe
  42. 42. khmer Integrated code coverage analysis The “GitHub Flow” model of code review Semantic versioning Continuous integrationIntegration and acceptance testing
  43. 43. Beyond best coding practices
  44. 44. Benchmarks http://assemblathon.org/
  45. 45. Benchmarks http://www.genome.org/cgi/doi/10.1101/gr.131383.111
  46. 46. Benchmarks http://www.genomeinabottle.org/ ~8300 10ug vials of DNA for NA12878
  47. 47. (Assembly) validation
  48. 48. (Assembly) validation Assembly doi:10.1186/1471-2105-15-126
  49. 49. Reproducibility ‘platforms’ usegalaxy.org taverna.org.uk/ pythonhosted.org/Sumatra/
  50. 50. Action points
  51. 51. Action points Attend a software Carpentry Boot Camp http://software-carpentry.org/
  52. 52. Action points Look for signs of best practice
  53. 53. Action points Look for signs of best practice during peer review nature.com
  54. 54. Action points Benchmarking/validation
  55. 55. Action points Develop (under)graduate curriculum
  56. 56. My goal today Flickr: http://farm4.staticflickr.com/3319/3265787219_bfbc654b5e_o.jpg

×