Coding & Best Practice in Programming in the NGS era

3,153 views

Published on

A talk I gave at the SeqAhead Scientific Meeting 2014 "NGS Data after the Gold Rush", May 7th 2014

Published in: Science, Technology, Business
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,153
On SlideShare
0
From Embeds
0
Number of Embeds
186
Actions
Shares
0
Downloads
44
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide
  • pre-merge code reviews.pair programming issue tracking
  • develop the reference materials, reference data, and reference methods needed to assess performance of human genome sequencing.
  • Coding & Best Practice in Programming in the NGS era

    1. 1. Coding & Best Practice in Programming Why it matters so much in the NGS era Lex Nederbragt Norwegian Sequencing Centre and Centre for Evolutionary and Ecological Synthesis lex.nederbragt@ibv.uio.no @lexnederbragt OK
    2. 2. Who am I @lexnederbragt flxlexblog.wordpress.com
    3. 3. How I became a bioinformatician
    4. 4. 2007: a grant GS FLX from Roche/454 Genome Analyzer from Solexa/Illumina ? Let’s try them out!
    5. 5. Specimen • Planktothrix rubescens NIVA CYA 98 • Cyanobacteria • (blue-green algae)
    6. 6. Planktothrix Half a million reads Average length 260 nt 10 million reads 33 nucleotides each Perl
    7. 7. Planktothrix Newbler SHARCGS Assembly Half a million reads Average length 260 nt 10 million reads 33 nucleotides each
    8. 8. Atlantic cod genome project 850 million bases (Mbp )‘Wild-caught’ GS FLX from Roche/454
    9. 9. Atlantic cod genome project phase 1
    10. 10. Cod genome project phase 2 From Wikimedia commons, user Sagar Joshi
    11. 11. In summary From flickr, user lesterpubliclibrary
    12. 12. Challenges in the next-generation sequencing era
    13. 13. High-throughput sequencing Phase 1: more is better Phase 2: smaller is better Phase 3: single-molecule Phase 4: nanopores
    14. 14. Democratization of sequencing MinION 512 nanopores 150mb/hour Up to 6 hours $900
    15. 15. Sequencing cost Thanks to Matt Clark (TGAC), modified from http://bit.ly/1iiajcS 454 & polony Solexa & SOLiD HiSeq HiSeq X Ten GAII End of the gold rush?
    16. 16. More more more Data Software Mathias Bigge, Ricordisamoa, others (wikimedia commons) TCTCCTAACAACCCCCcACACACACACACTGGTA CTGATGCCATTCTGCTTTACACCTATACACATCA TATACATtATACACACACACACACACACACAACA CTCTCCTAACCCACACACACTGGTACAGATGCCA GTCTGCTTAACACCTACGCACGTATTATACACAC ACACACACAACGCTCTCCTAACCCACACACACAC CAGTCTGCTTTAAACCTACACACATATTATACAA ACGAGTTGGTGACGTAAGGTTGATAAGGGATATT GGTAAGGGTTAAGGGTAGGGTTGGTGTTAGGGGC AAGGGTTAGGGTTAGTGTAAGGGGTAAGGGTTAG TGTAaGGAGTAAGGGTTAGTGTAAGGGGTTAGTG TTATTGTAAGGGGCTAGTGTTAGTGTTAGTGTTC AGGGTTAGTGTTAGGGGTAGGGTTAATgTTTAGG GTAATGTTTAGGGTTAGGGGTATGGGTTAGTGCT AGGGGTCAGGGTTAGTGTTAGGGTTAGACAACCC ACCTGAGAGAACCAGTGCGATGCCGCCGCAGGCG TTGGGCGAGGACATGGAGGTGCCGTTCATCAGCT GGGTCCCCCGGAGGGTCCAGTTGGGGACGGAGGC GATGGCTCCCCCCGGAGCGCTGATGCTGACCCCC AGGGCGCCGTCGATGCTGGGTCCCCGAGACGACC AGGTGTACTGGTTGGCCGGGAGCTTCTCCCTCAG GGAGTACTCCGCCACCATCATGTCGGGGGTCACG TAGGCCCCAACCCCTGGGGACAGACGGAGCGCGT TACACACCTCAACCCCTTACCCTCGGAGCCTACA
    17. 17. Software Constant stream of new software http://wwwdev.ebi.ac.uk/fg/hts_mappers 88 short-read mappers
    18. 18. Software Constant stream of new software http://neidetcher.com/ubuntu_package_dependency.html InstallationJudging quality Wikimedia commons, user Thebestofall007
    19. 19. Do we need to be worried?
    20. 20. Do we need to be worried? Self-taught bioinformaticians ACCCCCcACACACACACACTGGTACTGATGCC ACACCTATACACATCATATACATtATACACAC ACACAACACTCTCCTAACCCACACACACTGGT GTCTGCTTAACACCTACGCACGTATTATACAC AACGCTCTCCTAACCCACACACACACCAGTCT TACACACATATTATACAAACGAGTTGGTGACG AAGGGATATTGGTAAGGGTTAAGGGTAGGGTT GCAAGGGTTAGGGTTAGTGTAAGGGGTAAGGG GAGTAAGGGTTAGTGTAAGGGGTTAGTGTTAT TAGTGTTAGTGTTAGTGTTCAGGGTTAGTGTT TTAATgTTTAGGGTAATGTTTAGGGTTAGGGG TGCTAGGGGTCAGGGTTAGTGTTAGGGTTAGA GAGAGAACCAGTGCGATGCCGCCGCAGGCGTT ATGGAGGTGCCGTTCATCAGCTGGGTCCCCCG TTGGGGACGGAGGCGATGGCTCCCCCCGGAGC ACCCCCAGGGCGCCGTCGATGCTGGGTCCCCG GTGTACTGGTTGGCCGGGAGCTTCTCCCTCAG GCCACCATCATGTCGGGGGTCACGTAGGCCCC GACAGACGGAGCGCGTTACACACCTCAACCCC AGCCTACATAACCCAACCCTCTGGAGACGGCA AGTCAGAAATAGaGCTGACCGATTCATCAAAT lot’s of data lot’s of software recipe for disaster?
    21. 21. Correctness of results http://www.it.bton.ac.uk/staff/je/java/jewl/tutorial/tutorial.html
    22. 22. Reproducibility doi:10.1038/sj.embor.7401143 A reproducibility crisis?
    23. 23. Reproducibility and reusability http://upload.wikimedia.org/wikipedia/commons/4/48/Recycle.jpg
    24. 24. What it boils down to
    25. 25. My (given) title Coding & Best Practice in Programming Why it matters so much in the NGS era Why it matters so much in science Next-generation sequencing specific?
    26. 26. Diagnostic sequencing Wikimedia commons, user Bill Branson
    27. 27. Diagnostic sequencing
    28. 28. Diagnostic sequencing
    29. 29. Solutions
    30. 30. Solutions Flickr: http://farm4.staticflickr.com/3319/3265787219_bfbc654b5e_o.jpg Wikimedia commons
    31. 31. Best practices 10.1371/journal.pbio.1001745
    32. 32. Best practices Automate repetitive tasks Wikimedia commons, user Pzucchel
    33. 33. Best practices Coding styles, variable naming etc def test_seq: def sequence_is_DNA:
    34. 34. Best practices Use version control https://www.atlassian.com/git/workflows
    35. 35. Best practices From my own work: $ cd scripts $ ls blat_parse4.pl old_versions snps_flanks_2_fastq.pl $ ls old_versions/ blat_parse2.pl blat_parse_attemp1.pl blat_parse.pl.bak blat_parse.pl blat_parse3_backup.pl blat_parse3.pl
    36. 36. Best practices test, test, test def test_zero: assert run_the_function(0) == 0 Assert x > 0, ”cannot handle negative numbers"
    37. 37. Best practices Document well
    38. 38. Best practices Collaborate http://howdoitradestocks.com/wp-content/uploads/2011/12/share-ideas1.jpg
    39. 39. khmer, a 'case study'
    40. 40. khmer Crusoe et al. doi: 10.6084/m9.figshare.979190Michael Crusoe Titus Brown
    41. 41. khmer https://github.com/ged-lab/2013-paper-ssspe
    42. 42. khmer Integrated code coverage analysis The “GitHub Flow” model of code review Semantic versioning Continuous integrationIntegration and acceptance testing
    43. 43. Beyond best coding practices
    44. 44. Benchmarks http://assemblathon.org/
    45. 45. Benchmarks http://www.genome.org/cgi/doi/10.1101/gr.131383.111
    46. 46. Benchmarks http://www.genomeinabottle.org/ ~8300 10ug vials of DNA for NA12878
    47. 47. (Assembly) validation
    48. 48. (Assembly) validation Assembly doi:10.1186/1471-2105-15-126
    49. 49. Reproducibility ‘platforms’ usegalaxy.org taverna.org.uk/ pythonhosted.org/Sumatra/
    50. 50. Action points
    51. 51. Action points Attend a software Carpentry Boot Camp http://software-carpentry.org/
    52. 52. Action points Look for signs of best practice
    53. 53. Action points Look for signs of best practice during peer review nature.com
    54. 54. Action points Benchmarking/validation
    55. 55. Action points Develop (under)graduate curriculum
    56. 56. My goal today Flickr: http://farm4.staticflickr.com/3319/3265787219_bfbc654b5e_o.jpg

    ×