Coding & Best Practice in Programming
Why it matters so much in the NGS era
Lex Nederbragt
Norwegian Sequencing Centre and
Centre for Evolutionary and Ecological Synthesis
lex.nederbragt@ibv.uio.no
@lexnederbragt
OK
Who am I
@lexnederbragt flxlexblog.wordpress.com
How I became a bioinformatician
2007: a grant
GS FLX from Roche/454
Genome Analyzer from Solexa/Illumina
?
Let’s try them out!
Specimen
• Planktothrix rubescens NIVA CYA 98
• Cyanobacteria
• (blue-green algae)
Planktothrix
Half a million reads
Average length 260 nt
10 million reads
33 nucleotides each
Perl
Planktothrix
Newbler SHARCGS
Assembly
Half a million reads
Average length 260 nt
10 million reads
33 nucleotides each
Atlantic cod genome project
850 million bases (Mbp )‘Wild-caught’
GS FLX from Roche/454
Atlantic cod genome project phase 1
Cod genome project phase 2
From Wikimedia commons, user Sagar Joshi
In summary
From flickr, user lesterpubliclibrary
Challenges in the next-generation sequencing era
High-throughput sequencing
Phase 1: more is better
Phase 2: smaller is better
Phase 3: single-molecule
Phase 4: nanopores
Democratization of sequencing
MinION
512 nanopores
150mb/hour
Up to 6 hours
$900
Sequencing cost
Thanks to Matt Clark (TGAC), modified from http://bit.ly/1iiajcS
454 &
polony Solexa
&
SOLiD
HiSeq HiSeq X Ten
GAII
End of the gold rush?
More more more
Data Software
Mathias Bigge, Ricordisamoa, others (wikimedia commons)
TCTCCTAACAACCCCCcACACACACACACTGGTA
CTGATGCCATTCTGCTTTACACCTATACACATCA
TATACATtATACACACACACACACACACACAACA
CTCTCCTAACCCACACACACTGGTACAGATGCCA
GTCTGCTTAACACCTACGCACGTATTATACACAC
ACACACACAACGCTCTCCTAACCCACACACACAC
CAGTCTGCTTTAAACCTACACACATATTATACAA
ACGAGTTGGTGACGTAAGGTTGATAAGGGATATT
GGTAAGGGTTAAGGGTAGGGTTGGTGTTAGGGGC
AAGGGTTAGGGTTAGTGTAAGGGGTAAGGGTTAG
TGTAaGGAGTAAGGGTTAGTGTAAGGGGTTAGTG
TTATTGTAAGGGGCTAGTGTTAGTGTTAGTGTTC
AGGGTTAGTGTTAGGGGTAGGGTTAATgTTTAGG
GTAATGTTTAGGGTTAGGGGTATGGGTTAGTGCT
AGGGGTCAGGGTTAGTGTTAGGGTTAGACAACCC
ACCTGAGAGAACCAGTGCGATGCCGCCGCAGGCG
TTGGGCGAGGACATGGAGGTGCCGTTCATCAGCT
GGGTCCCCCGGAGGGTCCAGTTGGGGACGGAGGC
GATGGCTCCCCCCGGAGCGCTGATGCTGACCCCC
AGGGCGCCGTCGATGCTGGGTCCCCGAGACGACC
AGGTGTACTGGTTGGCCGGGAGCTTCTCCCTCAG
GGAGTACTCCGCCACCATCATGTCGGGGGTCACG
TAGGCCCCAACCCCTGGGGACAGACGGAGCGCGT
TACACACCTCAACCCCTTACCCTCGGAGCCTACA
Software
Constant stream of new software
http://wwwdev.ebi.ac.uk/fg/hts_mappers
88 short-read mappers
Software
Constant stream of new software
http://neidetcher.com/ubuntu_package_dependency.html
InstallationJudging quality
Wikimedia commons, user Thebestofall007
Do we need to be worried?
Do we need to be worried?
Self-taught bioinformaticians
ACCCCCcACACACACACACTGGTACTGATGCC
ACACCTATACACATCATATACATtATACACAC
ACACAACACTCTCCTAACCCACACACACTGGT
GTCTGCTTAACACCTACGCACGTATTATACAC
AACGCTCTCCTAACCCACACACACACCAGTCT
TACACACATATTATACAAACGAGTTGGTGACG
AAGGGATATTGGTAAGGGTTAAGGGTAGGGTT
GCAAGGGTTAGGGTTAGTGTAAGGGGTAAGGG
GAGTAAGGGTTAGTGTAAGGGGTTAGTGTTAT
TAGTGTTAGTGTTAGTGTTCAGGGTTAGTGTT
TTAATgTTTAGGGTAATGTTTAGGGTTAGGGG
TGCTAGGGGTCAGGGTTAGTGTTAGGGTTAGA
GAGAGAACCAGTGCGATGCCGCCGCAGGCGTT
ATGGAGGTGCCGTTCATCAGCTGGGTCCCCCG
TTGGGGACGGAGGCGATGGCTCCCCCCGGAGC
ACCCCCAGGGCGCCGTCGATGCTGGGTCCCCG
GTGTACTGGTTGGCCGGGAGCTTCTCCCTCAG
GCCACCATCATGTCGGGGGTCACGTAGGCCCC
GACAGACGGAGCGCGTTACACACCTCAACCCC
AGCCTACATAACCCAACCCTCTGGAGACGGCA
AGTCAGAAATAGaGCTGACCGATTCATCAAAT
lot’s of data
lot’s of software
recipe for disaster?
Correctness of results
http://www.it.bton.ac.uk/staff/je/java/jewl/tutorial/tutorial.html
Reproducibility
doi:10.1038/sj.embor.7401143
A reproducibility crisis?
Reproducibility and reusability
http://upload.wikimedia.org/wikipedia/commons/4/48/Recycle.jpg
What it boils down to
My (given) title
Coding & Best Practice in Programming
Why it matters so much in the NGS era
Why it matters so much in science
Next-generation sequencing specific?
Diagnostic sequencing
Wikimedia commons, user Bill Branson
Diagnostic sequencing
Diagnostic sequencing
Solutions
Solutions
Flickr: http://farm4.staticflickr.com/3319/3265787219_bfbc654b5e_o.jpg Wikimedia commons
Best practices
10.1371/journal.pbio.1001745
Best practices
Automate repetitive tasks
Wikimedia commons, user Pzucchel
Best practices
Coding styles, variable naming etc
def test_seq:
def sequence_is_DNA:
Best practices
Use version control
https://www.atlassian.com/git/workflows
Best practices
From my own work:
$ cd scripts
$ ls
blat_parse4.pl old_versions snps_flanks_2_fastq.pl
$ ls old_versions/
blat_parse2.pl blat_parse_attemp1.pl
blat_parse.pl.bak blat_parse.pl
blat_parse3_backup.pl
blat_parse3.pl
Best practices
test, test, test
def test_zero:
assert run_the_function(0) == 0
Assert x > 0, ”cannot handle negative numbers"
Best practices
Document well
Best practices
Collaborate
http://howdoitradestocks.com/wp-content/uploads/2011/12/share-ideas1.jpg
khmer, a 'case study'
khmer
Crusoe et al. doi: 10.6084/m9.figshare.979190Michael
Crusoe
Titus
Brown
khmer
https://github.com/ged-lab/2013-paper-ssspe
khmer
Integrated code
coverage analysis
The “GitHub Flow”
model of code review
Semantic
versioning
Continuous
integrationIntegration and
acceptance testing
Beyond best coding practices
Benchmarks
http://assemblathon.org/
Benchmarks
http://www.genome.org/cgi/doi/10.1101/gr.131383.111
Benchmarks
http://www.genomeinabottle.org/
~8300 10ug vials of DNA for NA12878
(Assembly) validation
(Assembly) validation
Assembly
doi:10.1186/1471-2105-15-126
Reproducibility ‘platforms’
usegalaxy.org
taverna.org.uk/
pythonhosted.org/Sumatra/
Action points
Action points
Attend a software Carpentry Boot Camp
http://software-carpentry.org/
Action points
Look for signs of best practice
Action points
Look for signs of best practice
during peer review
nature.com
Action points
Benchmarking/validation
Action points
Develop (under)graduate curriculum
My goal today
Flickr: http://farm4.staticflickr.com/3319/3265787219_bfbc654b5e_o.jpg

Coding & Best Practice in Programming in the NGS era

Editor's Notes

  • #40 pre-merge code reviews.pair programming issue tracking
  • #48 develop the reference materials, reference data, and reference methods needed to assess performance of human genome sequencing.