Your SlideShare is downloading. ×
GenoThreat / GenoGUARD -- open source biosecurity solution for the gene synthesis industry and the synthetic biology community.
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

GenoThreat / GenoGUARD -- open source biosecurity solution for the gene synthesis industry and the synthetic biology community.

54
views

Published on

DNA sequence screening software that implements the best match method recommended by the federal government. …

DNA sequence screening software that implements the best match method recommended by the federal government.

Publication: Adam L et al, Strengths and limitations of the federal guidance on synthetic DNA, Nature Biotechnology (2011) 29, 208–210 doi:10.1038/nbt.1802
US Department of Health and Human Services voluntary guidelines “Screening Framework Guidance for Synthetic Double-Stranded DNA Providers” November 2009.

Software: http://sourceforge.net/projects/genothreat/

Published in: Software, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
54
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. GenoTHREAT A biosecurity software to screen DNA synthesis orders against pathogens GBCB seminar Laura Adam 10/07/2014
  • 2. 7/10/2014 GenoTHREAT 2
  • 3. 7/10/2014 GenoTHREAT 3 (2005) Science, 310(5745), 77. AAAS..
  • 4. 7/10/2014 GenoTHREAT 4
  • 5. 7/10/2014 GenoTHREAT 5 http://www.washingtonpost.com/wp-srv/nation/daily/graphics/wmdbio_123004.html
  • 6. CURRENT REGULATIONS 7/10/2014 GenoTHREAT 6
  • 7. The Gene Synthesis Industry 7/10/2014 7GenoTHREAT
  • 8. Industry Response to Dual Use • 5 members (all based in Germany) • Undersigned by: ► 6 German or German/American ► 2 Chinese • “Code of Conduct for Best Practices in Gene Synthesis” • 5 companies (American) • 80% of worldwide synthesis capacity • “Harmonized Screening Protocol” 7/10/2014 8GenoTHREAT
  • 9. 7/10/2014 GenoTHREAT 9 Major Sections: Customer screening Sequence screening Record retention Government contact
  • 10. Our Primary Objectives 1. Interpret the (draft) guidance as an algorithm 2. Implement as a software: GenoTHREAT 3. Characterize screening efficacy 7/10/2014 10GenoTHREAT
  • 11. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance III. GenoTHREAT: implementation and characterization IV. Conclusions 7/10/2014 11GenoTHREAT
  • 12. [Guidance] : Purpose “[…] to minimize the risk that unauthorized individuals or individuals with malicious intent will obtain “toxins and agents of concern” through the use of nucleic acid synthesis technologies, and to simultaneously minimize any negative impacts on the conduct of research and business operations.” 7/10/2014 12GenoTHREAT
  • 13. [Guidance] : Goals of sequence screening • Agent of concern? • Select Agents and Toxins • Sequences of concern? • “dsDNA sequences derived from or encoding Select Agents and Toxins” • Sequence unique to select agent • No house-keeping genes • Both DNA strands and the six-frames translation • Detect any “sequence of concern” • Embedded : as small as 200bps  Use Best match approach (at least) 7/10/2014 13GenoTHREAT
  • 14. 7/10/2014 GenoTHREAT 14
  • 15. 7/10/2014 GenoTHREAT 15
  • 16. [Guidance] : Major Points 1. Perform Six Frame Translation 2. Divide the query sequences into subsequences of 200bp or 66aa 3. For each subsequence i. BLAST ii. Best Matches iii. Flag if SAT 4. Automatic decision 7/10/2014 16GenoTHREAT
  • 17. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance 1. Perform Six Frame Translation 2. Divide the query sequences into subsequences of 200bp or 66aa 3. For each subsequence i. BLAST ii. Best Matches iii. Flag if SAT 4. Automatic decision III. GenoTHREAT: implementation and characterization IV. Conclusions 7/10/2014 17GenoTHREAT
  • 18. [Algorithm] : Input a query DNA sequence to screen 7/10/2014 GenoTHREAT 18
  • 19. [Algorithm] : Six Frame translation 7/10/2014 19GenoTHREAT
  • 20. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance 1. Perform Six Frame Translation 2. Divide the query sequences into subsequences of 200bp or 66aa 3. For each subsequence i. BLAST ii. Best Matches iii. Flag if SAT 4. Automatic decision III. GenoTHREAT: implementation and characterization IV. Conclusions 7/10/2014 20GenoTHREAT
  • 21. [Algorithm] : Division 7/10/2014 21GenoTHREAT
  • 22. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance 1. Perform Six Frame Translation 2. Divide the query sequences into subsequences of 200bp or 66aa 3. For each subsequence i. BLAST ii. Best Matches iii. Flag if SAT 4. Automatic decision III. GenoTHREAT: implementation and characterization IV. Conclusions 7/10/2014 22GenoTHREAT
  • 23. [Algorithm] : What should we do with subsequences? 7/10/2014 GenoTHREAT 23
  • 24. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance 1. Perform Six Frame Translation 2. Divide the query sequences into subsequences of 200bp or 66aa 3. For each subsequence i. BLAST ii. Best Matches iii. Flag if SAT 4. Automatic decision III. GenoTHREAT: implementation and characterization IV. Conclusions 7/10/2014 24GenoTHREAT
  • 25. [Algorithm] : BLAST subsequences against entire Genbank database 7/10/2014 25GenoTHREAT
  • 26. Basic Local Alignment Search Tool (BLAST) • Developed at the U.S. National Center for Biotechnology Information • One of the most widely used bioinformatics tools • Aligns query sequences against sequences in the GenBank sequence database • Algorithm emphasizes speed over sensitivity 7/10/2014 26GenoTHREAT
  • 27. BLAST Query Sequence Database of sequences Local alignment 7/10/2014 27GenoTHREAT
  • 28. BLAST Output Percent Identity ► The percentage of identical nucleotides (or amino acid) in the sequence aligned Query Coverage ► The length of sequence aligned 7/10/2014 28GenoTHREAT
  • 29. [Algorithm] : What should we do with all those results of BLAST? 7/10/2014 29GenoTHREAT
  • 30. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance 1. Perform Six Frame Translation 2. Divide the query sequences into subsequences of 200bp or 66aa 3. For each subsequence i. BLAST ii. Best Matches iii. Flag if SAT 4. Automatic decision III. GenoTHREAT: implementation and characterization IV. Conclusions 7/10/2014 30GenoTHREAT
  • 31. [Guidance] : The Best match approach • Use local sequence alignment tool • suggest Blast • Best matches = greatest percent identity over the entire fragment • 66AA or 200bps fragments 7/10/2014 31GenoTHREAT
  • 32. [Algorithm] : Identify Best Matches 7/10/2014 32GenoTHREAT
  • 33. Best matches Mus musculus Mus musculus BLAST results PI QC (%) Mus musculus 100 100 Mus musculus 100 100 Danio rerio 97 100 Danio rerio 43 80 BLAST [Example] 7/10/2014 GenoTHREAT 33
  • 34. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance 1. Perform Six Frame Translation 2. Divide the query sequences into subsequences of 200bp or 66aa 3. For each subsequence i. BLAST ii. Best Matches iii. Flag if SAT 4. Automatic decision III. GenoTHREAT: implementation and characterization IV. Conclusions 7/10/2014 34GenoTHREAT
  • 35. [Algorithm]: Determine nature of Best Matches 7/10/2014 35GenoTHREAT
  • 36. [Algorithm] : How can we know if a Best Match is to a Select Agent or Toxin? Problem: no suggestion in guidance Solution: keyword and anti-keyword list 7/10/2014 36GenoTHREAT
  • 37. BLAST [Example] : Is this subsequence a hit? 7/10/2014 GenoTHREAT 37 BLAST results PI QC (%) Bacillus anthracis 100 100 Bacillus anthracis str. Sterne 100 100 Danio rerio 97 100 Danio rerio 43 80 Best matches Bacillus anthracis Bacillus anthracis str. Sterne
  • 38. [Example] : Keyword vs. Anti-keyword If a GenBank entry contains a keyword, then the sequence is flagged SA 7/10/2014 38GenoTHREAT
  • 39. [Example] : Keyword vs. Anti-keyword If a GenBank entry contains both a keyword and anti- keyword, the order is not flagged NSA 7/10/2014 39GenoTHREAT
  • 40. [Algorithm] : When to flag the subsequence? 7/10/2014 40GenoTHREAT
  • 41. QC (%) 100 100 100 80 Best matches Mus musculus Mus musculus BLAST results Score Mus musculus 100 Mus musculus 100 Danio rerio 97 Danio rerio 43 BLAST [Example] : Is this subsequence a hit? 7/10/2014 GenoTHREAT 41
  • 42. QC (%) 100 100 100 80 Best matches Lumpy skin disease virus Sheeppox virus BLAST results Score Lumpy skin disease virus 100 Sheeppox virus 100 Goatpox virus 98 Dearpox virus 44 BLAST [Example] : Is this subsequence a hit? 7/10/2014 GenoTHREAT 42
  • 43. QC (%) 100 100 100 80 Best matches Bacillus anthracis Bacillus cereus BLAST results Score Bacillus anthracis 100 Bacillus cereus 100 Plasmodium falciparum 63 Clostridium ljungdahlii 44 BLAST [Example] : Is this subsequence a hit? 7/10/2014 GenoTHREAT 43 [Guidance] : « unique to Select Agent » !!!
  • 44. [Algorithm] : No Best Matches… 7/10/2014 44GenoTHREAT
  • 45. [Algorithm] : Points of the Guidance left to interpretation How do you identify sequences of concern of 200bp or greater which partially span two adjacent subsequences? Problem: no suggestion in guidance Solution: extension method 7/10/2014 45GenoTHREAT
  • 46. [Algorithm] : Extension Method 7/10/2014 GenoTHREAT 46
  • 47. [Algorithm] : Extension Method 7/10/2014 47GenoTHREAT
  • 48. [Algorithm] : Extension Method 7/10/2014 48GenoTHREAT
  • 49. [Algorithm] : Extension Method 7/10/2014 49GenoTHREAT
  • 50. [Algorithm] : Extension Method 7/10/2014 50GenoTHREAT
  • 51. [Algorithm] : Extension Method 7/10/2014 51GenoTHREAT
  • 52. Extend to meet possible alignments 120bp 80bp 120bp80bp New subsequence [Algorithm] : Extension Method 7/10/2014 52GenoTHREAT
  • 53. [Algorithm] : Extension Method 7/10/2014 53GenoTHREAT
  • 54. [Algorithm] : Extension Method 7/10/2014 54GenoTHREAT
  • 55. [Algorithm] : Extension Method 7/10/2014 55GenoTHREAT
  • 56. [Algorithm] : Extension Method 7/10/2014 56GenoTHREAT
  • 57. [Algorithm] : Extension Method 7/10/2014 57GenoTHREAT
  • 58. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance 1. Perform Six Frame Translation 2. Divide the query sequences into subsequences of 200bp or 66aa 3. For each subsequence i. BLAST ii. Best Matches iii. Flag if SAT 4. Automatic decision III. GenoTHREAT: implementation and characterization IV. Conclusions 7/10/2014 58GenoTHREAT
  • 59. [Algorithm] : Recap 7/10/2014 59GenoTHREAT
  • 60. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance III. GenoTHREAT: implementation and characterization 1. Software implementation 2. Software Characterization IV. Conclusions 7/10/2014 60GenoTHREAT
  • 61. Using BLAST Online BLAST Performs BLAST via NCBI website interface ► Faster per BLAST ► Computationally less expensive ► Only sequential, due to NCBI restrictions ► Lack of privacy Local BLAST Performs BLAST in parallel on local machine ► User privacy ► Faster per sequence due to parallelization ► Computational expensive (Memory + CPU intensive ) 7/10/2014 GenoTHREAT 61
  • 62. Screening time & hardware 7/10/2014 GenoTHREAT 62 Online Desktop Business Class Server Sequence length (bp) Screening time (min)* 2,000 2 10,000 12.5 *Screening performed using business class server
  • 63. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance III. GenoTHREAT: implementation and characterization 1. Software implementation 2. Software Characterization i. Database of test sequences ii. Keyword list variation iii. Detection of Potentially dangerous sequences iv. BLAST parameters v. Real world gene orders simulation IV. Conclusions 7/10/2014 63GenoTHREAT
  • 64. Database of Test Sequences • Implementations must be compared to assess quality • Standardized set of test sequences is needed • Test Set contains 184 sequences: • Select Agents o Genes associated with toxins or pathogenicity o Genes associated with normal function • Model Organisms 64 7/10/2014 64GenoTHREAT
  • 65. Database of Test Sequences Contribute to the development of a standard test set of sequences 65 7/10/2014 65GenoTHREAT
  • 66. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance III. GenoTHREAT: implementation and characterization 1. Software implementation 2. Software Characterization i. Database of test sequences ii. Keyword list variation iii. Detection of Potentially dangerous sequences iv. BLAST parameters v. Real world gene orders simulation IV. Conclusions 7/10/2014 66GenoTHREAT
  • 67. Keyword and Anti-Keyword list • Test with the unmodified sequences (184 sequences) • Two lists of keywords • Limited • extensive • Plus • anti-keyword list • or not 7/10/2014 67GenoTHREAT
  • 68. Keyword List Content Variation 7/10/2014 GenoTHREAT 68 0 20 40 60 80 100 120 Limited keywords Extensivekeywords Correct SAT Correct NSAT Keyword list method not mentioned in guidance Limited keyword list: uniquely composed of words in SAT List Extensive keyword list: extension of limited keyword list containing words uniquely related to SAT.
  • 69. Anti-Keywords 7/10/2014 69GenoTHREAT Anti
  • 70. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance III. GenoTHREAT: implementation and characterization 1. Software implementation 2. Software Characterization i. Database of test sequences ii. Keyword list variation iii. Detection of potentially dangerous sequences iv. BLAST parameters v. Real world gene orders simulation IV. Conclusions 7/10/2014 70GenoTHREAT
  • 71. Modified Test Sequences Modification performed on the initial unmodified sequences ► Intervening sequences ► Degenerate sequences ► Mutated sequences (BLAST parameters) 7/10/2014 71GenoTHREAT
  • 72. Degenerate Sequences Potential Danger: Codon optimized nucleotide sequences 7/10/2014 GenoTHREAT 72 GATTTGGACACTCATTTCACC DLDTHFT Unmodified Nucleotide Degenerate NucleotideGATACGTCAACCTTTTAA GC Amino Acid Sequence Result: all codon optimized sequences detected due to screening of amino acid sequences
  • 73. Intervening sequences Potential Danger: SAT sequences hidden within larger, benign sequences 300bps NSAT 200bps SAT 300bps NSAT 300bps NSAT 300bps NSAT 250bps SAT 7/10/2014 73GenoTHREAT Result: All hidden sequences were detected
  • 74. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance III. GenoTHREAT: implementation and characterization 1. Software implementation 2. Software Characterization i. Database of test sequences ii. Keyword list variation iii. Detection of Potentially dangerous sequences iv. BLAST parameters v. Real world gene orders simulation IV. Conclusions 7/10/2014 74GenoTHREAT
  • 75. Mutated sequences Potential Danger: mutated, but still active, SAT sequences which do not align to GenBank entries 7/10/2014 75GenoTHREAT
  • 76. Nucleotides subsequences 7/10/2014 76GenoTHREAT Result: BLAST parameter settings affect screening capability
  • 77. Amino-Acid subsequences 7/10/2014 77GenoTHREAT Result: BLAST parameters do not clearly change the efficiency of the screening
  • 78. Nucleotides subsequences 7/10/2014 78GenoTHREAT Result: Direct relationship between screening time and ability to identify mutated sequences
  • 79. Amino-Acid subsequences 7/10/2014 79GenoTHREAT
  • 80. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance III. GenoTHREAT: implementation and characterization 1. Software implementation 2. Software Characterization i. Database of test sequences ii. Keyword list variation iii. Detection of Potentially dangerous sequences iv. BLAST parameters v. Real world gene orders simulation IV. Conclusions 7/10/2014 80GenoTHREAT
  • 81. Real world gene orders simulation Gene Synthesis company: low number of false hit needed 1. iGEM registry • Registry completed by iGEM teams each year • Contains 10,000 sequences 2. GenoCAD database • 1,258 sequences longer than 200 bp 7/10/2014 81GenoTHREAT
  • 82. iGEM Registry First step: screen registry sequences 1-->1724 Hit rate: 6.5% Major causes of hits: • 100% query coverage for Best Match too restrictive • Some results have 100% query coverage but very low Percent Identity • Keyword list issues 7/10/2014 82GenoTHREAT 95% 60% solved 2.9%
  • 83. iGEM Registry 7/10/2014 GenoTHREAT 83
  • 84. GenoCAD database • 1,258 sequences • 32 hits: 2.54% • Manual review: • YopH: protein from Y.pestis (gi|14488772) 7/10/2014 GenoTHREAT 84
  • 85. Real world gene orders simulation Hits left are due to: • Very often: 1 subsequence of 1 Protein frame leads to a correct hit  Is it worth flagging the entire sequence? • Sometimes: many subsequences leads to correct hits  Probably worth flagging 7/10/2014 85GenoTHREAT
  • 86. Road Map I. Current regulations II. Sequence screening algorithm: interpreting the guidance III. GenoTHREAT: implementation and characterization IV. Conclusions 7/10/2014 86GenoTHREAT
  • 87. GenoTHREAT • “Best Match” • Hardware and software parameters • Keyword list • BLAST parameters • Certain types of sequence modifications • High-resolution screen 7/10/2014 87GenoTHREAT
  • 88. Guidance conclusion Government Guidance potentially usable by companies: • Reasonable time • Good detection of sequences of concern • Number of false hits potentially low (manual review) 7/10/2014 88GenoTHREAT
  • 89. 7/10/2014 GenoTHREAT 89 http://www.dagorret.net/2009/12/18/new-technology- developed-by-microsoft-for-photography-dna-image/ http://www.wadsworth.org/testing/biodefense/education.shtml
  • 90. 7/10/2014 GenoTHREAT 90 © iGEM and Justin Knight.
  • 91. 7/10/2014 GenoTHREAT 91
  • 92. 7/10/2014 GenoTHREAT 92
  • 93. 7/10/2014 GenoTHREAT 93
  • 94. 7/10/2014 GenoTHREAT 94 A T A A C T C C C T G G G T C G T T A A A C C G G C G G C T G C G G C A G T C T T A G C A T A A T A A T C G G A T A G C A C T T T A T G A C C T G T C G T C G G G G C A C T A A A T G A A C T A G T G G C A G T A A C T G T C A G G C A G C A T A T A C A A C G T T C A A A T A A C T G C A T A G A A C C C A G A A T A A C T A C C A C C A C C G A A T C T T T A T C C A G A C G A C T G C A T G A C T C G C T T C T A C G A C G G T G A A T G A C G T T G G G T T G C G T C G C A T G G T A C C T A C T T A A C T T C G G T C G C T C A A T G A T C T G C A A A A G A A T C G G C T A T T G G A C T C C T A G G C G C G T C T T A T A T A T G C G G C G C T T T T A C G A T C C G G A C A T A A T C T A A G G T A T C G T A C G C G C G G G A A C A C G A G G T T G T A A C A C C G T A G C T A T C T C A T G C A T T C C G A C C A G C G G T T A T A T A A T A C T C G T T T T T T C C G C G T G C C A T C A T A C G A C G C T G G C C G C C G C G T T A G T G T C G T G T G T A C A C A C C G A G T T A C C C T C C T T C G T T C G C A C C A G C G T T A C T G C G T G T A G A G G A A A T T G G C T T G A G A G C T T T G C C C C A C C G C A C G A G G T A A C T A T T G A G A T C A G T C T A C A G A G T G C A A T A C A C C A A C G C http://sourceforge.net/projects/genothreat/
  • 95. Acknowledgeme nts Dr. Jean Peccoud Mandy L. Wilson The VT-ENSIMAG iGEM team (2010): Michael Kozar Gaelle Letort Olivier Mirat Arunima Srivastava Tyler Stewart My PhD committee: Dr. Bevan Dr. Garner Dr. Peccoud Dr. Ramakrishnan Dr. Setubal 7/10/2014 GenoTHREAT 95
  • 96. 7/10/2014 GenoTHREAT 96