Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Variant Sequence Search & Analysis

68 views

Published on

Variant Sequence Searching
Boston Biotech Conference 2018

Published in: Science
  • Be the first to comment

  • Be the first to like this

Variant Sequence Search & Analysis

  1. 1. 1 User Meeting | Company Confidential Do Not Distribute Variant Sequence Search and Analysis Ellen Sherin Sr. Product Manager GQ Life Sciences February 21, 2018 Ellen.Sherin@aptean.com 302-239-1506
  2. 2. 2 User Meeting | Company Confidential Do Not Distribute •Examples of variant sequences •Determining search scope •Algorithms, regular expressions, and logic •Search example •Text searches for variants Topics
  3. 3. 3 User Meeting | Company Confidential Do Not Distribute SNP – Simple Variant Description Often few enough variations so sequences can be written as explicits if desired Also possible to use coordinates to filter for hits crossing the SNP region
  4. 4. 4 User Meeting | Company Confidential Do Not Distribute More Complicated – Multiple Variations US 20150087572 In one aspect, an automatic dishwashing detergent composition comprising a variant protease of a parent protease, said parent protease amino acid sequence being identical to the amino acid sequence of SEQ ID NO:1, said variant protease of said parent protease mutations consisting of one of the following sets of mutations versus said parent protease: (i) N76D + S87R + G118R + S128L + P129Q + S130A
  5. 5. 5 User Meeting | Company Confidential Do Not Distribute 19. The isolated subtilisin variant of claim 1, wherein said isolated subtilisin variant comprises a combination of substitutions selected from: S87N/Q109Q/G118D/S128L/P129Q/S130A/S188S/T213E/N248R, S87N/Q109R/G118V/S128L/P129Q/S130A/S188D/T213R/N248D, S87N/Q109Q/G118V/S128L/P129Q/S130A/S188S/T213T/N248N, S87N/Q109R/G118V/S128L/P129Q/S130A/S188D/T213E/N248N, S87N/Q109R/G118V/S128L/P129Q/S130A/S188D/T213E/N248R, S87N/Q109R/G118V/S128L/P129Q/S130A/S188D/T213T/N248R, S87N/Q109R/G118V/S128L/P129Q/S130A/S188S/T213T/N248R, S87R/Q109R/G118V/S128L/P129Q/S130A/S188D/T213E/N248R, S87R/Q109Q/G118R/S128L/P129Q/S130A/S188D/T213T/N248R, S87R/Q109R/G118R/S128L/P129Q/S130A/S188D/T213T/N248R, S87N/Q109Q/G118R/S128L/P129Q/S130A/S188D/T213E/N248R, S87N/Q109R/G118V/S128L/P129Q/S130A/S188D/T213R/N248R, S87R/Q109Q/G118R/S128L/P129Q/S130A/S188S/T213E/N248R, S87N/Q109R/G118V/S128L/P129Q/S130A/S188D/T213T/N248N, S87R/Q109Q/G118R/S128L/P129Q/S130A/S188D/T213E/N248R, S87N/Q109R/G118V/S128L/P129Q/S130A/S188S/T213E/N248R, S87D/Q109D/G118D/S128L/P129Q/S130A/S188D/T213E/N248R, S87N/Q109R/G118V/S128L/P129Q/S130A/S188S/T213R/N248R, S87N/Q109R/G118V/S128L/P129Q/S130A/S188R/T213T/N248R, S87R/Q109R/G118R/S128L/P129Q/S130A/S188D/T213E/N248R, S87R/Q109Q/G118R/S128L/P129Q/S130A/S188D/T213E/N248N, And even harder!
  6. 6. 6 User Meeting | Company Confidential Do Not Distribute •Examples of variant sequences •Determining search scope •Algorithms, regular expressions, and logic •Search example •Text searches for variants Topics
  7. 7. 7 User Meeting | Company Confidential Do Not Distribute 1. Report all results containing at least one position with my variation 2. Report all combinations of positions containing my variation and/or wt. 3. Find all hits containing ANY variation (not necessarily the specified ones) in at least one of my positions. 4. Find all hits containing ANY variation (not necessarily the specified ones) in at least one of my positions, BUT at least one position must be my specified variation. 5. Is X in scope for “my variation”? (Answer: it should be!) OMG where do I begin?? Search Scope Possibilities N76D + S87R + G118R + S128L + P129Q + S130A
  8. 8. 8 User Meeting | Company Confidential Do Not Distribute • Write out each sequence as an explicit and search them as individual query sequences. • That can work for one or two positions with a limited number of substitutions; however this search request has six positions x 2 possibilities/position = 64 (26) possible sequences. • Increase the request to three variations per position by adding X as an option and we now have 729 query sequences! Four variations? 4096! • Neither of the above examples addresses the request for ANY variation at specific positions. Strategizing Question to ponder – if it’s impractical to write out all the explicits, what percentage of variants from any patent are present in sequence listings OR in ANY sequence database?
  9. 9. 9 User Meeting | Company Confidential Do Not Distribute •Examples of variant sequences •Determining search scope •Algorithms, regular expressions, and logic •Search example •Text searches for variants Topics
  10. 10. 10 User Meeting | Company Confidential Do Not Distribute • It gives you the flexibility to lock down some positions and vary others, very specifically. • It retrieves all combinations of specified variants. N76D + S87R retrieves N S N R D S D R • You can use X or any of the IUPAC degeneracy codes and MOTIF understands what they mean and auto-expands them. • It will NOT find an explicit X in a subject sequence unless you specify it (and you can’t use X to search for X alone!) The MOTIF Algorithm is a Wonderful Thing
  11. 11. 11 User Meeting | Company Confidential Do Not Distribute MOTIF auto-expands degeneracy characters
  12. 12. 12 User Meeting | Company Confidential Do Not Distribute The MOTIF algorithm recognizes both IUPAC degeneracy characters and explicit character combinations, written as “regular expressions” • Y15T=[YT] • Anything except T = [^T] • Can handle INDELS, multiple combinations, wildcarded regions • See GQ documentation for more specifics • Only reports results with 100% identity to query sequence (including specified variations) • Must be combined with text search for variations, because many variants are not indexed. Motif Algorithm for Searching Variants Notation Description Comments .* any amino acid, any length string GenomeQuest help documentation provides extensive documentation of variable notation (aka perl "regular expressions") in the MOTIF section. [^K] any single amino acid other than K X Any single amino acid [KG] either K or G but nothing else
  13. 13. 13 User Meeting | Company Confidential Do Not Distribute •Examples of variant sequences •Determining search scope •Algorithms, regular expressions, and logic •Search example •Text searches for variants Topics
  14. 14. 14 User Meeting | Company Confidential Do Not Distribute Search Example US 20150087572 In one aspect, an automatic dishwashing detergent composition comprising a variant protease of a parent protease, said parent protease amino acid sequence being identical to the amino acid sequence of SEQ ID NO:1, said variant protease of said parent protease mutations consisting of one of the following sets of mutations versus said parent protease: (i) N76D + S87R + G118R + S128L + P129Q + S130A
  15. 15. 16 User Meeting | Company Confidential Do Not Distribute Step 1 – Build Full-Length Variants> GG36-wt AQSVPWGISRVQAPAAHNRGLTGSGVKVAVLDTGISTHPDLNIRGGASFVPGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYAVKVLGASGSGSVSSIAQGLEWAGNNGMHVAN LSLGSPSPSATLEQAVNSATSRGVLVVAASGNSGAGSISYPARYANAMAVGATDQNNNRASFSQYGAGLDIVAPGVNVQSTYPGSTYASLNGTSMATPHVAGAAALVKQKNPSWSNVQIRNH LKNTATSLGSTNLYGSGLVNAEAATR > GG36-wtVar (each position can be either wt or my variation only) AQSVPWGISRVQAPAAHNRGLTGSGVKVAVLDTGISTHPDLNIRGGASFVPGEPSTQDGNGHGTHVAGTIAAL[ND]NSIGVLGVAP[SR]AELYAVKVLGASGSGSVSSIAQGLEWAGNN[GR] MHVANLS LGSP[SL][PQ][SA]ATLEQAVNSATSRGVLVVAASGNSGAGSISYPARYANAMAVGATDQNNNRASFSQYGAGLDIVAPGVNVQSTYPGSTYASLNGTSMATPHVAGAAALVKQKNPSWSNV QIRNHLKNTATSLGSTNLYGSGLVNAEAATR > GG36-X (each position can be anything) AQSVPWGISRVQAPAAHNRGLTGSGVKVAVLDTGISTHPDLNIRGGASFVPGEPSTQDGNGHGTHVAGTIAALXNSIGVLGVAPXAELYAVKVLGASGSGSVSSIAQGLEWAGNNXMHVANL SLGSPXXXATLEQAVNSATSRGVLVVAASGNSGAGSISYPARYANAMAVGATDQNNNRASFSQYGAGLDIVAPGVNVQSTYPGSTYASLNGTSMATPHVAGAAALVKQKNPSWSNVQIRNHL KNTATSLGSTNLYGSGLVNAEAATR > GG36-notVar (each position is NOT my variation) AQSVPWGISRVQAPAAHNRGLTGSGVKVAVLDTGISTHPDLNIRGGASFVPGEPSTQDGNGHGTHVAGTIAAL[^D]NSIGVLGVAP[^R]AELYAVKVLGASGSGSVSSIAQGLEWAGNN[^R] MHVANLSLGSP[^L][^Q][^A]ATLEQAVNSATSRGVLVVAASGNSGAGSISYPARYANAMAVGATDQNNNRASFSQYGAGLDIVAPGVNVQSTYPGSTYASLNGTSMATPHVAGAAALVKQK NPSWSNVQIRNHLKNTATSLGSTNLYGSGLVNAEAATR N76D + S87R + G118R + S128L + P129Q + S130A If you count amino acids, the positions are slightly different than anticipated in wt; they appear to be based on BPN’ numbering, which is still another confusing element in these searches. I’ll neglect this complexity for now!
  16. 16. 17 User Meeting | Company Confidential Do Not Distribute Step 2 – Build Subsequences MOTIF only reports results with 100% identity to query sequence (including specified variations)—so what if there are variations at additional positions? N76D + S87R + G118R + S128L + P129Q + S130A > GG36-wt AQSVPWGISRVQAPAAHNRGLTGSGVKVAVLDTGISTHPDLNIRGGASFVPGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYAVKVLGASGSGSVSSI AQGLEWAGNNGMHVANLSLGSPSPSATLEQAVNSATSRGVLVVAASGNSGAGSISYPARYANAMAVGATDQNNNRASFSQYGAGLDIVAPGVNVQSTYPGST YASLNGTSMATPHVAGAAALVKQKNPSWSNVQIRNHLKNTATSLGSTNLYGSGLVNAEAATR > GG36-wtVar (each position can be either wt or my variation only) VAGTIAAL[ND]NSIGVLGVAP[SR]AELYAV.*WAGNN[GR]MH.*LGSP[SL][PQ][SA]ATLEQAV > GG36-wtVarX (each position can be wt, my variation, or explicit X only) VAGTIAAL[^acefghiklmopqrstuvwy]NSIGVLGVAP[^acdefghiklmnopqtuvwy]AELYAV.*WAGNN[^acdefhijklmnopqstuvwy]MH.*L GSP[^acdefghijkmnopqrtuvwy][^acdefghiklmnorstuvwy][^cdefghijklmnopqrtuvwy]ATLEQAV > GG36-X (each position can be anything) VAGTIAALXNSIGVLGVAPXAELYAV.*WAGNNXMH.*LGSPXXXATLEQAV > GG36-notVar (each position is NOT my variation) VAGTIAAL[^D]NSIGVLGVAP[^R]AELYAV.*WAGNN[^R]MH.*LGSP[^L][^Q][^A]ATLEQAV .* means “any residue, any length”
  17. 17. 18 User Meeting | Company Confidential Do Not Distribute GG36-wtVar (25 hits) not GG36-wt (20 hits) 5 hits containing at least one variant position GG36-X (125 hits) not GG36-notVar (100 hits) 25 hits containing at least one variant position Note: Can replace GG36-wtVar with GG36-wtVarX if you consider X part of narrower answerset (I would!) Step 3 – Apply Boolean Logic 20 wt 20 wt 5 var 100 notvar 100 notvar 25 var
  18. 18. 19 User Meeting | Company Confidential Do Not Distribute Actual Search Results Narrower results • GG36-wtVarX (4103) minus GG36-wt (3932) =171 hits having at least one variant OR one X position. • GG36-wtVar (4095) – GG36wt (3932) =163 hits having at least one variant position • Eight hits have at least one X in a stated variant position. Broader results • GG36-X (4473) minus GG36-notVar (4308) = 165 hits having anything in at least one variant position • Why 165 rather than 171? Because X in a variant position would be considered “notVar” so if all positions were not the variant (including X) then it is removed. • Combine these results with the 171 and remove duplicates for the best answer set.
  19. 19. 20 User Meeting | Company Confidential Do Not Distribute Broader results • GG36-X (4473) minus GG36-notVar (4308) = 165 hits having anything (other than wt) in at least one variant position Narrowing NotVar Identifiers Removed Leaves results with at least one variant position Wild type identifiers removed Narrower results • GG36-wtVarX (4103) minus GG36-wt (3932) =171 hits having at least one variant OR one X position. • GG36-wtVar (4095) – GG36wt (3932) =163 hits having at least one variant position • Eight hits have at least one X in a stated variant position. ü GG36-X (165) minus GG36wtVar (163) = 2 results with any variations in at least one specified position (broader) ü GG36-wtVarX (171) contains all results with either X or requested variation at at least 1 specified position.
  20. 20. 21 User Meeting | Company Confidential Do Not Distribute In some still further embodiments, the isolated subtilisin variants comprise a combination of substitutions selected from: S87N/Q109Q/G118V/S128L/P129Q/S130A/S188S/T213T/N248D, S87N/Q109Q/G118V/S128L/P129Q/S130A/S188S/T213R/N248N, S87N/Q109Q/G118V/S128L/P129Q/S130A/S188R/T213R/N248D, S87N/Q109Q/G118V/S128L/P129Q/S130A/S188S/T213E/N248D, S87R/Q109D/G118R/S128L/P129Q/S130A/S188S/T213E/N248R, S87R/Q109D/G118R/S128L/P129Q/S130A/S188D/T213E/N248R, S87N/Q109D/G118V/S128L/P129Q/S130A/S188D/T213E/N248D, and S87N/Q109Q/G118V/S128L/P129Q/S130A/S188R/T213E/N248D, wherein the positions are numbered by correspondence with the amino acid sequence of B. amyloliquefaciens subtilisin BPNʹ set forth as SEQ ID NO:1. S87NR, Q109QD, G118VR, S128L, P129Q, S130A, S188SRD, T213TRE, N248DNR (and that’s just a small portion of one claim! US 20100192985 A1 https://www.google.ch/patents/US20100192985 Step 4 – Search Complete (not quite!)
  21. 21. 22 User Meeting | Company Confidential Do Not Distribute •Examples of variant sequences •Determining search scope •Algorithms, regular expressions, and logic •Search example •Text searches for variants Topics
  22. 22. 23 User Meeting | Company Confidential Do Not Distribute • There are many, many un-indexed variant sequences in patents that can’t be found by sequence searching. Text searching for the variations (with scrupulous attention to backbone numbering) is required to supplement MOTIF searching. • Text search the remaining PNs from the Xvar, after removing all PNs already reported from sequence search, using the variant notation and “AND in” the backbone name. –(N76D or S87R or G118R or S128L or P129Q or S130A) and GG36 • Refer back to the sequence search results to check any claimed % identities vs wt or variants (“anything having at least 90% identity to SEQ ID NO 2” for example – filter GQ results for that PN/SID to check. ) • Finally, do a separate ”text only” search and not out all the PNs you’ve already screened during the previous steps. Step 5 – Text Searches
  23. 23. 24 User Meeting | Company Confidential Do Not Distribute • Move Xvar results to text system of choice. Narrowing Sequence Results by Text OR Tip: you may also want to (separately) move your sequence result PNs to the same system, and then NOT them out from the Xvar results before screening.
  24. 24. 25 User Meeting | Company Confidential Do Not Distribute Text Narrowed Result (Still needs to be visually screened)
  25. 25. 26 User Meeting | Company Confidential Do Not Distribute Sample Text Query
  26. 26. 27 User Meeting | Company Confidential Do Not Distribute Don’t Neglect 3-Letter Codes!
  27. 27. 28 User Meeting | Company Confidential Do Not Distribute • Searching variants is a special skill, which requires different methodology than standard sequence searching • A proper variant search combines text screening of rejected sequence hits and a full text search with sequence results. • The majority of variant sequences are NOT indexed in any database (at least not associated with their patent) • If you think it’s hard for searchers just think – you could have to write or interpret documents like https://www.google.ch/patents/US20100192985! Summary
  28. 28. 29 User Meeting | Company Confidential Do Not Distribute Questions? Thank You for Attending Ellen.Sherin@aptean.com

×