Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GenomeQuest Master Class

132 views

Published on

GenomeQuest Master Class
GQLS User Meeting
February 22, 2018

Published in: Science
  • Be the first to comment

  • Be the first to like this

GenomeQuest Master Class

  1. 1. 1 User Meeting | Company Confidential Do Not Distribute GenomeQuest Master Class Ellen Sherin Sr. Product Manager
  2. 2. 2 User Meeting | Company Confidential Do Not Distribute Topics • Planning your search • % Identities and Algorithm Choices • Blast & mHSPs • Scoring degenerate sequences/correction factors • Interlude – some good tricks & tips • MOTIF and variable sequences • Questions for YOU
  3. 3. 3 User Meeting | Company Confidential Do Not Distribute Planning Your Search
  4. 4. 4 User Meeting | Company Confidential Do Not Distribute Anticipatory Searching In text searching we try to allow for all possibilities: alternate (flavor/flavour) or mis-spellings (misspellings, mispellings), synonyms, other languages, other possibilities N76D, N-76-D, N 76 D, N/76/D, ASP76ASN, ASP-76-ASN, “position 76 may be asp (aspartic acid) or asn (asparagine) or it may be deleted” We do the same for sequence searching, but consider the ways a sequence can be represented, both in designing our query and in analyzing the results.
  5. 5. 5 User Meeting | Company Confidential Do Not Distribute First the Basics Where Do Sequences Come From? First and foremost, how did the inventor describe the sequence? • As a cross-reference “Genbank accession number ABC123” • In generalities? “Any protease from a micro-organism” • Was a listing filed? The sequence written out in a table? • Shown in an alignment in an image? • Or is the whole thing very difficult to decipher? • Are there multiple Markush positions, represented by Xaa and described in words?
  6. 6. 6 User Meeting | Company Confidential Do Not Distribute Subsequence, coordinates, or full length? • CDRs – are they represented as the individual CDR, or is the full LC or HC listed, with text references to the CDR positions? Or both? • It’s also possible that a subsequence is claimed by coordinates. • It boils down to anticipating whether your sequence of interest is represented in isolation or as a subsequence. • You won’t go wrong searching the shorter sequences and then grouping by subject/group size at least x (usually 3). Only way to find both.
  7. 7. 7 User Meeting | Company Confidential Do Not Distribute Markush Sequences SNP – is the wt or variant listed? Or both? Or just coordinates for the variation in text? Use GQ’s query/subject position filter to detect hits covering your region of interest.
  8. 8. 8 User Meeting | Company Confidential Do Not Distribute US 20150087572 In one aspect, an automatic dishwashing detergent composition comprising a variant protease of a parent protease, said parent protease amino acid sequence being identical to the amino acid sequence of SEQ ID NO:1, said variant protease of said parent protease mutations consisting of one of the following sets of mutations versus said parent protease: (i) N76D + S87R + G118R + S128L + P129Q + S130A Markush Sequences Sequence search algorithms can only find sequences in their sequence database, represented as sequences. If they aren’t in the database, they won’t be found in a search. Seems intuitive, right?
  9. 9. 9 User Meeting | Company Confidential Do Not Distribute ..and how do they get into sequence databases? • Sequences come from either sequence listings or manual curation (or both) • If it’s not in the listing (for filings with listings), chances are it’s not in a database associated with that patent. • Key areas: variants, SNPs, modified/unusual residues, chemical modifications to sequence, cyclic polypeptides. • There’s no way to represent many of these in a sequence database in a way that is compatible with all four of our sequence search algorithms.
  10. 10. 10 User Meeting | Company Confidential Do Not Distribute • For a variant sequence, write out each sequence as an explicit and search them as individual query sequences. • That can work for one or two positions with a limited number of substitutions; however this search request has six positions x 2 possibilities/position = 64 (26) possible sequences. • Increase the request to three variations per position by adding X as an option and we now have 729 query sequences! Four variations? 4096! • Neither of the above examples addresses the request for ANY variation at specific positions. Why aren’t they in databases? Question to ponder – if it’s impractical to write out all the explicits, what percentage of variants from any patent are present in sequence listings OR in ANY sequence database?
  11. 11. 11 User Meeting | Company Confidential Do Not Distribute % Identities and Algorithm Choices
  12. 12. 12 User Meeting | Company Confidential Do Not Distribute NCBI vs GQ % Identity GenomeQuest NCBI
  13. 13. 13 User Meeting | Company Confidential Do Not Distribute Alignment Subject % ID Query % ID 100% 100% 100% 50% 50% 100% 50% 50% Alignment % identity, corrected for the ratio of the alignment length to either the query or subject length. This example assumes 100% alignment identity, the longer lines are 100 residues, the shorter lines are 50 residues. % Identity Definitions
  14. 14. 14 User Meeting | Company Confidential Do Not Distribute High Query/Subject % ID Blast or GenePast work equally well here. If the query sequence is short, then GenePast is preferred.
  15. 15. 15 User Meeting | Company Confidential Do Not Distribute High Query/Low Subject % ID Either Blast or GenePast will work for this type of alignment, depending on how you set your GenePast filters. With the standard settings, GenePast will hit because it has a high query % ID.
  16. 16. 16 User Meeting | Company Confidential Do Not Distribute High Subject/Low Query % ID This type of hit will be MISSED with GenePast with the displayed settings, because query % ID = 17%, subject % ID = 100%
  17. 17. 17 User Meeting | Company Confidential Do Not Distribute Claim 31: A regulatory polynucleotide molecule, or any complement thereof, or any fragment thereof, or any cis element thereof, comprising a nucleic acid sequence wherein the nucleic acid sequence exhibits an 80% or greater identity to a sequence selected from the group consisting of SEQ ID NO: 1 through SEQ ID NO: 500. Align len= 629 nt, Score= 612, Eval= 0.00e+0, Identity= 99.36%, Similarity= 99.36% Query len= 2003 nt, pos= 1376-2000 nt (fw), Identity query= 31.2%, Nb gaps query= 4, Alignment coverage query= 31.2%, HSP coverage query= 31.20% Subject len= 735 nt, pos= 107-735 nt (fw), Identity subject= 85.03%, Nb gaps subject= 0, Alignment coverage subject= 85.58% High Subject % ID in Claims
  18. 18. 18 User Meeting | Company Confidential Do Not Distribute High Subsequence Alignment Lower Query/Subject % ID This is an example of a hit that would be missed by GenePast with standard filters, but found by Blast Alignment % ID 95.6%, Query % ID 41.2%, Subject % ID 76%
  19. 19. 19 User Meeting | Company Confidential Do Not Distribute GenePast Gap Filters • Huge improvement! Converted me from Blast. • GenePast Query % ID ignores gaps, so hits with multiple gaps show up as 100% Query ID and pass the % ID filters. • Low signal to noise ratio.
  20. 20. 20 User Meeting | Company Confidential Do Not Distribute Blast & mHSPs
  21. 21. 21 User Meeting | Company Confidential Do Not Distribute Multiple High Scoring Pairs (mHSPs) Search of some sort of tobacco gene filtered for Brassica family hits (alignment from SciFinder)
  22. 22. 22 User Meeting | Company Confidential Do Not Distribute mHSP Hit
  23. 23. 23 User Meeting | Company Confidential Do Not Distribute Query % HSP Cov. Query % Id Align % Id Align. length Subj. start Subj. stop Query start Query stop 96.63 66.69 99.08 980 533 1512 994 15 96.63 29.95 99.54 438 71 508 1456 1019 What is the % identity of this result? Is the % identity used for screening & claims analysis • 96.63% (the sum of the two HSPs) • 99% (alignment % ID) • Separately 66.69% or 29.95%? (two individual query % IDs) mHSP Analysis This example was taken from an actual Office Action – the Examiner found this hit sequence through NCBI Blast – which treated it as a single alignment, rather than two, so he took the overall alignment % identity, which was 96.6%.
  24. 24. 24 User Meeting | Company Confidential Do Not Distribute Including mHSPs in Results Be sure to check the “GQ HSP handling” box when setting up Blast search. Sample mHSP View
  25. 25. 25 User Meeting | Company Confidential Do Not Distribute Identifying mHSPs in Search Results Group by Blast HSPs for identification Filter by Query % HSP coverage in place of Query % ID Bonus points – why not just group by Subject Identifier?
  26. 26. 26 User Meeting | Company Confidential Do Not Distribute Identifying mHSPs in EXCEL Reports Requirements: • Query % HSP coverage > query % ID • Query sequence has multiple hits to subject sequence, with roughly sequential coordinates in either the query or the subject (preferably both) For review purposes, consider the Query % HSP Coverage value roughly equivalent to Query % ID Identifier Subj. % Id Query % Id Align % Id Query % HSP Cov. mHSP-subject 11.77 52.70 99.58 99.78 mHSP-subject 2.41 10.78 100.00 99.78 mHSP-subject 2.85 12.76 100.00 99.78 mHSP-subject 1.74 7.81 100.00 99.78 mHSP-subject 1.92 8.58 100.00 99.78 mHSP-subject 1.89 8.47 100.00 99.78
  27. 27. 27 User Meeting | Company Confidential Do Not Distribute Great mHSP Example Query - Vector GenePast found alignment 1, but not 2
  28. 28. 28 User Meeting | Company Confidential Do Not Distribute Degeneracy Mismatch Corrections
  29. 29. 29 User Meeting | Company Confidential Do Not Distribute • Blast and GenePast score some types of degenerate matches as mismatches. • Unless you are aware of this and correct for it, you may think the % identity is lower than it really is. Scoring Degenerate Sequences
  30. 30. 30 User Meeting | Company Confidential Do Not Distribute Variable (Degenerate) Residue Scoring These sequences are 100% identical; however query & alignment % id reported as 97.01%
  31. 31. 31 User Meeting | Company Confidential Do Not Distribute Degeneracy character Genome Quest NCBI STN BLAST BLAST GenePast BLAST Through GUI "Any character" variables X vs. X No No yes yes N vs. N No No Yes Yes X vs. B No No No No N Substitutions No No No No IUPAC characters vs themselves Y vs. Y Yes Yes Yes Yes W vs. W Yes Yes Yes Yes IUPAC degeneracy substitutions Y vs. C No No No No A vs. N No No No No AG vs. NN No No No No G vs. B No No No No T vs. W No No No No Degeneracy Match Analysis
  32. 32. 33 User Meeting | Company Confidential Do Not Distribute What is the Solution? • If query contains a relatively high number of variable characters, and/or % identity is close to claimed, a correction factor can be used.. • Keep in mind, this assumes ALL DEGENERACY CHARACTERS match the hit and the alignment covers all areas with degenerate characters • The true % identity may be lower. • If hit is of interest, alignment will need to be reviewed and correction factor adjusted for mismatch • Excel auto-correct is your friend!
  33. 33. 34 User Meeting | Company Confidential Do Not Distribute Interlude for Tricks & Tips
  34. 34. 35 User Meeting | Company Confidential Do Not Distribute • Excel auto-correct function for embedding formulae • Adding sequences to EXCEL reports • Using gap filters to find indels • DDBJ Seq ID NO awareness Some Quick Tricks
  35. 35. 36 User Meeting | Company Confidential Do Not Distribute Excel Auto-Correct • OPTIONS in Windows Excel • PREFERENCES in Mac Make up a name for your formula; I start mine with a # for simplicity. Add it to auto-correct. Thanks, Anne Bulow-Find!
  36. 36. 37 User Meeting | Company Confidential Do Not Distribute Adding Sequence to EXCEL Reports Identifier Patent Id Patent family ID Patent sequence location Sequence US20150203574-0006 US20150203574A1 41431517 claim: 1 EIFHSGSTNYNPSLKS US8846037-0006 US8846037B2 41431517 claim: 1; 5; 22; 23; 28; 29 EIFHSGSTNYNPSLKS US8436158-0006 US8436158B2 41431517 claim: 1; 11; 20 EIFHSGSTNYNPSLKS US20130280266-0006 US20130280266A1 41431517 claim: 1; 4 EIFHSGSTNYNPSLKS
  37. 37. 38 User Meeting | Company Confidential Do Not Distribute • Export normally to Excel. Be certain to include a column for IDENTIFIER. • Export FASTA file with corresponding sequences. Process
  38. 38. 39 User Meeting | Company Confidential Do Not Distribute • Paste FASTA file into Word and transform so you have the equivalent of one column with the identifier, and a second with the corresponding sequence. (Involves replacing ^p with delimiter like %, then %> with ^p>) • Paste this into Excel, text to columns delimited by % • Use Excel VLOOKUP function, matching on IDENTIFIER. Data Transformation & Integration Identifier Sequence >US20130280266-0006 EIFHSGSTNYNPSLKS >US20150203574-0006 EIFHSGSTNYNPSLKS >US8436158-0006 EIFHSGSTNYNPSLKS >US8846037-0006 EIFHSGSTNYNPSLKS
  39. 39. 40 User Meeting | Company Confidential Do Not Distribute Final Product Identifier Patent Id Patent family ID Patent sequence location Sequence US20150203574-0006 US20150203574A1 41431517 claim: 1 EIFHSGSTNYNPSLKS US8846037-0006 US8846037B2 41431517 claim: 1; 5; 22; 23; 28; 29 EIFHSGSTNYNPSLKS US8436158-0006 US8436158B2 41431517 claim: 1; 11; 20 EIFHSGSTNYNPSLKS US20130280266-0006 US20130280266A1 41431517 claim: 1; 4 EIFHSGSTNYNPSLKS
  40. 40. 41 User Meeting | Company Confidential Do Not Distribute INDEL Detection
  41. 41. 42 User Meeting | Company Confidential Do Not Distribute InDel Detection with Gap Filters INDEL Type Query Gaps Subject Gaps Alignment Insertion mutant 1 0 Deletion mutant 0 1 One of each 1 1 Additional use – INDEL detection! Thanks, Bjarne Due Larsen!
  42. 42. 43 User Meeting | Company Confidential Do Not Distribute Can Also Display for InDel Analysis
  43. 43. 44 User Meeting | Company Confidential Do Not Distribute DDBJ Seq ID NO Issues
  44. 44. 45 User Meeting | Company Confidential Do Not Distribute Inaccurate DDBJ SEQ ID Nos. • JPO has misassigned SIDs in listing data sent to DDBJ. • ~50% of the records originating in the JPO were involved • The error was propagated to EMBL and GenBank via datafeeds. • GeneSeq and CAS Registry index manually so they were not affected. • The error was initially discovered in GQ-Pat • GenomeQuest put compensatory measures in place. • Most affected records are flagged in COMMENTS field with statement “The direct source for this document is not JPO”. • We’ve used our knowledge of this issue to correct many of the affected records. • Older records (pre-2012 or so) should be viewed with some suspicion.
  45. 45. 46 User Meeting | Company Confidential Do Not Distribute JPO Sequence Re-Ordering 1 DNA 2 PRT 3 DNA 4 PRT 1 DNA 2 DNA 3 PRT 4 PRT
  46. 46. 47 User Meeting | Company Confidential Do Not Distribute Corrected DDBJ Record LOCUS HW293343 1063 bp DNA linear PAT 29-OCT-2013 DEFINITION JP 2013522287-A/3: ADJUVANTED VACCINES FOR SEROGROUP B MENINGOCOCCUS. ACCESSION HW293343 VERSION HW293343.1 KEYWORDS JP 2013522287-A/3. SOURCE Neisseria meningitidis ORGANISM Neisseria meningitidis Bacteria; Proteobacteria; Betaproteobacteria; Neisseriales; Neisseriaceae; Neisseria. REFERENCE 1 (bases 1 to 1063) AUTHORS Rappuoli,R., O'hagan,D. and Pallaoro,M. TITLE ADJUVANTED VACCINES FOR SEROGROUP B MENINGOCOCCUS JOURNAL Patent: JP 2013522287-A 3 13-JUN-2013; NOVARTIS AG COMMENT OS Neisseria meningitidis PN JP 2013522287-A/3 PD 13-JUN-2013 PF 18-MAR-2011 JP 2012557654 PR 18-MAR-2010 US 61/315336 ,25-MAR-2010 US 61/317572 PA NOVARTIS AG PI rino rappuoli,derek o'hagan,michele pallaoro PT "ADJUVANTED VACCINES FOR SEROGROUP B MENINGOCOCCUS" PS N28
  47. 47. 48 User Meeting | Company Confidential Do Not Distribute GQ-Pat Search the COMMENTS field for records containing the string JPO. That’s the flag that these records were not prepared from a sequence listing file and need manual verification. The direct source for this document is not JPO. Copyright (c) GenomeQuest, Inc. 2017 SEQUENCE NOTES field contains DDBJ PS field content. EMBL Look for PN field and a JP publication number with a /n (number) on the end. CC PN JP 2009534032- A/2 GENBANK Look for the same format PN field as EMBL, but without the CC PN JP 2009534032- A/2 HowtoRecognizeAffectedRecords
  48. 48. 49 User Meeting | Company Confidential Do Not Distribute Variant Sequences in More Depth MOTIF
  49. 49. 50 User Meeting | Company Confidential Do Not Distribute • MOTIF is Unix GREP in biological clothing. Behind the scenes code reinterprets letters according to their IUPAC definition. • MOTIF understand the letters, but only as letter + allowable substitutions. It is not “bio-savvy” the way Blast is. • Once you understand the notation, you can design your queries appropriately. When is a residue letter just a letter?
  50. 50. 51 User Meeting | Company Confidential Do Not Distribute MOTIF auto-expands degeneracy characters Expansion uses IUPAC degeneracy table
  51. 51. 52 User Meeting | Company Confidential Do Not Distribute • Degeneracy characters in query sequence match themselves and all their substitutions; e.g. Y matches Y, C, or T • The converse is NOT true: C or T will NOT match Y in a subject. • In proteins, X in query matches anything; however, to find X in subject you have to “back your way in” by excluding anything except X and any other desired substitutions . MOTIF requires 100% identity. Understanding MOTIF matching
  52. 52. 53 User Meeting | Company Confidential Do Not Distribute • [KX] equivalent to anything, it will retrieve K or anything else, including X, in that position. • Degeneracy characters in subject not found automatically; they have to be searched explicitly. – [KV] will find either K or V, but not X. – [GA] will find either G or A but not R • Degeneracy characters in query interpreted as what they represent: [NACGTURYK MSWBDHV][R GA][YTUC][ SGC][WATU] • Always consider how an inventor might represent a sequence in the listing, and consider either using degeneracy characters (nucleotide) or including an explicit X in protein queries. Degeneracy Characters are Difficult!
  53. 53. 54 User Meeting | Company Confidential Do Not Distribute • For nucleic acids, K = G/T (or U) • AGCTAKA query is interpreted as – AGCTAGA, AGCTATA, AGCTAKA – It will match any of the above subjects, but it will not match AGCTANA • AGCTAKA as a subject will only match an identical query. It will NOT match AGCTAGA or AGCTATA. https://www.bioinformatics.org/sms/iupac.html Substitution Example
  54. 54. 55 User Meeting | Company Confidential Do Not Distribute Finding X in Subjects is NOT Obvious The only way to find subject sequences containing X is to ”not out” other letters. The same as X
  55. 55. 56 User Meeting | Company Confidential Do Not Distribute > GG36-wt VAGTIAALnNSIGVLGVAPsAELYAV.*WAGNNgMH.*LGSPspsATLEQAV > GG36-wtVar (won’t find X) VAGTIAAL[ND]NSIGVLGVAP[SR]AELYAV.*WAGNN[GR]MH.*LGSP[SL][PQ][SA]ATLEQAV > GG36-wtVarX VAGTIAAL[^acefghiklmopqrstuvwy]NSIGVLGVAP[^acdefghiklmnopqtuvwy]AELYAV.*WAGNN[^ acdefhijklmnopqstuvwy]MH.*LGSP[^acdefghijkmnopqrtuvwy][^acdefghiklmnorstuvwy][^cdefghijk lmnopqrtuvwy]ATLEQAV > GG36-X VAGTIAALXNSIGVLGVAPXAELYAV.*WAGNNXMH.*LGSPXXXATLEQAV > GG36-notVar VAGTIAAL[^D]NSIGVLGVAP[^R]AELYAV.*WAGNN[^R]MH.*LGSP[^L][^Q][^A]ATLEQAV Complex Variant Query File
  56. 56. 57 User Meeting | Company Confidential Do Not Distribute GG36-wtVar (25 hits) not GG36-wt (20 hits) 5 hits containing at least one variant position GG36-X (125 hits) not GG36-notVar (100 hits) 25 hits containing at least one variant position Note: Can replace GG36-wtVar with GG36-wtVarX if you consider X part of narrower answerset (I would!) Boolean Logic 20 wt 20 wt 5 var 100 notvar 100 notvar 25 var
  57. 57. 58 User Meeting | Company Confidential Do Not Distribute Broader results • GG36-X (4473) minus GG36-notVar (4308) = 165 hits having anything (other than wt) in at least one variant position Narrowing NotVar Identifiers Removed Leaves results with at least one variant position Wild type identifiers removed Narrower results • GG36-wtVarX (4103) minus GG36-wt (3932) =171 hits having at least one variant OR one X position. • GG36-wtVar (4095) – GG36wt (3932) =163 hits having at least one variant position • Eight hits have at least one X in a stated variant position. ü GG36-X (165) minus GG36wtVar (163) = 2 results with any variations in at least one specified position (broader) ü GG36-wtVarX (171) contains all results with either X or requested variation at at least 1 specified position.
  58. 58. 59 User Meeting | Company Confidential Do Not Distribute • There are many, many un-indexed variant sequences in patents that can’t be found by sequence searching. Text searching for the variations (with scrupulous attention to backbone numbering) is required to supplement MOTIF searching. • Text search the remaining PNs from the Xvar, after removing all PNs already reported from sequence search, using the variant notation and “AND in” the backbone name. – (N76D or S87R or G118R or S128L or P129Q or S130A) and GG36 • Refer back to the sequence search results to check any claimed % identities vs wt or variants (“anything having at least 90% identity to SEQ ID NO 2” for example – filter GQ results for that PN/SID to check. ) • Finally, do a separate ”text only” search and not out all the PNs you’ve already screened during the previous steps. Text Searches
  59. 59. 60 User Meeting | Company Confidential Do Not Distribute Steve Allen Steven Altman Anne Bulow-Find Heidi Madsen Bjarne Due Larsen Bob March Joan Odell Mary Jane Reeve Man Wu Denis Bayada Danyu Wu Henk Heus Acknowledgements
  60. 60. 61 User Meeting | Company Confidential Do Not Distribute • What topics would you like to see covered in quarterly webinars? • What new products/product features would you like to see in GQ and LQ? • Are you interested in participating in a potential product advisory board? • Anything else you want to share? Talk to us, we’re listening! NETWORKING SESSION NEXT – STAY AND CHAT! Ellen.Sherin@aptean.com Stephen.Allen@aptean.com sales@gqlifesciences.com Audience Questions to Ponder
  61. 61. 62 User Meeting | Company Confidential Do Not Distribute Questions? (camera off) Thank You for Attending Ellen.Sherin@aptean.com

×