Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

You can do WHAT with GenomeQuest? (Almost) 101 Things You May Not Know

23 views

Published on

You can do WHAT with GenomeQuest? (Almost) 101 Things You May Not Know

Published in: Science
  • Be the first to comment

  • Be the first to like this

You can do WHAT with GenomeQuest? (Almost) 101 Things You May Not Know

  1. 1. 1 Company Confidential Do Not Distribute You can do WHAT with GenomeQuest? 101 things (almost) you may not know Steve Allen Solutions Consultant GQ Life Sciences Stephen.Allen@aptean.com Ellen Sherin Sr Product Manager Ellen.Sherin@aptean.com
  2. 2. 2 Company Confidential Do Not Distribute Overview
  3. 3. 3 Company Confidential Do Not Distribute Search: Query Design
  4. 4. 4 Company Confidential Do Not Distribute Anticipatory Searching In text searching we try to allow for all possibilities: alternate (flavor/flavour) or mis-spellings, synonyms, other possibilities N76D, N-76-D, N 76 D, N/76/D, asp76asn, ASP-76-ASN, “position 76 may be asp (aspartic acid) or asn (asparagine) or it may be deleted” We do the same for sequence searching, but consider the ways a sequence can be represented, in both query design and result analysis.
  5. 5. 5 Company Confidential Do Not Distribute First the Basics - Where do Sequences Come From? How was a sequence described by the inventor? • In generalities? “Any protease from a micro-organism” • As a cross-reference “Genbank accession number ABC12345” • Was a listing filed? Part of a table? Shown in an alignment in an image? • Or is the whole thing very difficult to decipher? • Are there multiple Markush positions, represented by Xaa and described in words? • Are subsequences called out? If it’s not in the listing, or at least written out as a sequence somewhere, it’s probably not in any sequence database!
  6. 6. 6 Company Confidential Do Not Distribute US 20150087572 In one aspect, an automatic dishwashing detergent composition comprising a variant protease of a parent protease, said parent protease amino acid sequence being identical to the amino acid sequence of SEQ ID NO:1, said variant protease of said parent protease mutations consisting of one of the following sets of mutations versus said parent protease: (i) N76D + S87R + G118R + S128L + P129Q + S130A Markush Sequences GQ Motif algorithm can find sequences IF they are present in the sequence listing…most variants ARE NOT!
  7. 7. 7 Company Confidential Do Not Distribute • Index the sequences by writing out each sequence as an explicit in the database • That can work for one or two positions with a limited number of substitutions; however a sequence with six positions x 2 possibilities/position = 64 (26) possible sequences. • Increase the variability to three variations per position by adding X as an option and we now have 729 possible sequences! Four variations? 4096! Variant Sequence Representation Question to ponder – if it’s impractical to write out all the explicits, what percentage of variants from any patent are present in sequence listings OR in ANY sequence database?
  8. 8. 8 Company Confidential Do Not Distribute Retrieves records and/or sequences by patent number, so you can: • Create a saved, searchable virtual database; • Obtain sequence(s) for download and ultimate use in sequence listing, IDS prep, other molecular biology programs; • Review through GUI or download or both; • Link out to public patent sources; • With Platinum subscription, download full text PDF Patent Number Searches GenomeQuest Keyword Interface
  9. 9. 9 Company Confidential Do Not Distribute
  10. 10. 10 Company Confidential Do Not Distribute Search Setup
  11. 11. 11 Company Confidential Do Not Distribute Save Your Preferred Search Settings Be aware of maximum results setting
  12. 12. 12 Company Confidential Do Not Distribute Make Your Own Database Three Different Ways! 1. Upload your own sequences into GQ 2. Through Keyword Search>browse protein (or DNA), filter as desired, and make into database 3. From your search results via Analyses>Extract sequences>Subject sequences These Virtual Databases (vDBs) can be selected to search normally as if they were any regular database like GQ-Pat.
  13. 13. 13 Company Confidential Do Not Distribute 1. Upload Your Own Sequences into GQ • Any standard format – GENBANK, EMBL, FASTA file containing one or many sequences • Must be just protein or just nucleotide. • Will show up under MY DATA
  14. 14. 14 Company Confidential Do Not Distribute 2. Browse Database, Save as vDB
  15. 15. 15 Company Confidential Do Not Distribute Keyword Result View
  16. 16. 16 Company Confidential Do Not Distribute 3. Make vDB from Search Results Results must be just DNA or just protein; don’t use mixed results.
  17. 17. 17 Company Confidential Do Not Distribute CDR Query Setup • You can design a query to look for CDRs in isolation, or all three in a single subject, or both! • If you want the exact CDR sequence (or sequence with specified variations) MOTIF is the best option, and is the only algorithm that works for a single query, linking all three CDRs. • If you want non-specific variability, then GenePast should be used, limiting number of differences. • Later on we’ll see how to GROUP results to detect one subject comprising multiple queries. • This method is not limited to CDRs; it’s applicable for any group of subsequences contained in a single, longer sequence.
  18. 18. 18 Company Confidential Do Not Distribute MOTIF on full length – Direct Strike The long sequence gives hits comprising all three CDRs in the specific order provided. *. Represents “any number of unspecified residues, including zero”. If there is even a single mismatch, or the order is incorrect, it will not be found. If a CDR in the database uses Xaa, it will not be found unless specified in the query sequence as an alternative to the wt amino acid. >37-motif DLSIH.*GFDPQDGETIYAQKFQG.*GSSSSWFDP >9-motif RASQGISSWLA.*GASNLES.*QQANSFPWT Note the relationship between the two long sequences. There were 27 patents hit, and both sequences are present in all 27. LC and HC perhaps?
  19. 19. 19 Company Confidential Do Not Distribute Methodology – Searching CDRs All 3 CDRs in subject or patent Here we are searching the three CDRs in isolation. This can be done with either MOTIF or GenePast. Click on the intersection to see all 27 patents that contain all three queries. Extra credit – how many results will you get on the results view? The 27 patents will contain all three CDRs; however, are they present in isolation, in a specific subject, or both? 81 minimum (3 x 27)
  20. 20. 20 Company Confidential Do Not Distribute CDR Query Comments • We recommend searching all three (or six) CDRs as individual sequences. • The concatentated query is very useful for a direct strike, but shouldn’t be used exclusively, as you will miss hits to individual CDRs. • This accomplishes the same thing as grouping by subject, but it’s more specific and you get a smaller number of results (1/3 as many in this case). CDR1 CDR2 CDR3anything anything
  21. 21. 21 Company Confidential Do Not Distribute • [KX] equivalent to anything, it will retrieve K or anything else, including X, in that position. • Degeneracy characters in subject not found automatically; they have to be searched explicitly. – [KV] will find either K or V, but not X. – [GA] will find either G or A but not R • Degeneracy characters in query interpreted as what they represent: [NACGTURYK MSWBDHV][R GA][YTUC][ SGC][WATU] • Always consider how an inventor might represent a sequence in the listing, and consider either using degeneracy characters (nucleotide) or including an explicit X in protein queries. • There’s a special way to search for that explicit X. • Tip: look at the query sequence in MOTIF results, it will be written out with the degeneracy characters expanded (e.g. N will be written as AGCT) Degeneracy Characters are Difficult!
  22. 22. 22 Company Confidential Do Not Distribute Motif Search Methods ***
  23. 23. 23 Company Confidential Do Not Distribute SNP Queries • Use the MOTIF algorithm to search for 100% identity to either allele. • Reminder: with a single mismatch anywhere, MOTIF will not find the hit! • GenePast/Blast are also good choices; use coordinate filters to select only those results crossing the SNP region(s).
  24. 24. 24 Company Confidential Do Not Distribute Results Overview Intermediate Page
  25. 25. 25 Company Confidential Do Not Distribute Lots of Good Information Here  Correct query sequence count  Count of sequences that didn’t have hits  Is your total hit count < your max?  Be sure to understand the Venn – it is on the PATENT level, not subject  For >3 queries, you can use the Statistics Report functionality
  26. 26. 26 Company Confidential Do Not Distribute Intermediate Page
  27. 27. 27 Company Confidential Do Not Distribute Intermediate Page Analysis Validate results, look for fundamental issues: • Do I have at least one hit for each of my query sequences? • Repeat this overview after applying each filter set • Did I “max out” my results? • Set my max for 500 (default) and at least one query has 500 results
  28. 28. 28 Company Confidential Do Not Distribute Analysis
  29. 29. 29 Company Confidential Do Not Distribute Are All My Queries Present?
  30. 30. 30 Company Confidential Do Not Distribute You Can Have Many Analysis Views • Multiple views can be saved and switching between them is as simple as a mouse click
  31. 31. 31 Company Confidential Do Not Distribute Customized Views for Analysis Numeric Bibliographic
  32. 32. 32 Company Confidential Do Not Distribute Views are Created by DEFINE COLUMNS
  33. 33. 33 Company Confidential Do Not Distribute • The heart of GQ’s power • Full Boolean with nesting capabilities • Just like views, you can save multiple filters • Very flexible combinations • GUI allows on-the-fly changes – try it, you don’t like it then try something else! – I often use this capability to narrow down large resultsets by finding a cutoff that affects the majority of results • The AUDIT TRAIL page (found in exported Excel files) includes the applied filters. – I add a screenshot of my filters and paste it into the report for readability. GenomeQuest Filters
  34. 34. 34 Company Confidential Do Not Distribute My Starting Point Nested Boolean Filter In order to get the most out of GQ, you need to really understand the different percent identities: query, subject and alignment!
  35. 35. 35 Company Confidential Do Not Distribute Wildcarding Works! MultipleValuesare“OR”
  36. 36. 36 Company Confidential Do Not Distribute Text ANDing
  37. 37. 37 Company Confidential Do Not Distribute GenePast Gap Filters • Huge improvement! Converted me from Blast. • Prior issue was Query % ID ignored gaps, so you could get hits with multiple gaps show up as 100% Query ID.
  38. 38. 38 Company Confidential Do Not Distribute InDel Detection with Gap Filters INDEL Type Query Gaps Subject Gaps Alignment Insertion mutant 1 0 Deletion mutant 0 1 One of each 1 1 Additional use – INDEL detection! (thanks Bjarne!)
  39. 39. 39 Company Confidential Do Not Distribute Can Also Display for InDel Analysis
  40. 40. 40 Company Confidential Do Not Distribute SNP Detection – Coordinate Filter • SNP analysis often focuses on specific position(s), therefore the overall % identities are frequently irrelevant. • Only those alignments that cover the region of interest will pass screening • Use coordinates to narrow to these regions
  41. 41. 41 Company Confidential Do Not Distribute SNP Detection – Coordinate Filters Example : SNP is at position 1501 Filter for Query Start <=1500 and Query Stop >=1502
  42. 42. 42 Company Confidential Do Not Distribute Viewing Alignments
  43. 43. 43 Company Confidential Do Not Distribute • Results can be grouped for immediate feedback How many families/patents/sequences pass these filters? Are there any hits (including SIDs) that contain multiple query sequences? Which subjects contain: My three CDRs? My unique promoter and gene? Variation 1 but not Variation 2? Grouping
  44. 44. 44 Company Confidential Do Not Distribute Grouping Find queries (or patents or families) with a disproportionate hit count
  45. 45. 45 Company Confidential Do Not Distribute A New Way to Analyze Data – GQ’s New Result Browser • Simplified Interface • Very different viewing and analysis paradigm • YOU CAN EXPORT ALIGNMENTS (coming right up!)
  46. 46. 46 Company Confidential Do Not Distribute Single Step Analysis of Patent, Family, UFS Distribution
  47. 47. 47 Company Confidential Do Not Distribute Unique Family Sequence A Special Beast • Purpose is to show distribution of the identical sequence within a given family • The identical sequence may have many different UFS values . Any given UFS value is only unique for a single family. • THERE IS NO GUARANTEE THAT THE SEQ ID NO AND/OR PSL IS IDENTICAL FOR A GIVEN UFS throughout the family. • It is extremely useful for studying the distribution of a sequence hit of interest throughout a family
  48. 48. 48 Company Confidential Do Not Distribute UFS PSL, SID variability
  49. 49. 49 Company Confidential Do Not Distribute Reporting Tips & Tricks
  50. 50. 50 Company Confidential Do Not Distribute Reporting Tricks & Tips • Share results with other GQ users • Visualize subjects, patents or both containing multiple query sequences • View alignments adjacent to full text of claims through LifeQuest • IDS and ST25 Preparation • Export alignments • Family portrait, result analysis • Excel tips: – Freeze top row – Link back from Excel to each alignment and make that link available to other licensed GQ users – Prepare Excel pivot tables summarizing search results which can easily be changed to summarize results by many different parameters
  51. 51. 51 Company Confidential Do Not Distribute Share Your Results 1. Create a folder for results 2. Make it a shared folder 3. Set Permissions 4. Move Results to Folder
  52. 52. 52 Company Confidential Do Not Distribute Visualize Multiple Queries Aligned to a Single Subject
  53. 53. 53 Company Confidential Do Not Distribute Multiple Sequence Alignments
  54. 54. 54 Company Confidential Do Not Distribute View Claims Adjacent to Alignments
  55. 55. 55 Company Confidential Do Not Distribute • Export as FASTA and import into Excel (requires a little manipulation); • May want second tabular export for organism and molecule type. • Be sure to have sequences properly ordered; • Use Excel formulae to clean up and error check, then convert into PatentIn import format. <my-seq-name;moltype;organism> sequence • This does take some Excel skill to do right! GQ Can Help Prepare Sequence Listings
  56. 56. 56 Company Confidential Do Not Distribute Generating Sequence Documents for IDS Prep • IDS (information disclosure statements) may be filed during prosecution, either on the initiative of the patent practitioner, or in response to an Office Action. • These are essentially citations, and may be journal reference, patents filings, or sequence documents, or any combination. • The sequence documents are really easy to prepare from GQ, and with minimal training, may be done by clerical workers or other assistants. No knowledge of sequence is needed.
  57. 57. 57 Company Confidential Do Not Distribute Sample Genbank-Formatted Export for IDS • Uses standard sequence export interface. • Sequences can be obtained from regular search results or by keyword search. • Can export multiple sequences but they will need to be broken out into individual files.
  58. 58. 58 Company Confidential Do Not Distribute NRB - Excel Export of Alignments nrb export.xls
  59. 59. 59 Company Confidential Do Not Distribute Family Portrait Report Click on a family to see the list of patents matching your sequence
  60. 60. 60 Company Confidential Do Not Distribute Analysis Report for > 3 Queries
  61. 61. 61 Company Confidential Do Not Distribute To 400 Million and Beyond Contest! https://www.surveymonkey.com/r/NZ3SS5T
  62. 62. 62 Company Confidential Do Not Distribute Other Information Upcoming – QUARTERLY LIVE WEBINARS ON SPECIAL TOPICS – Stay tuned! You are also invited to submit a topic for consideration. Email Ellen.Sherin@aptean.com or Stephen.Allen@aptean.com with your suggestions. New offering – Consulting and Custom Training Stop by our booth at PIUG for further information!
  63. 63. 63 Company Confidential Do Not Distribute Questions? Thank You for Attending

×