Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

II-SDV 2015, 20 - 21 April, in Nice

1,021 views

Published on

Published in: Internet
  • Be the first to comment

  • Be the first to like this

II-SDV 2015, 20 - 21 April, in Nice

  1. 1. Richard Resnick CEO II-SDV 2015, Nice, France Integrated Keyword and Biological Sequence Searching in the Life Sciences
  2. 2. KEYWORD SEARCHING IN THE LIFE SCIENCES IS CHALLENGING How do you spell“somatostatin”? Ala-Gly-Cys-Lys-Asn-Phe-Phe-Trp-Lys-Thr-Phe-Thr-Ser-Cys somato* AND (Mus musculus OR mouse) TGAACCTCACAGC ATGGAGCCCCTCT CTTTGGCTTCCAC ACCTAGCTGGAAT GCCTCAGCTGCT 100%/4.2%/100% is not aa
  3. 3. Relevance of results to life sciences Completenessofpatentauthoritycoverage Size of bubble corresponds to the number of hits returned GOAL: HIGHLY RELEVANT RESULTS FROM BROAD PATENT AUTHORITY COVERAGE
  4. 4. SEQUENCE SEARCHING PRESENTS CHALLENGES CCCTCCATCATTTCACCATCCACACTCATAATAATCATATATATTCATCAATCATCTATATAAGTAGTGGCAGGAGCAATGAGAGGGAGG GTTTCTCCACTGATGCTGTTGCTAGGGATCCTTGTCCTGGCTTCAGTTTCTGCAACGCATGCCAAGTCATCACCTTACCAGAAGAAAACA GAGAACCCCTGCGCCCAGAGGTGCCTCCAGAGTTGTCAACAGGAACCGGATGACTTGAAGCAAAAGGCATGCGAGTCTCGCTGCACCAAG CTCGAGTATGATCCTCGTTGTGTCTATGATCCTCGAGGACACACTGGCACCACCAACCAACGTTCCCCTCCAGGGGAGCGGACACGTGGC CGCCAACCCGGAGACTACGATGATGACCGCCGTCAACCCCGAAGAGAGGAAGGAGGCCGATGGGGACCAGCTGGACCGAGGGAGCGTGAA AGAGAAGAAGACTGGAGACAACCAAGAGAAGATTGGAGGCGACCAAGTCATCAGCAGCCACGGAAAATAAGGCCCGAAGGAAGAGAAGGA GAACAAGAGTGGGGAACACCAGGTAGCCATGTGAGGGAAGAAACATCTCGGAACAACCCTTTCTACTTCCCGTCAAGGCGGTTTAGCACC CGCTACGGGAACCAAAACGGTAGGATCCGGGTCCTGCAGAGGTTTGACCAAAGGTCAAGGCAGTTTCAGAATCTCCAGAATCACCGTATT GTGCAGATCGAGGCCAAACCTAACACTCTTGTTCTTCCCAAGCACGCTGATGCTGATAACATCCTTGTTATCCAGCAAGGTATCAAATCT AATTCTATTCTAAACTACATATATTTTGTTGCTTGATACATATGATTCATTGGATTGCAGGGCAAGCCACCGTGACCGTAGCAAATGGCA ATAACAGAAGAGCTTTAATCTTGACGAGGGCCATGCACTCAGAATCCCATCCGTTTCATTTCCTACATCTTGACGACATGACACCAGAAC TCAGAGTAGCTAAATCTCATGCCGTTAACACACCCGGCCAGTTTGAGGTAGGTACCTCTTTCTTCTCACATATATATTCAATTCTCAATT ATCATCTTACATGTTGTGGGTGTTGCTTCACAGGATTTCTTCCCGGCGAGCAGCCGAGACCAATCATCCTACTTGCAGGGATTCAGCAGG AATACTTTGGAGGCCGCCTTCAATGTAAGCAAATGTGTCATAATTATGGAATTAAAAGAACGATCATGTTATAAACTTATAATATATATA TACATAGGCGGAATTCAATGAGATACGGAGGGTGCTGTTAGAAGAGAATGCAGGAGGTGAGCAAGAGGAGAGAGGGCAGAGGCGATGGAG TACTCGGAGTAGTGAGAACAATGAAGGAGTGATAGTCGAAGTGTCAAAGGAGCACGTTGAAGAACTTACTAAGCACGCTAAATCCGTCTC AAAGAAAGGCTCCGAAGAAGAGGGAGATATCACCAACCCAATCAACTTGAGAGAAGGCGAGCCCGATCTTTCTGACAACTTTGGGAGGTT ATTTGAGGTGAAGCCAGACAAGAAGAACCCCCAGCTTCAGGACCTGGACATGATGCTCACCTGTGTAGAGATCAAAGAAGGAGCTTTGAT GCTCCCACACTTCAACTCAAAGGCCATGGTCATCGTCGTCATCAACAAAGGAACTGGAAACCTTGAACTCGTAGCTGTAAGAAAAGAGCA ACAACAGAGGGGACGGCGGGAACAAGAGTGGGAAGAAGAGGAGGAAGATGAAGAAGAGGAGGGAAGTAACAGAGAGGTGCGTAGGTACAC AGCGAGGTTGAAGGAAGGCGATGTGTTCATCATGCCAGCAGCTCATCCAGTAGCCATCAACGCTTCCTCCGAACTCCATCTGCTTGGCTT CGGTATCAACGCTGAAAACAACCACAGAATCTTCCTTGCAGGTGATAAGGACAATGTGGTAGACCAGATAGAGAAGCAAGCGAAGGATTT AGCATTCCCTGGTTCGGGTGAACAAGTTGAGAAGCTCATCAAAAACCAGAGGGAGTCTCACTTTGTGAGTGCTCGTCCTCAATCTCAATC TCCGTCGTCTCCTGAAAAAGAGGACCAAGAGGAGGAAAACCAGGGAGGGAAGGGTCCACTCCTTTCAATTTTGAAGGCTTTTAACTGAGA ATGGAGGAAACTTGTTATGTATCCATAATAAGATCACGCTTTTGTAATCTACTATCCAAAAACTTATCAATAAATAAAAACGTTTGTGCG TTGTTTCTCCAAGAAATACGGGTGGCGCTTATGGTTGTTTATTTATACGAAACTAATTAAATACATCATAACGGCAACGACCTCTTATTT TGTAATTTTCTT   BLAST? 90% ID? Do I want total query coverage or total subject coverage? Global alignment? What word size? How do my sequence hits relate to my text search results? Fragment? Motif?
  5. 5. pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd: [19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm: [transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd: [19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm: [transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd: [19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm: [transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd: [19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm: [transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd: [19950101 TO 20140215] KEYWORD SEARCHING IN THE LIFE SCIENCES PRESENTS CHALLENGES How do my text search results relate to my sequence hits? How do I figure out this system’s query syntax? What if a keyword is misspelled in a patent claim? How can I exclude patents unrelated to my domain easily? How do I build and maintain reliable synonym lists? Can I be sure that all of the documents I need to review exist in the underlying database?
  6. 6. BUILDING A REPORT FROM DIFFERENT PLATFORMS IS CHALLENGING Lack of life science specificity in search platforms create multiple false-positive hits that require additional user review Varying underlying algorithms can create an apples-to-oranges comparison Different output formats make it difficult to analyze and compare results Little cross-platform integration necessitates downloading multiple files for manual collation
  7. 7. Identify prior art surrounding gene modification in peanut for gene families implicated in food allergies. “Ara h 1” is a seed storage protein from Arachis hypogaea. It is known because sensitization to it was found in 95% of peanut-allergic patients from North America. We’re seeking prior art that describes vaccines related to these allergies or sequences that hit to the Ara h 1 gene. CASE STUDY
  8. 8. Run a sequence search against the prior art for the peanut“ara h 1”gene sequence: Arachis hypogaea cultivar LUHUA 8 Ara h 1 allergen (ara h 1) gene (cds) Identify relevant documents related to peanuts and claiming transgenic modification of plants that decrease allergy risks, and limited to the documents published after January 1st 2010 Text Search Sequence Search CCCTCCATCATTTCACCATCCACACTCATAATAATCATATATA TTCATCAATCATCTATATAAGTAGTGGCAGGAGCAATGAGA GGGAGGGTTTCTCCACTGATGCTGTTGCT… SOLUTION: INTEGRATED LIFE SCIENCE SEARCH PLATFORMS Union Combine into a single, unique workfile
  9. 9. A COMPLETE REPORT FOR ANALYSIS Claims contains vaccin* in green Bioinformatics-related patents in red Sequence search results in blue A single, unified report for analyzing results.
  10. 10. STANDARD KEYWORDS AND BOOLEAN SYNTAX AREN’T ENOUGH Life science applications are more than collections of discrete, specific keywords. They include field-specific ontological terms that can have synonyms, alternate spellings, and varying word order. Building a single query that addresses all of these issues, plus allows the flexibility of Boolean, proximity, wildcard, field grouping, range searches, and term boosting, can be difficult.
  11. 11. USE EXISTING ONTOLOGY TERMS OR DEFINE YOUR OWN As you type, suggested matching terms appear, based on the ontologies you choose Simply typing“transgenic” with the NCBI ontology list allows“Transgenic Plants” as one option At any time, type in the ? symbol for a complete list of field choices Specify words in claims, date ranges, and many more options to further refine your query Define your own ontologies and synonyms that are relevant for your specific search area Includes synonyms and alternate spellings for the genus and species of peanut Hit“Search”or <return> to run the search
  12. 12. INSTANT RESULTS A result preview is shown, and we save it as a workfile called“TEXT SEARCH”
  13. 13. THE “TEXT SEARCH”WORKFILE Sort by any column Rank for priority Color code to categorize Quickly assign colors/ranks using keyboard shortcuts 3 (for 3 stars) O (for orange)
  14. 14. All the results seem relevant, but we want to annotate the documents talking about vaccines in the claims with a green color. NAVIGATE A WORKFILE Easily apply bulk annotations for future workfile manipulation Keyboard shortcuts allow fast workfile evaluation (next record) (close preview) (previous record)
  15. 15. FILTER A WORKFILE Type in free text, use wildcards, or type in“?”to filter by terms in a specific field
  16. 16. FILTER A WORKFILE Apply the filter to pull out the subset of documents that match your query. 12 documents contain the word“vaccine”, or related terms, in the claims. 12
  17. 17. Let’s annotate these in green. MAKING DOCUMENTS WITH VACCINES IN THE CLAIMS GREEN
  18. 18. MAKING DOCUMENTS WITH VACCINES IN THE CLAIMS GREEN Here is what our subset (vaccine in claims) looks like. You can reset the filter to see other documents that are in the workfile.
  19. 19. Let’s annotate in red the documents that are probably not really relevant. Notice that“Bio-informatics”is a synonym list and includes multiple spellings. MAKING BIOINFORMATICS DOCUMENTS RED 40 documents relate to bioinformatics methods.
  20. 20. vaccin* in claims bioinformatics related HERE IS OUR TEXT SEARCH WORKFILE
  21. 21. Now it’s time to complete the analysis with sequence search results. ara h 1 CDS sequence GenePast 90%ID over the length of the query or the subject (1000 results) PREPARE YOUR SEQUENCE SEARCH RESULTS
  22. 22. We export these results to a LifeQuest workfile. Apply a filter to keep the patents where the Patent sequence location of my hits are in the claims: that leads to 81 results in 25 patents. FILTER YOUR SEQUENCE SEARCH RESULTS & EXPORT
  23. 23. Save it as a new“SEQ search”workfile, and open to analyze. EXPORT YOUR SEARCH RESULTS TO A WORKFILE
  24. 24. In the“SEQ search”workfile, color code all as blue. MARK ALL OF THE SEQUENCE SEARCH DOCUMENTS BLUE
  25. 25. Run a sequence search against the prior art for the peanut“ara h 1”gene sequence: Arachis hypogaea cultivar LUHUA 8 Ara h 1 allergen (ara h 1) gene (cds) Identify relevant documents related to peanuts and claiming transgenic modification of plants that decrease allergy risks, and limited to the documents published after January 1st 2010 Text Search Sequence Search CCCTCCATCATTTCACCATCCACACTCATAATAATCATATATA TTCATCAATCATCTATATAAGTAGTGGCAGGAGCAATGAGA GGGAGGGTTTCTCCACTGATGCTGTTGCT… SOLUTION: INTEGRATED LIFE SCIENCE SEARCH PLATFORMS Union Combine into a single, unique workfile
  26. 26. CONSOLIDATE TEXT SEARCH AND SEQUENCE SEARCH RESULTS Merge the two workfiles together (union) to get a complete set for final analysis.
  27. 27. Sort, filter, analyze, and export! EVALUATE THE MERGED DATA SETS vaccin* in claims bioinformatics related sequence hit in claims
  28. 28. GENERATE A COMPLETE REPORT
  29. 29. GENERATE A COMPLETE REPORT FOR ANALYSIS Includes results from both sequence & text searches Create color codes for your specific categories Merge with other outputs or export to any format Sort or filter by any field Rank hits (1, 2, 3 stars) to easily identify priority Claims contain vaccin* bioinformatics related found using the“ara h 1” DNA sequence A single, unified report for analyzing results.
  30. 30. PLEASE COME BY OUR BOOTH FOR MORE INFORMATION.

×