Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scripps bioinformatics seminar_day_2

125 views

Published on

Part 2 of introduction to knowledge representation and applications for knowledge discovery in bioinformatics

Published in: Science
  • Be the first to comment

  • Be the first to like this

Scripps bioinformatics seminar_day_2

  1. 1. Day 2 of Computing on the shoulders of giants: how existing knowledge is represented and applied in bioinformatics Benjamin Good bgood@scripps.edu Assistant Professor of the Department of Molecular and Experimental Medicine
  2. 2. Recap from Day 1 • Make things (articles, genes, antibodies, etc.) easier to find • Answer questions • Generate hypotheses Controlled vocabularies (MeSH) Ontologies (Gene Ontology) knowledge graphs on the Web: the SPARQL query language knowledge plus computation = inference, the ABC model
  3. 3. Computing with knowledge • Challenges with knowledge graphs • Too much data • ->> query, sort, visualize, interact • Not enough data • ->> mine for more.. • Goal for practical day: Go beyond PubMed! • gain hands on experience using a knowledge graph • either with tools built for the purpose or with your own code…
  4. 4. Assignment: knowledge graph to hypothesis • Option 1 Coding • Implement and apply an ABC Model style hypothesis generating program (can adapt from example provided) • explain its logic, explain how you used it to generate a hypothesis, explain the hypothesis (provide a visual) • Option 2 Non-coding • Use a knowledge discovery application(s) (list provided) to define a new hypothesis • if you can’t think of where to start, try to explain why Metformin may contribute to cancer survival • Assignment deliverables: a document containing • the inputs you gave to your program or the online tool(s) you used • what was generated in response and the underlying logic • an image and text describing the results, especially any hypothesis you could derive • (for Option 1 also submit any code written or files generated as a tar or zip archive)
  5. 5. Online tools for knowledge discovery • http://knowledge.bio (* we make this one…) • http://www.biograph.be (this is a good tool, but often breaks down) • http://epiphanet.uth.tmc.edu (also on the flaky side, but can be good) • https://skr3.nlm.nih.gov/SemMed/ (works okay, requires a (free) account) • http://arrowsmith.psych.uic.edu (ugly interface, but good tool)
  6. 6. Demos • http://knowledge.bio • http://www.biograph.be • http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/start.cgi
  7. 7. Example question: repurposing all drugs http://tinyurl.com/hwm9388 ?drug ?disease interacts with protein geneencoded by genetic association treats??
  8. 8. Example program (feel free to follow or adapt to your interest) • Example • Input = a disease (A) • Output = a ranked list of drugs (C) that might be used for treatment • Render the results of your workflow as a cytoscape network that illustrates the reasoning behind the predictions • Implementation • Python • Use a SPARQL endpoint such as http://query.wikidata.org • + identify and use another endpoint (e.g. EBI, UniProt) • ++ access pubmed articles and MeSH indexing
  9. 9. Python setup • pip install RDFLib, SPARQLWrapper, pandas…. • Hopefully Jupyter already installed ? else install it http://jupyter.readthedocs.io/en/latest/install.html • get notebook from https://github.com/SuLab/sparql_to_pandas/blob/master/SPARQL_p andas.ipynb • go to directory where you put the notebook • run it with • >jupyter notebook • should be ready to run
  10. 10. the notebook • will run a basic search for disease-gene-drug connections in wikidata • will sort the results by the number of intervening genes • will export the data to a tab-delimited file you can view in Excel, text editor, or load into cytoscape • Your job: • Run it and extend it by one or more of: • adapting the query • changing the way the results are sorted • working with the output in cytoscape to produce an informative visualization
  11. 11. example output rendered in cytoscape
  12. 12. Other queries from Day 1 (slides 48-54) • Drugs that target a cancer and impact a specific biological process • http://tinyurl.com/j222k6g • Drugs that target a new disease linked via biological pathway with shared genes to disease the drug is now used to treat • http://tinyurl.com/gpfr9kj
  13. 13. Possible inputs for adaptations • Browse and examine wikidata.org to see what you might make use of • e.g. • Type of physical interaction between gene and drug • Gene ontology annotation (what evidence codes?) • Disease ontology hierarchy • Drug characteristics
  14. 14. Other possible knowledge sources • SPARQL • UniProt http://sparql.uniprot.org • EBI SPARQL https://www.ebi.ac.uk/rdf/documentation/sparql-endpoints • look for unique identifiers on genes and proteins that you can use to link wikidata content to their content • Text • use the NCBI the E-utils API to programmatically access pubmed articles and MeSH indexing http://www.ncbi.nlm.nih.gov/books/NBK25501/ • Can use to build co-occurrence networks of e.g. MeSH terms
  15. 15. Good luck! Ask questions!
  16. 16. ABC ranking algorithms • Out of all C, which are most strongly related to A? • Rank by N shared B concepts • c2: 4 • c4:3 • c1: 1 • c3: 1 • c5:1 • c6:1 • Next level: adjust to down-weight highly connected nodes A B C c1 c2 c3 c4 c5 c6
  17. 17. ABC ranking algorithms – advanced (require large networks to be useful) • Wren – Average Minimum Weight (AMW) (Wren) • http://bioinformatics.oxfordjournals.org/content/20/3/389.full.pdf • Linking Term Count with Average Minimum Weight (LTC-AMW) (Yetisgen-Yildiz and Pratt) • https://www.researchgate.net/publication/23759128_A_new_evaluation_me thodology_for_literature-based_discovery_systems • Predicate inter-dependence (Rastegar-Mojarad) • https://s3.amazonaws.com/uploads.hipchat.com/25885/154162/UaGvvQqbr hPBAWN/A%20new%20method.pdf

×