Advertisement
Advertisement

More Related Content

Similar to Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs to our Scientists(20)

Advertisement

More from Databricks(20)

Advertisement

Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs to our Scientists

  1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. Building a knowledge graph with Spark and NLP: How we recommend novel hypothesis to our scientists Eliseo Papa, MBBS PhD, AstraZeneca #UnifiedDataAnalytics #SparkAISummit
  3. Drug discovery is hard 3 COST OF A NEW DRUG ~ 2.6 BILLION PROBABILITY OF SELECTING THE RIGHT TARGET ARE 9- 12% AT BEST FALSE DISCOVERY RATE ESTIMATED AT 96% OVER ⅔ OF CLINICAL TRIALS FAIL FOR LACK OF EFFICACY
  4. Despite increase in R&D spending, the number of new medicines was constant 4
  5. AstraZeneca introduced the “5R” framework 5
  6. 5R has had a significant impact in improving our efficiency 6https://www.nature.com/articles/nrd.2017.244
  7. Difficulties remain Target decision take years to be validated 7 Too much data for scientists to consider when generating hypothesis
  8. We are investing in new sources of data and faster validation 8
  9. We need tools to make sense of data & make better and faster decisions 1) Partnerships 2) Internal Knowledge Graph build 3) Developing a RecSys for target identification 9
  10. Finding a drug target can be formulated as a hybrid recommendation problem • Scientists need to parse large amount of information and make a ranking prediction • Different formats, data models, locations • Estimates of probability of success needs to be constantly updated 10
  11. Multiple objective optimization 11
  12. Traditional recsys approaches 12 Collaborative filtering – “what is everyone else choosing as a drug target” Content-based filtering – “what are the characteristic of the target” Knowledge-based filtering – “what do we know about the target role in human disease”
  13. We assemble a large scale knowledge graph from public and AZ internal data 13 KG feature extraction (embeddings, gCNN,..) Machine learning model training Recommendations Insights validated in collaboration with scientists Pipeline decision Deduplication, entity linking, normalization, NLP Regular data release with multiples access options public unstructured AZ knowledge omics chemistry literature KG Data sources 1 2
  14. 14 Deduplication, entity linking, normalization, NLP Regular data release with multiples access options public unstructured AZ knowledge omics chemistry literature KG Data sources 1
  15. 15 OUTPUTSINPUTS files DB & API queries schemas REPORTS GRAPH(S) DASHBOARD & SEARCH FILES (Nodes & Edges) Delivering a scalable, modular, cloud-based graph creation pipeline, with automated publishing, analysis and reporting. The platform democratizes BIKG by facilitating easy knowledge addition, graph build, interrogation and evaluation. David Geleta
  16. KG pipeline on 16 • Series of prototype notebooks ported to Databricks and chained to form the BIKG creation pipeline • Fast, reproducible KG production: one order of magnitude speed improvement • Input (source files) & output (source parsers, node &edge deduplication) files stored on DBFS KG quality control visualization • Databricks Dashboards • Provides overview & in-depth views
  17. Pipeline – series of notebooks 17
  18. Pipeline stages 18 Dashboard Visualize QC metrics • (0) Source acquisition: sources are updated • (1) Parsing: each specified source is parsed into a set of nodes and edges; inputs differ: multiline JSON, JSON, RDF, APIs etc.. • (2) Matching & deduplication • Nodes: matched on labels and IDs • Edges: using deduplicated nodes, source and destination nodes are identified • (3) Evaluation: resulting KG is analyzed for completeness, correctness, etc.. • (4) Projections: KG is transformed into several forms: nodes & edges CSVs, GraphX graph frames, RDF ontologies etc..
  19. Node dictionary 19 Nodes with all known labels, classification, default identifier, and any other contextual information, excluding provenance.
  20. Mappings table 20 • Contains all mappings with types • Easy to filter by type, provenance • Facilitates different strength of folding (strict 1:1 equivalence, narrow/broad etc.) • Directionality implied by source, target id order & mapping relation type
  21. Edge assertions 21 • Contains all edge assertions • Easy to filter by type, provenance • Directionality implied by source, target id order & relation type • Edge types • structural : such edges provide ontological classification, can be used for clustering, folding etc. (e.g. rdfs:subClassOf, skos:broader) • mapping • "real" edge
  22. Keep evidence & context for each assertion 22
  23. 23 Deduplication, entity linking, normalization, NLP Regular data release with multiples access options public unstructured AZ knowledge omics chemistry literature BIKG Data sources 1
  24. 24 Focus on NLP literature
  25. 25 literature Large amount of knowledge relating to drug discovery knowledge is unstructured and continuously updated
  26. Use natural language processing to extract precise information at scale Named entity recognition Entity linking Relationship extraction 26https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005017 MEDLINE: > 60m abstract, weekly updates, 300GB, billion entities & relationships
  27. 27 NER termite Relationship extraction using syntax tree rules > 500 million relationships Tim Scrivener
  28. 28 NLP Termite on Termite is a commercial Named Entity Recognition tool that AZ uses to derive value out of unstructured data. Deploying this technology onto a spark cluster allowed us to simplify our processing architecture and achieve massive scalability Richard Jackson
  29. Syntax parsing increases precision of entity recognition 29
  30. Relationship from literatures reduce sparsity of biological KG 30https://blog.opentargets.org/link/
  31. Language models lead to improvements in recall and precision 31 Scaling still an issue.. 1. Distribute on Spark/DB using horovod 2. ”Distill” the model down in size
  32. Learned sentence representation can be used for downstream tasks 32 Input sentence BERT encoded representation • Distance and clustering, • link to the correct biological entity, • ranking probability of target • Classify the type of relation • Estimate how probable it is
  33. 33 BIKG feature extraction (embeddings, gCNN,..) Machine learning model training Recommendations Insights validated in collaboration with scientists Pipeline decision 2
  34. 34 34 Graph embedding pipeline • Ingest Knowledge graph data from a blob store • Random-split and convert data • Train model with PytorchBigGraph • Evaluate model • Generate embeddings • NearestNeighbour search with Faiss (fb) • Track artifacts • Write model and search results to a blob store Anna Gogleva
  35. Approximate nearest neighbor search 3535 Input node: kidney disease • Generate embeddings • Input a query node • Retrieve N nearest neighbor nodes of a required type • Use case 1: input a disease node return N nearest gene nodes • Use case 2: input gene list and re-oder it based on distance to a disease node
  36. Lessons learned 36 Spark lets us scale to million of data points across disparate sources Being able to add new data quickly helps the feedback loop and improves trust Backend engineers and biologists don’t talk the same language but working together can be magical We shouldn’t strike a balance between intuitive and comprehensive but instead build products for different audiences
  37. Acknowledgements 37 Richard Jackson NLP pipeline David Geleta Graph build pipeline Anna Gogleva Embeddings & RecSys pipeline Tim Scrivener NLP pipeline Georgios Gerogiokas Network analysis & embeddings pipeline Daniel Goude Data Ops Marina Pettersson Project manager Erik Jansson NLP pipeline Nick Brown Science IT Matthew Woodwark Jonathan Dry Oncology Claus Bendtsen Discovery Sci Ian Barrett Discovery Sci Ian Dix Want to work with data that matters ? job-search.astrazeneca.com and search for ”knowledge graph” @elipapa elipapa.github.io https://www.linkedin.com/in/eliseopapa/ Thank you !
  38. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
Advertisement