Successfully reported this slideshow.
Your SlideShare is downloading. ×

From Text to Data to the World: The Future of Knowledge Graphs

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 43 Ad

From Text to Data to the World: The Future of Knowledge Graphs

Download to read offline

Keynote Integrative Bioinformatics 2018
https://docs.google.com/document/d/1E7D4_CS0vlldEcEuknXjEnSBZSZCJvbI5w1FdFh-gG4/edit

Can we improve research productivity through providing answers stemming from knowledge graphs? In this presentation, I discuss different ways of building and combining knowledge graphs.

Keynote Integrative Bioinformatics 2018
https://docs.google.com/document/d/1E7D4_CS0vlldEcEuknXjEnSBZSZCJvbI5w1FdFh-gG4/edit

Can we improve research productivity through providing answers stemming from knowledge graphs? In this presentation, I discuss different ways of building and combining knowledge graphs.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to From Text to Data to the World: The Future of Knowledge Graphs (20)

Advertisement

Recently uploaded (20)

Advertisement

From Text to Data to the World: The Future of Knowledge Graphs

  1. 1. From Text to Data to the World: The Future of Knowledge Graphs Paul Groth | pgroth.com | @pgroth Thanks to: Matthew Clark, Frederik van den Broek, Anton Yuryev, Maria Shkrob, Sherri Matis-Mitchell, Timothy Hoctor, Brad Allen, Corey Harper, Ron Daniel, Helena Deus, Olaf Lodbrok
  2. 2. June 15, 2018 • Research productivity • Moving to answers – knowledge graphs • Building knowledge graphs – from text • Building knowledge graphs – from data • Combining knowledge graphs 2
  3. 3. June 15, 2018 3 Bloom, N., Jones, C. I., Van Reenen, J., & Webb, M. (2017). Are ideas getting harder to find? (No. w23782). National Bureau of Economic Research. Slides: https://web.stanford.edu/~chadj/slides- ideas.pdf
  4. 4. June 15, 2018 4 Bloom, N., Jones, C. I., Van Reenen, J., & Webb, M. (2017). Are ideas getting harder to find? (No. w23782). National Bureau of Economic Research. Slides: https://web.stanford.edu/~chadj/slides- ideas.pdf
  5. 5. June 15, 2018 5 Bloom, N., Jones, C. I., Van Reenen, J., & Webb, M. (2017). Are ideas getting harder to find? (No. w23782). National Bureau of Economic Research. Slides: https://web.stanford.edu/~chadj/slides- ideas.pdf
  6. 6. June 15, 2018 6 Bloom, N., Jones, C. I., Van Reenen, J., & Webb, M. (2017). Are ideas getting harder to find? (No. w23782). National Bureau of Economic Research. Slides: https://web.stanford.edu/~chadj/slides- ideas.pdf
  7. 7. June 15, 2018 7 Bloom, N., Jones, C. I., Van Reenen, J., & Webb, M. (2017). Are ideas getting harder to find? (No. w23782). National Bureau of Economic Research. Slides: https://web.stanford.edu/~chadj/slides- ideas.pdf
  8. 8. June 15, 2018 8 Bloom, N., Jones, C. I., Van Reenen, J., & Webb, M. (2017). Are ideas getting harder to find? (No. w23782). National Bureau of Economic Research. Slides: https://web.stanford.edu/~chadj/slides- ideas.pdf
  9. 9. IN PRACTICE Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2017). Searching Data: A Review of Observational Data Retrieval Practices. arXiv preprint arXiv:1707.06937. Some observations from @gregory_km survey & interviews : • The needs and behaviors of specific user groups (e.g. early career researchers, policy makers, students) are not well documented. • Participants require details about data collection and handling • Reconstructing data tables from journal articles, using general search engines, and making direct data requests are common. K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018). Understanding Data Retrieval Practices: A Social Informatics Perspective. arXiv preprint arXiv:1801.04971
  10. 10. ELSEVIER’S BUSINESS: PROVIDING ANSWERS FOR RESEARCHERS, DOCTORS AND NURSES My work is moving towards a new field; what should I know? • Journal articles, reference works, profiles of researchers, funders & institutions • Recommendations of people to connect with, reading lists, topic pages How should I treat my patient given her condition & history? • Journal articles, reference works, medical guidelines, electronic health records • Treatment plan with alternatives personalized for the patient How can I master the subject matter of the course I am taking? • Course syllabus, reference works, course objectives, student history • Quiz plan based on the student’s history and course objectives
  11. 11. THE ROLE OF METADATA IN THE SECOND MACHINE AGE – DC-2016 / KØBENHAVN / 13 OCTOBER ANSWERS ARE ABOUT THINGS, NOT JUST WORKS Why shouldn’t a search on an author return information about the author, including the author’s works? Where was the author born, when did she live, what is she known for? … All of this is possible, but only if we can make some fundamental changes in our approach to bibliographic description. ... The challenge for us lies in transforming what we can of our data into interrelated “things” without overindulging that metaphor. Coyle, K. (2016). FRBR, before and after: a look at our bibliographical models. Chicago: ALA Editions.
  12. 12. THE ROLE OF METADATA IN THE SECOND MACHINE AGE – DC-2016 / KØBENHAVN / 13 OCTOBER KNOWLEDGE GRAPHS DEFINED • Knowledge graphs are "graph structured knowledge bases (KBs) which store factual information in form of relationships between entities” • (Nickel, M., Murphy, K., Tresp, V. and Gabrilovich, E. (2015). A review of relational machine learning for knowledge graphs. arXiv:1503.00759v3) • Knowledge graphs are metadata evolved beyond the focus on the work, linking people, concepts, things and events • Knowledge Graphs are focused on things to provide answers
  13. 13. The Success of Knowledge Graphs 13 June 15, 2018
  14. 14. Knowledge Graphs at Elsevier 14 June 15, 2018
  15. 15. BUILDING KNOWLEDGE GRAPHS FROM TEXT
  16. 16. • Total concepts = 540,632 • 100+ person years of clinical expert knowledge ONTOLOGY MAINTENANCE
  17. 17. 17 One Weird Trick from Natural Language Processing (NLP) • Knowledge bases are populated by scanning text and doing Information Extraction • Most information extraction systems are looking for very specific things, like drug-drug interactions • Best accuracy for that one kind of data, but misses out on all the other concepts and relations in the text • For broad knowledge base, use Open Information Extraction that only uses some knowledge of grammar • The weird trick for open information extraction … a simple algorithm, known as ReVerb*: 1. Find “relation phrases” starting with a verb and ending with a verb or preposition 2. Find noun phrases before and after the relation phrase 3. Discard relation phrases not used with multiple combinations of arguments. In addition, brain scans were performed to exclude other causes of dementia. * Fader et al. Identifying Relations for Open Information Extraction
  18. 18. 18 ReVerb output # SD Documents Scanned 14,000,000 Extracted ReVerb Triples 473,350,566
  19. 19. ONTOLOGY MAINTENANCE Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction Concept Resolution 14M SD articles 475 M triples 3.3 million relations 49 M relations ~15k -> 1M entries Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel “Applying Universal Schemas for Domain Specific Ontology Expansion” 5th Workshop on Automated Knowledge Base Construction (AKBC) 2016 Michael Lauruhn, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016).
  20. 20. CHALLENGES Paul Groth, Michael Lauruhn, Antony Scerri. Ron Daniel: “Open Information Extraction on Scientific Text: An Evaluation”, 2018; [http://arxiv.org/abs/1802.05574 arXiv:1802.05574] – To appear in COLING 2018 698 unique relations 400 sentences
  21. 21. BUILDING KNOWLEDGE GRAPHS FROM DATA
  22. 22. 22 Medical Graph – Statistical correlations at scale I65 Occlusion and stenosis of precerebral arteries G40 Epilepsy has_successor I61 C71 Malignant neoplasm of brain odds ratio: 1.12 intracerebral hemorrhage has_successor criteria1: • Correlation selected by preditive modeling algorithmus • No. of relations is higher than in mirrored relation • p-value < 0,05 • Odds ratios balanced over all covariates. 1 Criteria based on: Jensen et.al.: Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nature Communications, 2014 Jun 24 ;5:4022. doi: 10.1038/ncomms5022. Other covariates Primary care Secondary care Drug prescriptions 5m patients each 6 years longitudinality
  23. 23. 23 Medical Graph in practice, patient 35: risk of depression • 49 year old man • Dx: overweight, diabetes, hypertension, anxiety disorder  has an absolute risk of 36% to develop a depression within the next 4 years
  24. 24. 24 … and rationale of why model thinks this
  25. 25. 25 • Targets for prediction: ICD-coded diagnoses • Only incident patients per diagnose considered, i.e. diagnosis-free 2009 – 2010 • if these patients remain diagnosis-free 2011 - 2014 (observation period), then 0 else 1 • Covariates: all ICD-/ATC-codes, age and sex measured in 2010 Example: Model to predict „I50 – Heart Failure“ 25 Predict 4 year long-term effects, balanced for all co-variables I50 - I50 free patients 2009 2010 time I50 - (coded as 0) I50 + (coded as1) 2011 2014 Covariates Remaining I50 free patients/ newly I50 diagnosed patients
  26. 26. 26 Technology stack feature extraction For 3.8m patients: • age, gender • all diagnoses: ICD10-coded, 3 digits, i.e. 2054 codes • all medications: ATC-coded, 5 digits, i.e. 906 codes • death, hospitalization Results in: 6277 features • 1623 targets, 2011-2014 • 2320 covariates, 2010 • 2334 filter-columns, 2009-2010 data mining Calculate prevalence, incidence, mean age for all covariates (i.e. diseases and medications) machine learning Predictive modelling for ~1600 targets • Linear classification model, resulting in odds ratios • Calculation of p-values Calculate statistics & build prediction models for ~1600 targets
  27. 27. COMBINING KNOWLEDGE GRAPHS
  28. 28. 28 | 28 • A rare genetic disease • Permanently excessive level of insulin in the blood • Develops within the first few days of life Symptoms include floppiness, shakiness, poor feedings, seizures, fits and convulsions. • If not caught quickly can lead to brain injury or even death. • In the most severe cases the only viable treatment is the removal of the pancreas, consigning the patient to a lifetime of diabetes. Example: Treatments for Congenital Hyperinsulinism is a UK charity that is building the rare disease community to raise awareness, drive research and develop treatments. is partnering with Findacure scientists to help identify and evaluate treatments for this devastating disease.
  29. 29. 29 29 Biological Pathways extracted via semantic text mining A upregulates B B upregulates C C increases Disease Normalizing vocabularies required: proteins, diseases, drugs, chemicals A  B  C  disease Bioactivities through text analysis IC50 6.3nM, kinase binding assay 10mM concentration Chemical Structures And Properties InChi, Name NCBI, Uniprot EMTREE ReaxysTree, Structures
  30. 30. 30 | 30 From pathways to treatments: Biovia PipelinePilot implementation combines data sources Automated analysis combines bioassay data with pathway data Find all targets that could be used to affect the disease state Query for each target to find the activities for each compound that are >6 log units Collate data by compound to summarize the targets/activities related to disease that the compound hits • Compute geometric mean of activities for ranking • Rank by number of targets and geometric mean of activities against targets Step 1 Step 2 Step 3
  31. 31. 31 | 31 Automated analysis combines bioassay data with pathway data From pathways to treatments: • 88 Targets related to hyperinsulinism with ≥3 literature references • Full PathwayStudio relationship information • PathwayStudio also has all compounds suggested as treatments Find all targets that could be used to affect the disease state Step 1
  32. 32. 32 32 The collaboration analysis shows clinical centers specializing in CHI • Filtered for institutions with > 4 publications and who collaborated with another institution. • Size of circle proportional to total number of publications • Line width proportional to the number of co-authored publications • Lines labeled with DOI’s Who is collaborating?
  33. 33. 33 33 • Filtered for authors with > 3 publication and who collaborated with another person. • Size of circle proportional to total number of publications • Line width proportional to the number of co-authored publications • Lines labeled with DOI’s • Numbers for authors are Scopus ID Who are the researchers in congenital hyperinsulinism?
  34. 34. Embeddings & Linked Prediction Pierre-Yves Vandenbussche (@pyvandenbussche) Translating Embeddings (TransE) http://pyvandenbussche.info/2017/tran slating-embeddings-transe/
  35. 35. Pierre-Yves Vandenbussche (@pyvandenbussche) Translating Embeddings (TransE) http://pyvandenbussc he.info/2017/translatin g-embeddings-transe/
  36. 36. Pierre-Yves Vandenbussche (@pyvandenbussche) Translating Embeddings (TransE) http://pyvandenbussche.info/201 7/translating-embeddings- transe/
  37. 37. Burger and Beans – weakly supervised/joint embeddings 37 correct text vector image vector Hypersphere of joint embeddings incorrect text vector Engilberge, Martin, Louis Chevallier, Patrick Pérez and Matthieu Cord. “Finding beans in burgers: Deep semantic-visual embedding with localization.” CoRR abs/1804.01720 (2018)
  38. 38. Burger and Beans Architecture June 15, 2018 38
  39. 39. 39 Ruobing Xie, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2017. Image-embodied knowledge representation learning. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI'17), Carles Sierra (Ed.). AAAI Press 3140- 3146. Learning Knowledge Graph relations from images
  40. 40. 40 Combining Knowledge Both, Fabian, Steffen Thoma, and Achim Rettinger. "Cross-modal Knowledge Transfer: Improving the Word Embedding of Apple by Looking at Oranges." Proceedings of the Knowledge Capture Conference. ACM, 2017.
  41. 41. Conclusion • We should help researchers do more • A move towards answers • Answers come from many sources (text, data, images…) • Embeddings as mechanism for integration • Knowledge graphs help integration
  42. 42. Thank you Paul Groth | @pgroth | p.groth@elsevier.com 5 , 2 0 1 8 42 Bloom, N., Jones, C. I., Van Reenen, J., & Webb, M. (2017). Are ideas getting harder to find? (No. w23782). National Bureau of Economic Research. Slides: https://web.stanford.edu/~chadj/slides- ideas.pdf
  43. 43. 43 Combining Knowledge Graphs with Embeddings Gupta, N., Singh, S., & Roth, D. (2017). Entity linking via joint encoding of types, descriptions, and context. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2681-2690).

Editor's Notes

  • Work with dans
    Reviewed 400 papers deep dive 114
  • We need to rely on more unsupervised than supervised techniques. Burger and beans is a weakly supervised which lets infer negatives by knowing what are the positives

    through word embeddings can also learn synonyms and such
  • Concept similarity
    Conc svd and pca are combinations
  • Predict entity types

×