LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODOChris Mungall
NOTE THAT I HAVE MOVED AWAY FROM SLIDESHARE TO ZENODO
The identical presentation is now here:
https://doi.org/10.5281/zenodo.7778641
General introduction to LinkML, The Linked Data Modeling Language.
Adapter from presentation given to NIH May 2022
https://linkml.io/linkml
Collaboratively Creating the Knowledge Graph of LifeChris Mungall
The document discusses collaboratively building a knowledge graph of life by connecting existing biological ontologies. It describes how ontologies can standardize and organize biological data by representing entities and their relationships in a graph. The challenges of integrating different ontology projects are addressed through initiatives like the Open Biological and Biomedical Ontologies (OBO) Foundry. The document outlines how ontologies can be formalized using OWL and connected using tools like the Ontology Development Kit to enable discovery across domains. Current efforts like the Gene Ontology, Biolink Model, and National Microbiome Data Collaborative are leveraging these techniques to create unified, semantically queryable knowledge graphs.
LinkML is a modeling language for building semantic models that can be used to represent biomedical and other scientific knowledge. It allows generating various schemas and representations like OWL, JSON Schema, GraphQL from a single semantic model specification. The key advantages of LinkML include simplicity through YAML files, ability to represent models in multiple forms like JSON, RDF, and property graphs, and "stealth semantics" where semantic representations like RDF are generated behind the scenes.
"SPARQL Cheat Sheet" is a short collection of slides intended to act as a guide to SPARQL developers. It includes the syntax and structure of SPARQL queries, common SPARQL prefixes and functions, and help with RDF datasets.
The "SPARQL Cheat Sheet" is intended to accompany the SPARQL By Example slides available at http://www.cambridgesemantics.com/2008/09/sparql-by-example/ .
Big Pharma Problems. Big Graphs: Creating the Merck Manufacturing MeshNeo4j
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms for those who already suffer from conditions like depression and anxiety.
Towards an Open Research Knowledge GraphSören Auer
The document-oriented workflows in science have reached (or already exceeded) the limits of adequacy as highlighted for example by recent discussions on the increasing proliferation of scientific literature and the reproducibility crisis. Now it is possible to rethink this dominant paradigm of document-centered knowledge exchange and transform it into knowledge-based information flows by representing and expressing knowledge through semantically rich, interlinked knowledge graphs. The core of the establishment of knowledge-based information flows is the creation and evolution of information models for the establishment of a common understanding of data and information between the various stakeholders as well as the integration of these technologies into the infrastructure and processes of search and knowledge exchange in the research library of the future. By integrating these information models into existing and new research infrastructure services, the information structures that are currently still implicit and deeply hidden in documents can be made explicit and directly usable. This has the potential to revolutionize scientific work because information and research results can be seamlessly interlinked with each other and better mapped to complex information needs. Also research results become directly comparable and easier to reuse.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
The document describes the RAG (Retrieval-Augmented Generation) model for knowledge-intensive NLP tasks. RAG combines a pre-trained language generator (BART) with a dense passage retriever (DPR) to retrieve and incorporate relevant knowledge from Wikipedia. RAG achieves state-of-the-art results on open-domain question answering, abstractive question answering, and fact verification by leveraging both parametric knowledge from the generator and non-parametric knowledge retrieved from Wikipedia. The retrieved knowledge can also be updated without retraining the model.
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODOChris Mungall
NOTE THAT I HAVE MOVED AWAY FROM SLIDESHARE TO ZENODO
The identical presentation is now here:
https://doi.org/10.5281/zenodo.7778641
General introduction to LinkML, The Linked Data Modeling Language.
Adapter from presentation given to NIH May 2022
https://linkml.io/linkml
Collaboratively Creating the Knowledge Graph of LifeChris Mungall
The document discusses collaboratively building a knowledge graph of life by connecting existing biological ontologies. It describes how ontologies can standardize and organize biological data by representing entities and their relationships in a graph. The challenges of integrating different ontology projects are addressed through initiatives like the Open Biological and Biomedical Ontologies (OBO) Foundry. The document outlines how ontologies can be formalized using OWL and connected using tools like the Ontology Development Kit to enable discovery across domains. Current efforts like the Gene Ontology, Biolink Model, and National Microbiome Data Collaborative are leveraging these techniques to create unified, semantically queryable knowledge graphs.
LinkML is a modeling language for building semantic models that can be used to represent biomedical and other scientific knowledge. It allows generating various schemas and representations like OWL, JSON Schema, GraphQL from a single semantic model specification. The key advantages of LinkML include simplicity through YAML files, ability to represent models in multiple forms like JSON, RDF, and property graphs, and "stealth semantics" where semantic representations like RDF are generated behind the scenes.
"SPARQL Cheat Sheet" is a short collection of slides intended to act as a guide to SPARQL developers. It includes the syntax and structure of SPARQL queries, common SPARQL prefixes and functions, and help with RDF datasets.
The "SPARQL Cheat Sheet" is intended to accompany the SPARQL By Example slides available at http://www.cambridgesemantics.com/2008/09/sparql-by-example/ .
Big Pharma Problems. Big Graphs: Creating the Merck Manufacturing MeshNeo4j
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms for those who already suffer from conditions like depression and anxiety.
Towards an Open Research Knowledge GraphSören Auer
The document-oriented workflows in science have reached (or already exceeded) the limits of adequacy as highlighted for example by recent discussions on the increasing proliferation of scientific literature and the reproducibility crisis. Now it is possible to rethink this dominant paradigm of document-centered knowledge exchange and transform it into knowledge-based information flows by representing and expressing knowledge through semantically rich, interlinked knowledge graphs. The core of the establishment of knowledge-based information flows is the creation and evolution of information models for the establishment of a common understanding of data and information between the various stakeholders as well as the integration of these technologies into the infrastructure and processes of search and knowledge exchange in the research library of the future. By integrating these information models into existing and new research infrastructure services, the information structures that are currently still implicit and deeply hidden in documents can be made explicit and directly usable. This has the potential to revolutionize scientific work because information and research results can be seamlessly interlinked with each other and better mapped to complex information needs. Also research results become directly comparable and easier to reuse.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
The document describes the RAG (Retrieval-Augmented Generation) model for knowledge-intensive NLP tasks. RAG combines a pre-trained language generator (BART) with a dense passage retriever (DPR) to retrieve and incorporate relevant knowledge from Wikipedia. RAG achieves state-of-the-art results on open-domain question answering, abstractive question answering, and fact verification by leveraging both parametric knowledge from the generator and non-parametric knowledge retrieved from Wikipedia. The retrieved knowledge can also be updated without retraining the model.
Continuous representations of words and documents, which is recently referred to as Word Embeddings, have recently demonstrated large advancements in many of the Natural language processing tasks.
In this presentation we will provide an introduction to the most common methods of learning these representations. As well as previous methods in building these representations before the recent advances in deep learning, such as dimensionality reduction on the word co-occurrence matrix.
Moreover, we will present the continuous bag of word model (CBOW), one of the most successful models for word embeddings and one of the core models in word2vec, and in brief a glance of many other models of building representations for other tasks such as knowledge base embeddings.
Finally, we will motivate the potential of using such embeddings for many tasks that could be of importance for the group, such as semantic similarity, document clustering and retrieval.
Slides from the Ontology Access Kit (OAK) workshop, https://incatools.github.io/ontology-access-kit/
OAK is a pluralistic Python library for accessing a variety of ontologies, using either the command line or the Python library
Training Week: Create a Knowledge Graph: A Simple ML Approach Neo4j
This document provides an overview of creating a knowledge graph using machine learning approaches. It discusses using natural language processing to extract entities, relationships, and triples from text to build a knowledge graph. It then describes using graph embedding techniques like word2vec and node2vec to vectorize the knowledge graph and perform machine learning tasks like node similarity. The document demonstrates these approaches using Python libraries for NLP, graph databases, and machine learning.
The document introduces ontologies and discusses their role in the Semantic Web. It defines an ontology as an explicit specification of a conceptualization that is shared between people or software agents. Ontologies allow concepts and relationships between concepts to be formally defined so that software applications can interpret data in the same way. The document outlines different types of ontologies including upper ontologies that define common concepts across domains, and domain ontologies that define the terms and relationships within a specific knowledge domain. Formal ontology languages are also discussed as a way to represent ontologies in a machine-readable format.
The document discusses the RDF data model. The key points are:
1. RDF represents data as a graph of triples consisting of a subject, predicate, and object. Triples can be combined to form an RDF graph.
2. The RDF data model has three types of nodes - URIs to identify resources, blank nodes to represent anonymous resources, and literals for values like text strings.
3. RDF graphs can be merged to integrate data from multiple sources in an automatic way due to RDF's compositional nature.
This document provides an introduction and examples for SHACL (Shapes Constraint Language), a W3C recommendation for validating RDF graphs. It defines key SHACL concepts like shapes, targets, and constraint components. An example shape validates nodes with a schema:name and schema:email property. Constraints like minCount, maxCount, datatype, nodeKind, and logical operators like and/or are demonstrated. The document is an informative tutorial for learning SHACL through examples.
Este documento presenta una introducción a SPARQL, el lenguaje de consulta para datos RDF. Explica que SPARQL se utiliza para realizar consultas en bases de datos RDF de forma similar a como SQL se utiliza para bases de datos relacionales. También describe características clave de SPARQL como filtros, funciones y operadores de consulta.
GSK: How Knowledge Graphs Improve Clinical Reporting WorkflowsNeo4j
This document discusses GSK's efforts to use knowledge graphs to improve clinical reporting workflows. It describes GSK's current multi-step clinical data flow process and the resources required. The document envisions a future where a clinical knowledge graph could provide a single connected data model, parallel processing, and accelerated decision making. GSK plans to test building a minimum viable product knowledge graph to ingest and analyze clinical trial data and derive metrics. The goal is to demonstrate feasibility and inform further development through a phased agile approach.
Meaning Representations for Natural Languages: Design, Models and ApplicationsYunyao Li
EMNLP'2022 Tutorial "Meaning Representations for Natural Languages: Design, Models and Applications"
Instructors: Jeffrey Flanigan, Ishan Jindal, Yunyao Li, Tim O’Gorman, Martha Palmer
Abstract:
We propose a cutting-edge tutorial that reviews the design of common meaning representations, SoTA models for predicting meaning representations, and the applications of meaning representations in a wide range of downstream NLP tasks and real-world applications. Reporting by a diverse team of NLP researchers from academia and industry with extensive experience in designing, building and using meaning representations, our tutorial has three components: (1) an introduction to common meaning representations, including basic concepts and design challenges; (2) a review of SoTA methods on building models for meaning representations; and (3) an overview of applications of meaning representations in downstream NLP tasks and real-world applications. We will also present qualitative comparisons of common meaning representations and a quantitative study on how their differences impact model performance. Finally, we will share best practices in choosing the right meaning representation for downstream tasks.
Knowledge Graphs and Generative AI
Dr. Katie Roberts, Data Science Solutions Architect, Neo4j
It’s no secret that Large Language Models (LLMs) are popular right now, especially in the age of Generative AI. LLMs are powerful models that enable access to data and insights for any user, regardless of their technical background, however, they are not without challenges. Hallucinations, generic responses, bias, and a lack of traceability can give organizations pause when thinking about how to take advantage of this technology. Graphs are well suited to ground LLMs as they allow you to take advantage of relationships within your data that are often overlooked with traditional data storage and data science approaches. Combining Knowledge Graphs and LLMs enables contextual and semantic information retrieval from both structured and unstructured data sources. In this session, you’ll learn how graphs and graph data science can be incorporated into your analytics practice, and how a connected data platform can improve explainability, accuracy, and specificity of applications backed by foundation models.
Knowledge graphs ilaria maresi the hyve 23apr2020Pistoia Alliance
Data for drug discovery and healthcare is often trapped in silos which hampers effective interpretation and reuse. To remedy this, such data needs to be linked both internally and to external sources to make a FAIR data landscape which can power semantic models and knowledge graphs.
And then there were ... Large Language ModelsLeon Dohmen
It is not often even in the ICT world that one witnesses a revolution. The rise of the Personal Computer, the rise of mobile telephony and, of course, the rise of the Internet are some of those revolutions. So what is ChatGPT really? Is ChatGPT also such a revolution? And like any revolution, does ChatGPT have its winners and losers? And who are they? How do we ensure that ChatGPT contributes to a positive impulse for "Smart Humanity?".
During a key note om April 3 and 13 2023 Piek Vossen explained the impact of Large Language Models like ChatGPT.
Prof. PhD. Piek Th.J.M. Vossen, is Full professor of Computational Lexicology at the Faculty of Humanities, Department of Language, Literature and Communication (LCC) at VU Amsterdam:
What is ChatGPT? What technology and thought processes underlie it? What are its consequences? What choices are being made? In the presentation, Piek will elaborate on the basic principles behind Large Language Models and how they are used as a basis for Deep Learning in which they are fine-tuned for specific tasks. He will also discuss a specific variant GPT that underlies ChatGPT. It covers what ChatGPT can and cannot do, what it is good for and what the risks are.
Building an Enterprise Knowledge Graph @Uber: Lessons from RealityJoshua Shinavier
This document summarizes Uber's experience building an enterprise knowledge graph. It notes that Uber has over 200,000 managed datasets and billions of trips served, making it an ideal testbed for a knowledge graph. However, it also outlines several lessons learned, including that real-world data is messy, an RDF-based approach is difficult, and property graphs alone are insufficient. The document advocates standardizing on shared vocabularies, fitting tools and data models to existing infrastructure, and collaborating across teams.
The document describes the Jena framework, which is a Java API for building semantic web and linked data applications. It allows for parsing, creating, querying and inferencing over RDF data. The key classes and interfaces in Jena include the Model interface for representing RDF graphs, classes for creating resources, properties and literals, interfaces for representing statements and querying models. Jena supports reading/writing RDF files, working with ontologies and rules, and includes a SPARQL query engine.
This document discusses the treatment of fictitious and non-human personages in RDA and the Library Reference Model (LRM). It explains that in the LRM, only humans or collectives of humans can be agents, excluding animals, fictional characters, and other non-humans. However, many works are attributed to such non-human creators. The document then outlines proposals from the RDA Fictitious Entities Working Group to address this issue in RDA, such as recording non-humans used as pseudonyms normally and relating actual animal performers or fictional characters to works they are involved with. The goal is to support user tasks while maintaining principles of authority control.
This presentation goes into the details of word embeddings, applications, learning word embeddings through shallow neural network , Continuous Bag of Words Model.
What Wikidata teaches us about knowledge engineeringElena Simperl
This document summarizes an expert talk about Wikidata and knowledge engineering. The key points are:
1) Wikidata is a collaborative knowledge graph with over 28,000 active users that contains over 97 million items and 1.6 billion edits. It allows both human and bot editors.
2) Studies of Wikidata show that a balanced mix of bot and human editors, as well as diversity in editor tenure and interests, leads to higher quality knowledge graph items and ontology.
3) Provenance or references are important for trust in Wikidata statements, but the quality of these references is not well understood and varies across languages. Further research is exploring how to better evaluate reference quality.
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
Mihai is the Principal Architect for Platform Engineering and Technology Solutions at IBM, responsible for Cloud Native and AI Solutions. He is a Red Hat Certified Architect, CKA/CKS, a leader in the IBM Open Innovation community, and advocate for open source development. Mihai is driving the development of Retrieval Augmentation Generation platforms, and solutions for Generative AI at IBM that leverage WatsonX, Vector databases, LangChain, HuggingFace and open source AI models.
Mihai will share lessons learned building Retrieval Augmented Generation, or “Chat with Documents” platforms and APIs that scale, and deploy on Kubernetes. His talk will cover use cases for Generative AI, limitations of Large Language Models, use of RAG, Vector Databases and Fine Tuning to overcome model limitations and build solutions that connect to your data and provide content grounding, limit hallucinations and form the basis of explainable AI. In terms of technology, he will cover LLAMA2, HuggingFace TGIS, SentenceTransformers embedding models using Python, LangChain, and Weaviate and ChromaDB vector databases. He’ll also share tips on writing code using LLM, including building an agent for Ansible and containers.
Scaling factors for Large Language Model Architectures:
• Vector Database: consider sharding and High Availability
• Fine Tuning: collecting data to be used for fine tuning
• Governance and Model Benchmarking: how are you testing your model performance
over time, with different prompts, one-shot, and various parameters
• Chain of Reasoning and Agents
• Caching embeddings and responses
• Personalization and Conversational Memory Database
• Streaming Responses and optimizing performance. A fine tuned 13B model may
perform better than a poor 70B one!
• Calling 3rd party functions or APIs for reasoning or other type of data (ex: LLMs are
terrible at reasoning and prediction, consider calling other models)
• Fallback techniques: fallback to a different model, or default answers
• API scaling techniques, rate limiting, etc.
• Async, streaming and parallelization, multiprocessing, GPU acceleration (including
embeddings), generating your API using OpenAPI, etc.
Generative AI represents a pivotal moment in computing history, opening up new opportunities for scientific discoveries. By harnessing extensive and diverse datasets, we can construct new general-purpose Foundation Models that can be fine-tuned for specific prediction and exploration tasks. This talk introduces our research program, which focuses on leveraging the power of Generative AI for materials discovery. Generative AI facilitates rapid exploration of vast materials design spaces, enabling the identification of new compounds and combinations. However, this field also presents significant challenges, such as effectively representing crystals in a compact manner and striking the right balance between utilizing known structural regions and venturing into unexplored territories. Our research delves into the development of a new kind of generative models specifically designed to search for diverse molecular/crystal regions that yield high returns, as defined by domain experts. In addition, our toolset includes Large Language Models that have been fine-tuned using materials literature and scientific knowledge. These models possess the ability to comprehend extensive volumes of materials literature, encompassing molecular string representations, mathematical equations in LaTeX, and codebases. We explore the open challenges, including effectively representing deep domain knowledge and implementing efficient querying techniques to address materials discovery problems.
Demonstration of the applicability of the Linked Data Modeling Language and CHEMROF ( https://chemkg.github.io/chemrof/) for semantic chemical sciences. Presented at MADICES 2022. https://github.com/MADICES/MADICES-2022
Scaling up semantics; lessons learned across the life sciencesChris Mungall
Semantic modeling is key to understanding the biological processes underpinning the health of humans and the health of ecosystems on this planet. There are a number of different approaches to semantic modeling, varying from modeling of *things* in the form of knowledge graphs, modeling of *data structures* in the form of semantic schemas, and modeling of *words* in the form of ultra-large language models. Taking the metaphor of modeling paradigms as planets in a semantic solar system, I will take us on a tour through the solar system, exploring the strengths of each approach, and looking through a historic lens at how we keep iterating over similar solutions with each rotation around the sun. As an alternative to the dichotomy of either resisting change, or starting afresh I urge an approach were we embrace change and adapt with each revolution. I will look specifically at how the OBO community have built powerful knowledge graphs of biological concepts, how the LinkML modeling language incorporates aspects of both frame languages and shape languages, and how language models can be integrated with semantic ontological approaches through the OntoGPT framework
Continuous representations of words and documents, which is recently referred to as Word Embeddings, have recently demonstrated large advancements in many of the Natural language processing tasks.
In this presentation we will provide an introduction to the most common methods of learning these representations. As well as previous methods in building these representations before the recent advances in deep learning, such as dimensionality reduction on the word co-occurrence matrix.
Moreover, we will present the continuous bag of word model (CBOW), one of the most successful models for word embeddings and one of the core models in word2vec, and in brief a glance of many other models of building representations for other tasks such as knowledge base embeddings.
Finally, we will motivate the potential of using such embeddings for many tasks that could be of importance for the group, such as semantic similarity, document clustering and retrieval.
Slides from the Ontology Access Kit (OAK) workshop, https://incatools.github.io/ontology-access-kit/
OAK is a pluralistic Python library for accessing a variety of ontologies, using either the command line or the Python library
Training Week: Create a Knowledge Graph: A Simple ML Approach Neo4j
This document provides an overview of creating a knowledge graph using machine learning approaches. It discusses using natural language processing to extract entities, relationships, and triples from text to build a knowledge graph. It then describes using graph embedding techniques like word2vec and node2vec to vectorize the knowledge graph and perform machine learning tasks like node similarity. The document demonstrates these approaches using Python libraries for NLP, graph databases, and machine learning.
The document introduces ontologies and discusses their role in the Semantic Web. It defines an ontology as an explicit specification of a conceptualization that is shared between people or software agents. Ontologies allow concepts and relationships between concepts to be formally defined so that software applications can interpret data in the same way. The document outlines different types of ontologies including upper ontologies that define common concepts across domains, and domain ontologies that define the terms and relationships within a specific knowledge domain. Formal ontology languages are also discussed as a way to represent ontologies in a machine-readable format.
The document discusses the RDF data model. The key points are:
1. RDF represents data as a graph of triples consisting of a subject, predicate, and object. Triples can be combined to form an RDF graph.
2. The RDF data model has three types of nodes - URIs to identify resources, blank nodes to represent anonymous resources, and literals for values like text strings.
3. RDF graphs can be merged to integrate data from multiple sources in an automatic way due to RDF's compositional nature.
This document provides an introduction and examples for SHACL (Shapes Constraint Language), a W3C recommendation for validating RDF graphs. It defines key SHACL concepts like shapes, targets, and constraint components. An example shape validates nodes with a schema:name and schema:email property. Constraints like minCount, maxCount, datatype, nodeKind, and logical operators like and/or are demonstrated. The document is an informative tutorial for learning SHACL through examples.
Este documento presenta una introducción a SPARQL, el lenguaje de consulta para datos RDF. Explica que SPARQL se utiliza para realizar consultas en bases de datos RDF de forma similar a como SQL se utiliza para bases de datos relacionales. También describe características clave de SPARQL como filtros, funciones y operadores de consulta.
GSK: How Knowledge Graphs Improve Clinical Reporting WorkflowsNeo4j
This document discusses GSK's efforts to use knowledge graphs to improve clinical reporting workflows. It describes GSK's current multi-step clinical data flow process and the resources required. The document envisions a future where a clinical knowledge graph could provide a single connected data model, parallel processing, and accelerated decision making. GSK plans to test building a minimum viable product knowledge graph to ingest and analyze clinical trial data and derive metrics. The goal is to demonstrate feasibility and inform further development through a phased agile approach.
Meaning Representations for Natural Languages: Design, Models and ApplicationsYunyao Li
EMNLP'2022 Tutorial "Meaning Representations for Natural Languages: Design, Models and Applications"
Instructors: Jeffrey Flanigan, Ishan Jindal, Yunyao Li, Tim O’Gorman, Martha Palmer
Abstract:
We propose a cutting-edge tutorial that reviews the design of common meaning representations, SoTA models for predicting meaning representations, and the applications of meaning representations in a wide range of downstream NLP tasks and real-world applications. Reporting by a diverse team of NLP researchers from academia and industry with extensive experience in designing, building and using meaning representations, our tutorial has three components: (1) an introduction to common meaning representations, including basic concepts and design challenges; (2) a review of SoTA methods on building models for meaning representations; and (3) an overview of applications of meaning representations in downstream NLP tasks and real-world applications. We will also present qualitative comparisons of common meaning representations and a quantitative study on how their differences impact model performance. Finally, we will share best practices in choosing the right meaning representation for downstream tasks.
Knowledge Graphs and Generative AI
Dr. Katie Roberts, Data Science Solutions Architect, Neo4j
It’s no secret that Large Language Models (LLMs) are popular right now, especially in the age of Generative AI. LLMs are powerful models that enable access to data and insights for any user, regardless of their technical background, however, they are not without challenges. Hallucinations, generic responses, bias, and a lack of traceability can give organizations pause when thinking about how to take advantage of this technology. Graphs are well suited to ground LLMs as they allow you to take advantage of relationships within your data that are often overlooked with traditional data storage and data science approaches. Combining Knowledge Graphs and LLMs enables contextual and semantic information retrieval from both structured and unstructured data sources. In this session, you’ll learn how graphs and graph data science can be incorporated into your analytics practice, and how a connected data platform can improve explainability, accuracy, and specificity of applications backed by foundation models.
Knowledge graphs ilaria maresi the hyve 23apr2020Pistoia Alliance
Data for drug discovery and healthcare is often trapped in silos which hampers effective interpretation and reuse. To remedy this, such data needs to be linked both internally and to external sources to make a FAIR data landscape which can power semantic models and knowledge graphs.
And then there were ... Large Language ModelsLeon Dohmen
It is not often even in the ICT world that one witnesses a revolution. The rise of the Personal Computer, the rise of mobile telephony and, of course, the rise of the Internet are some of those revolutions. So what is ChatGPT really? Is ChatGPT also such a revolution? And like any revolution, does ChatGPT have its winners and losers? And who are they? How do we ensure that ChatGPT contributes to a positive impulse for "Smart Humanity?".
During a key note om April 3 and 13 2023 Piek Vossen explained the impact of Large Language Models like ChatGPT.
Prof. PhD. Piek Th.J.M. Vossen, is Full professor of Computational Lexicology at the Faculty of Humanities, Department of Language, Literature and Communication (LCC) at VU Amsterdam:
What is ChatGPT? What technology and thought processes underlie it? What are its consequences? What choices are being made? In the presentation, Piek will elaborate on the basic principles behind Large Language Models and how they are used as a basis for Deep Learning in which they are fine-tuned for specific tasks. He will also discuss a specific variant GPT that underlies ChatGPT. It covers what ChatGPT can and cannot do, what it is good for and what the risks are.
Building an Enterprise Knowledge Graph @Uber: Lessons from RealityJoshua Shinavier
This document summarizes Uber's experience building an enterprise knowledge graph. It notes that Uber has over 200,000 managed datasets and billions of trips served, making it an ideal testbed for a knowledge graph. However, it also outlines several lessons learned, including that real-world data is messy, an RDF-based approach is difficult, and property graphs alone are insufficient. The document advocates standardizing on shared vocabularies, fitting tools and data models to existing infrastructure, and collaborating across teams.
The document describes the Jena framework, which is a Java API for building semantic web and linked data applications. It allows for parsing, creating, querying and inferencing over RDF data. The key classes and interfaces in Jena include the Model interface for representing RDF graphs, classes for creating resources, properties and literals, interfaces for representing statements and querying models. Jena supports reading/writing RDF files, working with ontologies and rules, and includes a SPARQL query engine.
This document discusses the treatment of fictitious and non-human personages in RDA and the Library Reference Model (LRM). It explains that in the LRM, only humans or collectives of humans can be agents, excluding animals, fictional characters, and other non-humans. However, many works are attributed to such non-human creators. The document then outlines proposals from the RDA Fictitious Entities Working Group to address this issue in RDA, such as recording non-humans used as pseudonyms normally and relating actual animal performers or fictional characters to works they are involved with. The goal is to support user tasks while maintaining principles of authority control.
This presentation goes into the details of word embeddings, applications, learning word embeddings through shallow neural network , Continuous Bag of Words Model.
What Wikidata teaches us about knowledge engineeringElena Simperl
This document summarizes an expert talk about Wikidata and knowledge engineering. The key points are:
1) Wikidata is a collaborative knowledge graph with over 28,000 active users that contains over 97 million items and 1.6 billion edits. It allows both human and bot editors.
2) Studies of Wikidata show that a balanced mix of bot and human editors, as well as diversity in editor tenure and interests, leads to higher quality knowledge graph items and ontology.
3) Provenance or references are important for trust in Wikidata statements, but the quality of these references is not well understood and varies across languages. Further research is exploring how to better evaluate reference quality.
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
Mihai is the Principal Architect for Platform Engineering and Technology Solutions at IBM, responsible for Cloud Native and AI Solutions. He is a Red Hat Certified Architect, CKA/CKS, a leader in the IBM Open Innovation community, and advocate for open source development. Mihai is driving the development of Retrieval Augmentation Generation platforms, and solutions for Generative AI at IBM that leverage WatsonX, Vector databases, LangChain, HuggingFace and open source AI models.
Mihai will share lessons learned building Retrieval Augmented Generation, or “Chat with Documents” platforms and APIs that scale, and deploy on Kubernetes. His talk will cover use cases for Generative AI, limitations of Large Language Models, use of RAG, Vector Databases and Fine Tuning to overcome model limitations and build solutions that connect to your data and provide content grounding, limit hallucinations and form the basis of explainable AI. In terms of technology, he will cover LLAMA2, HuggingFace TGIS, SentenceTransformers embedding models using Python, LangChain, and Weaviate and ChromaDB vector databases. He’ll also share tips on writing code using LLM, including building an agent for Ansible and containers.
Scaling factors for Large Language Model Architectures:
• Vector Database: consider sharding and High Availability
• Fine Tuning: collecting data to be used for fine tuning
• Governance and Model Benchmarking: how are you testing your model performance
over time, with different prompts, one-shot, and various parameters
• Chain of Reasoning and Agents
• Caching embeddings and responses
• Personalization and Conversational Memory Database
• Streaming Responses and optimizing performance. A fine tuned 13B model may
perform better than a poor 70B one!
• Calling 3rd party functions or APIs for reasoning or other type of data (ex: LLMs are
terrible at reasoning and prediction, consider calling other models)
• Fallback techniques: fallback to a different model, or default answers
• API scaling techniques, rate limiting, etc.
• Async, streaming and parallelization, multiprocessing, GPU acceleration (including
embeddings), generating your API using OpenAPI, etc.
Generative AI represents a pivotal moment in computing history, opening up new opportunities for scientific discoveries. By harnessing extensive and diverse datasets, we can construct new general-purpose Foundation Models that can be fine-tuned for specific prediction and exploration tasks. This talk introduces our research program, which focuses on leveraging the power of Generative AI for materials discovery. Generative AI facilitates rapid exploration of vast materials design spaces, enabling the identification of new compounds and combinations. However, this field also presents significant challenges, such as effectively representing crystals in a compact manner and striking the right balance between utilizing known structural regions and venturing into unexplored territories. Our research delves into the development of a new kind of generative models specifically designed to search for diverse molecular/crystal regions that yield high returns, as defined by domain experts. In addition, our toolset includes Large Language Models that have been fine-tuned using materials literature and scientific knowledge. These models possess the ability to comprehend extensive volumes of materials literature, encompassing molecular string representations, mathematical equations in LaTeX, and codebases. We explore the open challenges, including effectively representing deep domain knowledge and implementing efficient querying techniques to address materials discovery problems.
Demonstration of the applicability of the Linked Data Modeling Language and CHEMROF ( https://chemkg.github.io/chemrof/) for semantic chemical sciences. Presented at MADICES 2022. https://github.com/MADICES/MADICES-2022
Scaling up semantics; lessons learned across the life sciencesChris Mungall
Semantic modeling is key to understanding the biological processes underpinning the health of humans and the health of ecosystems on this planet. There are a number of different approaches to semantic modeling, varying from modeling of *things* in the form of knowledge graphs, modeling of *data structures* in the form of semantic schemas, and modeling of *words* in the form of ultra-large language models. Taking the metaphor of modeling paradigms as planets in a semantic solar system, I will take us on a tour through the solar system, exploring the strengths of each approach, and looking through a historic lens at how we keep iterating over similar solutions with each rotation around the sun. As an alternative to the dichotomy of either resisting change, or starting afresh I urge an approach were we embrace change and adapt with each revolution. I will look specifically at how the OBO community have built powerful knowledge graphs of biological concepts, how the LinkML modeling language incorporates aspects of both frame languages and shape languages, and how language models can be integrated with semantic ontological approaches through the OntoGPT framework
This document describes ChemData, an open-source tool for interactive analysis and visualization of large chemical datasets. ChemData stores chemical data in a MongoDB database and uses VTK for visualization. It allows users to perform queries, similarity searching, and multidimensional analysis through features like scatter plots, histograms, parallel coordinates, and k-means clustering. ChemData also integrates with other open chemistry tools like Avogadro and represents molecular structures using the ChemicalJSON format. Future directions include integration with additional databases and computational job results.
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesOntotext
This presentation will provide a brief introduction to logical reasoning and overview of the most popular semantic schema and ontology languages: RDFS and the profiles of OWL 2.
While automatic reasoning has always inspired the imagination, numerous projects have failed to deliver to the promises. The typical pitfalls related to ontologies and symbolic reasoning fall into two categories:
- Over-engineered ontologies. The selected ontology language and modeling patterns can be too expressive. This can make the results of inference hard to understand and verify, which in its turn makes KG hard to evolve and maintain. It can also impose performance penalties far greater than the benefits.
- Inappropriate reasoning support. There are many inference algorithms and implementation approaches, which work well with taxonomies and conceptual models of few thousands of concepts, but cannot cope with KG of millions of entities.
- Inappropriate data layer architecture. One such example is reasoning with virtual KG, which is often infeasible.
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryMarcus Hanwell
The Open Chemistry project is developing an ambitious platform to facilitate reproducible quantum chemistry workflows by integrating the best of breed open source projects currently available in a cohesive platform with extensions specific to the needs of quantum chemistry. The core of the project is a Python-based data server capable of storing metadata, executing quantum chemistry calculations, and processing the output. The platform exposes RESTful endpoints using programming language agnostic web endpoints, and uses Linux container technology to package quantum codes that are often difficult to build.
The Jupyter project has been leveraged as a web-based frontend offering reproducibility as a core principle. This has been coupled with the data server to initiate quantum chemistry calculations, cache results, make them searchable, and even visualize the results within a modern browser environment. The Avogadro libraries have been reused for visualization workflows, coupled with Open Babel for file translation, and examples of the use of NWChem and Psi4 will be demonstrated.
The core of the platform is developed upon JSON data standards, and encouraging the wider adoption of JSON/HDF5 as the principle storage mediums. A single page web application using React at its core will be shown for sharing simple views of data output, and linking to the Jupyter notebooks that documents how they were made. Command line tools and links to the Avogadro graphical interface will be shown demonstrating capabilities from web through to desktop.
The document discusses Chado and OBD, two database schemas for storing biological data and annotations. Chado is a relational database schema developed for model organism databases to store various types of genomics data and track provenance. It uses ontologies and supports modules for different data types. OBD is designed for biomedical annotations using semantic web technologies like RDF triples and SPARQL querying. It aims to index data from various sources and link to external databases. The document compares the two approaches and discusses wrapping existing databases as SPARQL endpoints.
All together now: piecing together the knowledge graph of lifeChris Mungall
The document summarizes challenges in organizing biological knowledge and progress made through collaborative ontology development. It discusses how early efforts focused on individual ontologies but challenges emerged in maintenance and linking data. New approaches focus on shared principles, standardized mappings between ontologies, and modeling knowledge as graphs. Tools like Boomer and LinkML help reconcile mappings and model data, while community efforts like OBO Foundry and Biolink Model advance integration through open collaboration. Overall progress has been made but more work is needed to operationalize ontologies and build interconnected knowledge graphs.
Presented at the first Avogadro User Meeting, and presents an overview of the history of Avogadro development. It discusses changes in the rewrite, and the broader Open Chemistry project.
Experiences with logic programming in bioinformaticsChris Mungall
This document discusses experiences applying logic programming techniques in bioinformatics. It describes Obol, a system that used definite clause grammars to parse biological terms, and Blipkit, a reusable bioinformatics toolkit built for SWI-Prolog. Blipkit includes domain models, I/O modules, and tools for integrating with relational databases and web services. The document discusses applications of logic programming for tasks like genome inference, phenotype matching, and consistency checking biological data. It evaluates different logic programming approaches for representing genomic data and rules.
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of DataeXascale Infolab
dipLODocus[RDF] is a new system for RDF data processing supporting both simple transactional queries and complex analytics efficiently. dipLODocus[RDF] is based on a novel hybrid storage model considering RDF data both from a graph perspective (by storing RDF subgraphs or RDF molecules) and from a "vertical" analytics perspective (by storing compact lists of literal values for a given attribute).
http://diuf.unifr.ch/main/xi/diplodocus/
The document provides a general introduction to artificial intelligence (AI), machine learning (ML), deep learning (DL), and data science (DS). It defines each term and describes their relationships. Key points include:
- AI is the ability of computers to mimic human cognition and intelligence.
- ML is an approach to achieve AI by having computers learn from data without being explicitly programmed.
- DL uses neural networks for ML, especially with unstructured data like images and text.
- DS involves extracting insights from data through scientific methods. It is a multidisciplinary field that uses techniques from ML, DL, and statistics.
Graph databases in computational bioloby: case of neo4j and TitanDBAndrei KUCHARAVY
This document discusses graph databases and their use in computational biology. It introduces Neo4j and TitanDB as graph database options and describes how biological interaction networks and pathways can be modeled as graphs. Key advantages of graph databases over relational databases are also summarized, such as increased speed for graph queries and simpler programming. The document provides an overview of Neo4j and TitanDB, including their core abstractions, interfaces, and advantages/limitations for storing large biological network data. Examples are given of loading Reactome pathway data into Neo4j and performing graph queries.
DNA sequencing is producing a wave of data which will change the way that drugs are developed, patients diagnosed, and our understanding of human biology. To fulfill this promise, however, the tools for interpretation and analysis must scale to match the quantity and diversity of "big data genomics."
ADAM is an open-source genomics processing engine, built using Spark, Apache Avro, and Parquet. This talk will discuss some of the advantages that the Spark platform brings to genomics, the benefits of using technologies like Parquet in conjunction with Spark, and the challenges of adapting new technologies for existing tools in bioinformatics.
These are slides for a talk given at the Apache Spark Meetup in Boston on October 20, 2014.
The document discusses the BioSamples Database (BioSD) and its conversion to linked data. BioSD aims to provide information about biological samples used in experiments in a centralized reference system. It was converted to linked data to allow for integration with other datasets, exploitation of ontologies, and improved searching. The conversion included changes to the data model and several improvements to the software. SPARQL queries are demonstrated to retrieve sample data and attributes. Potential new areas discussed include integrating geo-located samples with Google Maps and search by feature similarity.
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsBenjamin Good
The Gene Ontology (GO) Consortium (GOC) is developing a new knowledge representation approach called ‘causal activity models’ (GO-CAM). A GO-CAM describes how one or several gene products contribute to the execution of a biological process. In these models (implemented as OWL instance graphs anchored in Open Biological Ontology (OBO) classes and relations), gene products are linked to molecular activities via semantic relationships like ‘enables’, molecular activities are linked to each other via causal relationships such as ‘positively regulates’, and sets of molecular activities are defined as ‘parts’ of larger biological processes. This approach provides the GOC with a more complete and extensible structure for capturing knowledge of gene function. It also allows for the representation of knowledge typically seen in pathway databases.
Here, we present details and results of a rule-based transformation of pathways represented using the BioPAX exchange format into GO-CAMs. We have automatically converted all Reactome pathways into GO-CAMs and are currently working on the conversion of additional resources available through Pathway Commons. By converting pathways into GO-CAMs, we can leverage OWL description logic reasoning over OBO ontologies to infer new biological relationships and detect logical inconsistencies. Further, the conversion helps to increase standardization for the representation of biological entities and processes. The products of this work can be used to improve source databases, for example by inferring new GO annotations for pathways and reactions and can help with the formation of meta-knowledge bases that integrate content from multiple sources.
Some considerations on using the two systems to manage molecular biology knowledge networks. This comes from: https://github.com/marco-brandizi/odx_neo4j_converter_test
This document provides an overview of LinkML, a lightweight modeling language for building data schemas and knowledge graphs. It discusses how LinkML allows users to model data in a simple yet expressive way and generate outputs like JSON Schema, OWL, and RDF. LinkML aims to be developer-friendly and integrates with popular tools and standards. Several key projects currently use LinkML for tasks like building knowledge graphs and modeling genomics and clinical data.
Experiences in the biosciences with the open biological ontologies foundry an...Chris Mungall
The document discusses the need for ontologies in biology to integrate data from the large number of biological databases and standards. It outlines tools for building and using ontologies, including those for end users to search and analyze data, and those for ontology engineers to develop ontologies through automated reasoning and integration. The Gene Ontology is provided as an example of an ontology that has been widely adopted for analyzing gene sets. The document advocates developing ontologies through a collaborative framework like the Open Biological and Biomedical Ontologies to promote reuse and integration across domains.
Representation of kidney structures in UberonChris Mungall
The document discusses representation of kidney structures in the Uberon anatomy ontology. It provides examples of kidney classes like glomerular capsule and S-shaped body represented in Uberon along with their relationships. It also discusses how Uberon integrates representations of kidney structures from other species and anatomy ontologies through equivalence axioms and cross-links.
Uberon: opening up to community contributionsChris Mungall
The document discusses Uberon, an integrative multi-species anatomy ontology. It describes Uberon's taxonomic scope covering metazoans with a focus on vertebrates. It outlines how Uberon is edited on GitHub and maintained with cross-references to other species-specific anatomy ontologies. It also discusses how phenotypes from the Phenotype and Human Phenotype Ontology are directly mapped to Uberon and species-specific anatomies, as well as considerations for which anatomy ontology a phenotype ontology should use.
Modeling exposure events and adverse outcome pathways using ontologiesChris Mungall
This document discusses using ontologies to model exposure events, adverse outcome pathways, and phenotypes in order to support predictive toxicology. It describes existing ontologies like the Environment, Conditions, and Treatments Ontology (ECTO) and Gene Ontology Causal Activity Models (GO-CAMs) that can be used to represent exposure mechanisms and adverse outcomes. The document also presents challenges for developing an open predictive toxicology framework that leverages ontologies and linked data to make toxicology data more findable, accessible, interoperable, and reusable.
Causal reasoning using the Relation OntologyChris Mungall
The document discusses the need for standardized relationship types in biological data and ontologies. It provides an overview of the Relation Ontology (RO), which defines over 450 standardized relationship types organized hierarchically. RO provides a foundation for integrating multiple knowledge graphs and represents relationships in ontologies, linked data, and knowledge bases. It enables logical reasoning and inference across graphs through properties like transitivity.
This document discusses lessons learned from developing and using the Gene Ontology (GO) over the past 20 years. It covers how GO aims to systematically annotate gene function across species using an ontology. It describes how GO uses OWL constructs like subclasses, equivalence and reasoning to leverage relationships with other ontologies. It also discusses moving beyond simple annotation to represent biology accurately using causal models and graphs. Finally, it covers the Open Biology Ontology Foundry principles of collaboration, shared standards and interconnected ontologies that GO adheres to.
1. The document discusses using phenotypes across species to aid in interpreting genomic data from patients and improving diagnosis and treatment.
2. Building comprehensive phenotype databases from multiple sources is challenging due to disparate data on human genes/variants and model organisms.
3. The Monarch Initiative aims to link human diseases to phenotypes in model systems through an ontology-based knowledge base and portal.
4. Incorporating rich phenotypic data can improve variant filtering and interpretation by providing more context for sequencing results.
The document discusses the Environment Ontology (ENVO), which aims to represent environmental entities and their relationships in a structured format. It describes the main hierarchies in ENVO, including biome, environmental feature, and environmental material. ENVO represents different levels of environmental granularity from broad biomes down to specific materials. Any material entity can act as a feature determining an environmental system. The objectives for further developing ENVO are also outlined, such as representing various environmental qualities like temperature, nutrients, and toxins.
Chris Mungall discussed his path in biocuration which led him to focus on ontologies. Ontologies can amplify the impact of data by providing a structured knowledge framework. Early ontologies like GO became too monolithic so the Open Biological Ontologies (OBO) Foundry was created to develop interoperable, modular ontologies through collaboration. Mungall described work developing ontologies like Uberon, developing tools like ROBOT for quality control, and a vision for more sophisticated ontology annotation to encode biological knowledge.
Chris Mungall presented a Bayesian approach called k-BOOM (Bayesian OWL Ontology Merging) to combine existing disease ontologies and lists into a unified framework. k-BOOM generates hypothetical logical mappings between ontologies, estimates weights for each mapping, and uses a greedy algorithm to find the set of mappings that maximizes the probability of the merged ontology. It has been applied to merge several disease ontologies into MonDO (Monarch Disease Ontology). Evaluation found high agreement with held-out data and detection of errors in source ontologies. Next steps include improving weight estimation and evaluation.
Presentation on BioMake, a GNU-Make-like utility for managing builds and complex workflows using declarative specifications. From GMOD/PAG meeting 2017
Mapping Phenotype Ontologies for Obesity and DiabetesChris Mungall
This document discusses approaches to mapping phenotype ontologies across species and categories. It describes using OWL axioms to define phenotypes in a machine-interpretable way and create bridges between ontologies. This enables cross-ontology queries and integrated views of data. Challenges include modeling complex phenomena accurately in OWL and a lack of tools integrated into the ontology development process. The Monarch Initiative aims to address these issues by developing tools like TermGenie and providing integrated views of data from multiple ontologies.
Uberon is an integrative multi-species anatomy ontology that contains over 11,000 classes describing anatomical structures across multiple animal species, with a focus on chordates and mammals. It uses multiple relationship types like subclass, part-of, and develops-from to connect these classes in a structured ontology. Uberon aims to bridge between existing species-specific anatomy ontologies like the Mouse Anatomy ontology and the Foundational Model of Anatomy for human. It allows cross-referencing between these ontologies and helps integrate anatomical knowledge across models and humans.
Increased Expressivity of Gene Ontology Annotations - Biocuration 2013Chris Mungall
Presentation from Biocuration conference describing extension to GO annotation formalism allowing curators to capture more detailed biological context and specificity at time of annotation. Feature Portuguese Man-o-War assaults.
Uberon is a multi-species anatomy ontology covering animal anatomy. It contains over 8,000 classes describing anatomical structures across metazoans in a species-neutral way. Uberon bridges species-specific anatomy ontologies and allows cross-species analysis of high-throughput genomics and phenomics data. It is extensively connected to other biomedical ontologies and has been applied in projects involving phenomics, transcriptomics, systematics and finding disease models.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills MN
By harnessing the power of High Flux Vacuum Membrane Distillation, Travis Hills from MN envisions a future where clean and safe drinking water is accessible to all, regardless of geographical location or economic status.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
2. Challenge: making representations of biological
knowledge interoperable
OMIM
MGI
HGNC
FlyBase
ClinVar
CTD
DrugBank
UniProt
BGeeDb
GO
SGD
RGD PomBase Monarch
WormBasePharmGKB
Reactome
GWAS
catalog
CHEMBL
ENSEMBL
DrugBank
BioGrid
KEGG
Panther
ZFIN
Xen
Base
Animal
QTLdb
3. What do we mean by knowledge here?
● Data, sensu lato : collection of values in some organized form
○ Data, sensu stricto: Output of a data collection process
■ Instrumentation or observation; raw or processed; not altered by curation
■ Serves role as evidence
■ E.g. read count in RNAseq experiment OR examination of KO mouse
○ Metadata: Data about data (or more typically) datasets
■ May be curated at source, post-hoc, manually or automatically
■ E.g. details about an RNAseq experiment (factors, instrumentation, sample prep)
○ Knowledge: Propositional assertions inferred from data
■ Something you need evidence for
■ E.g.
● gene G is expressed in tissue T under condition C
● Knocking out G gives rise to phenotype P with high penetrance
● Many bio-”databases” are actually “knowledge bases” (by this definition)
● Usual caveats:
○ Other definitions available, divisions can be murky, this is a guide rather than dogma, etc
5. Haven’t we been here before?
http://www.mged.org/Meetings/presentations/OMG/sld019.htm
6. Haven’t we been here before?
http://www.mged.org/Meetings/presentations/OMG/sld019.htm
7. Complexity and fluidity of biological knowledge vs
schema rigidity
// hypothetical strawman schema
class Gene {
String: name
String: function
String: phenotype
Protein: product
Int: start
Int: end
String: chromosome
}
Bad assumption:
- Genes actually have multiple functions
- String representation rather than vocab
Bad assumption:
- Different builds?
- Should be inherited from generic
seq feature
Bad assumption:
- Genes can have multiple products
- Products not necessarily genes
- What about transcript, exon, ...
}
8. The backwards evolution of schema languages
● 80s: ER, SQL DDL
○ Basis in FOL, formal algebra/calculus
● 90s: OO, UML, Description Logics
○ Rich polymorphism
● 00s: XML, SOAP
○ Can’t even...
● 10s: JSON and JSON-Schema
○ No polymorphism
○ Limited typing
○ Tree-based
○ Geared towards web-apps, not rich modeling
9. What works: Open-ended knowledge representation
using RDF Graphs plus OWL
● RDF: minimal
representation
model for
representing simple
facts as edges
● OWL: encodes
semantics about
RDF graphs
10. Success of OWL:
Bio-Ontologies
● One datamodel (OWL),
covers rich variety of
interconnected biology
● APIs, SPARQL, ...
http://obofoundry.org/ontology/uberon.html
11. Analogous approach in biological databases
● GMOD Chado
● Graph-like database layered
over RDBMS
● Allowed flexibility and
extensibility
● Large uptake by small MODs
Mungall, C. J., Emmert, D. B., et al. (2007) A. Bioinformatics, 23(13),
i337-346. http://doi.org/10.1093/bioinformatics/btm189
https://github.com/GMOD/Chado
12. Knowledge Graphs, the most pluripotent representation of data, are no longer as exotic or
experimental as they were 10 years ago. Goofaceamazonlink etc are all using them to some degree.
13. Challenge: too much flexibility
● With flexible schema-free graph-based
representations, multiple ways of modeling
things
● OWL provides semantic open-world
biological constraints
○ All genes are located_on exactly 1 chromosome
● Software often needs more rigid closed-
world information model constraints
○ Information System A: gene can be located on
multiple contigs/scaffolds
○ Information System B: locational info not relevant
14. BioLink Model Approach
● Define a powerful underlying metamodel
○ Mix aspects of closed-world UML and open-world OWL
○ Build for extensibility
○ Define exports: UML, SQL DDL, GraphQL, Json-Schema, Java, ...
● Define core biological types (E)
○ Gene, disease, anatomical entity, disease, ...
○ Cede detailed typology to ontologies
● Define core properties (R)
○ Id, name, synonym
○ Part-of, interacts-with, gives-rise-to
● Define taxonomy of relationships (extension of R)
○ Gene-gene-interaction, gene-tissue-expression
● Extensibility through use-case specific profiles
https://biolink.github.io/biolink-model
15. Browsing the model
● YAML source
● Autogenerated website docs: https://biolink.github.io/biolink-model
● OWL export
○ Protege
○ Bioportal
● JSON-Schema (lossy unless working in JSON-LD)
● GraphQL (lossy)
● UML Diagrams (lossy)
https://biolink.github.io/biolink-model
22. Profiles
● Different projects require different views of the data
○ E.g. omission/inclusion of different fields
○ Denormalizations
○ Inlining vs referencing
● Metamodel supports remixing and mixins
● One core conceptual model
● Different serializations for different profiles
● Well-defined transforms
● Caveat: this part is not well documented yet
23. How do I use it? How do I get data?
● Data model is serialization neutral
○ Plus: Flexible
○ Negative: Additional layer of abstraction
● RDF/Turtle serialization
○ http://data.monarchinitiative.org/ttl/
○ Turtle conforms to association patterns
● Property graphs
○ http://neo4j.monarchinitiative.org/
● JSON
○ Challenge: lack of polymorphism
○ Available via generic model or specific models
○ API http://api.monarchinitiative.org/api/
○ Preview: https://data.monarchinitiative.org/json/
○ BDBags of JSON coming soon
24. What NOT to use the biolink-model for
● Raw data
● Metadata about a dataset
● ..
● However..
○ Underlying metamodel may be useful in providing flexible representations of these
○ Currently aligning with FHIR metamodel
25. How does this relate to KC7?
● One view: DC is about data sensu stricto, and metadata
○ Search = lightweight ontology (syns + subsumption) + metadata datamodels
○ “Knowledge bases” have their own specialized search interfaces developed by specialists
○ No role for a standard KM in DC
● Counterview
○ We’re not trying to compete with bio-KBs
○ We want to leverage knowledge to enhance data search
■ Analogous to how google KG enhances google search
○ Example:
■ Find TopMed studies relevant to my disease
● Exploit KG linkages between disease-phenotype, phenotype-variable, phenotype-
gene