Tony Burdett's slides from his talk at Connected Data London. Tony is a Senior Software Engineer at The European Bioinformatics Institute. He presented the complexity of data at the EMBL-EBI and what is their solution to make sense of all this data.
Facilitating semantic alignment of EMBL-EBI services using ontologies and semantic web technology. Presentation at the BioHackathon Symposium 2016, Japan.
Data integration is intrinsic to how modern research is undertaken in areas such as genomics, drug development and personalised medicine. To better enable this integration a large number of biomedical ontologies have been developed to provide standard semantics for describing metadata. There are now several hundred biomedical ontologies in widespread use that describe concepts such as genes, molecules, drugs and diseases. This amounts to millions of terms that are interconnected via relationships that naturally form a graph of biomedical terminology.
The Ontology Lookup Service (OLS) (http://www.ebi.ac.uk/ols) integrates over 160 ontologies and provide a central point for the biomedical community to query and visualise ontologies. OLS also provide a RESTful API over the ontologies that is used in high-throughput data annotation pipelines. OLS is built on top of a Neo4j database that provides efficient indexes for extracting ontological relationships. We have developed generic tools for loading RDF/OWL ontologies into Neo4j where the indexes are optimised for serving common ontology queries. We are now moving to adopt graph database more widely in applications relating to ontology mapping prediction and recommendation systems for data annotation.
Facilitating semantic alignment of EMBL-EBI services using ontologies and semantic web technology. Presentation at the BioHackathon Symposium 2016, Japan.
Data integration is intrinsic to how modern research is undertaken in areas such as genomics, drug development and personalised medicine. To better enable this integration a large number of biomedical ontologies have been developed to provide standard semantics for describing metadata. There are now several hundred biomedical ontologies in widespread use that describe concepts such as genes, molecules, drugs and diseases. This amounts to millions of terms that are interconnected via relationships that naturally form a graph of biomedical terminology.
The Ontology Lookup Service (OLS) (http://www.ebi.ac.uk/ols) integrates over 160 ontologies and provide a central point for the biomedical community to query and visualise ontologies. OLS also provide a RESTful API over the ontologies that is used in high-throughput data annotation pipelines. OLS is built on top of a Neo4j database that provides efficient indexes for extracting ontological relationships. We have developed generic tools for loading RDF/OWL ontologies into Neo4j where the indexes are optimised for serving common ontology queries. We are now moving to adopt graph database more widely in applications relating to ontology mapping prediction and recommendation systems for data annotation.
These slides were presented at the "graph databases in life sciences workshop". There is an accompanying Neo4j guide that will walk you through importing data into Neo4j using web services form a number of databases at EMBL-EBI.
https://github.com/simonjupp/importing-lifesci-data-into-neo4j
Keynote: SemSci 2017: Enabling Open Semantic Science
1st International Workshop co-located with ISWC 2017, October 2017, Vienna, Austria,
https://semsci.github.io/semSci2017/
Abstract
We have all grown up with the research article and article collections (let’s call them libraries) as the prime means of scientific discourse. But research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
We can think of “Research Objects” as different types and as packages all the components of an investigation. If we stop thinking of publishing papers and start thinking of releasing Research Objects (software), then scholar exchange is a new game: ROs and their content evolve; they are multi-authored and their authorship evolves; they are a mix of virtual and embedded, and so on.
But first, some baby steps before we get carried away with a new vision of scholarly communication. Many journals (e.g. eLife, F1000, Elsevier) are just figuring out how to package together the supplementary materials of a paper. Data catalogues are figuring out how to virtually package multiple datasets scattered across many repositories to keep the integrated experimental context.
Research Objects [1] (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described. The brave new world of containerisation provides the containers and Linked Data provides the metadata framework for the container manifest construction and profiles. It’s not just theory, but also in practice with examples in Systems Biology modelling, Bioinformatics computational workflows, and Health Informatics data exchange. I’ll talk about why and how we got here, the framework and examples, and what we need to do.
[1] Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble, Why linked data is not enough for scientists, In Future Generation Computer Systems, Volume 29, Issue 2, 2013, Pages 599-611, ISSN 0167-739X, https://doi.org/10.1016/j.future.2011.08.004
The metadata about scientific experiments are crucial for finding, reproducing, and reusing the data that the metadata describe. We present a study of the quality of the metadata stored in BioSample—a repository of metadata about samples used in biomedical experiments managed by the U.S. National Center for Biomedical Technology Information (NCBI). We tested whether 6.6 million BioSample metadata records are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the analyzed metadata. The BioSample metadata field names and their values are not standardized or controlled—15% of the metadata fields use field names not specified in the BioSample data dictionary. Only 9 out of 452 BioSample-specified fields ordinarily require ontology terms as values, and the quality of these controlled fields is better than that of uncontrolled ones, as even simple binary or numeric fields are often populated with inadequate values of different data types (e.g., only 27% of Boolean values are valid). Overall, the metadata in BioSample reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The aberrancies in the metadata are likely to impede search and secondary use of the associated datasets.
The availability of high-quality metadata is key to facilitating discovery in the large variety of scientific datasets that are increasingly becoming publicly available. However, despite the recent focus on metadata, the diversity of metadata representation formats and the poor support for semantic markup typically result in metadata that are of poor quality. There is a pressing need for a metadata representation format that provides strong interoperation capabilities together with robust semantic underpinnings. In this talk, we describe such a format, together with open-source Web-based tools that support the acquisition, search, and management of metadata. We outline an initial evaluation using metadata from a variety of biomedical repositories.
The Center for Expanded Data Annotation and Retrieval (CEDAR) has developed a suite of tools and services that allow scientists to create and publish metadata describing scientific experiments. Using these tools and services—referred to collectively as the CEDAR Workbench—scientists can collaboratively author metadata and submit them to public repositories. A key focus of our software is semantically enriching metadata with ontology terms. The system combines emerging technologies, such as JSON-LD and graph databases, with modern software development technologies, such as microservices and container platforms. The result is a suite of user-friendly, Web-based tools and REST APIs that provide a versatile end-to-end solution to the problems of metadata authoring and management. This talk presents the architecture of the CEDAR Workbench and focuses on the technology choices made to construct an easily usable, open system that allows users to create and publish semantically enriched metadata in standard Web formats.
Information recovery is the recovery of things (objects, Web pages, archives, and so forth) that fulfill explicit conditions set in an ordinary articulation like query. While IR targets fulfilling a bit of client data need generally communicated in common language, information recovery targets figuring out which records contain the specific terms of the user queries.
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording
of metadata for their interpretation.
The FAIR Guiding Principles for scientific data management and stewardship (http://www.nature.com/articles/sdata201618) has been an effective rallying-cry for EU and USA Research Infrastructures. FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has 8 years of experience of asset sharing and data infrastructure ranging across European programmes (SysMO and EraSysAPP ERANets), national initiatives (de.NBI, German Virtual Liver Network, UK SynBio centres) and PI's labs. It aims to support Systems and Synthetic Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges of and approaches to sharing, credit, citation and asset infrastructures in practice. I'll also highlight recent experiments in affecting sharing using behavioural interventions.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Presented at COMBINE 2016, Newcastle, 19 September.
http://co.mbine.org/events/COMBINE_2016
EUGM 2014 - Mark Davies (EMBL-EBI): SureChEMBL – Open Patent Data ChemAxon
Historically the cost of access to structured chemical data extracted from patents has been prohibitively high to many researchers working in the field of Drug Discovery. The benefit of delivering this dataset to the scientific community in a free and open manner cannot be underestimated. Aware of the demand for such a service, the European Bioinformatics Institute (EMBL-EBI) acquired the SureChem patent system from Digital Science Ltd. In December 2013. The service has been re-branded SureChEMBL and run by the ChEMBL group along side existing Open Drug Discovery and research resources such as the ChEMBL database and UniChem. The focus of this talk will provide an overview of the existing system architecture, including ChemAxon software, describing how we go from patent literature to structured chemical data, accessible via Web Interface and API. The challenges of migrating such a complex system will be discussed as well as the opportunities to enhance the data processing pipeline, based on prior knowledge from running large chemical resources. In addition to providing an overview of the system, our future plans for the SureChEMBL system will be described. To date these plans include extending the functionality of the entity extractor to identify additional entities important in the Drug Discovery process, such as protein targets, diseases and cell lines. Other plans are focused integration with existing EMBL-EBI resources, such as the ChEMBL database and Europe Pubmed Central. Finally we look towards new and exciting ways to share the data such as integration with Semantic Web technologies and distribution via private Virtual Machine instances.
Paul Rissen's slides from his talk at Connected Data London. Paul Rissen, who is the Senior Data Architect for BBC News, and the Product Manager for the Research and Education Space (http://res.space) presented how the BBC implemented a User-focused Semantic architecture.
These slides were presented at the "graph databases in life sciences workshop". There is an accompanying Neo4j guide that will walk you through importing data into Neo4j using web services form a number of databases at EMBL-EBI.
https://github.com/simonjupp/importing-lifesci-data-into-neo4j
Keynote: SemSci 2017: Enabling Open Semantic Science
1st International Workshop co-located with ISWC 2017, October 2017, Vienna, Austria,
https://semsci.github.io/semSci2017/
Abstract
We have all grown up with the research article and article collections (let’s call them libraries) as the prime means of scientific discourse. But research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
We can think of “Research Objects” as different types and as packages all the components of an investigation. If we stop thinking of publishing papers and start thinking of releasing Research Objects (software), then scholar exchange is a new game: ROs and their content evolve; they are multi-authored and their authorship evolves; they are a mix of virtual and embedded, and so on.
But first, some baby steps before we get carried away with a new vision of scholarly communication. Many journals (e.g. eLife, F1000, Elsevier) are just figuring out how to package together the supplementary materials of a paper. Data catalogues are figuring out how to virtually package multiple datasets scattered across many repositories to keep the integrated experimental context.
Research Objects [1] (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described. The brave new world of containerisation provides the containers and Linked Data provides the metadata framework for the container manifest construction and profiles. It’s not just theory, but also in practice with examples in Systems Biology modelling, Bioinformatics computational workflows, and Health Informatics data exchange. I’ll talk about why and how we got here, the framework and examples, and what we need to do.
[1] Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble, Why linked data is not enough for scientists, In Future Generation Computer Systems, Volume 29, Issue 2, 2013, Pages 599-611, ISSN 0167-739X, https://doi.org/10.1016/j.future.2011.08.004
The metadata about scientific experiments are crucial for finding, reproducing, and reusing the data that the metadata describe. We present a study of the quality of the metadata stored in BioSample—a repository of metadata about samples used in biomedical experiments managed by the U.S. National Center for Biomedical Technology Information (NCBI). We tested whether 6.6 million BioSample metadata records are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the analyzed metadata. The BioSample metadata field names and their values are not standardized or controlled—15% of the metadata fields use field names not specified in the BioSample data dictionary. Only 9 out of 452 BioSample-specified fields ordinarily require ontology terms as values, and the quality of these controlled fields is better than that of uncontrolled ones, as even simple binary or numeric fields are often populated with inadequate values of different data types (e.g., only 27% of Boolean values are valid). Overall, the metadata in BioSample reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The aberrancies in the metadata are likely to impede search and secondary use of the associated datasets.
The availability of high-quality metadata is key to facilitating discovery in the large variety of scientific datasets that are increasingly becoming publicly available. However, despite the recent focus on metadata, the diversity of metadata representation formats and the poor support for semantic markup typically result in metadata that are of poor quality. There is a pressing need for a metadata representation format that provides strong interoperation capabilities together with robust semantic underpinnings. In this talk, we describe such a format, together with open-source Web-based tools that support the acquisition, search, and management of metadata. We outline an initial evaluation using metadata from a variety of biomedical repositories.
The Center for Expanded Data Annotation and Retrieval (CEDAR) has developed a suite of tools and services that allow scientists to create and publish metadata describing scientific experiments. Using these tools and services—referred to collectively as the CEDAR Workbench—scientists can collaboratively author metadata and submit them to public repositories. A key focus of our software is semantically enriching metadata with ontology terms. The system combines emerging technologies, such as JSON-LD and graph databases, with modern software development technologies, such as microservices and container platforms. The result is a suite of user-friendly, Web-based tools and REST APIs that provide a versatile end-to-end solution to the problems of metadata authoring and management. This talk presents the architecture of the CEDAR Workbench and focuses on the technology choices made to construct an easily usable, open system that allows users to create and publish semantically enriched metadata in standard Web formats.
Information recovery is the recovery of things (objects, Web pages, archives, and so forth) that fulfill explicit conditions set in an ordinary articulation like query. While IR targets fulfilling a bit of client data need generally communicated in common language, information recovery targets figuring out which records contain the specific terms of the user queries.
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording
of metadata for their interpretation.
The FAIR Guiding Principles for scientific data management and stewardship (http://www.nature.com/articles/sdata201618) has been an effective rallying-cry for EU and USA Research Infrastructures. FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has 8 years of experience of asset sharing and data infrastructure ranging across European programmes (SysMO and EraSysAPP ERANets), national initiatives (de.NBI, German Virtual Liver Network, UK SynBio centres) and PI's labs. It aims to support Systems and Synthetic Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges of and approaches to sharing, credit, citation and asset infrastructures in practice. I'll also highlight recent experiments in affecting sharing using behavioural interventions.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Presented at COMBINE 2016, Newcastle, 19 September.
http://co.mbine.org/events/COMBINE_2016
EUGM 2014 - Mark Davies (EMBL-EBI): SureChEMBL – Open Patent Data ChemAxon
Historically the cost of access to structured chemical data extracted from patents has been prohibitively high to many researchers working in the field of Drug Discovery. The benefit of delivering this dataset to the scientific community in a free and open manner cannot be underestimated. Aware of the demand for such a service, the European Bioinformatics Institute (EMBL-EBI) acquired the SureChem patent system from Digital Science Ltd. In December 2013. The service has been re-branded SureChEMBL and run by the ChEMBL group along side existing Open Drug Discovery and research resources such as the ChEMBL database and UniChem. The focus of this talk will provide an overview of the existing system architecture, including ChemAxon software, describing how we go from patent literature to structured chemical data, accessible via Web Interface and API. The challenges of migrating such a complex system will be discussed as well as the opportunities to enhance the data processing pipeline, based on prior knowledge from running large chemical resources. In addition to providing an overview of the system, our future plans for the SureChEMBL system will be described. To date these plans include extending the functionality of the entity extractor to identify additional entities important in the Drug Discovery process, such as protein targets, diseases and cell lines. Other plans are focused integration with existing EMBL-EBI resources, such as the ChEMBL database and Europe Pubmed Central. Finally we look towards new and exciting ways to share the data such as integration with Semantic Web technologies and distribution via private Virtual Machine instances.
Paul Rissen's slides from his talk at Connected Data London. Paul Rissen, who is the Senior Data Architect for BBC News, and the Product Manager for the Research and Education Space (http://res.space) presented how the BBC implemented a User-focused Semantic architecture.
The Three Lines of Defense Model & Continuous Controls MonitoringCaseWare IDEA
Presented at ACFE conference.
Long gone are the days when organizations could afford to treat each risk, fraud and compliance issue as an individual problem and allow business processes, employees and systems to operate in silos. In order for businesses to activate robust fraud detection, diverse teams of fraud investigators, internal auditors, enterprise risk management specialists, business executives and compliance officers must work in unison; each brings a unique perspective and skill set that can be invaluable to the organization. One approach we’ll examine is the Three Lines of Defense Model where management control is the first line of defense in risk management. The various risk control and compliance functions are the second line of defense, and independent assurance is the third. Each team or “line” plays a distinct role to achieve organizational objectives.
You Will Learn How To:
1. Make a business case for collaboration while remaining true to the principles of your profession
2. Derive business benefits from risk management and internal audit working collaboratively to fulfill their second and third line of defense mandates
3. Tailor the Three Lines Defense Model to fit your organization
SLIDESHARE: www.slideshare.net/CaseWare_Analytics
WEBSITE: www.casewareanalytics.com
BLOG: www.casewareanalytics.com/blog
TWITTER: www.twitter.com/CW_Analytic
Ontologies and Semantic Web technologies play an important role in the life sciences to help make data more interoperable and reusable. There are now many publicly available ontologies that enable biologists to describe everything from gene function through to animal physiology and disease.
Various efforts such as the Open Biomedical Ontologies (OBO) foundry provide central registries for biomedical ontologies and ensure they remain interoperable through a set of common shared development principles.
At EMBL-EBI we contribute to the development of biomedical ontologies and make extensive use of them in the annotation of public datasets. Biological data typically comes with rich and often complex metadata, so the ontologies provide a standard way to capture “what the data is about” and gives us hooks to connect to more data about similar things.
These ontology annotations have been put to good use in a number of large-scale data integration efforts and there’s an increasing recognition of the need for ontologies in making data FAIR (Findable, Accessible, Interoperable and Reusable).
EMBL-EBI build a number of integrative data platforms where ontologies are at the core of our domain models. One example is the Open Targets platform, where data about disease from 18 different databases can be aggregated and grouped based on therapeutic areas in the ontology and used to identify potential drug targets.
The ontologies team at EMBL-EBI provide a suite of services that are aimed at making ontologies more accessible for both humans and machines. We work with scientific data curators and software developers to integrate ontologies and semantics into both the data generation and data presentation workflows. We provide:
– An ontology lookup service (OLS) that provides search and visualisation services to over 200+ ontologies
– Services for automating the annotation of metadata and learning from previous annotations (Zooma)
– An ontology mapping and alignment service (OXO)
– Tools for working with metadata and ontologies in spreadsheets (Webulous)
– Software for enriching documents in search engines to support “semantic” query expansion
I’ll present how we are using these services at EMBL-EBI to scale up the semantic annotation of metadata. I’ll talk about our open source technology stack and describe how we utilise a polyglot persistence approach (graph databases, triples stores, document stores etc) to optimize how we deliver ontologies and semantics to our users.
Ontologies for life sciences: examples from the gene ontologyMelanie Courtot
A half day course presented during the Earlham Institute summer school on bioinformatics 2016, in Norwich, UK, http://www.earlham.ac.uk/earlham-institute-summer-school-bioinformatics
Lecture delivered by T. Ashok Kumar, Head, Department of Bioinformatics, Noorul Islam College of Arts and Science, Kumaracoil, Thuckalay, INDIA. UGC Sponsored National Workshop on BIOINFORMATICS AND GENOME ANALYSIS for College Teachers on August 11 & 12, 2014. Organized by Centre for Bioinformatics, Department of Zoology, NMCC.
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesMonica Munoz-Torres
Precise elucidation of the many different biological features encoded in a genome requires a careful curation process that involves reviewing all available evidence to allow researchers to resolve discrepancies and validate automated gene models, protein alignments, and other biological elements. Genome annotation is an inherently collaborative task; researchers only rarely work in isolation, turning to colleagues for second opinions and insights from those with expertise in particular domains and gene families.
The i5k initiative seeks to sequence the genomes of 5,000 insect and related arthropod species. The selected species are known to be important to worldwide agriculture, food safety, medicine, and energy production as well as many used as models in biology, those most abundant in world ecosystems, and representatives in every branch of the insect phylogeny in an effort to better understand arthropod evolution and phylogeny. Because computational genome analysis remains an imperfect art, each of these new genomes sequenced will require visualization and curation.
Apollo is an instantaneous, collaborative, genome annotation editor, and the new JavaScript based version allows researchers real-time interactivity, breaking down large amounts of data into manageable portions to mobilize groups of researchers with shared interests. The i5K is a broad and inclusive effort that seeks to involve scientists from around the world in their genome curation process and Apollo is serving as the platform to empower this community. Here we offer details about this collaboration.
A Semantic Web based Framework for Linking Healthcare Information with Comput...Koray Atalag
Presented at Health Informatics New Zealand (HINZ 2017) Conference, 1-3 Nov 2017, Rotorua, New Zealand. Authorship: Koray Atalag, Reza Kalbasi, David Nickerson
The University of Auckland
Precise elucidation of the many different biological features encoded in any genome requires careful examination and review by researchers, who gather and evaluate the available evidence to corroborate and modify gene predictions and other biological elements. This curation process allows them to resolve discrepancies and validate automated gene model hypotheses and alignments. This approach is the well-established practice for well-known genomes such as human, mouse, zebrafish, Drosophila, et cetera. Desktop Apollo was originally developed to meet these needs.
The cost of sequencing a genome has been dramatically reduced by several orders of magnitude in the last decade, and the natural consequence is that more and more researchers are sequencing more and more new genomes, both within populations and across species. Because individual researchers can now readily sequence many genomes of interest, the need for a universally accessible genomic curation tool logically follows. Each new exome or genome sequenced requires visualization and curation to obtain biologically accurate genomic features sets, even for limited set of genes, because computational genome analysis remains an imperfect art. Additionally, unlike earlier genome projects, which had the advantage of more highly polished genomes, recent projects usually have lower coverage. Therefore researchers now face additional work correcting for more frequent assembly errors and annotating genes split across multiple contigs.
Genome annotation is an inherently collaborative task; researchers only very rarely work in isolation, turning to colleagues for second opinions and insights from those with with expertise in particular domains and gene families. The new JavaScript based Apollo, allows researchers real-time interactivity, breaking down large amounts of data into manageable portions to mobilize groups of researchers with shared interests. We are also focused on training the next generation of researchers by reaching out to educators to make these tools available as part of curricula via workshops and webinars, and through widely applied systems such as iPlant and DNA Subway. Here we offer details of our progress.
Presentation at Genome Informatics, Session (3) on Databases, Data Mining, Visualization, Ontologies and Curation.
Authors: Monica C Munoz-Torres, Suzanna E. Lewis, Ian Holmes, Colin Diesh, Deepak Unni, Christine Elsik.
Collaboratively Creating the Knowledge Graph of LifeChris Mungall
Overview of collaborative projects in the life sciences building out the necessary ontologies, schemas, and knowledge graphs for describing biological knowledge
After the amazing breakthroughs of machine learning (deep learning or otherwise) in the past decade, the shortcomings of machine learning are also becoming increasingly clear: unexplainable results, data hunger and limited generalisability are all becoming bottlenecks.
In this talk we will look at how the combination with symbolic AI (in the form of very large knowledge graphs) can give us a way forward, towards machine learning systems that can explain their results, that need less data, and that generalise better outside their training set.
--
Frank van Harmelen leads the Knowledge Representation & Reasoning group in the CS Department of the VU University Amsterdam. He is also Principal investigator of the Hybrid Intelligence Centre, a 20Μ€, 10 year collaboration between researchers at 6 Dutch universities into AI that collaborates with people instead of replacing them.
--
While mathematicians have used graph theory since the 18th century to solve problems, the software patterns for graph data are new to most developers. To enable "mass adoption" of graph technology, we need to establish the right abstractions, access APIs, and data models.
RDF triples, while of paramount importance in establishing RDF graph semantics, are a low-level abstraction, much like using assembly language. For practical and productive “graph programming” we need something different.
Similarly, existing declarative graph query languages (such as SPARQL and Cypher) are not always the best way to access graph data, and sometimes you need a simpler interface (e.g., GraphQL), or even a different approach altogether (e.g., imperative traversals such as with Gremlin).
Ora Lassila is a Principal Graph Technologist in the Amazon Neptune graph database group. He has a long experience with graphs, graph databases, ontologies, and knowledge representation. He was a co-author of the original RDF specification as well as a co-author of the seminal article on the Semantic Web.
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...Connected Data World
"The most important contribution management needs to make in the 21st Century is to increase the productivity of knowledge work and the knowledge worker", said Peter F. Drucker in 1999, and time has proven him right.
Even NASA is no exception, as it faces a number of challenges. NASA has hundreds of millions of documents, reports, project data, lessons learned, scientific research, medical analysis, geospatial data, IT logs, and all kinds of other data stored nation-wide.
The data is growing in terms of variety, velocity, volume, value and veracity. NASA needs to provide accessibility to engineering data sources, whose visibility is currently limited. To convert data to knowledge a convergence of Knowledge Management, Information Architecture and Data Science is necessary.
This is what David Meza, Acting Branch Chief - People Analytics, Sr. Data Scientist at NASA, calls "Knowledge Architecture": the people, processes, and technology of designing, implementing, and applying the intellectual infrastructure of organizations.
A talk by Aleksa Gordic | Software - Deep Learning engineer, Microsoft | The AI Epiphany
What can you learn about Graph Machine Learning in 2 months?
Aleksa Gordic, Machine Learning engineer @ Microsoft and Founder @ The AI Epiphany, shares his journey in the world of Graph Machine Learning. Aleksa started exploring the basics in the world of Graph Machine Learning, and ended up implementing and open sourcing his own Graph Attention Network on PyTorch.
In this talk, Aleksa will share the fundamentals of Graph Machine Learning, provide real-world examples, resources, and everything his younger self would be grateful for. Aleksa will also be available to answer questions.
What is Graph Machine Learning? Simply put, Graph Machine Learning is a branch of machine learning that deals with graph data.
Graphs consist of nodes, that may have feature vectors associated with them, and edges, which again may or may not have feature vectors attached. The applications are endless. Massive-scale recommender systems, particle physics, computational pharmacology / chemistry / biology, traffic prediction, fake news detection, and the list goes on and on.
In recent years graphs have been increasingly adopted in financial services for everything from fraud detection to Know Your Customer (KYC) to regulatory requirements. At the same time Environmental Social Governance (ESG) investing has become the fastest growing segment of financial services. In this session James discusses how many of these historical graph techniques are now being enhanced for the era of sustainable investing. Going beyond definitions, let's identify use cases, discuss news and trends, and wrap up with an ask me anything session.
What is graph all about, and why should you care? Graphs come in many shapes and forms, and can be used for different applications: Graph Analytics, Graph AI, Knowledge Graphs, and Graph Databases.
Talk by George Anadiotis. Connected Data London Meetup June 29th 2020.
Up until the beginning of the 2010s, the world was mostly running on spreadsheets and relational databases. To a large extent, it still does. But the NoSQL wave of databases has largely succeeded in instilling the “best tool for the job” mindset.
After relational, key-value, document, and columnar, the latest link in this evolutionary proliferation of data structures is graph. Graph analytics, Graph AI, Knowledge Graphs and Graph Databases have been making waves, included in hype cycles for the last couple of years.
The Year of the Graph marked the beginning of it all before the Gartners of the world got in the game. The Year of the Graph is a term coined to convey the fact that the time has come for this technology to flourish.
The eponymous article that set the tone was published in January 2018 on ZDNet by domain expert George Anadiotis. George has been working with, and keeping an eye on, all things Graph since the early 2000s. He was one of the first to note the continuing rise of Graph Databases, and to bring this technology in front of a mainstream audience.
The Year of the Graph has been going strong since 2018. In August 2018, Gartner started including Graph in its hype cycles. Ever since, Graph has been riding the upward slope of the Hype Cycle.
The need for knowledge on these technologies is constantly growing. To respond to that need, the Year of the Graph newsletter was released in April 2018. In addition, a constant flow of graph-related news and resources is being shared on social media.
To help people make educated choices, the Year of the Graph Database Report was released. The report has been hailed as the most comprehensive of its kind in the market, consistently helping people choose the most appropriate solution for their use case since 2018.
The report, articles, news stream, and the newsletter have been reaching thousands of people, helping them understand and navigate this landscape. We’ll talk about the Year of the Graph, the different shapes, forms, and applications for graphs, the latest news and trends, and wrap up with an ask me anything session.
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2Connected Data World
Do you have experience in data modeling, or using taxonomies to classify things, and want to upgrade to modeling knowledge graphs? This hands-on workshop with one of the leading knowledge graph practitioners will help you get started.
Parts 1 & 2
Do you have experience in data modeling, or using taxonomies to classify things, and want to upgrade to modeling knowledge graphs? This hands-on workshop with one of the leading knowledge graph practitioners will help you get started.
Part 3
For as long as people have been thinking about thinking, we have imagined that somewhere in the inner reaches of our minds there are ghostly, intangible things called ideas which can be linked together to create representations of the world around us — a world that has a certain structure, conforms to certain rules, and to a certain extent, can be predicted and manipulated on the basis of our ideas.
Rationalist philosophers have struggled for centuries to make a solid case for this intuitive, almost inborn view of human experience, but it is only with the advent of modern computing that we have the opportunity to build machines which truly think the way we think we think.
For the first time, we can give concrete form to our mental representations as graphs or hypergraphs, explicitly specify our mental schemas as ontologies, and formally define the rules by which we reason and act on new information. If we so choose, we can even use these human-like building blocks to construct systems that carry far more information than any single human brain, and that connect and serve millions of people in real time.
As enterprise knowledge graphs become increasingly mainstream, we appear to be headed in that direction, although there is no guarantee that the momentum will continue unless actively sustained. Where knowledge graphs are likely to be the most essential, in the long run, is at the interface between human and machine; mental representation versus formal knowledge representation.
In this talk, we will take a step back from the many practical and social challenges of building large-scale knowledge graphs, which at this point are well-known. Instead, we will take up the quest for an ideal data model for knowledge representation and data integration, seeking common ground among the most popular data models used in industry and open source software, surveying what we suspect to be true of our own inner models, and previewing structure and process in Apache TinkerPop, version 4. We will also take a tentative step forward into the world of augmented perception via graph stream processing.
Graph in Apache Cassandra. The World’s Most Scalable Graph DatabaseConnected Data World
Graph databases are everywhere right now. The explosive growth in the graph market coupled with the hype of solving graph problems is causing both excitement and confusion. From labeled property graphs to RDF to pure graph analytics to multi-model databases, the breadth of graph offerings is staggering.
The good news? DataStax has been listening—and building.
In this session, we’ll show you how DataStax Graph is architected into Apache Cassandra to deliver the world’s most scalable graph database. You’ll learn how to integrate Cassandra data into mixed workloads, design scalable property graphs, and even turn your existing tables into graphs.
With your high throughput time series data distributed next to its relationships, what will you build next?
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...Connected Data World
As one of the largest financial institutions worldwide, JP Morgan is reliant on data to drive its day-to-day operations, against an ever evolving regulatory regime. Our global data landscape possesses particular challenges of effectively maintaining data governance and metadata management.
The Data strategy at JP Morgan intends to:
a) generate business value
b) adhere to regulatory & compliance requirements
c) reduce barriers to access
d) democratize access to data
In this talk, we show how JP Morgan leverages semantic technologies to drive the implementation of our data strategy. We demonstrate how we exploit knowledge graph capabilities to answer:
1) What Data do I need?
2) What Data do we have?
3) Where does my Data come from?
4) Where should my Data come from?
5) What Data should be shared most?
Graph applications were once considered “exotic” and expensive. Until recently, few software engineers had much experience putting graphs to work. However, the use cases are now becoming more commonplace.
This talk explores a practical use case, one which addresses key issues of data governance and reproducible research, and depends on sophisticated use of graph technology.
Consider: some academic disciplines such as astronomy enjoy a wealth of data — mostly open data. Popular machine learning algorithms, open source Python libraries, and distributed systems all owe much to those disciplines and their history of big data.
Other disciplines require strong guarantees for privacy and security. Datasets used in social science research involve confidential details about human subjects: medical histories, wages, home addresses for family members, police records, etc.
Those cannot be shared openly, which impedes researchers from learning about related work by others. Reproducibility of research and the pace of science in general are limited. Nonetheless, social science research is vital for civil governance, especially for evidence-based policymaking (US federal law since 2018).
Even when data may be too sensitive to share openly, often the metadata can be shared. Constructing knowledge graphs of metadata about datasets — along with metadata about authors, their published research, methods used, data providers, data stewards, and so on — that provides effective means to tackle hard problems in data governance.
Knowledge graph work supports use cases such as entity linking, discovery and recommendations, axioms to infer about compliance, etc. This talk reviews the Rich Context AI competition and the related ADRF framework used now by more than 15 federal agencies in the US.
We’ll explore knowledge graph use cases, use of open standards and open source, and how this enhances reproducible research. Social science research for the public sector has much in common with data use in industry.
Issues of privacy, security, and compliance overlap, pointing toward what will be required of banks, media channels, etc., and what technologies apply. We’ll look at comparable work emerging in other parts of industry: open source projects, open standards emerging, and in particular a new set of features in Project Jupyter that support knowledge graphs about data governance.
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...Connected Data World
Making true “molecule”-“mechanism”-“observation” relationship connections is a time consuming, iterative and laborious process. In addition, it is very easy to miss critical information that affects key decisions or helps make plausible scientific connections.
The current practice for deciphering such relationships frequently involves subject matter experts (SMEs) requesting resource from resource-constrained data science departments to refine and redo highly similar ad hoc searches. The result of this is impairment of both the pace and quality of scientific reviews.
In this presentation, I show how semantic integration can be made to ultimately become part of an integrated learning framework for more informed scientific decision making. I will take the audience through our pilot journey and highlight practical learnings that should inform subsequent endeavours.
Semantic similarity for faster Knowledge Graph delivery at scaleConnected Data World
Knowledge graphs promise a novel platform for better holistic decision making and analytics. Many projects fail to reach their full potential because of the prohibitively high cost of integrating new knowledge from the required information sources.
The talk explains the concept of semantic similarity as a tool for efficient entity clustering and matching based on graph and text embeddings. It will demonstrate the underlying scalable and easy to understand algorithm of Random Indexing.
This work is part of the Ontotext Platform, which increases productivity in developing and maintaining large scale knowledge graphs. The platform enables enterprises to develop and operate on top of such mission-critical systems for decision support, information discovery and metadata management.
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...Connected Data World
What is the key to the holistic success of the fastest growing and most successful companies of our time globally? Well, often the key is the rapid increase in collected and analysed data. Graph databases provide a way to organise semantically by classes, not tables, are web-aware, and are superior for handling deep, complex relationships than traditional relational or NoSQL data stores.
It is these deep, complex relationships that can provide the rich context for hyper-personalising your product offering, inspiring consumers to purchase. In this talk, we describe how we are using artificial intelligence at Farfetch to not only help build a knowledge graph but also to evolve our insights with state-of-the-art graph-based AI.
A world of structured data promises us an incredible future. But most websites struggle to even implement basic schema.org markup. Fewer still represent and connect their pages and content in sophisticated, structured graphs. We can’t reach that incredible future without increasing and improving adoption.
To move forward, we need to make constructing rich structured data as easy as writing a recipe. This isn’t a pipe dream: at Yoast, we think we’ve solved schema for everybody, everywhere. We’d love to share our story.
The relationships between data sets matter. Discovering, analyzing, and learning those relationships is a central part to expanding our understand, and is a critical step to being able to predict and act upon the data. Unfortunately, these are not always simple or quick tasks.
To help the analyst we introduce RAPIDS, a collection of open-source libraries, incubated by NVIDIA and focused on accelerating the complete end-to-end data science ecosystem. Graph analytics is a critical piece of the data science ecosystem for processing linked data, and RAPIDS is pleased to offer cuGraph as our accelerated graph library.
Simply accelerating algorithms only addressed a portion of the problem. To address the full problem space, RAPIDS cuGraph strives to be feature-rich, easy to use, and intuitive. Rather than limiting the solution to a single graph technology, cuGraph supports Property Graphs, Knowledge Graphs, Hyper-Graphs, Bipartite graphs, and the basic directed and undirected graph.
A Python API allows the data to be manipulated as a DataFrame, similar and compatible with Pandas, with inputs and outputs being shared across the full RAPIDS suite, for example with the RAPIDS machine learning package, cuML.
This talk will present an overview of RAPIDS and cuGraph. Discuss and show examples of how to manipulate and analyze bipartite and property graph, plus show how data can be shared with machine learning algorithms. The talk will include some performance and scalability metrics. Then conclude with a preview of upcoming features, like graph query language support, and the general RAPIDS roadmap.
Elegant and Scalable Code Querying with Code Property GraphsConnected Data World
Programming is an unforgiving art form in which even minor flaws can cause rockets to explode, data to be stolen, and systems to be compromised. Today, a system tasked to automatically identify these flaws not only faces the intrinsic difficulties and theoretical limits of the task itself, it must also account for the many different forms in which programs can be formulated and account for the awe-inspiring speed at which developers push new code into CI/CD pipelines. So much code, so little time.
The code property graph – a multi-layered graph representation of code that captures properties of code across different abstractions – (application code, libraries and frameworks) – has been developed over the last six years to provide a foundation for the challenging problem of identifying flaws in program code at scale, whether it is high-level dynamically-typed Javascript, statically-typed Scala in its bytecode form, the syntax trees generated by Roslyn C# compiler, or the bitcode that flows through LLVM.
Based on this graph, we define a common query language based on formal code property graph specification to elegantly analyze code regardless of the source language. Paired with the formulation of a state-of-the-art data flow tracker based on code property graphs, we arrive at a distributed cloud native powerful code analysis. This talk provides an introduction to the technology.
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...Connected Data World
Do you want to learn how to use the low-hanging fruit of knowledge graphs — schema.org and JSON-LD — to annotate content and improve your SEO with semantics and entities? This hands-on workshop with one of the leading Semantic SEO practitioners will help you get started.
May graph technology improve the deployment of humanitarian projects? The goal of using what we call “Graphs for good at Action Against Hunger” is to be more efficient and transparent, and this can have a crucial impact on people’s lives.
Is there common behaviour factors between different projects? Can elements of different resources or projects be related? For example, security incidents in a city could influence the way other projects run in there.
The explained use case data comes from a project called Kit For Autonomous Cash Transfer in Humanitarian Emergencies (KACHE) whose goal is to deploy electronic cash transfers in emergency situations when no suitable infrastructure is available.
It also offers the opportunity to track transactions in order to better recognize crisis-affected population behaviours, understanding goods distribution network to improve recommendations, identifying the role of culture in transactional patterns, as well as most required items for every place.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
Connecting life sciences data at the European Bioinformatics Institute
1. 12th July, 2016
Connecting life sciences data at the
European Bioinformatics Institute
Tony Burdett
Technical Co-ordinator –
Samples, Phenotypes and
Ontologies Team
www.ebi.ac.uk
3. What is EMBL-EBI?
• Europe’s home for biological data services, research
and training
• A trusted data provider for the life sciences
• Part of the European Molecular Biology Laboratory,
an intergovernmental research organisation
• International: 570 members of staff from 57 nations
• Home of the ELIXIR Technical hub.
4. OUR MISSION
To provide freely
available data and
bioinformatics services
to all facets of the
scientific community in
ways that promote
scientific progress
5. Big data, big demand
~18.5 million
requests to EMBL-EBI
websites every day
60 petabytes
of EMBL-EBI storage capacity
EMBL-EBI handles
9.2 million
jobs on average per
month
Scientists at over
5 million
unique sites use
EMBL-EBI websites
6. Atlas
what happens
where
From molecules to medicine
Biology is changing:
• Lower-cost sequencing
• More data produced
• New types of data
• Emphasis on systems biology
Bioinformatics enables new
applications:
• molecular medicine
• agriculture
• food
• environmental sciences
7. Data resources at EMBL-EBI
Genes, genomes & variation
RNA Central
Array
Express
Expression Atlas
Metabolights
PRIDE
InterPro Pfam UniProt
ChEMBL SureChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene, protein & metabolite expression
Protein sequences, families &
motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
BioStudies
Gene Ontology
Experimental Factor
Ontology
Literature &
ontologies
8. Database interactions
• Collaborative community
facilitates social,
scientific and technical
interactions
• Right: internal
interactions between
data resources as
determined by the
exchange of data.
• Width of each internal
arc weighted according
to the number of different
data types exchanged.
9. Biology 101 – Central Dogma
Dhorspool at en.wikipedia [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)
or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons
10. Sadly, it’s not *quite* that simple…
User:Dhorspool [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)
or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons
16. How do we turn data into Linked Data
(Example from the Gene Expression Atlas)
Relational Data to RDF graph conversion
• Give “things” URIs
• Type “things” with ontologies
• Link “things” to other related “things”
17. Modeling data vs biology
• Typing and semantics is the main strength of RDF, so we
focused on this aspect
• A lot of ontologies for the life sciences
• However, most model biology
• What does an Ensembl entry represent? Is an Ensembl
identifier really an instance of a Sequence Ontology Gene
class?
ensembl:ENSMUSG00000001467
rdf:type
so:’protein coding gene’
Codiad
18. Database Entry or Real World Entity?
• Practically it makes sense to treat database entries as
proxies for the real world entity they represent
• Alternative introduces a layer of indirection that would only
make linking resources harder
• It means we can use biologically meaningful relationships
• But this may or may not work for all use cases
ensembl:ENSMUSG00000001467
rdf:type
so:’protein coding gene’
ensembl:ENSMUST00000001507
rdf:type
so:’transcript’
so:’transcribed from’
19. Knowledge representation challenges
• The semantics of our data is complex
• The provenance models are even more complex
• The relationship are hard to define
• Balancing use-cases with representation is a major
challenge
• The harder you try to get representation correct, the harder it
is for users to query
• Performance drops off for simple queries
21. EBI RDF Platform
Successes
• Novel queries possible over
EBI datasets
• Production quality RDF
releases
• Community of users
• Highly available public
SPARQL endpoints
• 500+ users (10-50 million
hits per month)
• Lot of interest from industry
• Catalyst for new RDF efforts
Lessons
● Public SPARQL endpoints
problematic
● Query federation not
performant
● Inference support limited
● Not scalable for all EBI data
e.g. Variation, ENA
● Lack of expertise in service
teams
● Too much overhead to get
started quickly in this space
22. Ontologies for life sciences
22
Genotype Phenotype
Sequence
Proteins
Gene products Transcript
Pathways
Cell type
BRENDA tissue /
enzyme source
Development
Anatomy
Phenotype
Plasmodium
life cycle
-Sequence types
and features
-Genetic Context
- Molecule role
- Molecular Function
- Biological process
- Cellular component
-Protein covalent bond
-Protein domain
-UniProt taxonomy
-Pathway ontology
-Event (INOH pathway
ontology)
-Systems Biology
-Protein-protein
interaction
-Arabidopsis development
-Cereal plant development
-Plant growth and developmental stage
-C. elegans development
-Drosophila development FBdv fly
development.obo OBO yes yes
-Human developmental anatomy, abstract
version
-Human developmental anatomy, timed version
-Mosquito gross anatomy
-Mouse adult gross anatomy
-Mouse gross anatomy and development
-C. elegans gross anatomy
-Arabidopsis gross anatomy
-Cereal plant gross anatomy
-Drosophila gross anatomy
-Dictyostelium discoideum anatomy
-Fungal gross anatomy FAO
-Plant structure
-Maize gross anatomy
-Medaka fish anatomy and development
-Zebrafish anatomy and development
-NCI Thesaurus
-Mouse pathology
-Human disease
-Cereal plant trait
-PATO PATO attribute and value.obo
-Mammalian phenotype
- Human phenotype
-Habronattus courtship
-Loggerhead nesting
-Animal natural history and life history
eVOC (Expressed
Sequence Annotation
for Humans)
23. Ontologies as Graphs
• OWL ontologies aren’t graphs, but…
… can be represented as an RDF graph
… people want to use them as graphs
• Plenty of RDF databases around
• But incomplete w.r.t. OWL semantics
• SPARQL is an acquired taste
24. Ontology repository use-cases
• Search for ontology terms
• labels, synonyms, descriptions
• Querying the structure
• Get parent/child terms
• Querying transitive closure
• Get ancestor/descendant terms
• Querying across relations
• Partonomy or development stages
• We can satisfy these requirements with Neo4J
25. OWL to Neo4j schema
Label every node by type (e.g. class, property or individual) and ontology id
Label every relation by name
include additional index for “special relations” like partonomy and subsets
26. Powerful yet simple queries
• Get the transitive closure for “heart” following parent and
partonomy relations from the UBERON anatomy ontology
MATCH path =
(n:Class)-
[r:SUBCLASSOF|RelatedTree*]
->(parent)<-
[r2:SUBCLASSOF|RelatedTree]
-(sibling:Class)
WHERE n.ontology_name = {0}
AND n.iri = {1}
27. Final thoughts – Neo4j and JSON-LD?
• A lot of frameworks now make it trivial to produce good
APIs
• What’s currently missing is how to integrate data from two or
more independent APIs
• Hard to crawl independent datasets for connections without
a human to interpret semantics
• Still a need to express a schema alongside the data
• W3C standard like RDF/RDFS/SKOS/OWL provide the
basic vocabularies and semantics for expressing data
schemas
• JSON-LD is bridging the gap from JSON to RDF
28. Acknowledgements
• Sample Phenotypes and Ontologies
• Simon Jupp, Olga Vrousgou, Thomas Liener, Dani Welter,
Catherine Leroy, Sira Sarntivijai, Ilinca Tudose, Helen
Parkinson
• Funding
• European Molecular Biology Laboratory (EMBL)
• European Union projects: DIACHRON, BioMedBridges and
CORBEL, Excelerate