Often information is spread among
several data sources, such as hospital databases, lab databases,
spreadsheets, etc. Moreover, the complexity of each of these data sources
might make it difficult for end-users to access them, and even
more, to query all of them at the same time.
A new solution that has been proposed to this problem is
ontology-based data access (OBDA).
OBDA is a popular paradigm, developed since the mid 2000s, to query
various types of data sources
using a common vocabulary familiar to the end-users. In a nutshell
OBDA separates the user
from the data sources (relational databases, CVS files, etc.) by means
of an ontology, which is a common terminology that provides the user with a
convenient query vocabulary, hides the structure of the data sources,
and can enrich incomplete data with background knowledge. About a
dozen OBDA systems have been implemented in both academia and
industry.
In this tutorial we will give an overview of OBDA, and our system -ontop-
which is currently being used in the context of the European project
Optique. We will discuss how to use -ontop- for data integration,
in particular concentrating on:
– How to create an ontology (common vocabulary) for a life science domain.
– How to map available data sources to this ontology.
– How to query the database using the terms in the ontology.
– How to check consistency of the data sources w.r.t. the ontology
A tutorial on how to create mappings using ontop, how inference (OWL 2 QL and RDFS) plays a role answering SPARQL queries in ontop, and how ontop's support for on-the-fly SQL query translation enables scenarios of semantic data access and data integration.
Ontology-based data access: why it is so cool!Josef Hardi
A brief introduction about ontology-based data access (shortly OBDA) and its core implementation. I presented too a recent simple benchmark between -ontop- and Semantika---two most available software for OBDA framework---in term of query performance (including details in the appendix section). The slides were presented for Friday Research Meeting in Stanford Center for Biomedical Informatics Research (BMIR).
License: Creative Commons by Attribution 3.0
Ontop: Answering SPARQL Queries over Relational DatabasesGuohui Xiao
We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL 2 QL and RDFS ontologies), and its support for all major relational databases.
This document discusses bringing OpenClinica clinical trial data into SAS. It describes developing a Java utility to convert OpenClinica export files into an XML format that can be read by SAS. The utility standardizes names, adds metadata like labels and formats, and structures the data into a tall, thin format suitable for SAS. It allows OpenClinica data to be easily imported into SAS for analysis while preserving metadata.
The document describes a project on named entity extraction from online news articles using two machine learning models: 1) a Maximum Entropy Markov Model and 2) a Deep Neural Network with LSTM. It provides an overview of named entity extraction and the challenges of the given problem/dataset. It then describes the two models in detail, including feature engineering for the MaxEnt model and architecture of the DNN model. Results show both models achieved similar accuracy of around 93.5-93.8%. The document concludes with limitations and comparisons of the two approaches.
- The document discusses OpenFlyData, a project that integrates biological data from multiple sources using Semantic Web technologies like RDF and SPARQL. It describes applications that allow searching gene expression data across databases.
- Key challenges addressed are that biological data is scattered across sites and integration requires mapping heterogeneous identifiers. The architecture uses a SPARQL endpoint and mappings to expose data from sources like FlyBase, BDGP and FlyAtlas.
- Performance testing showed good query times for real-time user interaction, though some queries took seconds and text matching had issues without custom solutions. Future work aims to add sources and develop more applications.
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Sease
Modern search engines has to keep up with the enormous growth in the number of documents and queries submitted by users. One of the problem to deal with is finding the best k relevant documents for a given query. This operation has to be fast and this is possible only by using specialised technologies.
Block max wand is one of the best known algorithm for solving this problem without any effectiveness degradation of its ranking.
After a brief introduction, in this talk I’m going to show a strategy introduced in “Faster BlockMax WAND with Variable-sized Blocks” (SIGIR 2017), that applied to BlockMaxWand data has made possible to speed up the algorithm execution by almost 2x.
Then, will be presented another optimisation of the BlockMaxWand algorithm (“Faster BlockMax WAND with Longer Skipping”, ECIR 2019) for reducing the time execution of short queries.
A tutorial on how to create mappings using ontop, how inference (OWL 2 QL and RDFS) plays a role answering SPARQL queries in ontop, and how ontop's support for on-the-fly SQL query translation enables scenarios of semantic data access and data integration.
Ontology-based data access: why it is so cool!Josef Hardi
A brief introduction about ontology-based data access (shortly OBDA) and its core implementation. I presented too a recent simple benchmark between -ontop- and Semantika---two most available software for OBDA framework---in term of query performance (including details in the appendix section). The slides were presented for Friday Research Meeting in Stanford Center for Biomedical Informatics Research (BMIR).
License: Creative Commons by Attribution 3.0
Ontop: Answering SPARQL Queries over Relational DatabasesGuohui Xiao
We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL 2 QL and RDFS ontologies), and its support for all major relational databases.
This document discusses bringing OpenClinica clinical trial data into SAS. It describes developing a Java utility to convert OpenClinica export files into an XML format that can be read by SAS. The utility standardizes names, adds metadata like labels and formats, and structures the data into a tall, thin format suitable for SAS. It allows OpenClinica data to be easily imported into SAS for analysis while preserving metadata.
The document describes a project on named entity extraction from online news articles using two machine learning models: 1) a Maximum Entropy Markov Model and 2) a Deep Neural Network with LSTM. It provides an overview of named entity extraction and the challenges of the given problem/dataset. It then describes the two models in detail, including feature engineering for the MaxEnt model and architecture of the DNN model. Results show both models achieved similar accuracy of around 93.5-93.8%. The document concludes with limitations and comparisons of the two approaches.
- The document discusses OpenFlyData, a project that integrates biological data from multiple sources using Semantic Web technologies like RDF and SPARQL. It describes applications that allow searching gene expression data across databases.
- Key challenges addressed are that biological data is scattered across sites and integration requires mapping heterogeneous identifiers. The architecture uses a SPARQL endpoint and mappings to expose data from sources like FlyBase, BDGP and FlyAtlas.
- Performance testing showed good query times for real-time user interaction, though some queries took seconds and text matching had issues without custom solutions. Future work aims to add sources and develop more applications.
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Sease
Modern search engines has to keep up with the enormous growth in the number of documents and queries submitted by users. One of the problem to deal with is finding the best k relevant documents for a given query. This operation has to be fast and this is possible only by using specialised technologies.
Block max wand is one of the best known algorithm for solving this problem without any effectiveness degradation of its ranking.
After a brief introduction, in this talk I’m going to show a strategy introduced in “Faster BlockMax WAND with Variable-sized Blocks” (SIGIR 2017), that applied to BlockMaxWand data has made possible to speed up the algorithm execution by almost 2x.
Then, will be presented another optimisation of the BlockMaxWand algorithm (“Faster BlockMax WAND with Longer Skipping”, ECIR 2019) for reducing the time execution of short queries.
This document summarizes the "KDD Cup - 1999" data sets used for computer network intrusion detection. The data sets contain over 5 million connection records with 41 features each, representing different types of attacks including denial of service, user to root, remote to local, and probing attacks. The document discusses the data redundancy, imbalance, and partitioning of the data sets. It also provides sample code for analyzing the data sets in Python, Java, and R and concludes that while the data sets were useful for early intrusion detection research, they have limitations such as redundancy and imbalance that prompted the introduction of new data sets like NSL-KDD.
The document introduces several free and inexpensive software alternatives for analyzing X-ray diffraction data, as commercial software packages can be very expensive. It recommends the Collaborative Computational Project #14 (CCP14) website as the best source for free software information. It then describes several free data conversion tools, including the PowDLL Converter and ConvX, which can convert between different file formats. Finally, it summarizes the capabilities of the free PowderX software for general data processing tasks like smoothing, background subtraction, and pattern indexing.
In computer science, a stream is a sequence of data elements made available over time. A stream can be thought of as items on a conveyor belt being processed one at a time rather than in large batches.
Streams are processed differently from batch data :
*Normal functions cannot operate on streams as a whole, as they have potentially unlimited data, and formally
*Streams are codata (potentially unlimited), not data (which is finite).
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
The document summarizes a system called SQLShare that aims to make SQL-based data analysis more accessible to scientists by lowering initial setup costs and providing automated tools. It has been used by 50 unique users at 4 UW campus labs on 16GB of uploaded data from various science domains like environmental science and metagenomics. The system provides data uploading, query sharing, automatic English-to-SQL translation, and personalized query recommendations to lower barriers to working with relational databases for analysis.
Model-based Analysis of Large Scale Software RepositoriesMarkus Scheidgen
1) The document discusses a model-based framework for analyzing large scale software repositories. It involves reverse engineering software from version control systems to create abstract syntax tree models, applying transformations and queries to derive metrics and insights, and using Scala for flexible queries and transformations.
2) Two example analyses are described: calculating design structure matrices and propagation costs, and detecting cross-cutting concerns by analyzing co-changed methods within commits.
3) The goal is to enable scalable, language-independent analysis of ultra-large repositories through model-based techniques instead of analyzing raw code directly. This allows abstracting different languages and repositories with common models and analyses.
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
There is a variety of public resources on the Internet which contain information about various aspects of chemical, biological and pharmaceutical domains. The quality, maturity, hosting organizations, team sizes behind these data resources vary wildly and as a consequence content cannot be always trusted and the effort of extracting information and preparing it for reuse is repeated again and again at various levels. This problem is especially serious in applications for QSAR, QSPR and QNAR modeling. On the other hand authors of this poster believe, based on their own extensive experience building various types of chemical, analytical and biological databases for decades, that the process of building such knowledgebase can be systematically described and automated. This poster will outline the work performed on text and data-mining various public resources on the Web, data curation process and making this information publicly available through a portal and a RESTful API. We will also demonstrate how such knowledgebase can be used for real-time QSAR and QSPR predictions.
Reference Representation in Large Metamodel-based DatasetsMarkus Scheidgen
This document discusses different model representations for large meta-model based datasets. It compares object-by-object representation to fragmentation strategies. Fragmentation breaks models into multiple fragments stored separately. The document evaluates different fragmentation strategies through theoretical analysis and implementation tests. It also compares part-of-source and relational representations and discusses applications of model fragmentation including software engineering and scientific data analysis.
Building Search & Recommendation EnginesTrey Grainger
In this talk, you'll learn how to build your own search and recommendation engine based on the open source Apache Lucene/Solr project. We'll dive into some of the data science behind how search engines work, covering multi-lingual text analysis, natural language processing, relevancy ranking algorithms, knowledge graphs, reflected intelligence, collaborative filtering, and other machine learning techniques used to drive relevant results for free-text queries. We'll also demonstrate how to build a recommendation engine leveraging the same platform and techniques that power search for most of the world's top companies. You'll walk away from this presentation with the toolbox you need to go and implement your very own search-based product using your own data.
The document discusses enabling live linked data by synchronizing semantic data stores with commutative replicated data types (CRDTs). CRDTs allow for massive optimistic replication while preserving convergence and intentions. The approach aims to complement the linked open data cloud by making linked data writable through a social network of data participants that follow each other's update streams. This would enable a "read/write" semantic web and transition linked data from version 1.0 to 2.0.
This document outlines the agenda for a two-day workshop on learning R and analytics. Day 1 will introduce R and cover data input, quality, and exploration. Day 2 will focus on data manipulation, visualization, regression models, and advanced topics. Sessions include lectures and demos in R. The goal is to help attendees learn R in 12 hours and gain an introduction to analytics skills for career opportunities.
Bioinformaticians constantly face challenges with data: from the large volumes of data to the need to integrate diverse data types. Relational databases have a long and successful history of managing data but have been unable to meet emerging needs of big data and highly integrated data stores. This talk discusses the limitations we face when using relational data models for bioinformatics applications. It describes the features, limitations and use cases of four alternative database models: key value databases, document databases, wide column data stores and graph databases. Use in bioinformatics applications is demonstrate with text mining and atherosclerosis research projects. The talk concludes with guidance on choosing an appropriate database model for varying bioinformatics requirements.
Building a semantic search system - one that can correctly parse and interpret end-user intent and return the ideal results for users’ queries - is not an easy task. It requires semantically parsing the terms, phrases, and structure within queries, disambiguating polysemous terms, correcting misspellings, expanding to conceptually synonymous or related concepts, and rewriting queries in a way that maps the correct interpretation of each end user’s query into the ideal representation of features and weights that will return the best results for that user. Not only that, but the above must often be done within the confines of a very specific domain - ripe with its own jargon and linguistic and conceptual nuances.
This talk will walk through the anatomy of a semantic search system and how each of the pieces described above fit together to deliver a final solution. We'll leverage several recently-released capabilities in Apache Solr (the Semantic Knowledge Graph, Solr Text Tagger, Statistical Phrase Identifier) and Lucidworks Fusion (query log mining, misspelling job, word2vec job, query pipelines, relevancy experiment backtesting) to show you an end-to-end working Semantic Search system that can automatically learn the nuances of any domain and deliver a substantially more relevant search experience.
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko
While we have seen a tremendous growth in machine learning methods over the last two decades there is still no one fits all solution. The next era of cheminformatics and pharmaceutical research in general is focused on mining the heterogeneous big data, which is accumulating at ever growing pace, and this will likely use more sophisticated algorithms such as Deep Learning (DL). There has been increasing use of DL recently which has shown powerful advantages in learning from images and languages as well as many other areas. However the accessibly of this technique for cheminformatics is hindered as it is not available readily to non-experts. It was therefore our goal to develop a DL framework embedded into a general research data management platform (Open Science Data Repository) which can be used as an API, standalone tool or integrated in new software as an autonomous module. In this poster we will present results of comparing performance of classic machine learning methods (Naïve Bayes, logistic regression, Support Vector Machines etc.) with Deep Learning and will discuss challenges associated with Ddeep Learning Neural Networks (DNN). The DNN learning models of different complexity (up to 6 hidden layers) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/) and Tensorflow (www.tensorflow.org) and applied to various use cases connected to prediction of physicochemical properties, ADME, toxicity and calculating properties of materials. It was also shown that using nVidia GPUs significantly accelerates calculations, although memory consumption puts some limits on performance and applicability of standard toolkits 'as is'.
The document discusses OntoMaven repositories and the OMG API4KP standard. It describes two scenarios using distributed knowledge platforms: a connected patient system and semantic annotation of biodiversity data. The API4KP metamodel defines classes for knowledge sources, environments, operations and events. Distributed architecture styles for API4KP include direct access, remote invocation, and decoupled event-based systems. OntoMaven supports knowledge repositories with plug-ins for distributed scenarios.
Duke is an open source tool for deduplicating and linking records across different data sources without common identifiers. It indexes data using Lucene and performs searches to find potential matches. Duke was used in a real-world project linking data from Mondial and DBpedia, where it correctly linked 94.9% of records while avoiding wrong links. Duke is flexible, scalable, and incremental, making it suitable for ongoing use at Hafslund to integrate customer records from multiple systems and remove duplicates. Future work may include improving comparators, adding a web service interface, and exploring parallelism.
The document discusses different file organization methods for storing data on external storage devices. It describes sequential, direct access, and indexed sequential file organization. Sequential files store records in the order they are entered, requiring searching through all preceding records to access a non-sequential record. Direct access files allow direct retrieval of any record through its logical address. Indexed sequential files store records sequentially but have an index file to allow direct access by key. The document compares advantages and disadvantages of each method.
1. The document discusses implementing archetype-based systems and some challenges, such as a lack of standardization for templates and queries.
2. It describes some open-source tools for working with archetypes, such as the LinkEHR Archetype Editor, and experiences using archetypes at hospitals in Spain.
3. The challenges of mapping between different standards like openEHR, MML, and HL7 CDA are discussed, as well as representing concepts from one standard as archetypes.
Top Indian Mobile App to pay Bar and restaurant bills.rupleeapp
The document discusses the benefits of meditation for reducing stress and anxiety. Regular meditation practice can help calm the mind and body by lowering heart rate and blood pressure. Studies have shown that meditating for just 10-20 minutes per day can have significant positive impacts on both mental and physical health over time.
This document summarizes the "KDD Cup - 1999" data sets used for computer network intrusion detection. The data sets contain over 5 million connection records with 41 features each, representing different types of attacks including denial of service, user to root, remote to local, and probing attacks. The document discusses the data redundancy, imbalance, and partitioning of the data sets. It also provides sample code for analyzing the data sets in Python, Java, and R and concludes that while the data sets were useful for early intrusion detection research, they have limitations such as redundancy and imbalance that prompted the introduction of new data sets like NSL-KDD.
The document introduces several free and inexpensive software alternatives for analyzing X-ray diffraction data, as commercial software packages can be very expensive. It recommends the Collaborative Computational Project #14 (CCP14) website as the best source for free software information. It then describes several free data conversion tools, including the PowDLL Converter and ConvX, which can convert between different file formats. Finally, it summarizes the capabilities of the free PowderX software for general data processing tasks like smoothing, background subtraction, and pattern indexing.
In computer science, a stream is a sequence of data elements made available over time. A stream can be thought of as items on a conveyor belt being processed one at a time rather than in large batches.
Streams are processed differently from batch data :
*Normal functions cannot operate on streams as a whole, as they have potentially unlimited data, and formally
*Streams are codata (potentially unlimited), not data (which is finite).
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
The document summarizes a system called SQLShare that aims to make SQL-based data analysis more accessible to scientists by lowering initial setup costs and providing automated tools. It has been used by 50 unique users at 4 UW campus labs on 16GB of uploaded data from various science domains like environmental science and metagenomics. The system provides data uploading, query sharing, automatic English-to-SQL translation, and personalized query recommendations to lower barriers to working with relational databases for analysis.
Model-based Analysis of Large Scale Software RepositoriesMarkus Scheidgen
1) The document discusses a model-based framework for analyzing large scale software repositories. It involves reverse engineering software from version control systems to create abstract syntax tree models, applying transformations and queries to derive metrics and insights, and using Scala for flexible queries and transformations.
2) Two example analyses are described: calculating design structure matrices and propagation costs, and detecting cross-cutting concerns by analyzing co-changed methods within commits.
3) The goal is to enable scalable, language-independent analysis of ultra-large repositories through model-based techniques instead of analyzing raw code directly. This allows abstracting different languages and repositories with common models and analyses.
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
There is a variety of public resources on the Internet which contain information about various aspects of chemical, biological and pharmaceutical domains. The quality, maturity, hosting organizations, team sizes behind these data resources vary wildly and as a consequence content cannot be always trusted and the effort of extracting information and preparing it for reuse is repeated again and again at various levels. This problem is especially serious in applications for QSAR, QSPR and QNAR modeling. On the other hand authors of this poster believe, based on their own extensive experience building various types of chemical, analytical and biological databases for decades, that the process of building such knowledgebase can be systematically described and automated. This poster will outline the work performed on text and data-mining various public resources on the Web, data curation process and making this information publicly available through a portal and a RESTful API. We will also demonstrate how such knowledgebase can be used for real-time QSAR and QSPR predictions.
Reference Representation in Large Metamodel-based DatasetsMarkus Scheidgen
This document discusses different model representations for large meta-model based datasets. It compares object-by-object representation to fragmentation strategies. Fragmentation breaks models into multiple fragments stored separately. The document evaluates different fragmentation strategies through theoretical analysis and implementation tests. It also compares part-of-source and relational representations and discusses applications of model fragmentation including software engineering and scientific data analysis.
Building Search & Recommendation EnginesTrey Grainger
In this talk, you'll learn how to build your own search and recommendation engine based on the open source Apache Lucene/Solr project. We'll dive into some of the data science behind how search engines work, covering multi-lingual text analysis, natural language processing, relevancy ranking algorithms, knowledge graphs, reflected intelligence, collaborative filtering, and other machine learning techniques used to drive relevant results for free-text queries. We'll also demonstrate how to build a recommendation engine leveraging the same platform and techniques that power search for most of the world's top companies. You'll walk away from this presentation with the toolbox you need to go and implement your very own search-based product using your own data.
The document discusses enabling live linked data by synchronizing semantic data stores with commutative replicated data types (CRDTs). CRDTs allow for massive optimistic replication while preserving convergence and intentions. The approach aims to complement the linked open data cloud by making linked data writable through a social network of data participants that follow each other's update streams. This would enable a "read/write" semantic web and transition linked data from version 1.0 to 2.0.
This document outlines the agenda for a two-day workshop on learning R and analytics. Day 1 will introduce R and cover data input, quality, and exploration. Day 2 will focus on data manipulation, visualization, regression models, and advanced topics. Sessions include lectures and demos in R. The goal is to help attendees learn R in 12 hours and gain an introduction to analytics skills for career opportunities.
Bioinformaticians constantly face challenges with data: from the large volumes of data to the need to integrate diverse data types. Relational databases have a long and successful history of managing data but have been unable to meet emerging needs of big data and highly integrated data stores. This talk discusses the limitations we face when using relational data models for bioinformatics applications. It describes the features, limitations and use cases of four alternative database models: key value databases, document databases, wide column data stores and graph databases. Use in bioinformatics applications is demonstrate with text mining and atherosclerosis research projects. The talk concludes with guidance on choosing an appropriate database model for varying bioinformatics requirements.
Building a semantic search system - one that can correctly parse and interpret end-user intent and return the ideal results for users’ queries - is not an easy task. It requires semantically parsing the terms, phrases, and structure within queries, disambiguating polysemous terms, correcting misspellings, expanding to conceptually synonymous or related concepts, and rewriting queries in a way that maps the correct interpretation of each end user’s query into the ideal representation of features and weights that will return the best results for that user. Not only that, but the above must often be done within the confines of a very specific domain - ripe with its own jargon and linguistic and conceptual nuances.
This talk will walk through the anatomy of a semantic search system and how each of the pieces described above fit together to deliver a final solution. We'll leverage several recently-released capabilities in Apache Solr (the Semantic Knowledge Graph, Solr Text Tagger, Statistical Phrase Identifier) and Lucidworks Fusion (query log mining, misspelling job, word2vec job, query pipelines, relevancy experiment backtesting) to show you an end-to-end working Semantic Search system that can automatically learn the nuances of any domain and deliver a substantially more relevant search experience.
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko
While we have seen a tremendous growth in machine learning methods over the last two decades there is still no one fits all solution. The next era of cheminformatics and pharmaceutical research in general is focused on mining the heterogeneous big data, which is accumulating at ever growing pace, and this will likely use more sophisticated algorithms such as Deep Learning (DL). There has been increasing use of DL recently which has shown powerful advantages in learning from images and languages as well as many other areas. However the accessibly of this technique for cheminformatics is hindered as it is not available readily to non-experts. It was therefore our goal to develop a DL framework embedded into a general research data management platform (Open Science Data Repository) which can be used as an API, standalone tool or integrated in new software as an autonomous module. In this poster we will present results of comparing performance of classic machine learning methods (Naïve Bayes, logistic regression, Support Vector Machines etc.) with Deep Learning and will discuss challenges associated with Ddeep Learning Neural Networks (DNN). The DNN learning models of different complexity (up to 6 hidden layers) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/) and Tensorflow (www.tensorflow.org) and applied to various use cases connected to prediction of physicochemical properties, ADME, toxicity and calculating properties of materials. It was also shown that using nVidia GPUs significantly accelerates calculations, although memory consumption puts some limits on performance and applicability of standard toolkits 'as is'.
The document discusses OntoMaven repositories and the OMG API4KP standard. It describes two scenarios using distributed knowledge platforms: a connected patient system and semantic annotation of biodiversity data. The API4KP metamodel defines classes for knowledge sources, environments, operations and events. Distributed architecture styles for API4KP include direct access, remote invocation, and decoupled event-based systems. OntoMaven supports knowledge repositories with plug-ins for distributed scenarios.
Duke is an open source tool for deduplicating and linking records across different data sources without common identifiers. It indexes data using Lucene and performs searches to find potential matches. Duke was used in a real-world project linking data from Mondial and DBpedia, where it correctly linked 94.9% of records while avoiding wrong links. Duke is flexible, scalable, and incremental, making it suitable for ongoing use at Hafslund to integrate customer records from multiple systems and remove duplicates. Future work may include improving comparators, adding a web service interface, and exploring parallelism.
The document discusses different file organization methods for storing data on external storage devices. It describes sequential, direct access, and indexed sequential file organization. Sequential files store records in the order they are entered, requiring searching through all preceding records to access a non-sequential record. Direct access files allow direct retrieval of any record through its logical address. Indexed sequential files store records sequentially but have an index file to allow direct access by key. The document compares advantages and disadvantages of each method.
1. The document discusses implementing archetype-based systems and some challenges, such as a lack of standardization for templates and queries.
2. It describes some open-source tools for working with archetypes, such as the LinkEHR Archetype Editor, and experiences using archetypes at hospitals in Spain.
3. The challenges of mapping between different standards like openEHR, MML, and HL7 CDA are discussed, as well as representing concepts from one standard as archetypes.
Top Indian Mobile App to pay Bar and restaurant bills.rupleeapp
The document discusses the benefits of meditation for reducing stress and anxiety. Regular meditation practice can help calm the mind and body by lowering heart rate and blood pressure. Studies have shown that meditating for just 10-20 minutes per day can have significant positive impacts on both mental and physical health over time.
CA External WAAE Roadmap - UK User Group - CA Workload Automation Technology ...Extra Technology
This document provides information on planned and potential future features for CA Workload Automation AE. It includes roadmaps for current and upcoming releases, descriptions of planned marquee features and their business value, as well as a list of potential features under consideration. Disclaimers are provided stating that the roadmap and future features are for informational purposes only and may change.
Bottled water canmixing case-study-ihwf indiaNavill11
IHWF officials conducted a field study to understand the BRAND MIXING through CAN MIXING issue at the
core level and suggest ways and means to prevent it, thus helping water & beverage industry operate in a
way that minimizes BRAND MIXING and save themselves from getting into legal tangles.
This document discusses how a media product represents different social groups. It focuses on representing its target audience of 8-16 year old females. It aims to portray teens in a positive light, breaking stereotypes by creating a happy, vibrant magazine. It represents its main feature celebrity, Tiffany, as a beautiful, pure, glamorous female inspiration. Other models are also portrayed as attractive, successful role models. Class is represented through sophisticated costumes on the models to suggest they are famous and wealthy.
Quote burgemeester ''Een gemeentegids is niet volledig zonder advertenties van de handelaars, bedrijven, instellingen, vrije beroepen, en zelfstandige ondernemers in dienst- en/of goederenverlening. Immers ook zij zorgen voor de informatie, waaraan in het bijzonder nieuwe inwoners behoefte hebben.'' Een mooiere omschrijving konden wij niet wensen als onderdeel van onze legitimatie. Dank hiervoor en hopelijk tot een volgende publicatie Moorslede ;-) Jan Duchau - 0498/361 531 - jan@inforegio.be
This document discusses feedback received on a draft double page magazine spread about Bloom and the revisions made in response. The feedback suggested adding secondary images, moving the page numbers to the end, flipping the main image to the other side for better structure, moving a pull quote to break up text blocks, and anchoring the main image. The final spread incorporated this feedback to create a more visually pleasing layout, enhanced the main image to make Bloom look more attractive, and maintained consistency with the magazine's style.
Prof. M. Abdul Rahiman is the Pro Vice Chancellor of Kerala Technological University. He has extensive experience in academia and administration, having previously served as the Director of the All India Council for Technical Education and Director of Vocational Higher Secondary Education in Kerala. He has published extensively in the areas of digital image processing and pattern recognition and guides PhD students.
Stephen Kazman was born in Toronto and has lived in North York his whole life. He attended the University of Toronto and University of Windsor, obtaining a Bachelor of Science degree. He has a love of nature, animals, archaeology, and ensuring safe, equitable schools in his community. He promises bi-monthly community meetings, accessibility to address concerns, and respect for the diverse cultures and environment in the school system.
Begin 2014 verscheen de jaarlijkse Wegwijs De Haan. Een gids voor alle inwoners en toeristen. Vol nuttige adressen en diensten. De inhoud werd zoals steeds opgemaakt door het gemeentebestuur zelf. Publi-Touch zorgde voor de vormgeving, de druk, de verspreiding. Voor de advertentiewerving waren mijn vader en mezelf verantwoordelijk. Er verscheen naast de papieren versie, via de post huis-aan-huis bedeeld, ook een digitale webversie via http://www.inforegio.be/De_Haan Iedereen met commerciele belangen, De Haan-randgemeenten-regio, staat het vrij zich te laten informeren om hierin te adverteren. Neem voor mogelijke deelname in een volgende uitgave gerust contact met ons op: Etienne Duchau 0495/ 54 77 56 centrumgroep@telenet.be, Jan Duchau 0498/36 15 31 jan@inforegio.be
This document discusses the development of a contents page for a pop magazine. The original contents page used a clear structured layout with pink squares for secondary images. Feedback suggested anchoring and labeling the images with page numbers for relevance and clarity. The final contents page added labeled secondary images connoting pop music, sticking to the house style of the front cover for consistency. Test audiences felt the final contents page looked professional and was well laid out.
This document provides an index and descriptions of string instruments from the String Power apprentice, craftsman, artisan, and master collections. The collections include violins and violas of various sizes and styles. Descriptions provide details on materials, construction methods, fittings, and other features. Price ranges from $199 to $4600 depending on the instrument model and size. The document aims to outline the tiers of instruments available from String Power LLC in order to suit players of different levels and budgets.
The document discusses how computation can accelerate the generation of new knowledge by enabling large-scale collaborative research and extracting insights from vast amounts of data. It provides examples from astronomy, physics simulations, and biomedical research where computation has allowed more data and researchers to be incorporated, advancing various fields more quickly over time. Computation allows for data sharing, analysis, and hypothesis generation at scales not previously possible.
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)BigData_Europe
Overview of Open PHACTS, the BDE Pilot project in SC1, presented at BDE SC1 Workshop 3, 13 December, 2017.
https://www.big-data-europe.eu/the-final-big-data-europe-workshop/
This document discusses using the T-BioInfo platform to provide practical education in bioinformatics. It describes how the platform can integrate different types of omics data and analysis into intuitive, visual pipelines. This allows non-experts to analyze and interpret complex datasets. Example projects are provided, such as using RNA-seq data to identify genes involved in a disease. The goal is to teach bioinformatics through collaborative, project-based learning without requiring programming skills. Learners would reconstruct simulated biological processes and contribute to ongoing analysis of real scientific datasets.
This document compares the performance of classification algorithms like decision tree, naive Bayes, and k-nearest neighbors using five data mining tools: RapidMiner, WEKA, Tanagra, Orange, and Knime. The algorithms are tested on an Indian liver patient dataset containing 416 liver patient and 167 non-liver patient records. Accuracy scores are reported from the confusion matrices generated by each tool. Overall, Knime achieved the highest accuracy for all three algorithms, with decision tree and k-nearest neighbor performing better than naive Bayes. WEKA had the lowest naive Bayes accuracy.
This document compares two solutions for filtering hierarchical data sets: Solution A uses MySQL and Python, while Solution B uses MongoDB and C++. Both solutions were tested on a 2011 MeSH data set using various filtering methods and thresholds. Solution A generally had faster execution times at lower thresholds, while Solution B scaled better to higher thresholds. However, the document concludes that neither solution is clearly superior, and further study is needed to evaluate their performance for real-world human users.
The suite of free software tools created within the OpenCB (Open Computational Biology – https://github.com/opencb) initiative makes possible to efficiently manage large genomic databases.
These tools are not widely used, since there is quite a steep learning curve for their adoption, thanks to the complexity of the software stack, but they may be really cost-effective for hospitals, research institutions etcetera.
The objective of the talk is showing the potential of the OpenCB suite, the information to start using it and the advantages for the end users. BioDec is currently deploying a large OpenCGA installation for the Genetic Unit of one of the main Italian Hospitals, where data in the order of the hundreds of TBs will be managed and analyzed by bioinformaticians.
This document summarizes bioinformatics tools that can be used for analysis of high-throughput sequencing data for molecular diagnostics. It discusses databases for virulence factors and antimicrobial resistance as well as tools for assembly, annotation, pan-genome analysis, visualization, and commercial solutions. The presentation emphasizes that there is no single best tool and different approaches are needed for different questions. Collaboration with other researchers is recommended.
A consistent and efficient graphical User Interface Design and Querying Organ...CSCJournals
We propose a software layer called GUEDOS-DB upon Object-Relational Database Management System ORDMS. In this work we apply it in Molecular Biology, more precisely Organelle complete genome. We aim to offer biologists the possibility to access in a unified way information spread among heterogeneous genome databanks. In this paper, the goal is firstly, to provide a visual schema graph through a number of illustrative examples. The adopted, human-computer interaction technique in this visual designing and querying makes very easy for biologists to formulate database queries compared with linear textual query representation.
Presentation in the "Whole genome sequencing for clinical microbiology:Translation into routine applications" Symposium , Basel , Switzerland, 2 Sep 2017
This document discusses the challenges and opportunities biology faces with increasing data generation. It outlines four key points:
1) Research approaches for analyzing infinite genomic data streams, such as digital normalization which compresses data while retaining information.
2) The need for usable software and decentralized infrastructure to perform real-time, streaming data analysis.
3) The importance of open science and reproducibility given most researchers cannot replicate their own computational analyses.
4) The lack of data analysis training in biology and efforts at UC Davis to address this through workshops and community building.
A scalable ontology reasoner via incremental materializationRokan Uddin Faruqui
This document describes an incremental materialization approach for scalable ontology reasoning. It summarizes an existing system called OwlOntDB that uses materialization to infer and store implicit knowledge from an ontology in a database. The goal is to develop an incremental materialization approach to address the limitations of OwlOntDB, namely high initial processing times and the need to re-materialize the whole ontology for updates. It describes how incremental materialization can efficiently handle insertions and deletions by only updating the affected parts of the inferred knowledge. Experimental results show that the incremental approach significantly improves performance over a non-incremental baseline, especially for ontologies with many instances.
Biology, medicine, physics, astrophysics, chemistry: all these scientific domains need to process large amount of data with more and more complex software systems. For achieving reproducible science, there are several challenges ahead involving multidisciplinary collaboration and socio-technical innovation with software at the center of the problem. Despite the availability of data and code, several studies report that the same data analyzed with different software can lead to different results. I am seeing this problem as a manifestation of deep software variability: many factors (operating system, third-party libraries, versions, workloads, compile-time options and flags, etc.) themselves subject to variability can alter the results, up to the point it can dramatically change the conclusions of some scientific studies. In this keynote, I argue that deep software variability is a threat and also an opportunity for reproducible science. I first outline some works about (deep) software variability, reporting on preliminary evidence of complex interactions between variability layers. I then link the ongoing works on variability modelling and deep software variability in the quest for reproducible science.
Driving Deep Semantics in Middleware and Networks: What, why and how?Amit Sheth
Amit Sheth, "Driving Deep Semantics in Middleware and Networks: What, why and how?," Keynote talk at Semantic Sensor Networks Workshop at the 5th International Semantic Web Conference (ISWC-2006), November 6, 2006, Athens, Georgia, USA.
The document introduces Tag.bio as a low-code analytics application platform built from interconnected data products in a data mesh architecture. It consists of data, algorithms, and analysis apps contributed by different groups - data engineers, data scientists, and domain experts. The platform can integrate various data sources and enable collaboration between groups. It then provides demos of the Tag.bio developer studio and data portal. Key capabilities discussed include integration with AWS services like AI/ML and HealthLake, as well as security features like confidential computing. Example use cases presented are for clinical trials, healthcare, life sciences, and universities.
LO-PHI: Low-Observable Physical Host Instrumentation for Malware AnalysisPietro De Nicolao
Presentation of paper "LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis" for the course of Advanced Topics in Computer Security of prof. Stefano Zanero.
Source and further information: https://github.com/pietrodn/lo-phi
For a Bioinformatics Discussion for Students and Post-Docs (BioDSP) meeting: Expands on Sandve's "Ten Simple Rules for Reproducible Computational Research"
Facilitating Data Curation: a Solution Developed in the Toxicology DomainChristophe Debruyne
Christophe Debruyne, Jonathan Riggio, Emma Gustafson, Declan O'Sullivan, Mathieu Vinken, Tamara Vanhaecke, Olga De Troyer.
Presented at the 2020 IEEE 14th International Conference on Semantic Computing, San Diego, California, 3-5 February 2020
Toxicology aims to understand the adverse effects of
chemical compounds or physical agents on living organisms. For
chemicals, much information regarding safety testing of cosmetic
ingredients is now scattered in a plethora of safety evaluation
reports. Toxicologists in our university intend to collect this
information into a single repository. Their current approach uses
spreadsheets, does not scale well, and makes data curation and
querying cumbersome. Semantic technologies (e.g., RDF, OWL,
and Linked Data principles) would be more appropriate for
this purpose. However, this technology is not very accessible to
toxicologists without extensive training. In this paper, we report
on a tool that supports subject matter experts in the construction
of an RDF–based knowledge base for the toxicology domain. The
tool is using the jigsaw metaphor for guiding the subject matter
experts. We demonstrate that the jigsaw metaphor is a viable
option for generating RDF. Future work includes investigating
appropriate methods and tools for knowledge evolution and data
analysis.
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...open_phacts
The Open PHACTS Discovery Platform integrates multiple biomedical data resources into a single open access point using semantic web technology. It is guided by business questions from pharmaceutical companies to integrate data from sources like ChEMBL, DrugBank, UniProt, and more. The platform is run as a public-private partnership through 2021 to support drug discovery.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Things to Consider When Choosing a Website Developer for your Website | FODUUFODUU
Choosing the right website developer is crucial for your business. This article covers essential factors to consider, including experience, portfolio, technical skills, communication, pricing, reputation & reviews, cost and budget considerations and post-launch support. Make an informed decision to ensure your website meets your business goals.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfTechgropse Pvt.Ltd.
In this blog post, we'll delve into the intersection of AI and app development in Saudi Arabia, focusing on the food delivery sector. We'll explore how AI is revolutionizing the way Saudi consumers order food, how restaurants manage their operations, and how delivery partners navigate the bustling streets of cities like Riyadh, Jeddah, and Dammam. Through real-world case studies, we'll showcase how leading Saudi food delivery apps are leveraging AI to redefine convenience, personalization, and efficiency.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Ontologies Ontop Databases
1. Ontologies Ontop Databases
Martin Rezk (PhD)
Faculty of Computer Science
Free University of Bozen-Bolzano, Italy
Semantic Web Applications and Tools for Life Sciences
9/12/MMXIV
1 / 61
2. Who am I?
Postdoc at: Free University of Bozen-Bolzano, Bolzano, Italy.
Research topics: Ontology-based Data Access (OBDA),
Efficient Query Answering, Query rewriting, Data integration.
Leader of the -ontop- project.
2 / 61
3. What are we going to learn today?
How to organize and access your data using ontologies.
How to do it with our system: –ontop–.
How to use this approach for data integration and consistency
checking.
3 / 61
4. Declaimer
Part of the material here has been taken from a tutorial by
Mariano Rodriguez (2011)
http://www.slideshare.net/marianomx/ontop-a-tutorial
4 / 61
5. Declaimer: License
License
This work is supported by the Optique Project and is licensed
under the Creative Commons Attribution-Share Alike 3.0 License
http://creativecommons.org/licenses/by-sa/3.0/
5 / 61
6. Overview
Introduction: Optique and Ontop
Ontology Based Data Access
The Database:
Ontologies
Mappings
Virtual Graph
Querying
Data Integration
Checking Consistency
Conclusions
6 / 61
7. Outline
Introduction: Optique and Ontop
Ontology Based Data Access
The Database:
Ontologies
Mappings
Virtual Graph
Querying
Data Integration
Checking Consistency
Conclusions
7 / 61
9. When does this Go Wrong?
I’m sure the information is there but there are so many
concepts involved that I can’t find it in the application.
I can’t say what I want to the application.
The application doesn’t know about this data.
They Get Lost !!
9 / 61
10. How they Tackle This Problem
Engineer
information need specialised
Application
IT-expert
query
answers
10 / 61
11. How they Tackle This Problem
May take weeks to respond.
Takes several years to master data stores and user
needs.
Siemens loses up to 50.000.000 euro per year because of this!!
11 / 61
13. Optique
Provides a semantic end-to-end connection between users and
data sources;
Enables users to rapidly formulate intuitive queries using
familiar vocabularies and conceptualisations.
13 / 61
14. How Optique Tackles the Problem
Velocity Volume
BIG DATA
Variety Complexity
(i) Providing scalable big/fast data infrastructures.
(ii) Providing the ability to cope with variety and complexity in the
data.
(iii) Providing a good understanding of the data.
14 / 61
15. How Optique Tackles the Problem
Velocity Volume
BIG DATA
Variety Complexity
(i) Providing scalable big/fast data infrastructures.
(ii) Providing the ability to cope with variety and complexity in the
data. (OBDA)
(iii) Providing a good understanding of the data. (OBDA)
14 / 61
16. -ontop-: Our OBDA System
-ontop- is a platform to query databases as virtual RDF Graphs
using SPARQL, and exploiting OWL 2 QL ontologies and
mappings
It is currently being developed in the context of the Optique
project.
Development of -ontop- started 4.5 years ago.
-ontop- is already well established
website has +-500 visits per month
got 500 new registrations in the past year
-ontop- is open-source and released under Apache license.
We support all major relational DBs (Oracle, DB2, Postgres,
MySQL, etc.)
15 / 61
17. Outline
Introduction: Optique and Ontop
Ontology Based Data Access
The Database:
Ontologies
Mappings
Virtual Graph
Querying
Data Integration
Checking Consistency
Conclusions
16 / 61
18. Ontology-based data access (OBDA)
Ontology-based
Data Access
Ontology
Source Source
Source
Mapping
Queries
Data source(s): are external and independent (possibly multiple
and heterogeneous).
Ontology: provides a unified common vocabulary, and a
conceptual view of the data.
Mappings: relate each term in the ontology to a set of (SQL)
views over (possibly federated) data.
17 / 61
19. Simple Life Science Running Example
Data source(s): Hospital Databases with cancer patients (First
1, then 2)
Ontology: A common domain vocabulary defining Patient,
Cancer, LungCancer, etc.
Mappings: Relating the vocabulary and the databases.
18 / 61
20. Pre-requisites
Before we start you need:
Java 1.7 (check typing: java -version)
The material online:
https://www.dropbox.com/sh/316cavgoavjtnu3/
AAADoau5NuNGq3zXO4JJQ6rya?dl=0
19 / 61
21. Outline
Introduction: Optique and Ontop
Ontology Based Data Access
The Database:
Ontologies
Mappings
Virtual Graph
Querying
Data Integration
Checking Consistency
Conclusions
20 / 61
22. DB Engines and SQL
Standard way to store LARGE volumes of data: Mature,
Robust and FAST.
Domain is structured as tables, data becomes rows in these
tables.
Powerful query language (SQL) to retrieve this data.
Major companies developed SQL DBs for the last 30 years
(IBM, Microsoft, Oracle)..
21 / 61
23. Data Source
Cancer Patient Database 1
Table: tbl_patient
PatientId Name Type Stage
1 Mary false 4
2 John true 7
22 / 61
24. Data Source
Cancer Patient Database 1
Table: tbl_patient
PatientId Name Type Stage
1 Mary false 4
2 John true 7
Type is:
false for Non-Small Cell Lung Cancer (NSCLC)
true for Small Cell Lung Cancer (SCLC)
Stage is:
1-6 for NSCLC stages: I,II,III,IIIa,IIIb,IV
7-8 for SCLC stages: Limited,Extensive
22 / 61
25. Creating the DB in H2
H2 is a pure java SQL database
Just unzip the downloaded package
Easy to run, just run the scripts:
Open a terminal (in mac Terminal.app, in windows run cmd.exe)
Move to the H2 folder (e.g., cd h2)
Start H2 using the h2 scripts
sh h2.sh (in mac/linux - You might need “chmod u+x h2.sh ”)
h2w.bat (in windows)
23 / 61
26. How it looks:
jdbc:h2:tcp: = protocol information
locahost = server location
helloworld= database name 24 / 61
28. Creating the table
You can use the files create.sql and insert.sql
CREATE TABLE "tbl_patient" (
patientid INT NOT NULL PRIMARY KEY,
name VARCHAR(40),
type BOOLEAN,
stage TINYINT
)
Adding Data:
INSERT INTO "tbl_patient"
(patientid,name,type,stage)
VALUES
(1,'Mary',false,4),
(2,'John',true,7);
26 / 61
29. Example SQL Query
Patients with type false and stage IIIa or above (select.sql)
SELECT patientid
FROM "tbl_patient"
WHERE
TYPE = false AND stage >= 4
27 / 61
30. The More Meaningful Question
Give me the id and the name of the patients with a tumor at stage
IIIa
28 / 61
31. Outline
Introduction: Optique and Ontop
Ontology Based Data Access
The Database:
Ontologies
Mappings
Virtual Graph
Querying
Data Integration
Checking Consistency
Conclusions
29 / 61
32. Ontologies: Conceptual view
Definition
An artifact that contains a vocabulary, relations between
the terms in the vocabulary, and that is expressed in a
language whose syntax and semantics (meaning of the
syntax) are shared and agreed upon.
Mike type Patient
NSCLC subClassOf LungCancer
LungCancer subClassOf Cancer
30 / 61
33. Ontologies: Protege
Go to the protégé-ontop folder from your material. This is a
Protégé 4.3 package that includes the ontop plugin
Run Protégé from the console using the run.bat or run.sh
scripts. That is, execute:
cd Protege_4.3/; run.sh
31 / 61
34. The ontology: Creating Concepts and
Properties
Add the concept: Patient.
(See PatientOnto.owl)
32 / 61
35. The ontology: Creating Concepts and
Properties
Add these object properties:
(See PatientOnto.owl)
32 / 61
36. The ontology: Creating Concepts and
Properties
Add these data properties:
(See PatientOnto.owl)
32 / 61
37. Outline
Introduction: Optique and Ontop
Ontology Based Data Access
The Database:
Ontologies
Mappings
Virtual Graph
Querying
Data Integration
Checking Consistency
Conclusions
33 / 61
38. Mappings
We have the vocabulary, the database, now we need to link
those two.
Mappings define triples (subject, property, object) out of SQL
queries.
These triples is accessible during query time (the on-the-fly
approach) or can be imported into the OWL ontology (the ETL
approach)
Definition (Intuition)
A mapping have the following form:
TripleTemplate SQL Query to build the triples
represents triples constructed from each result row
returned by the SQL query in the mapping.
34 / 61
39. The Mappings
p.Id Name Type Stage
1 Mary false 2
(:db1/{p.id},type, :Patient) Select p.id From tbl_patient
(:db1/{p.id},:hasName, {name}) Select p.id,name From tbl_patient
(:db1/{p.id},:hasNeoplasm, :db1/neoplasm/{p.id})
Select p.id From tbl_patient
(:db1/neoplasm/{p.id},:hasStage, :stage-IIIa)
Select p.id From tbl_patient where stage=4
35 / 61
40. The Mappings
p.Id Name Type Stage
1 Mary false 2
(:db1/{p.id},type, :Patient) Select p.id From tbl_patient
(:db1/{p.id},:hasName, {name}) Select p.id,name From tbl_patient
(:db1/{p.id},:hasNeoplasm, :db1/neoplasm/{p.id})
Select p.id From tbl_patient
(:db1/neoplasm/{p.id},:hasStage, :stage-IIIa)
Select p.id From tbl_patient where stage=4
35 / 61
41. The Mappings
Using the Ontop Mapping tab, we now need to define the
connection parameters to our lung cancer database
Steps:
1. Switch to the Ontop Mapping tab
2. Add a new data source (give it a name, e.g., PatientDB)
3. Define the connection parameters as follows:
Connection URL: jdbc:h2:tcp://localhost/helloworld
Username: sa
Password: (leave empty)
Driver class: org.h2.Driver (choose it from the drop down menu)
4. Test the connection using the “Test Connection” button
36 / 61
43. The Mappings
Switch to the “Mapping Manager” tab in the ontop mappings
tab.
Select your datasource
click Create:
target: :db1/{patientid} a :Patient .
source: SELECT patientid FROM "tbl_patient"
target: :db1/{patientid} :hasName {name} .
source: Select patientid,name FROM "tbl_patient"
target: :db1/{patientid} :hasNeoplasm :db1/neoplasm/{patientid}.
source: SELECT patientid FROM "tbl_patient"
target: :db1/neoplasm/{patientid} :hasStage :stage-IIIa .
source: SELECT patientid FROM "tbl_patient" where stage=4
38 / 61
44. The Mappings
Now we classify the neoplasm individual using our knowledge of
the database.
We know that “false” in the table patient indicates a “Non
Small Cell Lung Cancer”, so we classify the patients as a
:NSCLC.
39 / 61
45. Outline
Introduction: Optique and Ontop
Ontology Based Data Access
The Database:
Ontologies
Mappings
Virtual Graph
Querying
Data Integration
Checking Consistency
Conclusions
40 / 61
46. Virtual Graph
Data:
The vocabulary is more domain oriented and independent from
the DB.
No more values to encode types or stages.
Later, this will allow us to easily integrate new data or domain
information (e.g., an ontology).
Our data sources are now documented!.
41 / 61
47. Virtual Graph
Data and Inference:
There is a new individual :db1/neoplasm/1 that stands for the
cancer (tumor) of Mary. This allows the user to query specific
properties of the tumor independently of the patient.
We get extra information as shown above.
41 / 61
48. Outline
Introduction: Optique and Ontop
Ontology Based Data Access
The Database:
Ontologies
Mappings
Virtual Graph
Querying
Data Integration
Checking Consistency
Conclusions
42 / 61
49. On-the-fly access to the DB
Recall our information need: Give me the id and the name of
the patients with a tumor at stage IIIa.
Enable Ontop in the “Reasoner” menu
43 / 61
50. On-the-fly access to the DB
In the ontop SPARQL tab add all the prefixes
44 / 61
51. On-the-fly access to the DB
Write the SPARQL Query
SELECT ?p ?name WHERE
{ ?p rdf:type :Patient .
?p :hasName ?name .
?p :hasNeoplasm ?tumor .
?tumor :hasStage :stage-IIIa .}
Click execute
This is the main way to access data in ontop and its done by
querying ontop with SPARQL.
45 / 61
52. How we do Inference...
We embed inference into the query
We do not need to reason with the (Big) Data
46 / 61
54. Outline
Introduction: Optique and Ontop
Ontology Based Data Access
The Database:
Ontologies
Mappings
Virtual Graph
Querying
Data Integration
Checking Consistency
Conclusions
48 / 61
55. What about Data Integration?
Cancer Patient Database 2
T_Name
PId Nombre
1 Anna
2 Mike DB information is distributed in multiple tables.
The IDs of the two DBs overlap.
T_NSCLC
Id hosp Stge
1 X two
2 Y one Information is encoded differently. E.g. Stage of
cancer is text (one, two...)
T_SCLC
key hosp St
1 XXX
2 YYY 49 / 61
58. New Mappings
The URI’s for the new individuals differentiate the data sources
(db2 vs. db1)
Being an instance of NSCLC and SCLC depends now on the
table, not a column value
50 / 61
59. Outline
Introduction: Optique and Ontop
Ontology Based Data Access
The Database:
Ontologies
Mappings
Virtual Graph
Querying
Data Integration
Checking Consistency
Conclusions
51 / 61
60. Consistency
A logic based ontology language, such as OWL, allows
ontologies to be specified as logical theories, this implies that
it is possible to constrain the relationships between concepts,
properties, and data.
In OBDA inconsistencies arise when your mappings violate the
constraints imposed by the ontology.
In OBDA we have two types of constraints:
Disjointness: The intersection between classes Patient and
Employee should be empty. There can also be disjoint
properties.
Functional Properties: Every patient has at most one name.
52 / 61
65. Outline
Introduction: Optique and Ontop
Ontology Based Data Access
The Database:
Ontologies
Mappings
Virtual Graph
Querying
Data Integration
Checking Consistency
Conclusions
57 / 61
66. Conclusions
Ontologies give you a common vocabulary to formulate the
queries, and mappings to find the answers.
Ontologies and Semantic Technology can help to handle the
problem of accessing Big Data
Diversity:
Using ontologies describing particular domains allows to hide
the storage complexity.
Agreement on data identifiers allows for integration of datasets.
Understanding: Agreement on vocabulary allow to better define
your data and allows for easy information exchange.
There is no need of computationally expensive ETL processes.
Reasoning is scalable because we reason at the query level.
You do not need to have everything ready to use it!
58 / 61
67. What I left outside this talk...
Semantic Query Optimisation
SWRL and Recursion
Performance Evaluation
Aggregates and bag semantics
Give out about Database Engines
Tons of theory
etc. etc. etc...
59 / 61
69. Extra: Where Reasoning takes Place
If we pose the query asking for all the instances of the class
Neoplasm:
SELECT ?x WHERE { ?x rdf:type :Noeplasms). }
61 / 61
70. Extra: Where Reasoning takes Place
If we pose the query asking for all the instances of the class
Neoplasm:
SELECT ?x WHERE { ?x rdf:type :Noeplasms). }
(Intuitively) -ontop- will translate it into:
SELECT ?x WHERE { { ?x rdf:type :Neoplasms. }
UNION
{ ?x rdf:type :BenignNeoplasms. }
UNION
{ ?x rdf:type :MalignantNeoplasm. }
UNION
...
{ ?x rdf:type :NSCLC). }
UNION
{ ?x rdf:type :SCLC). } }
61 / 61
71. Extra: Where Reasoning takes Place
If we pose the query asking for all the instances of the class
Neoplasm:
SELECT ?x WHERE { ?x rdf:type :Noeplasms). }
(Intuitively) -ontop- will translate it into:
SELECT ?x WHERE { {?x rdf:type :Neoplasms.}
UNION
{ ?x rdf:type :BenignNeoplasms. }
UNION
{ ?x rdf:type :MalignantNeoplasm . }
UNION
...
{ ?x rdf:type :NSCLC). }
UNION
{ ?x rdf:type :SCLC). } }
61 / 61
72. Extra: Where Reasoning takes Place
If we pose the query asking for all the instances of the class
Neoplasm:
SELECT ?x WHERE { ?x rdf:type :Noeplasms). }
(Intuitively) -ontop- will translate it into:
SELECT Concat(:db1/neoplasm/, TBL.PATIENT.id) AS ?x
FROM TBL.PATIENT
61 / 61