We propose HARE, a SPARQL query engine that encompasses human-machine query processing to augment the completeness of query answers. We empirically assessed the e effectiveness of HARE on 50 SPARQL queries over DBpedia. Experimental results clearly show that our solution accurately enhances answer completeness.
This work was presented at The Web Conference 2018, Journal Track. ACM Open ToC Service: https://dl.acm.org/authorize?N655127
Reference:
Maribel Acosta, Elena Simperl, Fabian Flöck, and Maria-Esther Vidal. 2017. Enhancing answer completeness of SPARQL queries via crowdsourcing. Web Semantics: Science, Services and Agents on the World Wide Web (2017). https://doi.org/10.1016/j.websem.2017.07.001
Information access over linked data requires to determine
subgraph(s), in linked data's underlying graph, that correspond to the required information need. Usually, an information access framework is able to retrieve richer information by checking of a large number of possible subgraphs. However, on the ecking of a large number of possible subgraphs increases information access complexity. This makes information access frameworks less eective. A large number of contemporary linked data information access frameworks reduce the complexity by introducing dierent heuristics but they suer on retrieving richer information. Or, some frameworks do not care about the complexity. However, a practically usable framework should retrieve richer information with lower complexity. In linked data information access, we hypothesize that pre-processed data statistics of linked data can be used to eciently check a large number of possible subgraphs. This will help to retrieve comparatively richer information with lower data access complexity. Preliminary evaluation of our proposed hypothesis shows promising performance.
Propagation of Policies in Rich Data FlowsEnrico Daga
Enrico Daga† Mathieu d’Aquin† Aldo Gangemi‡ Enrico Motta†
† Knowledge Media Institute, The Open University (UK)
‡ Université Paris13 (France) and ISTC-CNR (Italy)
The 8th International Conference on Knowledge Capture (K-CAP 2015)
October 10th, 2015 - Palisades, NY (USA)
http://www.k-cap2015.org/
The NPOESS program uses Unified Modeling Language (UML) to describe the format of the HDF5 files produced. For each unique type of data product, the HDF5 storage organization and the means to retrieve the data is the same. This provides a consistent data retrieval interface for manual and automated users of the data, without which would require custom development and cumbersome maintenance. The data formats are described using UML to provide a profile of HDF5 files.
An introduction to frequent pattern mining algorithms and their usage in mining log data. Presented by Krishna Sridhar (Dato) at Seattle DAML meetup, Feb 2016.
Information access over linked data requires to determine
subgraph(s), in linked data's underlying graph, that correspond to the required information need. Usually, an information access framework is able to retrieve richer information by checking of a large number of possible subgraphs. However, on the ecking of a large number of possible subgraphs increases information access complexity. This makes information access frameworks less eective. A large number of contemporary linked data information access frameworks reduce the complexity by introducing dierent heuristics but they suer on retrieving richer information. Or, some frameworks do not care about the complexity. However, a practically usable framework should retrieve richer information with lower complexity. In linked data information access, we hypothesize that pre-processed data statistics of linked data can be used to eciently check a large number of possible subgraphs. This will help to retrieve comparatively richer information with lower data access complexity. Preliminary evaluation of our proposed hypothesis shows promising performance.
Propagation of Policies in Rich Data FlowsEnrico Daga
Enrico Daga† Mathieu d’Aquin† Aldo Gangemi‡ Enrico Motta†
† Knowledge Media Institute, The Open University (UK)
‡ Université Paris13 (France) and ISTC-CNR (Italy)
The 8th International Conference on Knowledge Capture (K-CAP 2015)
October 10th, 2015 - Palisades, NY (USA)
http://www.k-cap2015.org/
The NPOESS program uses Unified Modeling Language (UML) to describe the format of the HDF5 files produced. For each unique type of data product, the HDF5 storage organization and the means to retrieve the data is the same. This provides a consistent data retrieval interface for manual and automated users of the data, without which would require custom development and cumbersome maintenance. The data formats are described using UML to provide a profile of HDF5 files.
An introduction to frequent pattern mining algorithms and their usage in mining log data. Presented by Krishna Sridhar (Dato) at Seattle DAML meetup, Feb 2016.
Analysing streams of text data to extract topics is an important task for getting useful
insights to be leveraged in subsequent workflows. For example extracting topics from text to be
continuously ingested into a search engine can be useful to tag documents with important
keywords or concepts to be used at search time. Another use case is doing analysis of support
tickets to get insights on the most common problems for customers.
In this talk we illustrate how to use Flink's Dynamic processing capabilities to continuously train
topic models from unlabelled text and use such models to extract topics from the data itself.
Such topic models will be built leveraging distributed representations of words and documents.
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingMaribel Acosta Deibe
Best Student Paper Award at the 8th International Conference on Knowledge Capture (K-CAP 2015).
http://tinyurl.com/hare-paper
Abstract:
Due to the semi-structured nature of RDF data, missing values affect answer completeness of queries that are posed against RDF. To overcome this limitation, we present HARE, a novel hybrid query processing engine that brings together machine and human computation to execute SPARQL queries. We propose a model that exploits the characteristics of RDF in order to estimate the complete- ness of portions of a data set. The completeness model complemented by crowd knowledge is used by the HARE query engine to on-the-fly decide which parts of a query should be executed against the data set or via crowd computing. To evaluate HARE, we created and executed a collection of 50 SPARQL queries against the DBpedia data set. Experimental results clearly show that our solution accurately enhances answer completeness.
(The HARE logo is based on the art work by icons8 (https://icons8.com/)
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyMaribel Acosta Deibe
Summary of crowdsourcing studies to assess the quality of knowledge graphs and complete missing values. Results focus on findings over the DBpedia knowledge graph ( https://wiki.dbpedia.org/).
Related publications:
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S., & Lehmann, J. Crowdsourcing Linked Data Quality Assessment. In International Semantic Web Conference (pp. 260-276), 2013.
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Flöck, F., & Lehmann, J. Detecting Linked Data Quality issues via Crowdsourcing: A DBpedia Study. Semantic Web Journal, 9(3), 303-335, 2018.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. HARE: A hybrid SPARQL engine to enhance query answers via crowdsourcing. In Proceedings of the 8th International Conference on Knowledge Capture (p. 11). 2015. Best Student Paper Award.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. Enhancing answer completeness of SPARQL queries via crowdsourcing. Journal of Web Semantics, 45, 41-62, 2017.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. HARE: An engine for enhancing answer completeness of SPARQL queries via crowdsourcing. Companion Volume of the Web Conference (pp. 501-505). 2018.
Deep learning methods applied to physicochemical and toxicological endpointsValery Tkachenko
Chemical and pharmaceutical companies, and government agencies regulating both chemical and biological compounds, all strive to develop new methods to provide efficient prioritization, evaluation and safety assessments for the hundreds of new chemicals that enter the market annually. While there is a lot of historical data available within the various agencies, organizations and companies, significant gaps remain in both the quantity and quality of data available coupled with optimal predictive methods. Traditional QSAR methods are based on sets of features (fingerprints) which representing the functional characteristics of chemicals. Unfortunately, due to both data gaps and limitations in the development of QSAR models, read-across approaches have become a popular area of research. Successes in the application of Artificial Neural Networks, and specifically in Deep Learning Neural Networks, has delivered a new optimism that the lack of data and limited feature sets can be overcome by using Deep Learning methods. In this poster we will present a comparison of various machine learning methods applied to several toxicological and physicochemical parameter endpoints. This abstract does not reflect U.S. EPA policy.
Semantics and optimisation of the SPARQL1.1 federation extensionOscar Corcho
Presentation done at ESWC2011 for the paper "Semantics and optimisation of the SPARQL1.1 federation extension". Buil-Aranda C, Arenas M, Corcho O. ESWC2011, May 2011, Hersonissos, Greece
Building Learning to Rank (LTR) search reranking models using Large Language ...Sujit Pal
Search engineers have many tools to address relevance. Older tools are typically unsupervised (statistical, rule based) and require large investments in manual tuning effort. Newer ones involve training or fine-tuning machine learning models and vector search, which require large investments in labeling documents with their relevance to queries.
Learning to Rank (LTR) models are in the latter category. However, their popularity has traditionally been limited to domains where user data can be harnessed to generate labels that are cheap and plentiful, such as e-commerce sites. In domains where this is not true, labeling often involves human experts, and results in labels that are neither cheap nor plentiful. This effectively becomes a roadblock to adoption of LTR models in these domains, in spite of their effectiveness in general.
Generative Large Language Models (LLMs) with parameters in the 70B+ range have been found to perform well at tasks that require mimicking human preferences. Labeling query-document pairs with relevance judgements for training LTR models is one such task. Using LLMs for this task opens up the possibility of obtaining a potentially unlimited number of query judgment labels, and makes LTR models a viable approach to improving the site’s search relevancy.
In this presentation, we describe work that was done to train and evaluate four LTR based re-rankers against lexical, vector, and heuristic search baselines. The models were a mix of pointwise, pairwise and listwise, and required different strategies to generate labels for them. All four models outperformed the lexical baseline, and one of the four models outperformed the vector search baseline as well. None of the models beat the heuristics baseline, although two came close – however, it is important to note that the heuristics were built up over months of trial and error and required familiarity of the search domain, whereas the LTR models were built in days and required much less familiarity.
The construction of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago. Specific examples of quality issues for the EPISuite data include multiple records for the same chemical structure with different measured property values, inconsistency between the structure, chemical name and CAS registry number for single records, the inability to convert the SMILES strings into chemical structures, hypervalency in the chemical structures and the absence of stereochemistry for thousands of data records. Relative to the era of EPISuite development, modern cheminformatics tools allow for more advanced capabilities in terms of chemical structure representation and storage, as well as enabling automated data validation and standardization approaches to examine data quality. This presentation will review both our manual and automated approaches to examining key datasets related to the EPISuite training and test data. This includes approaches to validate between chemical structure representations (e.g. molfile and SMILES) and identifiers (chemical names and registry numbers), as well as approaches to standardize the data into QSAR-consumable formats for modeling. We have quantified and segregated the data into various quality categories to allow us to thoroughly investigate the resulting models that can be developed from these data slices and to examine to what extent efforts into the development of large high-quality datasets have the expected pay-off in terms of prediction performance. This abstract does not reflect U.S. EPA policy.
The ionization state of a chemical, reflected in pKa values, affects lipophilicity, solubility, protein binding and the ability of a chemical to cross the plasma membrane. These properties govern the pharmacokinetic parameters such as absorption, distribution, metabolism, excretion and toxicity and thus pKa is a fundamental chemical property and is used in many models of chemical toxicity.
Experimentally determining pKa is not feasible for high-throughput assays. Predicting pKa is challenging and existing models have been developed only using restricted chemical space (e.g., anilines, phenols, benzoic acids, primary amines) and lack of a generalized model impedes ADME modeling.
No free and open source models exist for heterogeneous chemical classes, however, several proprietary programs exist. In this work, pKa open data bundled with DataWarrior (http://www.openmolecules.org/) were used to develop predictive models for pKa. After data cleaning, there were ~3100 and ~3900 monoprotic chemicals with an acidic or basic pKa, respectively. 1D and 2D chemical descriptors (AlogP, Topological polar surface area, etc) in addition to 12 fingerprints (presence or absence of a chemical group) were generated using PaDEL software. Three datasets were used: acidic, basic and acidic and basic combined.
13 datasets were examined, the 1D/2D descriptors and 12 fingerprints. Using the Extreme Gradient Boosting algorithm showed that the MACCS and Substructure Count fingerprints yielded the best results, with models showing an R-Squared of ~0.78 and a RMSE of 1.42.
Recently, Deep Learning models have showed remarkable progress in image recognition and natural language processing. To determine if the Deep Learning algorithms would increase model performance we examined the datasets and found that the Deep Learning models were somewhat superior than Extreme Gradient Boosting with an R-Squared of ~0.80 and an RMSE of ~1.38.
This work does not reflect U.S. EPA policy.
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
First draft for upcoming Hadoop World presentation "Renaissance in Medicine" that gives an overview of the upcoming changes in medical practice that are enabled by BigData technologies. Specific algorithmic techniques are detailed that enable this use case.
Current advances to bridge the usability-expressivity gap in biomedical seman...Maulik Kamdar
I presented a talk at the Protege research meeting on the 'Current advances to bridge the usability-expressivity gap in biomedical semantic search (and visualizing linked data)' https://sites.google.com/site/protegeresearchmeeting/meeting-materials/current-advances-to-bridge-the-usability-expressivity-gap-in-semantic-search
Analysing streams of text data to extract topics is an important task for getting useful
insights to be leveraged in subsequent workflows. For example extracting topics from text to be
continuously ingested into a search engine can be useful to tag documents with important
keywords or concepts to be used at search time. Another use case is doing analysis of support
tickets to get insights on the most common problems for customers.
In this talk we illustrate how to use Flink's Dynamic processing capabilities to continuously train
topic models from unlabelled text and use such models to extract topics from the data itself.
Such topic models will be built leveraging distributed representations of words and documents.
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingMaribel Acosta Deibe
Best Student Paper Award at the 8th International Conference on Knowledge Capture (K-CAP 2015).
http://tinyurl.com/hare-paper
Abstract:
Due to the semi-structured nature of RDF data, missing values affect answer completeness of queries that are posed against RDF. To overcome this limitation, we present HARE, a novel hybrid query processing engine that brings together machine and human computation to execute SPARQL queries. We propose a model that exploits the characteristics of RDF in order to estimate the complete- ness of portions of a data set. The completeness model complemented by crowd knowledge is used by the HARE query engine to on-the-fly decide which parts of a query should be executed against the data set or via crowd computing. To evaluate HARE, we created and executed a collection of 50 SPARQL queries against the DBpedia data set. Experimental results clearly show that our solution accurately enhances answer completeness.
(The HARE logo is based on the art work by icons8 (https://icons8.com/)
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyMaribel Acosta Deibe
Summary of crowdsourcing studies to assess the quality of knowledge graphs and complete missing values. Results focus on findings over the DBpedia knowledge graph ( https://wiki.dbpedia.org/).
Related publications:
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S., & Lehmann, J. Crowdsourcing Linked Data Quality Assessment. In International Semantic Web Conference (pp. 260-276), 2013.
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Flöck, F., & Lehmann, J. Detecting Linked Data Quality issues via Crowdsourcing: A DBpedia Study. Semantic Web Journal, 9(3), 303-335, 2018.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. HARE: A hybrid SPARQL engine to enhance query answers via crowdsourcing. In Proceedings of the 8th International Conference on Knowledge Capture (p. 11). 2015. Best Student Paper Award.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. Enhancing answer completeness of SPARQL queries via crowdsourcing. Journal of Web Semantics, 45, 41-62, 2017.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. HARE: An engine for enhancing answer completeness of SPARQL queries via crowdsourcing. Companion Volume of the Web Conference (pp. 501-505). 2018.
Deep learning methods applied to physicochemical and toxicological endpointsValery Tkachenko
Chemical and pharmaceutical companies, and government agencies regulating both chemical and biological compounds, all strive to develop new methods to provide efficient prioritization, evaluation and safety assessments for the hundreds of new chemicals that enter the market annually. While there is a lot of historical data available within the various agencies, organizations and companies, significant gaps remain in both the quantity and quality of data available coupled with optimal predictive methods. Traditional QSAR methods are based on sets of features (fingerprints) which representing the functional characteristics of chemicals. Unfortunately, due to both data gaps and limitations in the development of QSAR models, read-across approaches have become a popular area of research. Successes in the application of Artificial Neural Networks, and specifically in Deep Learning Neural Networks, has delivered a new optimism that the lack of data and limited feature sets can be overcome by using Deep Learning methods. In this poster we will present a comparison of various machine learning methods applied to several toxicological and physicochemical parameter endpoints. This abstract does not reflect U.S. EPA policy.
Semantics and optimisation of the SPARQL1.1 federation extensionOscar Corcho
Presentation done at ESWC2011 for the paper "Semantics and optimisation of the SPARQL1.1 federation extension". Buil-Aranda C, Arenas M, Corcho O. ESWC2011, May 2011, Hersonissos, Greece
Building Learning to Rank (LTR) search reranking models using Large Language ...Sujit Pal
Search engineers have many tools to address relevance. Older tools are typically unsupervised (statistical, rule based) and require large investments in manual tuning effort. Newer ones involve training or fine-tuning machine learning models and vector search, which require large investments in labeling documents with their relevance to queries.
Learning to Rank (LTR) models are in the latter category. However, their popularity has traditionally been limited to domains where user data can be harnessed to generate labels that are cheap and plentiful, such as e-commerce sites. In domains where this is not true, labeling often involves human experts, and results in labels that are neither cheap nor plentiful. This effectively becomes a roadblock to adoption of LTR models in these domains, in spite of their effectiveness in general.
Generative Large Language Models (LLMs) with parameters in the 70B+ range have been found to perform well at tasks that require mimicking human preferences. Labeling query-document pairs with relevance judgements for training LTR models is one such task. Using LLMs for this task opens up the possibility of obtaining a potentially unlimited number of query judgment labels, and makes LTR models a viable approach to improving the site’s search relevancy.
In this presentation, we describe work that was done to train and evaluate four LTR based re-rankers against lexical, vector, and heuristic search baselines. The models were a mix of pointwise, pairwise and listwise, and required different strategies to generate labels for them. All four models outperformed the lexical baseline, and one of the four models outperformed the vector search baseline as well. None of the models beat the heuristics baseline, although two came close – however, it is important to note that the heuristics were built up over months of trial and error and required familiarity of the search domain, whereas the LTR models were built in days and required much less familiarity.
The construction of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago. Specific examples of quality issues for the EPISuite data include multiple records for the same chemical structure with different measured property values, inconsistency between the structure, chemical name and CAS registry number for single records, the inability to convert the SMILES strings into chemical structures, hypervalency in the chemical structures and the absence of stereochemistry for thousands of data records. Relative to the era of EPISuite development, modern cheminformatics tools allow for more advanced capabilities in terms of chemical structure representation and storage, as well as enabling automated data validation and standardization approaches to examine data quality. This presentation will review both our manual and automated approaches to examining key datasets related to the EPISuite training and test data. This includes approaches to validate between chemical structure representations (e.g. molfile and SMILES) and identifiers (chemical names and registry numbers), as well as approaches to standardize the data into QSAR-consumable formats for modeling. We have quantified and segregated the data into various quality categories to allow us to thoroughly investigate the resulting models that can be developed from these data slices and to examine to what extent efforts into the development of large high-quality datasets have the expected pay-off in terms of prediction performance. This abstract does not reflect U.S. EPA policy.
The ionization state of a chemical, reflected in pKa values, affects lipophilicity, solubility, protein binding and the ability of a chemical to cross the plasma membrane. These properties govern the pharmacokinetic parameters such as absorption, distribution, metabolism, excretion and toxicity and thus pKa is a fundamental chemical property and is used in many models of chemical toxicity.
Experimentally determining pKa is not feasible for high-throughput assays. Predicting pKa is challenging and existing models have been developed only using restricted chemical space (e.g., anilines, phenols, benzoic acids, primary amines) and lack of a generalized model impedes ADME modeling.
No free and open source models exist for heterogeneous chemical classes, however, several proprietary programs exist. In this work, pKa open data bundled with DataWarrior (http://www.openmolecules.org/) were used to develop predictive models for pKa. After data cleaning, there were ~3100 and ~3900 monoprotic chemicals with an acidic or basic pKa, respectively. 1D and 2D chemical descriptors (AlogP, Topological polar surface area, etc) in addition to 12 fingerprints (presence or absence of a chemical group) were generated using PaDEL software. Three datasets were used: acidic, basic and acidic and basic combined.
13 datasets were examined, the 1D/2D descriptors and 12 fingerprints. Using the Extreme Gradient Boosting algorithm showed that the MACCS and Substructure Count fingerprints yielded the best results, with models showing an R-Squared of ~0.78 and a RMSE of 1.42.
Recently, Deep Learning models have showed remarkable progress in image recognition and natural language processing. To determine if the Deep Learning algorithms would increase model performance we examined the datasets and found that the Deep Learning models were somewhat superior than Extreme Gradient Boosting with an R-Squared of ~0.80 and an RMSE of ~1.38.
This work does not reflect U.S. EPA policy.
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
First draft for upcoming Hadoop World presentation "Renaissance in Medicine" that gives an overview of the upcoming changes in medical practice that are enabled by BigData technologies. Specific algorithmic techniques are detailed that enable this use case.
Current advances to bridge the usability-expressivity gap in biomedical seman...Maulik Kamdar
I presented a talk at the Protege research meeting on the 'Current advances to bridge the usability-expressivity gap in biomedical semantic search (and visualizing linked data)' https://sites.google.com/site/protegeresearchmeeting/meeting-materials/current-advances-to-bridge-the-usability-expressivity-gap-in-semantic-search
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
The influence of data curation on QSAR Modeling – Presented at American Chemi...Kamel Mansouri
This presentation examined the impact of data quality on the construction of QSAR models being developed within the EPA‘s National Center for Computational Toxicology. We have developed a public-facing platform to provide access to predictive models. As part of the work we have attempted to disentangle the influence of the quality versus quantity of data available to develop and validate QSAR models. This abstract does not reflect U.S. EPA policy.
VOLT: A Provenance-Producing, Transparent SPARQL Proxy for the On-Demand Computation of Linked Data & its Applications to Spatiotemporally Dependent Data
Similar to HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowdsourcing (20)
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...Maribel Acosta Deibe
During empirical evaluations of query processing techniques, metrics like execution time, time for the first answer, and throughput are usually reported. Albeit informative, these metrics are unable to quantify and evaluate the efficiency of a query engine over a certain time period – or diefficiency –, thus hampering the distinction of cutting- edge engines able to exhibit high-performance gradually. We tackle this issue and devise two experimental metrics named dief@t and dief@k, which allow for measuring the diefficiency during an elapsed time period t or while k answers are produced, respectively. The dief@t and dief@k measurement methods rely on the computation of the area under the curve of answer traces, and thus capturing the answer concentration over a time interval. We report experimental results of evaluating the behavior of a generic SPARQL query engine using both metrics. Observed results suggest that dief@t and dief@k are able to measure the performance of SPARQL query engines based on both the amount of answers produced by an engine and the time required to generate these answers.
Adaptive Semantic Data Management Techniques for Federations of EndpointsMaribel Acosta Deibe
Emerging technologies that support networks of sensors or mobile smartphones are making available an extremely large volume of data or Big Data; additionally, in the context of the Cloud of Linked Data, a large number of huge RDF linked datasets have become available, and this number keeps growing. Simultaneously, although scalable and efficient RDF engines that follow the traditional optimize-then-execute paradigm have been developed to locally access RDF data, SPARQL endpoints have been implemented for remote query processing. Given the size of existing datasets, lack of statistics to describe available sources, and unpredictable conditions of remote queries, existing solutions are still insufficient. First, the most efficient RDF engines rely their query processing algorithms on physical access and storage structures that are locally stored; however, because of the size of existing linked datasets, loading the data and their links is not always feasible. Second, remote linked data query processing can be extremely costly because of the lack of query planning; also, current techniques are not adaptable to unpredictable data transfers or data availability, thus, executions can be unsuccess- ful. To overcome these limitations, query physical operators and execution engines need to be able to access remote data and adapt query execution schedulers to data availability. In this tutorial we present the basis of adaptive query processing frameworks defined in the database area, and their applicability in the Linked and Big Data context where data can be accessed through SPARQL endpoints. This tutorial explains the limitations of existing RDF engines, adaptive query processing techniques, and how traditional RDF data management approaches can be well-suitable to runtime conditions, and extended to access a large volume of data distributed in federations of SPARQL endpoints.
Semantic Data Management in Graph Databases: ESWC 2014 TutorialMaribel Acosta Deibe
In this tutorial we present the basis of graph database frameworks and their applicability in semantic data management. The tutorial targets any conference attendee interested in learning about the current graph-based limited capabilities of existing RDF engines, existing graph database techniques, and extensions to RDF data management approaches in order to provide an efficient graph-based access to linked data.
The tutorial describes existing approaches to model graph databases and different techniques implemented in RDF and Database engines including their main drawbacks when a large volume of interconnected data needs to be traversed.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxRASHMI M G
Abnormal or anomalous secondary growth in plants. It defines secondary growth as an increase in plant girth due to vascular cambium or cork cambium. Anomalous secondary growth does not follow the normal pattern of a single vascular cambium producing xylem internally and phloem externally.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowdsourcing
1. HARE:
An Engine for Enhancing Answer Completeness
of SPARQL Queries via Crowdsourcing
Maribel Acosta, Elena Simperl, Fabian Flöck, Maria-Esther Vidal
3. Motivation (1)
3
Due to the semi-structured nature of RDF,
incomplete values cannot be easily detected.
4. Motivation (2)
4
SELECT DISTINCT ?drug WHERE {
?drug rdf:type dbo:Drug .
?drug dbo:atcPrefix “C01” .
?drug dbp:routesOfAdministration ?route .
}
Retrieve drugs that are annotated with the prefix “C01” (Cardiac Therapy) in the Anatomical
Therapeutic Chemical (ATC) classification system and which have known routes of administration.
47 drugs
(v. 2016)
5. Motivation (2)
5
SELECT DISTINCT ?drug WHERE {
?drug rdf:type dbo:Drug .
?drug dbo:atcPrefix “C01” .
?drug dbp:routesOfAdministration ?route .
}
Retrieve drugs that are annotated with the prefix “C01” (Cardiac Therapy) in the Anatomical
Therapeutic Chemical (ATC) classification system and which have known routes of administration.
98 drugs
(v. 2016)
(There are 48 drugs without
routes of administration)
6. Motivation (3)
6
Examples of drugs (with ATC prefix “C01”) with no routes of administration in
All images licensed under Fair use via Wikipedia.
dbr:Acadesine dbr:Acetyldigitoxin dbr:Dimetofrine dbr:Flecainide
(v. 2016)
Intravenous administration,
for treating leukemia.
Source: PubChem
Also used in doping (sports).
Source: PubMed
Oral administration,
Source: DrugBank
No route found. No route found.
7. Problem Definition
7
Given an RDF dataset D and a SPARQL query Q against D. Consider D* the
virtual dataset that contains all the data that should be in D.
P1) Identifying portions P of Q that yield missing values.
P2) Resolving missing values.
[[P]]D [[P]]D*⊂
µ [[P]]D µ [[P]]D*∉ ∈
Do not belong to
the solution of P
Should belong to
the solution of P
9. SELECT DISTINCT ?drug WHERE {
?drug rdf:type dbo:Drug .
?drug dbo:atcPrefix “C01” .
?drug dbp:routesOfAdministration ?route .
}
HARE Overview
9
{?drug à dbr:Ibuprofen}
{?drug à dbr:Flecainide}
Query Engine
RDF Completeness
Model
Microtask Manager
{?drug à dbr:Acadesine}
{?drug à dbr:Ibuprofen}
{?drug à dbr:Flecainide}
{?drug à dbr:Acadesine}
Crowd Knowledge
CKB+ CKB- CKB~
D
τ
10. HARE
• A hybrid machine/human SPARQL query engine that is able to enhance
the size of query answers.
• Based on a novel RDF completeness model, HARE implements query
optimization and execution techniques:
P1) Identifying portions of queries that yield missing values.
• HARE resorts to microtask crowdsourcing:
P2) Resolving missing values.
10
11. RDF Completeness Model (1)
• Relies on the Local Closed World Assumption (LCWA).
• Estimates the local completeness of resources with respect to other
resources in an RDF graph that belong to the same classes.
11
rdf:type
dbp:routesOf
Administration
rdf:type
rdf:type
dbo:Drug
dbr:Procainamide
dbr:Flecainide
dbr:Bretylium
Local Completeness
dbp:routesOf
Administration
dbp:routesOf
Administration
dbp:routesOf
Administration
dbp:routesOf
Administration
12. RDF Completeness Model (2)
① Multiplicity of an RDF Resource
Number of objects that a resource has for a certain predicate.
12
MOD(dbr:Procainamide, dbp:routesOfAdministration) = 3
dbr:Procainamide
dbp:routesOfAdministration
dbp:routesOfAdministration
dbp:routesOfAdministration
dbr:Intravenous
dbr:Intramuscular_injection
dbr:Oral_administration
13. RDF Completeness Model (3)
② Aggregated Multiplicity of a Class
Given a predicate, median number of distinct objects that have all the
resources that belong to a class.
13
AMOD(dbo:Drug, dbp:routesOfAdministration) = 3
MOD(dbr:Procainamide, dbp:routesOfAdministration) = 3
MOD(dbr:Bretylium, dbp:routesOfAdministration) = 2
14. RDF Completeness Model (4)
③ Local Completeness of an RDF Resource
Given a predicate, the completeness of an RDF resource is determined by
the aggregated predicate multiplicity of the classes that it belongs to.
14
CompD(dbr:Procainamide | dbp:routesOfAdministration) =
CompD(dbr:Bretylium | dbp:routesOfAdministration) =
CompD(dbr:Flecainide | dbp:routesOfAdministration) =
3
3
2
3
①Computed in
Computed in ②
0
3
15. Crowd Knowledge Bases (1)
• The knowledge collected from the crowd is captured in three KBs:
• CKB+, CKB–, CKB~ are fuzzy RDF datasets composed of 4-tuples:
15
CKB~
CKB+
CKB–
(subject, predicate, object, membership_degree)
RDF triple
16. Crowd Knowledge Bases (2)
16
Types of Crowd Knowledge Bases
(dbr: Acadesine, dbp:routesOfAdministration, _:o2, 0.78)
“Flecainide is administered orally.”
(dbr:Flecainide, dbp:routesOfAdministration, dbr:Oral_administration, 0.9)
“Flecainide does not have a (known) route of administration.”
(dbr:Flecainide, dbp:routesOfAdministration, _:o1, 0.05)
“I am not sure if Acadesine has a route of administration.”
CKB+
CKB-
CKB~
Contradiction (C)
Unknownness (U)
17. Query Engine (1)
• The engine computes the probability of crowdsourcing a triple pattern t in
query Q, denoted PCROWD(t).
• If PCROWD(t) is greater than a user threshold τ, then the query engine
crowdsources the triple pattern t.
• α is a score weight between 0.0 and 1.0.
17
PCROWD (t) =
α (1 – Comp(t)) + (1 – α) max {max{m+, m–}, min{C(t), 1 – U(t)}}
Estimated
incompleteness
Crowd
unreliability
Crowd
confidence
18. Query Engine (2)
• The engine combines mappings obtained from the dataset D and fuzzy
mappings from the crowd stored in CKB+.
• We define a fuzzy set semantics for SPARQL.
18
({?drug à dbr:Isoprenaline, ?route à dbr:Intravenous}, 0.94)
{?drug à dbr:Isoprenaline, ?route à dbr:Inhalation}
CKB+
D
The complexity of computing the mapping set of a SPARQL query under fuzzy set semantics is
the same as under set semantics.
The HARE query engine does not increase the time complexity of computing the mapping set of
a SPARQL query.
Corollary
Theorem
19. Microtask Manager (1)
19
• Receives triple patterns to crowdsource.
• Creates human tasks.
• Submits tasks to the crowdsourcing platform.
(dbr:Flecainide, dbp:routesOfAdministration, ?route)
20. dbr:Flecainide
Microtask Manager (2)
20
rdfs:label
Flecainide acetate (/flɛˈkeɪnaɪd/
US dict: fle·kā′·nīd) is a classic Ic
antyarrhythmic agent (...)
rdfs:comment
wiki-commons:Special:FilePath/
Flecainide_structure.svg
foaf:depiction
http://en.wikipedia.org/
wiki/Flecainide
foaf:isPrimaryTopicOf
dbp:routesOfAdministration
“Flecainide“@en
“routes of administration“@en
RDF Graphs:
22. Experimental Settings
• Benchmark: 50 queries against (English version, 2014).
• Ten queries in five different knowledge domains:
History, Life Sciences, Movies, Music, and Sports.
• Implementation details:
• Dataset (queries executed directly against the dataset).
• HARE (our proposed approach).
• HARE BL (generates microtask interfaces replacing URIs by labels).
• Crowdsourcing configuration:
• The crowd is reached via CrowdFlower.
• Four different triple patterns per task, 0.07 US$ per task (Sep. 2015).
• At least 3 answers were collected per task.
22
23. Overview of the Results
• Benchmark: 50 queries against (English version, 2014).
• Ten queries in five different knowledge domains:
History, Life Sciences, Movies, Music, and Sports.
• Implementation details:
• Dataset (queries executed directly against the dataset).
• HARE (our proposed approach).
• HARE BL (generates microtask interfaces replacing URIs by labels).
• Crowdsourcing configuration:
• The crowd is reached via CrowdFlower.
• Four different triple patterns per task, 0.07 US$ per task (Sep. 2015).
• At least 3 answers were collected per task.
23
Total triple patterns crowdsourced: 1,004
Total answers collected from the crowd: 3,163
75%-98% of the crowd answers
were produced in 12 minutes
24. 0
500
1000
1500
0.00 0.25 0.50 0.75 1.00
τ
Crowdsourcedtriplepatterns
Sports
History
LifeSciences
Music
Movies
24
# Crowdsourced Triple Patterns per Domain
The RDF completeness model considerably reduces the
number of triple patterns to crowdsource (τ >= 0.5).
Effectiveness of the RDF Completeness Model
25. Completeness of Query Answers
Sports Music Life Sciences Movies History
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10
0.00
0.25
0.50
0.75
1.00
Query
Recall
Dataset HARE−BL HARE
25
Recall of tested approaches w.r.t. D* per SPARQL query
Recall varies across queries and knowledge domains.
Completing answers in certain domains is more challenging.
26. Sports Music Life Sciences Movies History
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10
0.00
0.25
0.50
0.75
1.00
Query
Recall
Dataset HARE−BL HARE
Completeness of Query Answers
26
Recall of tested approaches w.r.t. D* per SPARQL query
HARE outperforms the other approaches across all knowledge domains.
Our RDF completeness model captures the skewed distributions of values.
Recall varies across queries and knowledge domains.
Completing answers in certain domains is more challenging.
✓ ✓✓✓✓ ✓✓✓✓✓ ✓ ✓✓✓✓ ✓✓✓✓✓ ✓ ✓✓✓✓ ✓✓✓✓✓ ✓✓✓✓ ✓✓✓✓✓ ✓ ✓✓✓✓ ✓✓✓✓✓
27. Quality of Crowd Answers: Precision
27
The crowd exhibits heterogeneous performance within domains.
This supports the importance of HARE triple-based approach.
28. Quality of Crowd Answers: Precision
28
The precision of the crowd answers is in general higher when
crowdsourcing semantically enriched tasks.
30. Conclusions
• HARE: Hybrid query engine against RDF data sets.
• Supports microtasks to enhance query answers on-the-fly.
• Experimental results confirmed that:
Future work
• Study further approaches to capture crowd reliability.
• Consider other quality dimensions on the knowledge collected from the
crowd.
30
3.13 – 12 times
Size of query answer
Precision
0.62 – 0.97
Crowd quality
Semantically enriched tasks
31. Maribel Acosta, Elena Simperl, Fabian Flöck, Maria-Esther Vidal
31
HARE:
An Engine for Enhancing Answer Completeness of
SPARQL Queries via Crowdsourcing
Size of query answer
Precision
Crowd quality
SELECT DISTINCT ?drug WHERE {
?drug rdf:type dbo:Drug .
?drug dbo:atcPrefix “C01” .
?drug dbp:routesOfAdministration ?route .
}
Crowd Knowledge
CKB+ CKB- CKB~
D