A brief introduction about ontology-based data access (shortly OBDA) and its core implementation. I presented too a recent simple benchmark between -ontop- and Semantika---two most available software for OBDA framework---in term of query performance (including details in the appendix section). The slides were presented for Friday Research Meeting in Stanford Center for Biomedical Informatics Research (BMIR).
License: Creative Commons by Attribution 3.0
Often information is spread among
several data sources, such as hospital databases, lab databases,
spreadsheets, etc. Moreover, the complexity of each of these data sources
might make it difficult for end-users to access them, and even
more, to query all of them at the same time.
A new solution that has been proposed to this problem is
ontology-based data access (OBDA).
OBDA is a popular paradigm, developed since the mid 2000s, to query
various types of data sources
using a common vocabulary familiar to the end-users. In a nutshell
OBDA separates the user
from the data sources (relational databases, CVS files, etc.) by means
of an ontology, which is a common terminology that provides the user with a
convenient query vocabulary, hides the structure of the data sources,
and can enrich incomplete data with background knowledge. About a
dozen OBDA systems have been implemented in both academia and
industry.
In this tutorial we will give an overview of OBDA, and our system -ontop-
which is currently being used in the context of the European project
Optique. We will discuss how to use -ontop- for data integration,
in particular concentrating on:
– How to create an ontology (common vocabulary) for a life science domain.
– How to map available data sources to this ontology.
– How to query the database using the terms in the ontology.
– How to check consistency of the data sources w.r.t. the ontology
A tutorial on how to create mappings using ontop, how inference (OWL 2 QL and RDFS) plays a role answering SPARQL queries in ontop, and how ontop's support for on-the-fly SQL query translation enables scenarios of semantic data access and data integration.
Ontop: Answering SPARQL Queries over Relational DatabasesGuohui Xiao
We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL 2 QL and RDFS ontologies), and its support for all major relational databases.
This project aimed to create a series of models for the extraction of Named Entities (People, Locations, Organizations, Dates) from news headlines obtained online. We created two models: a traditional Natural Processing Language Model using Maximum Entropy , and a Deep Neural Network Model using pre-trained word embeddings. Accuracy results of both models show similar performance, but the requirements and limitations of both models are different and can help determine what type of model is best suited for each specific use case
Often information is spread among
several data sources, such as hospital databases, lab databases,
spreadsheets, etc. Moreover, the complexity of each of these data sources
might make it difficult for end-users to access them, and even
more, to query all of them at the same time.
A new solution that has been proposed to this problem is
ontology-based data access (OBDA).
OBDA is a popular paradigm, developed since the mid 2000s, to query
various types of data sources
using a common vocabulary familiar to the end-users. In a nutshell
OBDA separates the user
from the data sources (relational databases, CVS files, etc.) by means
of an ontology, which is a common terminology that provides the user with a
convenient query vocabulary, hides the structure of the data sources,
and can enrich incomplete data with background knowledge. About a
dozen OBDA systems have been implemented in both academia and
industry.
In this tutorial we will give an overview of OBDA, and our system -ontop-
which is currently being used in the context of the European project
Optique. We will discuss how to use -ontop- for data integration,
in particular concentrating on:
– How to create an ontology (common vocabulary) for a life science domain.
– How to map available data sources to this ontology.
– How to query the database using the terms in the ontology.
– How to check consistency of the data sources w.r.t. the ontology
A tutorial on how to create mappings using ontop, how inference (OWL 2 QL and RDFS) plays a role answering SPARQL queries in ontop, and how ontop's support for on-the-fly SQL query translation enables scenarios of semantic data access and data integration.
Ontop: Answering SPARQL Queries over Relational DatabasesGuohui Xiao
We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL 2 QL and RDFS ontologies), and its support for all major relational databases.
This project aimed to create a series of models for the extraction of Named Entities (People, Locations, Organizations, Dates) from news headlines obtained online. We created two models: a traditional Natural Processing Language Model using Maximum Entropy , and a Deep Neural Network Model using pre-trained word embeddings. Accuracy results of both models show similar performance, but the requirements and limitations of both models are different and can help determine what type of model is best suited for each specific use case
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
There is a variety of public resources on the Internet which contain information about various aspects of chemical, biological and pharmaceutical domains. The quality, maturity, hosting organizations, team sizes behind these data resources vary wildly and as a consequence content cannot be always trusted and the effort of extracting information and preparing it for reuse is repeated again and again at various levels. This problem is especially serious in applications for QSAR, QSPR and QNAR modeling. On the other hand authors of this poster believe, based on their own extensive experience building various types of chemical, analytical and biological databases for decades, that the process of building such knowledgebase can be systematically described and automated. This poster will outline the work performed on text and data-mining various public resources on the Web, data curation process and making this information publicly available through a portal and a RESTful API. We will also demonstrate how such knowledgebase can be used for real-time QSAR and QSPR predictions.
We use metadata of various kind to improve and enrich text document clustering using an extension of Latent Dirichlet Allocation (LDA). The methods are fully implemented, evaluated and software is available on github.
These are the slides of an invited talk I gave September 8 at the Alexandria Workshop of TPDL-2016: http://alexandria-project.eu/events/3rd-workshop/
Bioinformaticians constantly face challenges with data: from the large volumes of data to the need to integrate diverse data types. Relational databases have a long and successful history of managing data but have been unable to meet emerging needs of big data and highly integrated data stores. This talk discusses the limitations we face when using relational data models for bioinformatics applications. It describes the features, limitations and use cases of four alternative database models: key value databases, document databases, wide column data stores and graph databases. Use in bioinformatics applications is demonstrate with text mining and atherosclerosis research projects. The talk concludes with guidance on choosing an appropriate database model for varying bioinformatics requirements.
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Sease
Modern search engines has to keep up with the enormous growth in the number of documents and queries submitted by users. One of the problem to deal with is finding the best k relevant documents for a given query. This operation has to be fast and this is possible only by using specialised technologies.
Block max wand is one of the best known algorithm for solving this problem without any effectiveness degradation of its ranking.
After a brief introduction, in this talk I’m going to show a strategy introduced in “Faster BlockMax WAND with Variable-sized Blocks” (SIGIR 2017), that applied to BlockMaxWand data has made possible to speed up the algorithm execution by almost 2x.
Then, will be presented another optimisation of the BlockMaxWand algorithm (“Faster BlockMax WAND with Longer Skipping”, ECIR 2019) for reducing the time execution of short queries.
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko
While we have seen a tremendous growth in machine learning methods over the last two decades there is still no one fits all solution. The next era of cheminformatics and pharmaceutical research in general is focused on mining the heterogeneous big data, which is accumulating at ever growing pace, and this will likely use more sophisticated algorithms such as Deep Learning (DL). There has been increasing use of DL recently which has shown powerful advantages in learning from images and languages as well as many other areas. However the accessibly of this technique for cheminformatics is hindered as it is not available readily to non-experts. It was therefore our goal to develop a DL framework embedded into a general research data management platform (Open Science Data Repository) which can be used as an API, standalone tool or integrated in new software as an autonomous module. In this poster we will present results of comparing performance of classic machine learning methods (Naïve Bayes, logistic regression, Support Vector Machines etc.) with Deep Learning and will discuss challenges associated with Ddeep Learning Neural Networks (DNN). The DNN learning models of different complexity (up to 6 hidden layers) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/) and Tensorflow (www.tensorflow.org) and applied to various use cases connected to prediction of physicochemical properties, ADME, toxicity and calculating properties of materials. It was also shown that using nVidia GPUs significantly accelerates calculations, although memory consumption puts some limits on performance and applicability of standard toolkits 'as is'.
Abstract: In collaborative agile ontology development projects support for modular reuse of ontologies from large existing remote repositories, ontology project life cycle management, and transitive dependency management are important needs. The Apache Maven approach has proven its success in distributed collaborative Software Engineering by its widespread adoption. The contribution of this paper is a new design artifact called OntoMaven. OntoMaven adopts the Maven-based development methodology and adapts its concepts to knowledge engineering for Maven-based ontology development and management of ontology artifacts in distributed ontology repositories.
Abstract:
An increasing number of applications rely on RDF, OWL 2, and SPARQL for storing and querying data. SPARQL, however, is not targeted towards end-users, and suitable query interfaces are needed. Faceted search is a prominent approach for end-user data access, and several RDF-based faceted search systems have been developed. There is, however, a lack of rigorous theoretical underpinning for faceted search in the context of RDF and OWL 2. In this paper, we provide such solid foundations. We formalise faceted interfaces for this context, identify a fragment of first-order logic capturing the underlying queries, and study the complexity of answering such queries for RDF and OWL 2 profiles. We then study interface generation and update, and devise efficiently implementable algorithms. Finally, we have implemented and tested our faceted search algorithms for scalability, with encouraging results.
Building Search & Recommendation EnginesTrey Grainger
In this talk, you'll learn how to build your own search and recommendation engine based on the open source Apache Lucene/Solr project. We'll dive into some of the data science behind how search engines work, covering multi-lingual text analysis, natural language processing, relevancy ranking algorithms, knowledge graphs, reflected intelligence, collaborative filtering, and other machine learning techniques used to drive relevant results for free-text queries. We'll also demonstrate how to build a recommendation engine leveraging the same platform and techniques that power search for most of the world's top companies. You'll walk away from this presentation with the toolbox you need to go and implement your very own search-based product using your own data.
Abstract: Ontologies are used in numerous research disciplines and commercial applications to uniformly and semantically annotate real-world objects. Due to a rapid development of application domains the corresponding ontologies are changed frequently to include up-to-date knowledge. These changes dramatically influence dependent data as well as applications/systems, for instance, ontology mappings, that semantically interrelate ontologies. The talk will give an overview on evolution of ontologies and ontology-based mappings.
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskSaurabh Saxena
Studied feasibility of applying state-of-the-art deep learning models like end-to-end memory networks and neural attention- based models to the problem of machine comprehension and subsequent question answering in corporate settings with huge
amount of unstructured textual data. Used pre-trained embeddings like word2vec and GLove to avoid huge training costs.
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairClaire Le Goues
In this talk we present lessons learned, good ideas, and thoughts on the future, with an eye toward informing junior researchers about the realities and opportunities of a long-running project. We highlight some notions from the original paper that stood the test of time, some that were not as prescient, and some that became more relevant as industrial practice advanced. We place the work in context, highlighting perceptions from software engineering and evolutionary computing, then and now, of how program repair could possibly work. We discuss the importance of measurable benchmarks and reproducible research in bringing scientists together and advancing the area. We give our thoughts on the role of quality requirements and properties in program repair. From testing to metrics to scalability to human factors to technology transfer, software repair touches many aspects of software engineering, and we hope a behind-the-scenes exploration of some of our struggles and successes may benefit researchers pursuing new projects.
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
There is a variety of public resources on the Internet which contain information about various aspects of chemical, biological and pharmaceutical domains. The quality, maturity, hosting organizations, team sizes behind these data resources vary wildly and as a consequence content cannot be always trusted and the effort of extracting information and preparing it for reuse is repeated again and again at various levels. This problem is especially serious in applications for QSAR, QSPR and QNAR modeling. On the other hand authors of this poster believe, based on their own extensive experience building various types of chemical, analytical and biological databases for decades, that the process of building such knowledgebase can be systematically described and automated. This poster will outline the work performed on text and data-mining various public resources on the Web, data curation process and making this information publicly available through a portal and a RESTful API. We will also demonstrate how such knowledgebase can be used for real-time QSAR and QSPR predictions.
We use metadata of various kind to improve and enrich text document clustering using an extension of Latent Dirichlet Allocation (LDA). The methods are fully implemented, evaluated and software is available on github.
These are the slides of an invited talk I gave September 8 at the Alexandria Workshop of TPDL-2016: http://alexandria-project.eu/events/3rd-workshop/
Bioinformaticians constantly face challenges with data: from the large volumes of data to the need to integrate diverse data types. Relational databases have a long and successful history of managing data but have been unable to meet emerging needs of big data and highly integrated data stores. This talk discusses the limitations we face when using relational data models for bioinformatics applications. It describes the features, limitations and use cases of four alternative database models: key value databases, document databases, wide column data stores and graph databases. Use in bioinformatics applications is demonstrate with text mining and atherosclerosis research projects. The talk concludes with guidance on choosing an appropriate database model for varying bioinformatics requirements.
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Sease
Modern search engines has to keep up with the enormous growth in the number of documents and queries submitted by users. One of the problem to deal with is finding the best k relevant documents for a given query. This operation has to be fast and this is possible only by using specialised technologies.
Block max wand is one of the best known algorithm for solving this problem without any effectiveness degradation of its ranking.
After a brief introduction, in this talk I’m going to show a strategy introduced in “Faster BlockMax WAND with Variable-sized Blocks” (SIGIR 2017), that applied to BlockMaxWand data has made possible to speed up the algorithm execution by almost 2x.
Then, will be presented another optimisation of the BlockMaxWand algorithm (“Faster BlockMax WAND with Longer Skipping”, ECIR 2019) for reducing the time execution of short queries.
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko
While we have seen a tremendous growth in machine learning methods over the last two decades there is still no one fits all solution. The next era of cheminformatics and pharmaceutical research in general is focused on mining the heterogeneous big data, which is accumulating at ever growing pace, and this will likely use more sophisticated algorithms such as Deep Learning (DL). There has been increasing use of DL recently which has shown powerful advantages in learning from images and languages as well as many other areas. However the accessibly of this technique for cheminformatics is hindered as it is not available readily to non-experts. It was therefore our goal to develop a DL framework embedded into a general research data management platform (Open Science Data Repository) which can be used as an API, standalone tool or integrated in new software as an autonomous module. In this poster we will present results of comparing performance of classic machine learning methods (Naïve Bayes, logistic regression, Support Vector Machines etc.) with Deep Learning and will discuss challenges associated with Ddeep Learning Neural Networks (DNN). The DNN learning models of different complexity (up to 6 hidden layers) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/) and Tensorflow (www.tensorflow.org) and applied to various use cases connected to prediction of physicochemical properties, ADME, toxicity and calculating properties of materials. It was also shown that using nVidia GPUs significantly accelerates calculations, although memory consumption puts some limits on performance and applicability of standard toolkits 'as is'.
Abstract: In collaborative agile ontology development projects support for modular reuse of ontologies from large existing remote repositories, ontology project life cycle management, and transitive dependency management are important needs. The Apache Maven approach has proven its success in distributed collaborative Software Engineering by its widespread adoption. The contribution of this paper is a new design artifact called OntoMaven. OntoMaven adopts the Maven-based development methodology and adapts its concepts to knowledge engineering for Maven-based ontology development and management of ontology artifacts in distributed ontology repositories.
Abstract:
An increasing number of applications rely on RDF, OWL 2, and SPARQL for storing and querying data. SPARQL, however, is not targeted towards end-users, and suitable query interfaces are needed. Faceted search is a prominent approach for end-user data access, and several RDF-based faceted search systems have been developed. There is, however, a lack of rigorous theoretical underpinning for faceted search in the context of RDF and OWL 2. In this paper, we provide such solid foundations. We formalise faceted interfaces for this context, identify a fragment of first-order logic capturing the underlying queries, and study the complexity of answering such queries for RDF and OWL 2 profiles. We then study interface generation and update, and devise efficiently implementable algorithms. Finally, we have implemented and tested our faceted search algorithms for scalability, with encouraging results.
Building Search & Recommendation EnginesTrey Grainger
In this talk, you'll learn how to build your own search and recommendation engine based on the open source Apache Lucene/Solr project. We'll dive into some of the data science behind how search engines work, covering multi-lingual text analysis, natural language processing, relevancy ranking algorithms, knowledge graphs, reflected intelligence, collaborative filtering, and other machine learning techniques used to drive relevant results for free-text queries. We'll also demonstrate how to build a recommendation engine leveraging the same platform and techniques that power search for most of the world's top companies. You'll walk away from this presentation with the toolbox you need to go and implement your very own search-based product using your own data.
Abstract: Ontologies are used in numerous research disciplines and commercial applications to uniformly and semantically annotate real-world objects. Due to a rapid development of application domains the corresponding ontologies are changed frequently to include up-to-date knowledge. These changes dramatically influence dependent data as well as applications/systems, for instance, ontology mappings, that semantically interrelate ontologies. The talk will give an overview on evolution of ontologies and ontology-based mappings.
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskSaurabh Saxena
Studied feasibility of applying state-of-the-art deep learning models like end-to-end memory networks and neural attention- based models to the problem of machine comprehension and subsequent question answering in corporate settings with huge
amount of unstructured textual data. Used pre-trained embeddings like word2vec and GLove to avoid huge training costs.
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairClaire Le Goues
In this talk we present lessons learned, good ideas, and thoughts on the future, with an eye toward informing junior researchers about the realities and opportunities of a long-running project. We highlight some notions from the original paper that stood the test of time, some that were not as prescient, and some that became more relevant as industrial practice advanced. We place the work in context, highlighting perceptions from software engineering and evolutionary computing, then and now, of how program repair could possibly work. We discuss the importance of measurable benchmarks and reproducible research in bringing scientists together and advancing the area. We give our thoughts on the role of quality requirements and properties in program repair. From testing to metrics to scalability to human factors to technology transfer, software repair touches many aspects of software engineering, and we hope a behind-the-scenes exploration of some of our struggles and successes may benefit researchers pursuing new projects.
Querying and reasoning over large scale building datasets: an outline of a pe...Ana Roxin
Presented at the International Workshop on Semantic Big Data (SBD 2016), held in conjunction with the 2016 ACM SIGMOD Conference
July 1st, 2016, San Francisco, USA
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilDatabricks
Equipment maintenance log of the global fleet is traditionally maintained using legacy infrastructure and data models, which limit the ability to extract insights at scale. However, to impact the bottom line, it is critical to ingest and enrich global fleet data to generate data driven guidance for operations. The impact of such insights is projected to be millions of dollars per annum.
To this end, we leverage Databricks to perform machine learning at scale, including ingesting (structured and unstructured data) from legacy systems, and then sifting through millions of nonlinearly growing records to extract insights using NLP. The insights enable outlier identification, capacity planning, prioritization of cost reduction opportunities, and the discovery process for cross-functional teams.
SQL on Hadoop benchmarks using TPC-DS query setKognitio
Sharon Kirkham, VP Analytics & Consulting at Kognitio, ran the TPC-DS query set using Impala, SparkSQL and Kognitio, to test for speed, reliability and concurrency for different SQL on Hadoop solutions. Standard Hive was originally investigated as part of this benchmark but lack of SQL support and poor single thread performance meant it was removed.
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
Have a big data product / database / DBMS? need to test it? don't know where to start? some things to consider while you design your Automation QA.
Link to Video
https://www.youtube.com/watch?v=MlT4pP7BGFQ
Big Data for Testing - Heading for Post Process and AnalyticsOPNFV
Yujun Zhang, ZTE Corporation, Donald Hunter, Cisco, Trevor Cooper, Intel
The testing community created tens of testing projects, hundreds of testing cases, thousands of testing jobs. Huge amount of testing data has been produced. What comes next, then?
The testing community puts in place tools and procedures to declare testcases/projects, normalize and upload results. These tools and procedures have been adopted so we now have lots of data covering lots of scenarios, hardware, installers.
In this presentation, we shall discuss the stakes and challenges of result post processing.
* How analytics can provide valuable inputs to the community, end users or upstream projects.
* How can we produce accurate indicators, reports and graphs, focus on interpreting / consuming test results.
* How can we get the best of breeds of our result mine?
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
A brief overview of Real-Time Analytics at Netflix and the challenges we've faced in designing and deploying production ready products based on real-time data.
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
Overview and extended description: AI is expected to be the engine of technological advancements in the healthcare industry, especially in the areas of radiology and image processing. The purpose of this session is to demonstrate how we can build a AI-based Radiologist system using Apache Spark and Analytics Zoo to detect pneumonia and other diseases from chest x-ray images. The dataset, released by the NIH, contains around 110,00 X-ray images of around 30,000 unique patients, annotated with up to 14 different thoracic pathology labels. Stanford University developed a state-of-the-art model using CNN and exceeds average radiologist performance on the F1 metric. This talk focuses on how we can build a multi-label image classification model in a distributed Apache Spark infrastructure, and demonstrate how to build complex image transformations and deep learning pipelines using BigDL and Analytics Zoo with scalability and ease of use. Some practical image pre-processing procedures and evaluation metrics are introduced. We will also discuss runtime configuration, near-linear scalability for training and model serving, and other general performance topics.
Using SigOpt to Tune Deep Learning Models with Nervana CloudSigOpt
In this talk I'll show how the Bayesian Optimization methods used by SigOpt, coupled with the incredibly scalable deep learning architecture provided with ncloud and neon, allow anyone it easily tune their models to quickly achieve higher accuracy. I'll walk through the techniques and show an explicit example with results.
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Fwdays
Here in DS team in WIX we want to help to create stunning sites by applying recent achievement of AI research to production. Since Data Science engineering practices are still not fully shaped we found out that it is crucial to bring the best practices from software engineering - give Data Scientist ability to deliver models fast without loss in quality and computation efficiency to stay competitive in this overhyped market. To achieve this we are developing our own infrastructure for creating pipelines and deploying them to production with minimum (to none) engineer involvement.
This talk will cover initial motivation, solved technical issues and lessons learned while building such ML delivery system.
Website: https://fwdays.com/en/event/data-science-fwdays-2019/review/continuous-delivery-of-ml-pipelines-to-production
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Ontology-based data access: why it is so cool!
1. Ontology-Based Data
Access: Why It is So Cool!
Josef Hardi
josef.hardi@stanford.edu
September 4, 2015
Ontology-Based Data Access is a concept developed by Diego Calvanese and
Mariano Rodriguez-Muro in KRDB Research Centre at Free University of Bozen-
Bolzano
2. Outline
● What is Ontology-based Data Access, or OBDA?
○ Motivation
○ System Black Box
○ Process Illustration
● Project -ontop- and Quest
● Experiment
○ Query Answering Performance
○ -ontop- vs Semantika
● Conclusion
● Q&A
3. Acknowledgement
Parts of the slides in this presentation are taken from
tutorial or lecture slides by:
Diego Calvanese,
Mariano Rodriguez-Muro, and
Martin Rezk
5. Think a scenario
Data Layer
Data Service
conceptual view
Image source: (various sources)
What is Ontology-based Data Access?
6. Data Access Bottleneck
Image source: Rezk, Martin. Ontologies Ontop Databases http://www.slideshare.net/MartnRezk/slides-swat4-ls
What is Ontology-based Data Access?
7. Query Answering
tbl_patient+2015
PatientId Name Cell_type cStage
1 Mary true 7
2 John false 6
3 Bill false 4
Cancer type is:
● NSCLC is when Cell_type is
false,
● SCLC is when Cell_type is
true.
Cancer stage is:
● I, II, III, IIIa, IIIb, IV for
NSCLC, corr. cStage: 1 - 6,
● Limited and Extensive for
SCLC, corr. cStage: 7 and 8.
There is “hidden logic” inside
the table that is specifically
used by the application. Not
for querying the data!
8. Query Answering
tbl_patient+2015
PatientId Name Cell_type cStage
1 Mary true 7
2 John false 6
3 Bill false 4
Name cStage
John 6
Bill 4
RESULT
select Name, cStage
from tbl_patient+2015
where Cell_type = false
and cStage >= 4;
9. Can we do it better?
Show me all the patients’ name and stage
status that have large tumor with at least in
a minimum stage IIIa.
Query Answering
10. Bridge the semantics
tbl_patient+2015
PatientId Name Cell_type cStage
1 Mary true 7
2 John false 6
3 Bill false 4
Cancer type is:
● NSCLC is when Cell_type is
false,
● SCLC is when Cell_type is
true.
Cancer stage is:
● I, II, III, IIIa, IIIb, IV for
NSCLC,
● Limited and Extensive for
SCLC.
hasStage
ISA
name
ISA
ISA
hasNeoplasm
SNOMED-CT
*SCLC = Small Cell Lung Cancer, NSCLC = Non-Small Cell Lung Cancer
Query Answering
11. OBDA Answering
● (Data) Sources: represents the external and independent
resources. Existing organization assets.
● Ontology: provides a unified common vocabulary. The
conceptual view of the underlying data
● Mappings: relates the terms in ontology to a set of SQL
views.
Image source: Rezk, Martin. Ontologies Ontop Databases http://www.slideshare.net/MartnRezk/slides-swat4-ls
Query Answering
12. OBDA Answering Black Box
● Rewriting: Create a new query which is the expanded
version of the original query, using all the defined
inclusion assertions in the ontology.
● Unfolding: Substitute each part in the expanded query
with corresponding SQL views from the given mappings.
● Evaluation: Execute the complete SQL to a target RDBMS.
Image source: Kontchakov, Roman, et.al. Ontology-based Data Access: Ontop of Databases. http://www.dcs.bbk.ac.uk/~roman/papers/ISWC13.pdf
Query Answering
13. OBDA Answering Illustration
Q: Show me all the Person in the hospital?
Q’: Show me
all the Person UNION
all the Nurse UNION
all the Doctor UNION
all the Patient UNION
anyone who has
Neoplasm in the hospital?
Rewritten
14. Look where is the source(s)
(No source)
Q’: Show me
all the Person UNION
all the Nurse UNION
all the Doctor UNION
all the Patient UNION
anyone who has
Neoplasm
in the hospital?
Get the list from table Nurse
Get the list from table Doctor
Get the list from table Patient
Get the list from table Cancer
Patient 2015
M
M
M
M
M
OBDA Answering Illustration
15. Substitute with SQL views
Q’: Show me
all the Person UNION
select NurseId from tbl_nurse UNION
select doc_id from tbl_doctor UNION
select pid from tbl_patient UNION
select PatientId from tbl_patient+2015
in the hospital?
OBDA Answering Illustration
Unfolded
16. Execute the SQL
select NurseId from tbl_nurse
UNION
select doc_id from tbl_doctor
UNION
select pid from tbl_patient
UNION
select PatientId from tbl_patient+2015
OBDA Answering Illustration
Evaluated
17. 42!
(Computational) Price to Pay
Query answering in OBDA setting:
● PTIME in the size of ontology (efficiently
tractable)
● AC0
in the size of the data (very efficiently
tractable)
● NP-Complete in the size of query
(exponential)
*Tractable problem: there exists an algorithm that will eventually terminate in a
reasonable amount of time and return you the result.
OBDA Answering Illustration
18.
19. -ontop- Project
● A platform to query relational databases using
SPARQL language,
● The implementation started in 2010,
● Supports several database systems, like: MySQL,
PostgreSQL, H2, SQL Server, Oracle, IBM DB2.
● Distributed under open-source license.
● It is currently being developed within the context of
EU Optique project.
● Fantastic add-ons: Efficient rewriting, Query
optimization, Transitive query, Rules entailment,
Cross-linked datasets.
-ontop-
23. Berlin SPARQL Benchmark (BSBM)
● A benchmark suite built around e-commerce
domain.
○ A set of products is offered by different vendors and
customers are posting product reviews.
● Consists of 12 different queries, emulating
the search and navigation pattern of a
consumer looking for a product.
● A Query-Mix consists of 25 querying actions
that simulate a product search scenario.
● No inference.
Experiment
24. BSBM-100
● Dataset of 100 million triples,
● Transformed into relational db schema:
offer > 5.7 million rows
person > 147 thousand rows
producer > 5 thousand rows
product > 288 thousand rows
productfeature > 47 thousand rows
productfeatureproduct > 5.5 million rows
producttype > 2 thousand rows
producttypeproduct > 1.4 million rows
review > 2.8 million rows
vendor > 2 thousand rows
Experiment
25. Test Databases
● MySQL - v5.6
○ Vanilla
○ Optimized
■ CREATE INDEX
■ OPTIMIZE TABLE - ANALYZE
● PostgreSQL - v9.4.4
○ Vanilla
○ Optimized
■ CREATE INDEX
■ VACUUM TABLE - ANALYZE
Experiment
26. Test Machine
● MacBook Pro
○ OS X Yosemite 64-bit
○ Java 8 (build 1.8.0_51-b16)
○ Intel Core i7 3 GHz
○ Memory 16 GB
○ Flash storage
○ Direct connection - no network cost
Experiment
27. Benchmark Flow
for each obda-endpoint do:
for each dbms do:
for each dbms-variant do:
start endpoint;
start dbms;
loop 2:
run ‘benchmark -runs 100 -w 10’;
stop dbms;
stop endpoint;
Experiment
29. Conclusion
● OBDA offers a non-invasive solution to
existing (legacy) database system for
better data access service.
● A lot of interesting topics can be harvested
from OBDA use case scenarios.
○ Health and clinical domain perhaps?
● OBDA performance relies heavily on the
efficiency of the underlying data
infrastructure (both HW and SW).
32. Query Answering over Database
Image source: Calvanese, Diego. Ontology-Based Data Access and Integration. https://www.essi.upc.edu/docs/slides-obda-2010-02-08
34. Query Answering over Ontology
Image source: Calvanese, Diego. Ontology-Based Data Access and Integration. https://www.essi.upc.edu/docs/slides-obda-2010-02-08
36. Query Answering via Rewriting
Image source: Calvanese, Diego. Ontology-Based Data Access and Integration. https://www.essi.upc.edu/docs/slides-obda-2010-02-08
52. Ontop SQL Creation
SELECT
3 AS `titleQuestType`, NULL AS `titleLang`, QVIEW1.`title` AS `title`,
10 AS `publishDateQuestType`, NULL AS `publishDateLang`, CAST
(QVIEW1.`publishDate` AS CHAR(8000) CHARACTER SET utf8) AS
`publishDate`
FROM review QVIEW1
WHERE
(QVIEW1.`product` = '62033') AND
(QVIEW1.`producer` = '1245') AND
QVIEW1.`publisher` IS NOT NULL AND
QVIEW1.`nr` IS NOT NULL AND
QVIEW1.`title` IS NOT NULL AND
QVIEW1.`publishDate` IS NOT NULL
53. Semantika SQL Creation
SELECT `OBDA_VIEW1`.`title` AS `title`,
`OBDA_VIEW1`.`publishDate` AS `publishDate`
FROM `bsbm100`.`review` AS `OBDA_VIEW1`
WHERE `OBDA_VIEW1`.`publisher` IS NOT NULL AND
`OBDA_VIEW1`.`product` = 62033 AND
`OBDA_VIEW1`.`publishDate` IS NOT NULL AND
`OBDA_VIEW1`.`nr` IS NOT NULL AND
`OBDA_VIEW1`.`title` IS NOT NULL AND
`OBDA_VIEW1`.`producer` = 1245
55. Ontop SQL Creation
SELECT
1 AS `reviewQuestType`, NULL AS `reviewLang`, CONCAT('http://www4.wiwiss.fu-berlin.
de/bizer/bsbm/v01/instances/dataFromRatingSite', REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(CAST(QVIEW1.`publisher` AS CHAR
(8000) CHARACTER SET utf8),' ', '%20'),'!', '%21'),'@', '%40'),'#', '%23'),'$', '%24'),'&', '%26'),'*', '%42'), '(', '%28'), ')', '%29'), '[', '%5B'), ']', '%5D'),
',', '%2C'), ';', '%3B'), ':', '%3A'), '?', '%3F'), '=', '%3D'), '+', '%2B'), '''', '%22'), '/', '%2F'), '/Review', REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(CAST(QVIEW1.`nr` AS CHAR(8000) CHARACTER SET utf8),' ', '%20'),'!', '%21'),'@', '%40'),'#', '%23'),'$', '%24'),'&', '%26'),'*', '%42'), '(', '%28'),
')', '%29'), '[', '%5B'), ']', '%5D'), ',', '%2C'), ';', '%3B'), ':', '%3A'), '?', '%3F'), '=', '%3D'), '+', '%2B'), '''', '%22'), '/', '%2F')) AS `review`,
3 AS `titleQuestType`, NULL AS `titleLang`, QVIEW1.`title` AS `title`,
10 AS `publishDateQuestType`, NULL AS `publishDateLang`, CAST(QVIEW1.`publishDate` AS CHAR(8000) CHARACTER SET utf8) AS
`publishDate`,
4 AS `rating1QuestType`, NULL AS `rating1Lang`, CAST(QVIEW1.`rating1` AS CHAR(8000) CHARACTER SET utf8) AS `rating1`,
4 AS `rating2QuestType`, NULL AS `rating2Lang`, CAST(QVIEW2.`rating2` AS CHAR(8000) CHARACTER SET utf8) AS `rating2`
FROM (
review QVIEW1
LEFT OUTER JOIN review QVIEW2
ON (QVIEW1.`nr` = QVIEW2.`nr`) AND
(QVIEW1.`publisher` = QVIEW2.`publisher`) AND
QVIEW2.`rating2` IS NOT NULL AND
QVIEW1.`publisher` IS NOT NULL AND
QVIEW1.`nr` IS NOT NULL
)
WHERE
QVIEW1.`title` IS NOT NULL AND
QVIEW1.`nr` IS NOT NULL AND
QVIEW1.`publishDate` IS NOT NULL AND
(QVIEW1.`product` = '62033') AND
QVIEW1.`publisher` IS NOT NULL AND
QVIEW1.`rating1` IS NOT NULL AND
(QVIEW1.`producer` = '1245')
56. Semantika SQL Creation
SELECT CONCAT('http://www4.wiwiss.fu-berlin.
de/bizer/bsbm/v01/instances/dataFromRatingSite{1}/Review{2}',' : ','"',
`OBDA_VIEW1`.`publisher`,'" "',`OBDA_VIEW1`.`nr`,'"') AS `review`,
`OBDA_VIEW1`.`title` AS `title`,
`OBDA_VIEW1`.`publishDate` AS `publishDate`,
`OBDA_VIEW1`.`rating1` AS `rating1`,
`OBDA_VIEW1`.`rating2` AS `rating2`
FROM `bsbm100_optimized`.`review` AS `OBDA_VIEW1`
WHERE `OBDA_VIEW1`.`publisher` IS NOT NULL AND
`OBDA_VIEW1`.`product` = 62033 AND
`OBDA_VIEW1`.`publishDate` IS NOT NULL AND
`OBDA_VIEW1`.`nr` IS NOT NULL AND
`OBDA_VIEW1`.`title` IS NOT NULL AND
`OBDA_VIEW1`.`rating1` IS NOT NULL AND
`OBDA_VIEW1`.`producer` = 1245