Ontologies are seeing a resurgence of interest and usage as big data proliferates, machine learning advances, and integration of data becomes more paramount. The previous models of sometimes labor-intensive, centralized ontology construction and maintenance do not mesh well in today’s interdisciplinary world that is in the midst of a big data, information extraction, and machine learning explosion. In this talk, we will provide some historical perspective on ontologies and their usage, and discuss a model of building and maintaining large collaborative, interdisciplinary ontologies along with the data repositories and data services that they empower. We will give a few examples of heterogeneous semantic data resources made more interconnected and more powerful by ontology-supported infrastructures, discuss a vision for ontology-enabled future research and provide some examples in a large health empowerment joint effort between RPI and IBM Watson Health.
The document discusses recent advances in generative adversarial networks (GANs) for image generation. It summarizes two influential GAN models: ProgressiveGAN (Karras et al., 2018) and BigGAN (Brock et al., 2019). ProgressiveGAN introduced progressive growing of GANs to produce high resolution images. BigGAN scaled up GAN training through techniques like large batch sizes and regularization methods to generate high fidelity natural images. The document also discusses using GANs to generate full-body, high-resolution anime characters and adding motion through structure-conditional GANs.
Choosing technologies for a big data solution in the cloudJames Serra
Has your company been building data warehouses for years using SQL Server? And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”? What technologies and tools should use? That is what this presentation will help you answer. First we will cover what questions to ask concerning data (type, size, frequency), reporting, performance needs, on-prem vs cloud, staff technology skills, OSS requirements, cost, and MDM needs. Then we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data? Should I use a data lake? Do I still need a cube? What about Hadoop/NoSQL? Do I need the power of MPP? Should I build a "logical data warehouse"? What is this lambda architecture? Can I use Hadoop for my DW? Finally, we’ll show some architectures of real-world customer big data solutions. Come to this session to get started down the path to making the proper technology choices in moving to the cloud.
What is business intelligence? Where have we been, where are we now, and where are we going? These slides provide a brief history of business intelligence, enjoy.
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Amazon Web Services
Realizing the value of social media analytics can bolster your business goals. This type of analysis has grown in recent years due to the large amount of available information and the speed at which it can be collected and analyzed. In this workshop, we build a serverless data processing and machine learning (ML) pipeline that provides a multi-lingual social media dashboard of tweets within Amazon QuickSight. We leverage API-driven ML services, AWS Glue, Amazon Athena and Amazon QuickSight. These building blocks are put together with very little code by leveraging serverless offerings within AWS.
Phar Data Platform: From the Lakehouse Paradigm to the RealityDatabricks
Despite the increased availability of ready-to-use generic tools, more and more enterprises are deciding to build in-house data platforms. This practice, common for some time in research labs and digital native companies, is now making its waves across large enterprises that traditionally used proprietary solutions and outsourced most of their IT. The availability of large volumes of data, coupled with more and more complex analytical use cases driven by innovations in data science have yielded these traditional and on premise architectures to become obsolete in favor of cloud architectures powered by open source technologies.
The idea of building an in-house platform at a larger enterprise comes with many challenges of its own: Build an Architecture that combines the best elements of data lakes and data warehouses to accommodate all kinds from BI to ML use cases. The need to interoperate with all the company’s data and technology, including legacy systems. Cultural transformation, including a commitment to adopt agile processes and data driven approaches.
This presentation describes a success story on building a Lakehouse in an enterprise such as LIDL, a successful chain of grocery stores operating in 32 countries worldwide. We will dive into the cloud-based architecture for batch and streaming workloads based on many different source systems of the enterprise and how we applied security on architecture and data. We will detail the creation of a curated Data Lake comprising several layers from a raw ingesting layer up to a layer that presents cleansed and enriched data to the business units as a kind of Data Marketplace.
A lot of focus and effort went into building a semantic Data Lake as a sustainable and easy to use basis for the Lakehouse as opposed to just dumping source data into it. The first use case being applied to the Lakehouse is the Lidl Plus Loyalty Program. It is already deployed to production in 26 countries with more than 30 millions of customers’ data being analyzed on a daily basis. In parallel to productionizing the Lakehouse, a cultural and organizational change process was undertaken to get all involved units to buy into the new data driven approach.
The document discusses recent advances in generative adversarial networks (GANs) for image generation. It summarizes two influential GAN models: ProgressiveGAN (Karras et al., 2018) and BigGAN (Brock et al., 2019). ProgressiveGAN introduced progressive growing of GANs to produce high resolution images. BigGAN scaled up GAN training through techniques like large batch sizes and regularization methods to generate high fidelity natural images. The document also discusses using GANs to generate full-body, high-resolution anime characters and adding motion through structure-conditional GANs.
Choosing technologies for a big data solution in the cloudJames Serra
Has your company been building data warehouses for years using SQL Server? And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”? What technologies and tools should use? That is what this presentation will help you answer. First we will cover what questions to ask concerning data (type, size, frequency), reporting, performance needs, on-prem vs cloud, staff technology skills, OSS requirements, cost, and MDM needs. Then we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data? Should I use a data lake? Do I still need a cube? What about Hadoop/NoSQL? Do I need the power of MPP? Should I build a "logical data warehouse"? What is this lambda architecture? Can I use Hadoop for my DW? Finally, we’ll show some architectures of real-world customer big data solutions. Come to this session to get started down the path to making the proper technology choices in moving to the cloud.
What is business intelligence? Where have we been, where are we now, and where are we going? These slides provide a brief history of business intelligence, enjoy.
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Amazon Web Services
Realizing the value of social media analytics can bolster your business goals. This type of analysis has grown in recent years due to the large amount of available information and the speed at which it can be collected and analyzed. In this workshop, we build a serverless data processing and machine learning (ML) pipeline that provides a multi-lingual social media dashboard of tweets within Amazon QuickSight. We leverage API-driven ML services, AWS Glue, Amazon Athena and Amazon QuickSight. These building blocks are put together with very little code by leveraging serverless offerings within AWS.
Phar Data Platform: From the Lakehouse Paradigm to the RealityDatabricks
Despite the increased availability of ready-to-use generic tools, more and more enterprises are deciding to build in-house data platforms. This practice, common for some time in research labs and digital native companies, is now making its waves across large enterprises that traditionally used proprietary solutions and outsourced most of their IT. The availability of large volumes of data, coupled with more and more complex analytical use cases driven by innovations in data science have yielded these traditional and on premise architectures to become obsolete in favor of cloud architectures powered by open source technologies.
The idea of building an in-house platform at a larger enterprise comes with many challenges of its own: Build an Architecture that combines the best elements of data lakes and data warehouses to accommodate all kinds from BI to ML use cases. The need to interoperate with all the company’s data and technology, including legacy systems. Cultural transformation, including a commitment to adopt agile processes and data driven approaches.
This presentation describes a success story on building a Lakehouse in an enterprise such as LIDL, a successful chain of grocery stores operating in 32 countries worldwide. We will dive into the cloud-based architecture for batch and streaming workloads based on many different source systems of the enterprise and how we applied security on architecture and data. We will detail the creation of a curated Data Lake comprising several layers from a raw ingesting layer up to a layer that presents cleansed and enriched data to the business units as a kind of Data Marketplace.
A lot of focus and effort went into building a semantic Data Lake as a sustainable and easy to use basis for the Lakehouse as opposed to just dumping source data into it. The first use case being applied to the Lakehouse is the Lidl Plus Loyalty Program. It is already deployed to production in 26 countries with more than 30 millions of customers’ data being analyzed on a daily basis. In parallel to productionizing the Lakehouse, a cultural and organizational change process was undertaken to get all involved units to buy into the new data driven approach.
Modern Data Challenges require Modern Graph TechnologyNeo4j
This session focuses on key data trends and challenges impacting enterprises. And, how graph technology is evolving to future-proof data strategy and architectures.
This document provides an introduction and overview of implementing Data Vault 2.0 on Snowflake. It begins with an agenda and the presenter's background. It then discusses why customers are asking for Data Vault and provides an overview of the Data Vault methodology including its core components of hubs, links, and satellites. The document applies Snowflake features like separation of workloads and agile warehouse scaling to support Data Vault implementations. It also addresses modeling semi-structured data and building virtual information marts using views.
This talk is an introduction to the vector search engine Weaviate. You will learn how storing data using vectors enables semantic search and automatic data classification. Topics like the underlying vector storage mechanism and how the pre-trained language vectorization model enables this are touched. In addition, this presentation consists of live demos to show the power of Weaviate and how you can get started with your own datasets. No prior technical knowledge is required; all concepts are illustrated with real use case examples and live demos. Most of all data is unstructured. Additionally, data is often stored without context, meaning and relation to concepts in the real world. This means that all this data is difficult to index, classify and search through. While this is traditionally solved by manual effort or expensive machine learning models, Weaviate takes another approach to this problem. Weaviate is a vector search engine, which stores data as vectors and automatically adds context and meaning to new data. This enables to search through the data without using exact matching keywords. Moreover, data can be automatically classified. Weaviate is completely open source, has a built-in machine learning model, has a graph-like data model, completely API-based and is cloud-native. Weaviate uses a GraphQL API next to RESTful endpoints to interact with the data in an intuitive manner. Additionally, Python, Go and JavaScript clients are available to facilitate interaction between Weaviate and your applications. GraphQL and client examples will be shown in the presentation.
Sharded Redis With Sentinel Vs Redis Cluster: What We Learned: Patrick KingRedis Labs
Redis was initially used for caching and locking at New Relic. As usage grew, manual sharding was required using application-level hashing to distribute keys across multiple Redis instances. This led to configuration and deployment challenges. New Relic then upgraded to Redis Cluster, which automatically shards and distributes keys. This removed the need for manual sharding code and provides an easier to manage clustered Redis deployment. Going forward, New Relic plans to deploy Redis Cluster using Kubernetes for automated operations and improve monitoring of the clustered Redis infrastructure.
KDD2021 論文読み会: Markdowns in e commerce fresh retail a counterfactual predict...Haruka Matsuzaki
This document summarizes a reading session of a paper on markdown price prediction and optimization in e-commerce.
The paper is from Alibaba and proposes a two-stage approach: 1) Using machine learning to predict demand based on product and category features, and 2) Formulating the pricing problem as a Markov decision process to find the optimal multi-period pricing policy via the Bellman equation.
The approach jointly optimizes prices across stores with the goal of increasing gross merchandise volume by over 20% according to the authors.
Business intelligence in the real time economyJohan Blomme
1. Business intelligence is evolving from reactive, historical reporting to real-time decision making embedded in business processes. This allows for more proactive responses to changing market conditions.
2. There is a shift towards self-service business intelligence where all employees can access, analyze, and share real-time data to improve decision making. Technologies like in-memory analytics enable faster, interactive analysis.
3. Collaboration and sharing of insights is facilitated by new interactive dashboard and visualization tools with Web 2.0 features. Business intelligence is becoming more user-centric and accessible for all employees.
BigQuery is Google Cloud Platform's interactive big data service that allows users to analyze massive datasets in seconds using SQL-like queries. It offers a scalable and fast way to query terabytes of data without the expense of maintaining servers or databases. BigQuery organizes data into a project-dataset-table hierarchy and uses a distributed architecture to efficiently process queries across servers.
There are patterns for things such as domain-driven design, enterprise architectures, continuous delivery, microservices, and many others.
But where are the data science and data engineering patterns?
Sometimes, data engineering reminds me of cowboy coding - many workarounds, immature technologies and lack of market best practices.
Introduction to our Datawarehouse solutions called BigQuery.
The Google Cloud Platform products are based on our internal systems which are powering Google AdWords, Search, YouTube and our leading research in the field of real-time data analysis.
You can get access ($300 for 60 days) to our free trial through google.com/cloud
This document discusses using ARIMA models with BigQuery ML to analyze time series data. It provides an overview of time series data and ARIMA models, including how ARIMA models incorporate AR and MA components as well as differencing. It also demonstrates how to create an ARIMA prediction model and visualize results using BigQuery ML and Google Data Studio. The document concludes that ARIMA models in BigQuery ML can automatically select the optimal order for time series forecasting and that multi-variable time series are not yet supported.
Cyberenvironments integrate shared and custom cyberinfrastructure resources into a process-oriented framework to support scientific communities and allow researchers to focus on their work rather than managing infrastructure. They enable more complex multi-disciplinary challenges to be tackled through enhanced knowledge production and application. Key challenges include coordinating distributed resources and users without centralization and evolving systems rapidly to keep pace with advancing science.
Modern Data Challenges require Modern Graph TechnologyNeo4j
This session focuses on key data trends and challenges impacting enterprises. And, how graph technology is evolving to future-proof data strategy and architectures.
This document provides an introduction and overview of implementing Data Vault 2.0 on Snowflake. It begins with an agenda and the presenter's background. It then discusses why customers are asking for Data Vault and provides an overview of the Data Vault methodology including its core components of hubs, links, and satellites. The document applies Snowflake features like separation of workloads and agile warehouse scaling to support Data Vault implementations. It also addresses modeling semi-structured data and building virtual information marts using views.
This talk is an introduction to the vector search engine Weaviate. You will learn how storing data using vectors enables semantic search and automatic data classification. Topics like the underlying vector storage mechanism and how the pre-trained language vectorization model enables this are touched. In addition, this presentation consists of live demos to show the power of Weaviate and how you can get started with your own datasets. No prior technical knowledge is required; all concepts are illustrated with real use case examples and live demos. Most of all data is unstructured. Additionally, data is often stored without context, meaning and relation to concepts in the real world. This means that all this data is difficult to index, classify and search through. While this is traditionally solved by manual effort or expensive machine learning models, Weaviate takes another approach to this problem. Weaviate is a vector search engine, which stores data as vectors and automatically adds context and meaning to new data. This enables to search through the data without using exact matching keywords. Moreover, data can be automatically classified. Weaviate is completely open source, has a built-in machine learning model, has a graph-like data model, completely API-based and is cloud-native. Weaviate uses a GraphQL API next to RESTful endpoints to interact with the data in an intuitive manner. Additionally, Python, Go and JavaScript clients are available to facilitate interaction between Weaviate and your applications. GraphQL and client examples will be shown in the presentation.
Sharded Redis With Sentinel Vs Redis Cluster: What We Learned: Patrick KingRedis Labs
Redis was initially used for caching and locking at New Relic. As usage grew, manual sharding was required using application-level hashing to distribute keys across multiple Redis instances. This led to configuration and deployment challenges. New Relic then upgraded to Redis Cluster, which automatically shards and distributes keys. This removed the need for manual sharding code and provides an easier to manage clustered Redis deployment. Going forward, New Relic plans to deploy Redis Cluster using Kubernetes for automated operations and improve monitoring of the clustered Redis infrastructure.
KDD2021 論文読み会: Markdowns in e commerce fresh retail a counterfactual predict...Haruka Matsuzaki
This document summarizes a reading session of a paper on markdown price prediction and optimization in e-commerce.
The paper is from Alibaba and proposes a two-stage approach: 1) Using machine learning to predict demand based on product and category features, and 2) Formulating the pricing problem as a Markov decision process to find the optimal multi-period pricing policy via the Bellman equation.
The approach jointly optimizes prices across stores with the goal of increasing gross merchandise volume by over 20% according to the authors.
Business intelligence in the real time economyJohan Blomme
1. Business intelligence is evolving from reactive, historical reporting to real-time decision making embedded in business processes. This allows for more proactive responses to changing market conditions.
2. There is a shift towards self-service business intelligence where all employees can access, analyze, and share real-time data to improve decision making. Technologies like in-memory analytics enable faster, interactive analysis.
3. Collaboration and sharing of insights is facilitated by new interactive dashboard and visualization tools with Web 2.0 features. Business intelligence is becoming more user-centric and accessible for all employees.
BigQuery is Google Cloud Platform's interactive big data service that allows users to analyze massive datasets in seconds using SQL-like queries. It offers a scalable and fast way to query terabytes of data without the expense of maintaining servers or databases. BigQuery organizes data into a project-dataset-table hierarchy and uses a distributed architecture to efficiently process queries across servers.
There are patterns for things such as domain-driven design, enterprise architectures, continuous delivery, microservices, and many others.
But where are the data science and data engineering patterns?
Sometimes, data engineering reminds me of cowboy coding - many workarounds, immature technologies and lack of market best practices.
Introduction to our Datawarehouse solutions called BigQuery.
The Google Cloud Platform products are based on our internal systems which are powering Google AdWords, Search, YouTube and our leading research in the field of real-time data analysis.
You can get access ($300 for 60 days) to our free trial through google.com/cloud
This document discusses using ARIMA models with BigQuery ML to analyze time series data. It provides an overview of time series data and ARIMA models, including how ARIMA models incorporate AR and MA components as well as differencing. It also demonstrates how to create an ARIMA prediction model and visualize results using BigQuery ML and Google Data Studio. The document concludes that ARIMA models in BigQuery ML can automatically select the optimal order for time series forecasting and that multi-variable time series are not yet supported.
Cyberenvironments integrate shared and custom cyberinfrastructure resources into a process-oriented framework to support scientific communities and allow researchers to focus on their work rather than managing infrastructure. They enable more complex multi-disciplinary challenges to be tackled through enhanced knowledge production and application. Key challenges include coordinating distributed resources and users without centralization and evolving systems rapidly to keep pace with advancing science.
Amit Sheth with TK Prasad, "Semantic Technologies for Big Science and Astrophysics", Invited Plenary Presentation, at Earthcube Solar-Terrestrial End-User Workshop, NJIT, Newark, NJ, August 13, 2014.
Like many other fields of Big Science, Astrophysics and Solar Physics deal with the challenges of Big Data, including Volume, Variety, Velocity, and Veracity. There is already significant work on handling volume related challenges, including the use of high performance computing. In this talk, we will mainly focus on other challenges from the perspective of collaborative sharing and reuse of broad variety of data created by multiple stakeholders, large and small, along with tools that offer semantic variants of search, browsing, integration and discovery capabilities. We will borrow examples of tools and capabilities from state of the art work in supporting physicists (including astrophysicists) [1], life sciences [2], material sciences [3], and describe the role of semantics and semantic technologies that make these capabilities possible or easier to realize. This applied and practice oriented talk will complement more vision oriented counterparts [4].
[1] Science Web-based Interactive Semantic Environment: http://sciencewise.info/
[2] NCBO Bioportal: http://bioportal.bioontology.org/ , Kno.e.sis’s work on Semantic Web for Healthcare and Life Sciences: http://knoesis.org/amit/hcls
[3] MaterialWays (a Materials Genome Initiative related project): http://wiki.knoesis.org/index.php/MaterialWays
[4] From Big Data to Smart Data: http://wiki.knoesis.org/index.php/Smart_Data
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...María Poveda Villalón
The document proposes a lightweight methodology called LOT (Linked Open Terms) for developing Linked Data ontologies and vocabularies in a reusable way. The methodology is data-driven and focuses on ontology search, selection, integration, completion and evaluation activities. It provides guidelines for reusing existing terms and linking them according to Linked Data principles while keeping the processes lightweight. The methodology is intended to help domain experts create ontologies and vocabularies for publishing data on the semantic web in an interoperable way without requiring extensive knowledge engineering expertise. Future work involves providing more detailed guidelines, examples, and connecting existing tools to support each step of the methodology.
A Keynote at the Web Science Conference, 2018, held at the VU Amsterdam [1]. This describes in the main the output of the Semantic Technology Institute International (STI2) Summit (for senior researchers in the Semantic Web field) held in Crete in September, 2017 [2].
1. https://websci18.webscience.org/
2. https://www.sti2.org/events/2017-sti2-semantic-summit
Australia's Environmental Predictive CapabilityTERN Australia
Federating world-leading research, data and technical capabilities to create Australia’s National Environmental Prediction System (NEPS).
Community consultation presentation.
3-12 February 2020
Dr Michelle Barker (Facilitator)
(Presentation v5)
The document discusses how data-centric science is driving the need for new tools and technologies to support large-scale data sharing and collaboration. It provides examples of projects like the Sloan Digital Sky Survey that have pioneered new models for open data publishing and public engagement with science. Microsoft research is working on technologies to support the entire scientific research lifecycle from data acquisition and modeling to analysis, visualization, and open dissemination of research outputs.
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
Talk covering how knowledge graphs are making us rethink how change occurs in Knowledge Organization Systems. Based on https://arxiv.org/abs/1611.00217
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
The Largest General Translational Informatics Public Private Partnership to DateLaura Berry
Presented at the Global Pharma R&D Informatics Congress. To find out more, visit:
www.global-engage.com
In this presentation, Jay Bergeron from Pfizer discusses eTriks: a 5 year IMI project to provide translational informatics products and services to other programs and EU Public Private Partnerships.
Keynote presentation delivered at ELAG 2013 in Gent, Belgium, on May 29 2013. Discusses Research Objects and the relationship to work my team has been involved in during the past couple of years: OAI-ORE, Open Annotation, Memento.
A Recommender Story: Improving Backend Data Quality While Reducing CostsDatabricks
A recommender story: improving backend data quality while reducing costsnInformation overload is one of the biggest challenges academics face on a daily basis while finding the right knowledge to advance science. With around 7k research articles being published every day, how do you find the right ones?
Elsevier is a global information analytics business that helps institutions and professionals advance healthcare, open science and improve performance. With many data sources and signals being available, data science and big data engineering provide the perfect opportunity to deliver more value to researchers.
Here we will focus on Mendeley, an open (free of charge) academic content platform to help researchers discover new information via functionalities such as a crowd sourced collection of academic related documents (Catalogue) and various personalized recommender systems. MendeleySuggest, the recommender system, helps millions of researchers worldwide to find documents and people relevant to their research field, they did not yet know exist. The personalised recommenders are powered by Mendeley Catalogue, clustering 2 billion records correctly into canonical records, state of the art algorithms and big data solutions (e.g. Spark).
In the past few years, we noticed that with our content growth, quality of the canonical records started drifting due to scalability issues. As a result, we faced clustering accuracy problems and, in turn, impacting also the recommenders. In this talk we will highlight how we rearchitected the fabrication of Mendeley Catalogue to improve its scalability and accuracy. In addition, we will show how the migration from Hadoop Map Reduce to Spark has helped us reduce costs as well as improving maintainability.
Networked Science, And Integrating with DataverseAnita de Waard
This document discusses the growing interconnectedness of research data and tools in a networked science environment. It summarizes Elsevier's current and potential future connections to the Dataverse platform, including exporting data from the Hivebench ELN to Dataverse, linking articles to datasets in Dataverse through frameworks like Scholix, indexing Dataverse through Elsevier's data search tools, and tracking metrics on Dataverse datasets through analytics platforms like PlumX. The author expresses interest in further strengthening integration between these systems to advance open sharing of research data.
The pulse of cloud computing with bioinformatics as an exampleEnis Afgan
The document discusses how cloud computing can enable large-scale genomic analysis by providing on-demand access to computational resources and petabytes of reference data. It describes how tools like Galaxy and CloudMan allow researchers to perform genomic analysis in the cloud through a web browser by automating the provisioning and configuration of cloud resources. This approach makes genomic research more accessible and enables the elastic scaling of analysis as needed.
Talk at the World Science Festival at Columbia, June 2, 2017: session on Big Data and Physics: http://www.worldsciencefestival.com/programs/big-data-future-physics/
Presentation made in the context of the FAO AIMS Webinar titled “Knowledge Organization Systems (KOS): Management of Classification Systems in the case of Organic.Edunet” (http://aims.fao.org/community/blogs/new-webinaraims-knowledge-organization-systems-kos-management-classification-systems)
21/2/2014
The document proposes an Oz Mammals Bioinformatics and Data Resource to store, share, and analyze genomic and other data from Australian mammal studies. It would:
1) Capture existing Oz mammal data and resources, provide long-term storage, and integrate new genomic data from the OMG Project.
2) Enable data sharing within the OMG project and provide access to Oz mammal data worldwide.
3) Give access to data processing, analysis, and visualization tools, and integrate with external resources like the Atlas of Living Australia.
Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery; Steve Hughes, NASA; Data Publication Repositories
The 2nd Research Data Access and Preservation (RDAP) Summit
An ASIS&T Summit
March 31-April 1, 2011 Denver, CO
In cooperation with the Coalition for Networked Information
http://asist.org/Conferences/RDAP11/index.html
Similar to Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017 (20)
Keynote presentation for the International Semantic Web Conference in Athens Greece, on November 9, 2023. The talk addresses the generative AI explosion and its potential impacts on the Semantic Web and Knowledge Graph communities and, in fact, may spark a research Renaissance.
Abstract:
We are living in an age of rapidly advancing technology. History may view this period as one in which generative artificial intelligence is seen as reshaping the landscape and narrative of many technology-based fields of research and application. Times of disruptions often present both opportunities and challenges. We will discuss some areas that may be ripe for consideration in the field of Semantic Web research and semantically-enabled applications. Semantic Web research has historically focused on representation and reasoning and enabling interoperability of data and vocabularies. At the core are ontologies along with ontology-enabled (or ontology-compatible) knowledge stores such as knowledge graphs. Ontologies are often manually constructed using a process that (1) identifies existing best practice ontologies (and vocabularies) and (2) generates a plan for how to leverage these ontologies by aligning and augmenting them as needed to address requirements. While semi-automated techniques may help, there is typically a significant portion of the work that is often best done by humans with domain and ontology expertise. This is an opportune time to rethink how the field generates, evolves, maintains, and evaluates ontologies. We consider how hybrid approaches, i.e., those that leverage generative AI components along with more traditional knowledge representation and reasoning approaches to create improved processes. The effort to build a robust ontology that meets a use case can be large. Ontologies are not static however and they need to evolve along with knowledge evolution and expanded usage. There is potential for hybrid approaches to help identify gaps in ontologies and/or refine content. Further, ontologies need to be documented with term definitions and their provenance. Opportunities exist to consider semi-automated techniques for some types of documentation, provenance, and decision rationale capture for annotating ontologies. The area of human-AI collaboration for population and verification presents a wide range of areas of research collaboration and impact. Ontologies need to be populated with class and relationship content. Knowledge graphs and other knowledge stores need to be populated with instance data in order to be used for question answering and reasoning. Population of large knowledge graphs can be time consuming. Generative AI holds the promise to create candidate knowledge graphs that are compatible with the ontology schema. The knowledge graph should contain provenance information identifying how the content was populated and its source and correctness and currency should be checked. A human-AI assistant approach is presented.
Keynote presentation for Mobilizing Computable Biomedical Knowledge Conference 2021. Looking in particular at emerging trends of cognitive assistants, personal health knowledge graphs, and meta descriptions for knowledge resources. Examples taken from RPI-IBM project on Health Empowerment by Analysis, Learning, and Semantics and NIEHS project with RPI-MSSM-Columbia on Human Health Exposure Analysis Repository Data Center.
Towards an Environmental Health Sciences Ontology:CHEAR to HHEAR and BeyondDeborah McGuinness
The National Institute of Environmental Health Sciences (NIEHS) supported a Children's Health Exposure Analysis Repository(CHEAR) program that needed to integrate data across exposure science and health. We led the data science effort of this program and design the CHEAR ontology to support data integration and to leverage a wide range of existing ontologies and vocabularies. We are refactoring the ontology to support human health (instead of just aiming to support child health, and broadening support a broad range of environmental health sciences applications.
The document discusses the use of ontologies and taxonomies to enhance findability, accessibility, interoperability, and reuse of data and resources. It provides definitions for taxonomy, ontology, knowledge engineering, and artificial intelligence. It describes how ontologies can specify terminology, concepts, and relationships in a domain to provide a rich description. The document also discusses ontology development processes and gives examples of how ontologies can enable semantic search, data integration, and interpretation across different studies and data sources.
Automating Semantic Metadata Collection in the Field with Mobile ApplicationDeborah McGuinness
Presentation at Mobile Deployment of Semantic Technologies Workshop at the International Semantic Web Conference. Abstract: In the past few decades, the field of ecology has grown from a collection of disparate researchers who collected data on their local phenomenon by hand, to large ecosystems-oriented projects partially fueled by automated sensor networks and a diversity of models and experiments. These modern projects rely on sharing and integrating data to answer questions of increasing scale and complexity. Interpreting and sharing the big data sets generated by these projects relies on information about how the data was collected and what the data is about, typically stored as metadata. Metadata ensures that the data can be interpreted and shared accurately and efficiently. Traditional paper-based metadata collection methods are slow, error-prone, and non-standardized, making data sharing difficult and inefficient. Semantic technologies offer opportunities for better data management in ecology, but also may pose a challenging learning curve to already busy researchers. This paper presents a mobile application for recording semantic metadata about sensor network deployments and experimental settings in real time, in the field, and without expecting prior knowledge of semantics from the users. This application enables more efficient and less error-prone in-situ metadata collection, and generates structured and shareable metadata.
This document discusses using linked open data and semantic technologies to support next generation science. It provides background on the increasing availability of open data and opportunities for citizen science contributions. Semantic technologies can help integrate and link diverse scientific data sources. Linked data principles allow disparate datasets to be connected through shared identifiers and relationships. Examples are provided of existing projects that use semantic approaches to enable scientific data discovery, analysis and collaboration across domains like population health, water quality monitoring and climate change. Overall, the document argues that semantic technologies are mature and can help scientists address large, distributed problems by facilitating data integration and knowledge sharing.
This talk introduces Linked Data and Semantic Web by using two examples - population sciences grid and semantAqua - a semantically enabled environmental monitoring. It shows a few tools and the semantic methodology and opens discussion for LOD and team science
The Semantic Travel Concierge - a vision of the potential of semantic technologies for the travel industry. Deborah L. McGuinness Keynote at the Opentravel Alliance Advisory Forum - Miami, Fla, April 11, 2012.
The document discusses the evolving landscape of semantic technologies and their applications to scientific domains like eScience. It introduces the Tetherless World Constellation, a research group applying semantic web techniques. Examples are given of projects applying semantics to areas like virtual observatories and provenance capture. The value of semantic technologies is discussed for integration, discovery, and validation of scientific data and models. Modular ontologies and semantically-enabled frameworks are presented as important directions for reuse and collaboration.
My keynote at the Ontologies Come of Age workshop at the International Semantic Web Conference in Bonn Germany. This workshop was named after a paper I wrote about a decade ago.
Ontologies for the Real World by Deborah L. McGuinness. Invited talk for the 2011 Future Worlds Microsoft Faculty Summit in the Semantic Knowledge for Commodity Computing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
1. Ontologies for the Modern Age
Deborah L. McGuinness
Tetherless World Senior Constellation Chair
Professor of Computer, Cognitive, and Web Science
Director RPI Web Science Research Center
RPI Institute for Data Exploration and Application Health Informatics Lead
dlm@cs.rpi.edu , @dlmcguinness ,
2. We have come a long way
since 2001
Tracks:
• Ontology and Ontology Maintenance
• Interoperability, Integration, &
Composition
• Web Services & Applications
• Needed to add tutorial / demo / BOF
track to handle large preregistration
numbers
Sponsors:
VerticalNet, Nokia, Spiritsoft, Enigmatic,
Empolis, Connotate, Mondeca, L&C, SC4,
Network Inference, Ontoprise, Inria, KSL,
NSF, DARPA
From http://swsa.semanticweb.org:
245 Attendees | 35/58 Papers Accepted | 3
Tutorials | 0 Workshops
and 2 co-located events, plus BOF/DEMO
• Kicked off Semantic Web Science
Association (SWSA) and
• ISWC conference series (2002)
• Background for Web Science / Web
Science Trust (2006)
McGuinness ISWC 10/23/2017
3. Themes continue and expand
Co-located and track themes valid then
and expanding now
Tutorials - 7
Workshop explosion – 18
– some of which are vibrant communities
and have been running for many years or
evolved (e.g. Linked Science -> Enabling
Open Semantic Science)
- Some continued themes – Ontologies
come of age (again) 2011
- Some newer themes (e.g., benchmarking
linked data, semantic web for x: IoT,
biodiversity, etc.)
McGuinness ISWC 10/23/2017
4. Ontologies
An ontology specifies a rich description of the
• Terminology, concepts, nomenclature
• Relationships among concepts and individuals
• Sentences distinguishing concepts, refining
definitions & relationships
relevant to a particular domain or area of interest.
* Based on AAAI ‘99 Ontologies Panel ̶ McGuinness, Welty, Uschold, Gruninger, Lehmann
McGuinness 6/7/2017
• "Pull" for Ontologies. Invited
talk. Semantics for the Web.
Dagstuhl, Germany, 2000.
• Ontologies Come of Age.
Fensel, Hendler, Lieberman,
Wahlster, eds. Spinning the
Semantic Web: Bringing the
World Wide Web to Its Full
Potential. MIT Press, 2003.
McGuinness ISWC 10/23/2017
5. Ontology-Enabled Application
Configurator Example
McGuinness, Resnick, Isbell. Description Logic in
Practice: A CLASSIC: Application. IJCAI, 1995.
Web-based configurator.
KR-literate designer and
maintainer
Tools like CLASSIC,
Protégé, Ontolingua,
Chimaera, PROMPT, …all
benefit by having a
knowledge representation
expert project owner /
maintainer with domain
expert access
Applications of the day lived
reasonably* well with limited
expressivity
McGuinness ISWC 10/23/2017
6. Building and Evolving
Ontologies
Past Present
Design Knowledge Representation
(KR) Expert with domain
expert access
KR Expert(s) paired with domain
experts AND community
Population KR expert learns domain and
builds ontology with some
external reuse
KR and domain experts determine
seed vocabularies and HEAVILY
leverage them
Evolution KR expert heavily involved KR expert involved in building /
customizing tools that domain
experts use; Input may include
automatic techniques output (e.g.,
extraction)
Tool Users Trained in Computer Science Trained in Domain ScienceS
Application Users Targeted well understood
user base
Diverse and evolving user base
Reuse Well thought out Expect the unexpected
McGuinness ISWC 10/23/2017
7. ●Limited data integration without controlled
vocabulary
●Limited reproducibility without shared
definitions
●Difficulty in reuse without provenance
Ontologies can enhance integration,
communication, reuse, and research impact
Ontology “Pull”: Browsing / Configuration
to Interoperability / Transparency
McGuinness ISWC 10/23/2017
8. Data Life Cycle
Consistent
terminology
and
meaning
Ontology-enhanced
Search and
organization
Data management Image: J.Crabtree with permission NIEHS 50 yr FEST
Ontology-enabled
interpretation & integrationOntology-enabled
integrity checking
Provenance
annotations for trust
and reuse
Computer understandable specifications of meaning
(semantics) support enhanced lifespan & impact of data
McGuinness ISWC 10/23/2017
9. Child Health Exposure
Analysis Repository
Stingone, Mervish, Kovatch, McGuinness, Gennings, and Teitelbaum. Big and Disparate Data: Considerations for Pediatric Consortia. Current
Opinions in Pediatrics Journal. 29(2):231-239, April 2017 Funding: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01.
McGuinness ISWC 10/23/2017
10. Ontology Development Process
**
Use Cases
Existing Ontologies
& Vocabularies
Expert Interviews
Labkey,
Ontology
Fragments
Ontology
Curation
(ongoing)
Reviewers & Curators
* Ontology Development Team
* Domain collaborators
* Invited experts
* "Consumers" (data analysts)
Knowledge Graph
Integration
* Linking data and
metadata content to
domain terms
* Linking workflows
based on semantic
descriptions
Repository
Integration
* Source Datasets
* Analytics source
code
* Results
* Publications
Knowledge-
Enhanced
Search
Finding what
is there that
might be of
use
Semantic
Extract
Transform,
Load
(SETLr)
Expert
Guidance
Sources
Data Reporting
Templates
Data Dictionaries /
Codebooks
Foundational
Ontologies/Vocabularies
Human Aware
Data Acquisition
Framework
Ontology
Browser
Generated
Ontology
* domain concepts
* authoritative
vocabularies
* vetted definitions
* supporting citations
Erickson, McGuinness, McCusker, Chastain, Pinheiro, Rashid, Liang, Liu, Stingone, …
Exemplified by
McGuinness ISWC 10/23/2017
11. 11
• Ontology support for
mapping and integration
(e.g., education level)
• Ontology informs decisions
about variables that may be
combined, serve as proxy,
or used to derive desired
info (e.g., birth outcomes)
• Ontology Integrity
constraints may help flag
errors (e.g., APGAR > 10)
• Ontology helps expose
implicit information and find
links
Fenton Z-Score
Sex
Birth
weight
Gest
Age
Mother’s Highest
Education Level
Val
Did not attend school 0
Elementary school 1
Technical post-primary 2
Middle school 3
Technical post-middle
school 4
Highschool or junior
college 5
Technical post-junior
college 6
College 7
Graduate 8
Doesn’t know 9
Mother
Education
Val
Less than High
School 0
High School
Graduate or More 1
Support Browsing,Searching, Pooling
Deriving Values, Verification, …
McGuinness, McCusker, Pinheiro, Stingone, et. al. Funding: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
McGuinness ISWC 10/23/2017
12. Laboratory Information Mgmt
System (LIMS)-based
backend integrated with
Ontology.
Includes automatic ingest,
access control, data
governance, download, ...
Supports Search study,
sample, subject, ...
Enables statiscians to ask for
content to support their
studies e.g., find
Child: Birth Weight, Gender,
Gestational Age at Birth
Mother: Age, BMI “early in
pregnancy based on
inclusion criterion for the
particular study”, Parity,
Education
Metals: As, CD, Mn, Mo, Pb
CHEAR Human Aware Data
Acquisition Framework
McGuinness ISWC 10/23/2017
Pinheiro, Liang, Rashid, Liu, Chastain, Santos, McCusker, McGuinness
Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
13. CHEAR Ontology infrastructure
13
Thousands of instances of CHEAR & foundational
ontologies (e.g., subjects, samples, lab capabilities)
Thousands of concepts and
relationships from
foundational ontologies
Hundreds of concepts from the CHEAR ontology
258 Analytes (incl. 36 metals, from lab spec)
176 Epidemiological Attributes (from pilots)
28 Sample types (incl. 3 pregnancy, from lab spec)
42 Assay types (from Lab Capabilities)
122 Instrument types (from Lab Capabilities)
We use:
●Labkey to create, curate
and maintain CHEAR
concepts (ontology)
●Labkey to create and
maintain CHEAR instances
(knowledge graph)
●SETLr to build and publish
the CHEAR ontology from
CHEAR concepts
●HADatAc to connect
CHEAR/foundational
concepts and instances to
CHEAR data
●HADatAc to
browse/select/retrieve
CHEAR data from CHEAR
vocabulary
Funding: NIH/NIEHS 0255-0236-4609 / 1U2CES02655
01
Disease Ontology
UBERON (Anatomy)
Units Ontology
CHEBI (Chemicals)
RefMet* (Metabolites)
ENVO* (Environment)
UniProt* (Proteins)
HAScO (Instruments/methods)
SIO (Semantic Science Int Ont)
PROV (Provenance)
Example Ontology and
infrastructure (CHEAR)
14. Content keeps expanding…Metabolomics
Targeted Analytes and RefMet
Refmet main classes and
subclasses are mapped to
CHEBI classes where
available.
CHEAR targeted analyte
classes and superclasses
are also aligned to CHEBI.
When including the CHEBI
hierarchy as well, the
following main classes and
subclasses in RefMet have
targeted analyte
subclasses (count in
parentheses).
McGuinness ISWC 10/23/2017 Funding: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
15. Mapping Data to Meaning:
Semantic Data Dictionaries
Rashid, Chastain, Stingone, McGuinness, McCusker. The Semantic Data Dictionary
Approach to Data Annotation and Integration. Enabling Open Semantic Science, Oct 21, 2017
McGuinness ISWC 10/23/2017 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
16. Columns
id
race
age
edu
bmi
weight
height
smoker
pb_1
pb_2
ga
birthwt
Semantic Data Dictionaries
Describe a Bigger Picture
id
(sio:Identifier)
race
(sio:Race)
age
(sio:Age)
edu
(chear:EducationLevel)
bmi
(chear:BMI)
weight
(sio:Mass)
height
(sio:Height)
smoker
(chear:SmokingStatus)
pb_1
(sio:Concentration)
pb_2
(sio:Concentration)
ga
(chear:GestationalAge)
birthwt
(chear:Weight)
??mother
(sio:Human)
??child
(sio:Human)
??birth
(chear:Birth)
??pregnancy
(chear:Pregnancy)
??sample1
(Serum)
??sample2
(Serum)
??pb_1
(Pb)
??pb_2
(Pb)
hasAttribute
??visit1
(chear:Visit)
??visit2
(chear:Visit)
hasAttribute
existsAtwasDerivedFrom
hasPart hasAttribute
inRelationTo
existsAt
existsAt
inRelationTo
(chear:Mother)
inRelationTo
hasRole
Plus Units of
Measure
(not shown)
McGuinness ISWC 10/23/2017 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
17. Epidemiological Measurements
Concepts /
relationships from
foundationahttp://
websci16.org/spo
nsorsl ontologies
Examples
of terms
from the
CHEAR
ontology
Examples
of terms
from th
CHEAR
ontol
Examples
of concepts
from the
CHEAR
ontology
Instance of
foundational
ontology
term
Ontology and Knowledge
Graph (Behind the Scenes)
Concepts /
relationships from
foundational
ontologies
McGuinness ISWC 2017 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
18. CHEAR Study-based Evolution
Strategy
Identify terms
that can be
mapped to
existing ontology
Identify terms
to be added
to ontology
Describe new
terms w/
definitions and
location within
existing ontology
Mappings (e.g.
variable names)
incorporated into
knowledge graph
Data into
knowledge graph
after embargo
period
Incorporate new
terms into
existing ontology
Review and
revise updates
with stakeholders
Data Structures
& Standards
Working Group
Compile new
terms across
multiple studies
(e.g. Quarterly)
Data Center
New
version
Ontology
X
McGuinness ISWC 2017 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
19. Ontology-Enabled Study
Search
Blood Biomarkers for Children’s Health (Study 1)
Institution:
Principal Investigator(s):
Number of Subjects:
Number of Samples:
Study Description:
Keywords:
Urine Biomarkers for Children’s Health (Study 2)
Institution:
Principal Investigator(s):
Number of Subjects:
Number of Samples:
Study Description:
Keywords:
Metabolomic Biomarkers for Children’s Health (Study 3)
Institution:
Principal Investigator(s):
Number of Subjects:
Number of Samples:
Study Description:
Keywords:
McGuinness ISWC 2017 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
20. Data Search expanded view
20
Ontology-Enabled Data Search
McGuinness ISWC 2017 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
21. 21
• Domain Science is asking for ontologies for Findable,
Accessible, Interoperable, Reusable (FAIR) kind of
issues
• With tooling and processes, domain scientists can help
build and maintain ontologies and ontology-enabled
applications, ex. Epidemiologists are doing this
• While classical ontology considerations remain
important (e.g., expressive enough for use case),
ecosystem considerations dominate many
considerations, including maintainability and longevity
• The data center content in CHEAR with the human
aware data collection framework front end provides
some of the infrastructure I envision in an open*
knowledge network
Evolving Reflections
McGuinness, McCusker, Pinheiro, Stingone, et. al
McGuinness ISWC 10/23/2017
23. Health Empowerment by Analytics,
Learning, and Semantics
• How can we enhance population and
individual health using information found
inside and outside the traditional (E)HR?
• How can we develop precision medicine
across the many levels of research from
Genome to Phenotype to Population
Health?
• How can we use IBM’s Watson
Technology, augmented with Rensselaer’s
semantics, learning, and analytics
expertise to achieve these goals?
McGuinness ISWC 10/23/2017 Partially supported through IBM Cognitive Network funding
24. Health Empowerment
• Knowledge as Medicine
– Cognitive agent technology will enhance ability to
• Explain relationships found in the data and linking these to
appropriate scientific literature,
• Put meaningful labels on clusters and connections derived from
the analytic process,
• Provide inputs to cognitive systems based on the data found in
databases and medical literature
• Generate and/or test hypotheses about health and medicine to
the level of the individual (precision medicine and health)
24McGuinness ISWC 10/23/2017 partially supported
through IBM Cognitive Network
25. Semantics-enabled Framework
**
Ontologies are an important piece;
but are part of a larger integrated
framework
Semanalytics / SemNExt RPI team:
McGuinness, Bennett PIs with
McCusker, Erickson, Seneviratne and the extended
Research groups including Rashid here, along with
input from IBM HEALS collaborators and
motivation from MANY projects, particularly from
NIEHS/ Mount Sinai,
McGuinness ISWC 10/23/2017 partially supported through IBM Cognitive Network
26. Probability-Aware
Knowledge Exploration
• Knowledge imported
from drug, protein, and
disease interaction
databases.
• Each interaction given
an evidence-driven
probability.
• Find drugs that could
affect melanoma,
filtered by interaction
probability.
• The best hypotheses
were generated using
the highest
probabilities.
McCusker, Dumontier, Yan,
He, Dordick, McGuinness.
Finding Melanoma Drugs
through a Probabilistic
Knowledge Graph. PeerJ
4L 32007, 2016.
McGuinness ISWC 10/23/2017
27. Semantic Extract, Transform, Load
for Knowledge Graphs
XML
CSV
JSON
JSON-LD
Templating
Script
RDF
Graphs
Satoru
E T L
HTM
L
Entrez
McCusker, Rashid, Liang, Liu, Chastain, Pinheiro, Stingone,
McGuinness. Broad, Interdisciplinary Science In Tela: An Exposure
and Child Health Ontology. Web Science, 2017
McGuinness ISWC 10/23/2017
28. Cancer Data: Comprehensive
omics, epidemiology & patient care
Initial data analysis aimed initially at the
following public datasets:
TCGA: RNA expression, tumor mutation, protein
expression, and clinical attributes (including staging,
treatment, risk, and survival) on 32 cancer types in >
14,000 patients
NHANES: Cross-sectional biannual survey of the health and nutrition
of the US population, including illness, environmental exposures, and
risk exposures.
Multiparameter Intelligent Monitoring in Intensive
Care: Longitudinal patient records from patients who
stayed in the intensive care units at Beth Israel
Deaconess Medical Center
Additional analysis will include deidentified data in
cancer topics
McGuinness ISWC 10/23/2017 partially supported through IBM Cognitive Network
29. Building the Knowledge graph:
Reusing Knowledge Sources to Bridge
Abstractions
• Already done: COSMIC Gene Census, OMIM, DrugBank,
iRefIndex
• Pathway data: KEGG, Reactome (small molecule
interactions, curated interactions)
• Gene Ontology: protein localization in cell types and
tissues, protein functions, biological process involvement
• UniProt: Protein families, including common binding sites
• CAP Protocols: Current cancer staging standards,
NCCN… many of these evolve, e.g., breast cancer
staging guidelines
• Vocabularies: SNOMED, NCI Thesaurus, NCI
Metathesaurus, etc.
McGuinness ISWC 10/23/2017
30. Discussion topics
• Old style ontologies along with their
considerations are still important…
Expressiveness is still an issue… and may
be a growing issue.
• But old style, old processes, old
ecoysystems will not make the impact we
want them to without buy in from a diverse
community of developers and users
• Ecosystems matter! With respect to
process, infrastructure, community, ….
• Goble’s point from Semantic Science –
need community, driver, tools
• Modern age ontologies are just one piece
– an important piece – but one part of the
puzzle – without the other puzzle pieces
we will not change the world
McGuinness ISWC 10/23/2017
31. Value Propositions Matter to
Get and Keep Collaborators
What will we be able to do that is hard or impossible today? One set of topics from
an applied mathematician collaborator (Bennett)
• How to merge data from heterogeneous data sources for analysis
• What types of data are available for analysis
• What interesting analysis questions we are capable of asking
• Is a potential analysis question too broad or imprecise for the data
• Which adjustment covariates should be used for a given analysis question
• Which statistical and machine learning methods and workflows are appropriate
• What background information might be relevant for an analysis question
• If measurements are plausible and can be trusted
• Are there explanations of derived results/hypotheses in literature
• Are results similar to those of prior analyses
• What are appropriate ways to visualize and present results to user
• Should changes in data trigger a reanalysis/new analysis of questions of interest
McGuinness ISWC 10/23/2017 partially supported through IBM Cognitive Network
32. Preliminary Study
Bennett, Erickson, McCusker, McGuinness et al
McGuinness ISWC 10/23/2017
Hypothesis:
Does factor increase odds
of disease?
User Specifies:
Data (NHANES Cohorts)
Disease Definition
Confounders (age, BMI)
Factors (pesticides)
Agent dynamically applies
standard risk analysis
workflow based on log-odds
Applicable to any risk
problem and data sets
User
Specify Data
Specify Model
Ingest and Clean
Data
Conduct
Modeling
Analyze and
Visualize Results
Knowledge
Graph
Semantic
Browser
Analytics
Agent
WORKFLOW
33. Semantic-analytics framework
to support precision health
Obtain goals from all stakeholders
One analyst’s Goals:
• Integrate analytics with knowledge graphs to select
germane data, discover relevant patterns, predict
outcomes, and provide interpretations in response to
queries from users or cognitive agents.
• Design and demonstrate semantic analytics workflows
across the knowledge graph to support precision health
inquiries
• Discover new patterns and predict outcomes to create new
knowledge and insights from the knowledge graph with the
assistance of a cognitive computing agent.
Bennett, Erickson, McCusker, McGuinness, et al
McGuinness ISWC 10/23/2017
34. Some Observations
• Ontologies are coming of age again…. But in some different
ways and as part of much larger ecosystems
• Champions are emerging out of a number of fields: (e.g., Bio, Env
Health, Biostatisticians, Earth Scientists, Nano materials scientists, etc.)
• Ontologies can support question formation, validation, and
answer generation in new ways
• Ontologies can support movement across abstraction levels
• Ontologies should not be done alone – community requested,
developed, & maintained resources are the future
• Ontology engineering is evolving to be more community-centric
• Building for longevity now also an early consideration (wine
ontology taught me early lessons
• Ontologies can help change the world when viewed as part of
ecosystems…. Lets change the world together!
McGuinness ISWC 10/23/2017