A talk looking at Human factors of DSL Usage, tool support for DSL users, benefits for using DSLs. In essence an experience report on the usage of DSL's in the development of a metadata management toolkit, which focuses on DSL development in general
This document discusses and provides information on four different concordancing tools that can be used for educational purposes: AntConc, AdTAT, Saffron, and TextSTAT. It provides the websites for each tool and briefly describes their functions, such as generating word frequency lists and concordances, analyzing texts in different languages and encodings, and performing textual searches using regular expressions. The document concludes by thanking the reader.
Speculating on the Future of the Metadata Standards LandscapeJenn Riley
Riley, Jenn. "Speculating on the Future of the Metadata Standards Landscape." American Library Association Annual Meeting, June 25, 2011, New Orleans, LA.
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB
The Broad Institute has developed a novel high-throughput gene-expression profiling technology and has used it to build an open-source catalog of over a million profiles that captures the functional states of cells when treated with drugs and other types of perturbations. Referred to as the Connectivity Map (or CMap), these data when paired with pattern matching algorithms, facilitate the discovery of connections between drugs, genes and diseases. We wished to expose this resource to scientists around the world via an API that is easily accessible to programmers and biologists alike. We required a database solution that could handle a variety of data types and handle frequent changes to the schema. We realized that a relational database did not fit our needs, and gravitated towards MongoDB for its ease of use, support for dynamic schema, complex data structures and expressive query syntax. In this talk, we’ll walk through how we built the CMap library. We’ll discuss why we chose MongoDB, the various schema design iterations and tradeoffs we’ve made, how people are using the API, and what we’re planning for the next generation of biomedical data.
This document discusses the use of concordancers in corpus linguistics and language teaching. A concordancer is a tool that allows users to search electronic texts and analyze word combinations and frequencies. The document provides examples of concordancer programs and discusses how they can be used by students, language teachers, and researchers. It then summarizes two articles that used concordancers - one to analyze metaphoric expressions used by doctors and patients, and another to teach medical students how to write academic research descriptions.
Entity Linking, Link Prediction, and Knowledge Graph CompletionJennifer D'Souza
A survey presented at the International Winter School on Knowledge Graphs and Semantic Web 2020 http://www.kgswc.org/winter-school/; November 2020; DOI: 10.13140/RG.2.2.12523.77603
Introduction to OPEN DATA and other hypes (2017/18)Julià Minguillón
This document introduces open data and related concepts. It defines open data as data that is freely available, accessible, and reusable. Key aspects of open data include it being in an open format, having open-source software to access and use it, and having open licenses without legal restrictions. The document discusses how open data can be used at various stages of the data life cycle from collection to analysis and visualization. Examples of tools for working with open data at different stages are also provided.
This document discusses and provides information on four different concordancing tools that can be used for educational purposes: AntConc, AdTAT, Saffron, and TextSTAT. It provides the websites for each tool and briefly describes their functions, such as generating word frequency lists and concordances, analyzing texts in different languages and encodings, and performing textual searches using regular expressions. The document concludes by thanking the reader.
Speculating on the Future of the Metadata Standards LandscapeJenn Riley
Riley, Jenn. "Speculating on the Future of the Metadata Standards Landscape." American Library Association Annual Meeting, June 25, 2011, New Orleans, LA.
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB
The Broad Institute has developed a novel high-throughput gene-expression profiling technology and has used it to build an open-source catalog of over a million profiles that captures the functional states of cells when treated with drugs and other types of perturbations. Referred to as the Connectivity Map (or CMap), these data when paired with pattern matching algorithms, facilitate the discovery of connections between drugs, genes and diseases. We wished to expose this resource to scientists around the world via an API that is easily accessible to programmers and biologists alike. We required a database solution that could handle a variety of data types and handle frequent changes to the schema. We realized that a relational database did not fit our needs, and gravitated towards MongoDB for its ease of use, support for dynamic schema, complex data structures and expressive query syntax. In this talk, we’ll walk through how we built the CMap library. We’ll discuss why we chose MongoDB, the various schema design iterations and tradeoffs we’ve made, how people are using the API, and what we’re planning for the next generation of biomedical data.
This document discusses the use of concordancers in corpus linguistics and language teaching. A concordancer is a tool that allows users to search electronic texts and analyze word combinations and frequencies. The document provides examples of concordancer programs and discusses how they can be used by students, language teachers, and researchers. It then summarizes two articles that used concordancers - one to analyze metaphoric expressions used by doctors and patients, and another to teach medical students how to write academic research descriptions.
Entity Linking, Link Prediction, and Knowledge Graph CompletionJennifer D'Souza
A survey presented at the International Winter School on Knowledge Graphs and Semantic Web 2020 http://www.kgswc.org/winter-school/; November 2020; DOI: 10.13140/RG.2.2.12523.77603
Introduction to OPEN DATA and other hypes (2017/18)Julià Minguillón
This document introduces open data and related concepts. It defines open data as data that is freely available, accessible, and reusable. Key aspects of open data include it being in an open format, having open-source software to access and use it, and having open licenses without legal restrictions. The document discusses how open data can be used at various stages of the data life cycle from collection to analysis and visualization. Examples of tools for working with open data at different stages are also provided.
Svante Schubert presented on metadata and the new metadata model for OpenDocument Format 1.2. The new model addresses limitations of the current ODF metadata by making it more extensible and descriptive. It uses RDF and OWL to annotate content in a common way, aligning with semantic web standards. Metadata is stored in RDF files and linked to content elements via IDs. This allows software to more easily find, combine and share information. OpenOffice.org 3 will provide APIs to access and extend the new metadata capabilities.
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Stuart Chalk
An electronic laboratory Notebook (ELN) can be characterized as a system that allows scientists to capture the data and resources used in performing scientific experiments. This allows users to easily organize and find their data however, little information about the scientific process is recorded.
In this paper we highlight the current status of progress toward semantic representation of science in ELNs.
Profiling systems have achieved notable adoption by research institutions.1 Multi-site search of research profiling systems has substantially evolved since the first deployment of systems such as DIRECT2Experts.2 CTSAsearch is a federated search engine using VIVO-compliant Linked Open Data (LOD) published by members of the NIH-funded Clinical and Translational Science (CTSA) consortium and other interested parties. Sixty-four institutions are currently included, spanning six distinct platforms and three continents (North America, Europe and Australia). In aggregate, CTSAsearch has data on 150-300 thousand unique researchers and their 10 million publications. The public interface is available at http://research.icts.uiowa.edu/polyglot.
Scientific Units in the Electronic AgeStuart Chalk
Scientists have standardized on the SI unit system since the late 1700’s. While much work has been done over the years to refine and redefine the system, little has formally done to standardize the representation of the SI units in electronic systems.
This paper will present a summary of current efforts toward electronic representation of scientific units in text, XML, and RDF, an analysis of needs for current computer/network systems, and an outline of future work.
Assignment 5 presentation (smaller w audio)blewter8
Metadata interoperability allows metadata standards to communicate across system boundaries by preventing inaccuracies when sharing metadata. It is important to use controlled vocabularies from standard term sets for consistency. Seungmin Lee and Elin Jacob addressed interoperability between the MARC and FRBR metadata standards by categorizing their elements based on attributes and relationships, resulting in seven core categories that can be mapped between the two standards and allowing representation of both single and hierarchical structures. Achieving metadata interoperability can save time and should be a high priority when developing new standards.
The document discusses general trees, which are a type of tree data structure where each node can have zero or more children. It defines a general tree, lists some key properties like the number of nodes, height, root, leaves, and ancestors. The document also provides examples of different types of trees including binary trees, balanced trees, unbalanced trees, red-black trees, and B-trees. It briefly mentions simulating a general tree and implementing tree data structures in programming.
Open Research Knowledge Graph (ORKG) - an overview Jennifer D'Souza
The ORKG makes scientific knowledge human- and machine-actionable and thus enables completely new ways of machine assistance. This will help researchers find relevant contributions to their field and create state-of-the-art comparisons and reviews. With the ORKG, scientists can explore knowledge in entirely new ways and share results even across different disciplines. This presentation offered an overview about the ORKG. The presentation was made on 15.7.2021 for the meeting of Lower Saxony librarian trainees.
A Generic Scientific Data Model and Ontology for Representation of Chemical DataStuart Chalk
The current movement toward openness and sharing of data is likely to have a profound effect on the speed of scientific research and the complexity of questions we can answer. However, a fundamental problem with currently available datasets (and their metadata) is heterogeneity in terms of implementation, organization, and representation.
To address this issue we have developed a generic scientific data model (SDM) to organize and annotate raw and processed data, and the associated metadata. This paper will present the current status of the SDM, implementation of the SDM in JSON-LD, and the associated scientific data model ontology (SDMO). Example usage of the SDM to store data from a variety of sources with be discussed along with future plans for the work.
The document discusses making data FAIR (Findable, Accessible, Interoperable, and Reusable) through a novel combination of web technologies. It describes the core FAIR principles for each component - findable, accessible, interoperable, and reusable. It then discusses how applying these principles through an "internet-inspired" approach using existing standards and protocols could help make large, heterogeneous and complex data more actionable for various applications and users. The presentation provides examples of how this could work through a layered architecture similar to the internet, with shared standards and specifications at each layer.
How to build systems that find, access, exchange and reuse information from linked datasources? My keynote at the Platform Linked Data Netherlands Congress.
http://www.pilod.nl/wiki/Congres_Linked_Data_is_FAIR_voor_Iedereen_%E2%80%93_7_november_2018
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Stuart Chalk
Scientists are looking for ways to leverage web 2.0 technologies in the research laboratory and as a consequence a number of approaches to web-based electronic notebooks are being evaluated. In this presentation I discuss the Eureka Research Workbench, an electronic laboratory notebook built on semantic technology and XML. Using this approach the context of the information recorded in the laboratory can be captured and searched along with the data itself. A discussion of the current system is presented along with the next planned development of the framework and long-term plans relative to linked open data. Presented at the 246th American Chemical Society Meeting in Indianapolis, IN, USA on September 12th, 2013.
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Stuart Chalk
Recently, the US government has mandated that publicly funded scientific research data be freely made available in a useable form, allowing integration of data in other systems. While this mandate has been articulated, existing publications and new papers (PDF) still do not provide accessible data, meaning that the usefulness is limited without human intervention.
This presentation outlines our efforts to extract scientific data from PDF files, using the PDFToText software and regular expressions (regex), and process it into a form that structures the data and its context (metadata). Extracted data is processed (cleaned, normalized), organized, and inserted into a contextually developed MySQL database. The data and metadata can then be output using a generic JSON-LD based scientific data model (SDM) under development in our laboratory.
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014Ryan B Harvey, CSDP, CSM
I gave a talk on the basics of SQL and its utility for data preprocessing and analysis tasks to the Data Wrangers DC meetup group, a member meetup of the Data Community DC (http://datacommunitydc.org).
The talk covered an introduction to relational data, database tools, and the SQL standard, as well as the basics of SQL select statements, common table expressions and creating views from select statements. In addition, the use of relevant libraries in R and Python to connect to data in relational databases were explained using examples with PostgreSQL, IPython notebooks, and RMarkdown.
Talk information: http://www.meetup.com/Data-Wranglers-DC/events/171768162/
Talk materials: https://github.com/nihonjinrxs/dwdc-june2014
This document outlines several use cases for data transformation and processing on data.gov.uk. It describes how XML data is transformed to RDF using XSLT with parameters. It also discusses on-the-fly data transformations and complex nested data processing pipelines that include multiple steps like data enrichment. The challenges of representing provenance for non-digital and heterogeneous data from different systems are also summarized.
An increasing number of researchers rely on computational methods to generate the results described in their publications. Research software created to this end is heterogeneous (e.g., scripts, libraries, packages, notebooks, etc.) and usually difficult to find, reuse, compare and understand due to its disconnected documentation (dispersed in manuals, readme files, web sites, and code comments) and a lack of structured metadata to describe it. In this talk I will describe the main challenges for finding, comparing and reusing research software, how structured metadata can help to address some of them, which are the best practices being proposed by the community; and current initiatives to aid their adoption by researchers within EOSC.
Impact: The talk addresses an important aspect of the EOSC infrastructure for quality research software by ensuring that software contributed to the EOSC ecosystem can be found, compared and reused by researchers. The talk also aims to address metadata quality of current research products, which is critical for successful adoption.
Presented at the EOSC symposium
OpenRefine is a tool for working with messy data that allows for data profiling, cleaning, extension, and ETL prototyping. The document outlines two use cases - cleaning inconsistencies in a city thesaurus XML file with over 5,000 concepts, and preparing city data for a contest by cleaning duplicates, adding new data from another project, and using the tool's project history. OpenRefine is a cross-platform web application that can be used as an intermediate step between systems for data manipulation and cleaning.
This document discusses converting metadata to linked open data. It provides an overview of the process of mapping metadata fields and their values to URIs and standardized vocabularies. This involves selecting existing terms where possible, cleaning up field values, and manually mapping values that don't match existing terms. It also discusses tools for working with linked data and principles for publishing open data online.
OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphsdgarijo
In this presentation we describe the Ontology-Based APIs framework (OBA), our approach to automatically create REST APIs from ontologies while following RESTful API best
practices. Given an ontology (or ontology network) OBA uses standard technologies familiar to web developers (OpenAPI Specification, JSON) and combines them with W3C standards (OWL, JSON-LD frames and SPARQL) to create maintainable APIs with documentation, units tests, automated validation of resources and clients (in Python, Javascript, etc.) for non Semantic Web experts to access the contents of a target
knowledge graph. We showcase OBA with three examples that illustrate the capabilities of the framework for different ontologies.
This article advocates that information storage requirements should not be expressed in the form of data models or conceptual schemas, but database structures should allow for any expression in a general purpose language, whereas implementation constraints should be expressed as constraints on the use of the general purpose language.
IRJET- An Efficient Way to Querying XML Database using Natural LanguageIRJET Journal
This document discusses an efficient way to query XML databases using natural language. It proposes a framework that can accept English language queries and translate them into XQuery or SQL expressions to retrieve data from an XML database. The system performs linguistic processing to map tokens in the natural language query to XQuery fragments, then executes the translated query against the database. Existing approaches are discussed that typically use semantic and syntactic analysis to represent the query logically before translation, but have limitations in handling ambiguity. The proposed system aims to improve query translation accuracy by leveraging token relationships and classifications determined from natural language parsing.
Svante Schubert presented on metadata and the new metadata model for OpenDocument Format 1.2. The new model addresses limitations of the current ODF metadata by making it more extensible and descriptive. It uses RDF and OWL to annotate content in a common way, aligning with semantic web standards. Metadata is stored in RDF files and linked to content elements via IDs. This allows software to more easily find, combine and share information. OpenOffice.org 3 will provide APIs to access and extend the new metadata capabilities.
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Stuart Chalk
An electronic laboratory Notebook (ELN) can be characterized as a system that allows scientists to capture the data and resources used in performing scientific experiments. This allows users to easily organize and find their data however, little information about the scientific process is recorded.
In this paper we highlight the current status of progress toward semantic representation of science in ELNs.
Profiling systems have achieved notable adoption by research institutions.1 Multi-site search of research profiling systems has substantially evolved since the first deployment of systems such as DIRECT2Experts.2 CTSAsearch is a federated search engine using VIVO-compliant Linked Open Data (LOD) published by members of the NIH-funded Clinical and Translational Science (CTSA) consortium and other interested parties. Sixty-four institutions are currently included, spanning six distinct platforms and three continents (North America, Europe and Australia). In aggregate, CTSAsearch has data on 150-300 thousand unique researchers and their 10 million publications. The public interface is available at http://research.icts.uiowa.edu/polyglot.
Scientific Units in the Electronic AgeStuart Chalk
Scientists have standardized on the SI unit system since the late 1700’s. While much work has been done over the years to refine and redefine the system, little has formally done to standardize the representation of the SI units in electronic systems.
This paper will present a summary of current efforts toward electronic representation of scientific units in text, XML, and RDF, an analysis of needs for current computer/network systems, and an outline of future work.
Assignment 5 presentation (smaller w audio)blewter8
Metadata interoperability allows metadata standards to communicate across system boundaries by preventing inaccuracies when sharing metadata. It is important to use controlled vocabularies from standard term sets for consistency. Seungmin Lee and Elin Jacob addressed interoperability between the MARC and FRBR metadata standards by categorizing their elements based on attributes and relationships, resulting in seven core categories that can be mapped between the two standards and allowing representation of both single and hierarchical structures. Achieving metadata interoperability can save time and should be a high priority when developing new standards.
The document discusses general trees, which are a type of tree data structure where each node can have zero or more children. It defines a general tree, lists some key properties like the number of nodes, height, root, leaves, and ancestors. The document also provides examples of different types of trees including binary trees, balanced trees, unbalanced trees, red-black trees, and B-trees. It briefly mentions simulating a general tree and implementing tree data structures in programming.
Open Research Knowledge Graph (ORKG) - an overview Jennifer D'Souza
The ORKG makes scientific knowledge human- and machine-actionable and thus enables completely new ways of machine assistance. This will help researchers find relevant contributions to their field and create state-of-the-art comparisons and reviews. With the ORKG, scientists can explore knowledge in entirely new ways and share results even across different disciplines. This presentation offered an overview about the ORKG. The presentation was made on 15.7.2021 for the meeting of Lower Saxony librarian trainees.
A Generic Scientific Data Model and Ontology for Representation of Chemical DataStuart Chalk
The current movement toward openness and sharing of data is likely to have a profound effect on the speed of scientific research and the complexity of questions we can answer. However, a fundamental problem with currently available datasets (and their metadata) is heterogeneity in terms of implementation, organization, and representation.
To address this issue we have developed a generic scientific data model (SDM) to organize and annotate raw and processed data, and the associated metadata. This paper will present the current status of the SDM, implementation of the SDM in JSON-LD, and the associated scientific data model ontology (SDMO). Example usage of the SDM to store data from a variety of sources with be discussed along with future plans for the work.
The document discusses making data FAIR (Findable, Accessible, Interoperable, and Reusable) through a novel combination of web technologies. It describes the core FAIR principles for each component - findable, accessible, interoperable, and reusable. It then discusses how applying these principles through an "internet-inspired" approach using existing standards and protocols could help make large, heterogeneous and complex data more actionable for various applications and users. The presentation provides examples of how this could work through a layered architecture similar to the internet, with shared standards and specifications at each layer.
How to build systems that find, access, exchange and reuse information from linked datasources? My keynote at the Platform Linked Data Netherlands Congress.
http://www.pilod.nl/wiki/Congres_Linked_Data_is_FAIR_voor_Iedereen_%E2%80%93_7_november_2018
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Stuart Chalk
Scientists are looking for ways to leverage web 2.0 technologies in the research laboratory and as a consequence a number of approaches to web-based electronic notebooks are being evaluated. In this presentation I discuss the Eureka Research Workbench, an electronic laboratory notebook built on semantic technology and XML. Using this approach the context of the information recorded in the laboratory can be captured and searched along with the data itself. A discussion of the current system is presented along with the next planned development of the framework and long-term plans relative to linked open data. Presented at the 246th American Chemical Society Meeting in Indianapolis, IN, USA on September 12th, 2013.
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Stuart Chalk
Recently, the US government has mandated that publicly funded scientific research data be freely made available in a useable form, allowing integration of data in other systems. While this mandate has been articulated, existing publications and new papers (PDF) still do not provide accessible data, meaning that the usefulness is limited without human intervention.
This presentation outlines our efforts to extract scientific data from PDF files, using the PDFToText software and regular expressions (regex), and process it into a form that structures the data and its context (metadata). Extracted data is processed (cleaned, normalized), organized, and inserted into a contextually developed MySQL database. The data and metadata can then be output using a generic JSON-LD based scientific data model (SDM) under development in our laboratory.
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014Ryan B Harvey, CSDP, CSM
I gave a talk on the basics of SQL and its utility for data preprocessing and analysis tasks to the Data Wrangers DC meetup group, a member meetup of the Data Community DC (http://datacommunitydc.org).
The talk covered an introduction to relational data, database tools, and the SQL standard, as well as the basics of SQL select statements, common table expressions and creating views from select statements. In addition, the use of relevant libraries in R and Python to connect to data in relational databases were explained using examples with PostgreSQL, IPython notebooks, and RMarkdown.
Talk information: http://www.meetup.com/Data-Wranglers-DC/events/171768162/
Talk materials: https://github.com/nihonjinrxs/dwdc-june2014
This document outlines several use cases for data transformation and processing on data.gov.uk. It describes how XML data is transformed to RDF using XSLT with parameters. It also discusses on-the-fly data transformations and complex nested data processing pipelines that include multiple steps like data enrichment. The challenges of representing provenance for non-digital and heterogeneous data from different systems are also summarized.
An increasing number of researchers rely on computational methods to generate the results described in their publications. Research software created to this end is heterogeneous (e.g., scripts, libraries, packages, notebooks, etc.) and usually difficult to find, reuse, compare and understand due to its disconnected documentation (dispersed in manuals, readme files, web sites, and code comments) and a lack of structured metadata to describe it. In this talk I will describe the main challenges for finding, comparing and reusing research software, how structured metadata can help to address some of them, which are the best practices being proposed by the community; and current initiatives to aid their adoption by researchers within EOSC.
Impact: The talk addresses an important aspect of the EOSC infrastructure for quality research software by ensuring that software contributed to the EOSC ecosystem can be found, compared and reused by researchers. The talk also aims to address metadata quality of current research products, which is critical for successful adoption.
Presented at the EOSC symposium
OpenRefine is a tool for working with messy data that allows for data profiling, cleaning, extension, and ETL prototyping. The document outlines two use cases - cleaning inconsistencies in a city thesaurus XML file with over 5,000 concepts, and preparing city data for a contest by cleaning duplicates, adding new data from another project, and using the tool's project history. OpenRefine is a cross-platform web application that can be used as an intermediate step between systems for data manipulation and cleaning.
This document discusses converting metadata to linked open data. It provides an overview of the process of mapping metadata fields and their values to URIs and standardized vocabularies. This involves selecting existing terms where possible, cleaning up field values, and manually mapping values that don't match existing terms. It also discusses tools for working with linked data and principles for publishing open data online.
OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphsdgarijo
In this presentation we describe the Ontology-Based APIs framework (OBA), our approach to automatically create REST APIs from ontologies while following RESTful API best
practices. Given an ontology (or ontology network) OBA uses standard technologies familiar to web developers (OpenAPI Specification, JSON) and combines them with W3C standards (OWL, JSON-LD frames and SPARQL) to create maintainable APIs with documentation, units tests, automated validation of resources and clients (in Python, Javascript, etc.) for non Semantic Web experts to access the contents of a target
knowledge graph. We showcase OBA with three examples that illustrate the capabilities of the framework for different ontologies.
This article advocates that information storage requirements should not be expressed in the form of data models or conceptual schemas, but database structures should allow for any expression in a general purpose language, whereas implementation constraints should be expressed as constraints on the use of the general purpose language.
IRJET- An Efficient Way to Querying XML Database using Natural LanguageIRJET Journal
This document discusses an efficient way to query XML databases using natural language. It proposes a framework that can accept English language queries and translate them into XQuery or SQL expressions to retrieve data from an XML database. The system performs linguistic processing to map tokens in the natural language query to XQuery fragments, then executes the translated query against the database. Existing approaches are discussed that typically use semantic and syntactic analysis to represent the query logically before translation, but have limitations in handling ambiguity. The proposed system aims to improve query translation accuracy by leveraging token relationships and classifications determined from natural language parsing.
Semantic Interoperability - grafi della conoscenzaGiorgia Lodi
This document summarizes Giorgia Lodi's presentation on meaningful data and semantic interoperability in the Italian public sector. Lodi discusses issues with data quality such as missing values, semantics mismatches, and use of strings instead of codes. She argues that adopting semantic web standards like RDF, OWL and SPARQL can help address these issues by linking data together and representing it semantically. Ontologies and knowledge graphs can be used to represent domain knowledge and infer new facts. Tools like FRED can generate knowledge graphs from unstructured text. Overall, Lodi argues that semantic web technologies have the potential to improve data interoperability and quality in the public sector, though challenges remain.
USING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEYcseij
This document summarizes research on using ontologies to overcome drawbacks of databases and vice versa. It discusses how ontologies can be used to store and manage large numbers of database instances to improve performance. It also explains how databases can help address issues with ontologies, such as a lack of semantics, by providing structured storage. The document reviews drawbacks of both databases and ontologies and how each can help address limitations of the other through integration. This mutual benefit is an active area of research at the intersection of databases and ontologies.
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
The presentation provides an overview of what an ontology is and how it can be used for representing information and for retrieving data with a particular focus on the linguistic resources available for supporting this kind of task. Overview of semantic-based retrieval approaches by highlighting the pro and cons of using semantic approaches with respect to classic ones. Use cases are presented and discussed
The document discusses the Semantic Web, which aims to develop the current web so that machines can understand the meaning of information and not just display it. It outlines some key technologies being used like XML, RDF, and ontologies to add structure and meaning to web content. This will allow software agents to perform more sophisticated tasks by processing structured, machine-readable information based on defined ontologies. The Semantic Web represents an evolution from today's web designed primarily for humans to one where machines can also comprehend and utilize web content.
The technology of object oriented databases was introduced to system developers in
the late 1980’s. Object DBMSs add database functionality to object programming languages. A
major benefit of this approach is the unification of the application and database development into
a seamless data model and language environment. As a result, applications require less code, use
more natural data modeling, and code bases are easier to maintain.
Gellish A Standard Data And Knowledge Representation Language And OntologyAndries_vanRenssen
Database structures should allow for the expression of any fact that can be expressed in natural languages. This means that they should allow for any expression of facts in a universal formal language. Constraints should be specified in a separate layer. This article describes the basic concepts of such a universal semantic database structure and associated formal subset of a natural language.
The need of Interoperability in Office and GIS formatsMarkus Neteler
Free GIS and Interoperability: The need of Interoperability in Office and GIS formats
GIS Open Source, interoperabilità e cultura del dato nei SIAT della Pubblica Amministrazione
[GIS Open Source, interoperability and the 'culture of data' in the spatial data warehouses of the Public Administration]
Introduction of Data Science and Data AnalyticsVrushaliSolanke
Data science involves extracting meaningful insights from raw and structured data using scientific methods, technologies, and algorithms. It is a multidisciplinary field that uses tools to manipulate and analyze large amounts of data to find new and useful information. Data science uses powerful hardware, programming, and efficient algorithms to solve data problems and is the future of artificial intelligence. It involves collecting, preparing, analyzing, visualizing, managing, and preserving large data sets. Examples of data science applications include smart watches and Tesla's use of deep learning for self-driving cars.
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRcscpconf
Recent and continuing advances in online information systems are creating many opportunities
and also new problems in information retrieval. Gathering the information in different natural
language is the most difficult task, which often requires huge resources. Cross-language
information retrieval (CLIR) is the retrieval of information for a query written in the native
language. This paper deals with various classification techniques that can be used for solving
the problems encountered in CLIR.
Reference Domain Ontologies and Large Medical Language Models.pptxChimezie Ogbuji
Large Language Models (LLMs) have exploded into the modern research and development consciousness and triggered an artificial intelligence revolution. They are well-positioned to have a major impact on Medical Informatics. However, much of the data used to train these revolutionary models are general-purpose and, in some cases, synthetically generated from LLMs. Ontologies are a shared and agreed-upon conceptualization of a domain and facilitate computational reasoning. They have become important tools in biomedicine, supporting critical aspects of healthcare and biomedical research, and are integral to science. In this talk, we will delve into ontologies, their representational and reasoning power, and how terminology systems such as SNOMED-CT, an international master terminology providing comprehensive coverage of the entire domain of medicine, can be used with Controlled Natural Languages (CNL) to advance how LLMs are used and trained.
Semantic Web: Technolgies and Applications for Real-WorldAmit Sheth
Amit Sheth and Susie Stephens, "Semantic Web: Technolgies and Applications for Real-World," Tutorial at 2007 World Wide Web Conference, Banff, Canada.
Tutorial discusses technologies and deployed real-world applications through 2007.
Tutorial description at: http://www2007.org/tutorial-T11.php
The document discusses the concepts of semantic technology and the semantic web. It defines key concepts like tabula rasa, the network effect, and intelligence embedded in data through relationships. It also outlines technologies used in the semantic web like RDF, OWL, SPARQL, FOAF, and DBpedia and how search engines and companies are using these technologies for applications like sentiment analysis, natural language processing, and information extraction.
Terminologies play an important role in openEHR by enabling semantic interoperability between electronic health records. OpenEHR uses archetypes to define structured data sets and templates to specify use cases. Terminology bindings associate coded concepts and value sets from reference terminologies to openEHR data points. Different kinds of bindings are used including direct bindings and using local value sets. Future work includes developing context-dependent and compositional bindings.
These slides were presented as part of a W3C tutorial at the CSHALS 2010 conference (http://www.iscb.org/cshals2010). The slides are adapted from a longer introduction to the Semantic Web available at http://www.slideshare.net/LeeFeigenbaum/semantic-web-landscape-2009 .
A PDF version of the slides is available at http://thefigtrees.net/lee/sw/cshals/cshals-w3c-semantic-web-tutorial.pdf .
Concept and example of a semantic solution implemented with SQL views to cooperate with users on queries over structured data with independence from database schema knowledge and technology.
Pattern based approach for Natural Language Interface to DatabaseIJERA Editor
Natural Language Interface to Database (NLIDB) is an interesting and widely applicable research field. As the name suggests an NLIDB allows a naive user to ask query to database in natural language. This paper presents an NLIDB namely Pattern based Natural Language Interface to Database (PBNLIDB) in which patterns for simple query, aggregate function, relational operator, short-circuit logical operator and join are defined. The patterns are categorized into valid and invalid. Valid patterns are directly used to translate natural language query into Structured Query Language (SQL) query whereas an invalid pattern assists the query authoring service in generating options for user so that the query could be framed correctly. The system takes an English language query as input, recognizes pattern in the query, selects one of the before mentioned features of SQL based on the pattern, prepares an SQL statement, fires it on database and displays the result.
The document defines key concepts related to database management systems (DBMS) including what a DBMS is, the different levels of database architecture (external, conceptual, internal), data definition language (DDL), normalization, entity relationship (ER) modeling, and database normalization forms. It provides examples to illustrate database concepts and discusses the advantages of using a DBMS compared to traditional file management systems.
Similar to Healthcare Data Management using Domain Specific Languages for Metadata Management. - Splash2018 DSLDI Workshop (20)
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Healthcare Data Management using Domain Specific Languages for Metadata Management. - Splash2018 DSLDI Workshop
1. Healthcare Data Management using Domain Specific
Languages for Metadata Management
David Milward
This talk looks at :
human factors of DSLs
tool support for DSL users
studies of usability and other benefits of DSLs
experience reports of DSLs deployed in practice
5. • A data dictionary is not enough
• A catalogue is not enough
• The tools must be usable by domain experts.
• There will be more models than you think
6.
7.
8.
9.
10.
11. Within our language the prime artefacts are DataItems and Relationships, so it is essential to
ensure that all relationships are valid, therefore we need a notion of referential integrity.
12. Referential Integrity is provided out of the box with Xtext, which in turn leverages the Eclipse Modelling Framework
(EMF), which in turn has an API for validation of Ecore models.This can be accessed directly for other validation, and
thus one can avoid complicated grammar rules – since DSL’s in Xtext are defined by grammar rules (similar to ANTLR)
13.
14. Refines – the idea of can I use data item A instead of
data item B.
Example: Sex is classified in a number of ways in NHS
datasets, for Genomics England the key ones are:
1. Phenotypic Sex - 2 Female 1 Male 9 Indeterminate
2. Person Stated Gender - 1 Male 2 Female 9 Indeterminate
(Unable to be classified as either male or female) X Not Known (PERSON
STATEDGENDER CODE not recorded)
3. Person Karyotypic Sex XY : XX : XO : XXY : XYY :XXX :XXYY :
XXXY : : XXXX : other : unknown
15.
16.
17.
18.
19.
20.
21.
22. Very common groovy ‘builder’ pattern used for building request object.
This code is used behind the scenes to fetch details of data items, in particular
regular expressions which are used in DataElement specification, the regular
expressions are then used to generate “fake data”
My name is David Milward, I'm a PHD(DPHIL) student at Oxford University Department of Computer Science, a rather ancient student - I previously worked in Data Interoperability for about 10 years prior, primarily for NATO. NATO an organization with a large number of members who all want to share data, that is to say they want to see everyone else's data, but they don't want anyone looking at there's....I've since worked for the last 5 years in the healthcare sector exploring new ways of managing and integrating datasets, some of which involve Domain Specific Languages, which is why am I am here talking....
I’ve partitioned the talk into the following sections for clarity and reference.
In fact I am going to be telling a story over the next 25 minutes, it starts in late 2013/2014 when I was a full-time DPhil (PHD) student at Oxford and I was asked to provide a DSL to represent the ISO11179 standard for metadata registries, specifically in connection with a project my supervisor was conducting at the Oxford Biomedical Research Centre. We submitted a paper to this same conference in 2015 called: “Domain-Specific Modelling for Clinical Research”, which my colleague (at the time) Dr Sayeed Shah presented. The Oxford BRC continued working with Genomics England until early 2016, at which point they asked myself and some colleagues to take over the work, since it was no longer deemed ‘research’, so we formed a small company to carry on this work with Genomics England and a number of other Healthcare Organizations, the core work has gone into a toolkit which I will give you a quick demonstration of at the end of this talk.
The story is not JUST about Domain specific languages, it is a project that has used DSL’s and it is the documentation of that experience. We ended up using 3 DSL’s, 1 based on the Xtext workbench, and 2 as Groovy (internal) DSLs. I’m going to tell the story from the beginning to give context, and then move into cover the DSL’s in turn in more depth.
Finally I will give a quick toolkit demonstration, showing how DSL’s can be used in Dataset validation.
The original project was to integrate datasets from 5 different healthcare trusts.
The approach was to write a small mini-language to try and capture the essence of ISO11179 – the standard for metadata registries. This has already been used in an exploratory project at Oxford, sponsored by Cancer Research UK, and had been used in the US for the CaCore (G. A. Komatsoulis, D. B. Warzel, F. W. Hartel, K. Shanbhag,et al. caCORE version 3: Implementation of a
model driven, service-oriented architecture for semantic interoperability. Journal of Biomedical Informatics, 41(1):106–123, 2008.)
Thus we were following up these lines of work.
The initial problems were that the core meta-model defined in ISO11179 was internally inconsistent and impossible to implement fully.
The first prototype resulted many data items having to be entered twice or 3 times, hence we trimmed it back to Iteration 1.
At this point the team at Genomics England looked at our results and decided they wanted to sponsor the project forward, this time to integrate 10-12 different datasets.
However the terminology was still difficult so we started from scratch.
The problem:
The left column showed the main artefacts used to express datasets using ISO11179
The middle column was our first iteration
The RHS column was out second (successful) iteration.
Names : Model Catalogue, Metadata Exchange, Metadata Registry, MML (Metadata Modelling Language), MDML (Metadata Management Language)
By this point in time the metadata registry had morphed into the Model Catalogue.
The results in the 2015 are listed as:
A data dictionary is not enough. A simple, flat list of data
definitions does not support re-use at scale: it requires the user
to place all of the contextual information into the definition of
each data item, and mitigates against the automatic generation
and application of definitions. Instead, a compositional approach
is required, in which data elements are defined in explicit context.
A catalogue is not enough. The models in the catalogue must be
linked to implementations, and to each other, with a considerable
degree of automatic support. If the models are out of sync with
the implementations, and with the data, then their value is sharply
diminished. If you are going to manage data at scale, you need a
data model-driven approach.
The tools must be usable by domain experts. To have the processes
of model creation and maintenance mediated by software
engineers is problematic: there may be misunderstandings regarding
interpretation, but—more importantly—there are not enough
software engineers to go around. An appropriate user interface, that
closely matches the intuition and expectations of domain experts,
is essential.
There will be more models than you think. Different models
will be required for different types of implementation, and—in any
research domain, at least—data models will be constantly evolving,
with data being collected against different versions.
Intelligent, automatic support is essential. The information content
of precise data models is considerable, and there may be complex
dependencies between data concepts and constraints. A considerable
degree of automation is required if users are to cope with
this complexity.
=================================================
The data standards being used in healthcare originate from different specialist areas, use different formats and have many overlaps, resulting in a number of different viewpoints over which standards are more or less useful.
This list gives an overview of the main standards currently used in the UK NHS.
Each standards has a different history, a different set of demands and a different ‘language’.
A quick example would be the idea of pre-coordinated and post-coordinated terms in SNOMED CT.
This is the kind of existing Dataset that needs to be managed.
It is presented in a spreadsheet form, the headers provide the definition of the metamodel, which is then transformed into the MDML format.
Example will be shown later on.
Continuing the idea that Xtext builds on EMF, we are able out of the box to enforce referential integrity. After the first two iteration we made a detailed examination of what kind of language was needed, and we wrote this specification up using a formal language called Z.
The Z specification allowed us to work out what was required in the initial language.
Referential Integrity was required, as expressed in the snippet here.
This
Key Features
Unique ID for a DataModel – built in with version and status
Status relates to lifecycle – draft-finalized-superceded(deprecated)
GUID – unique identifier for dataitem – e.g. status=draft 123123@0.01, or status=finalized 123123@1.00
Refines relates to the idea of can dataitem A be used in place of dataitem B e.g. sex.
Phenotypic Sex - 2 Female 1 Male 9 Indeterminate
Person Stated Gender - 1 Male 2 Female 9 Indeterminate (Unable to be classified as either male or female) X Not Known (PERSON STATED GENDER CODE not recorded)
Person Karyotypic Sex XY XY XX XX XO XO XXY XXY XYY XYY XXX XXX XXYY XXYY XXXY XXXY XXXX XXXX other other unknown unknown
So when we are mapping and transforming datasets refinement relationships can be defined.
This idea has been removed from the latest versions, but may be re-introduced.
Relationships can be varied, and can map between different elements or data items
ExtensionItems are in effect pure “metadata”
And enable transformations to work effectively – anything not expressible in the target dataset is stored in an ExtensionItem
Status – refer to previous explanation
Constraint – allow a) constraints upon element types – i.e. a DataType can be constrained by a regular expression
b) A number of DataElements can be constrained as a group within a Datamodel
You can also write this from a groovy script – see web page documentation at the end.
Future work – to put a direct call into the metadata registry to get the item – i.e. within the DSL
What is happening is that the item named is being searched for, an ID found and then a request made to the server and specifications are obtained.
This is the fake data being generated from the previous DSL.