Many areas of scientific discovery rely on combining data from multiples data sources. However there are many challenges in linking data. This presentation highlights these challenges in the context of using Linked Data for environmental and social science databases.
Dr. Katherine Skinner is the Executive Director of the Educopia Institute, a not-for-profit educational organization that builds networks and collaborative communities to help cultural, scientific, and scholarly institutions achieve greater impact.
This presentation is about challenging the roles we need to play in order to “move the needle” in academic publishing into the next generation of scholarly communications.
A discussion on the research paper 'An Efficient Approximate Protocol for Privacy-Preserving Association Rule Mining' by 'Murat Kantarcioglu, Robert Nix , and Jaideep Vaidya'
Privacy Preserved Distributed Data Sharing with Load Balancing SchemeEditor IJMTER
Data sharing services are provided under the Peer to Peer (P2P) environment. Federated
database technology is used to manage locally stored data with a federated DBMS and provide unified
data access. Information brokering systems (IBSs) are used to connect large-scale loosely federated data
sources via a brokering overlay. Information brokers redirect the client queries to the requested data
servers. Privacy preserving methods are used to protect the data location and data consumer. Brokers are
trusted to adopt server-side access control for data confidentiality. Query and access control rules are
maintained with shared data details under metadata. A Semantic-aware index mechanism is applied to
route the queries based on their content and allow users to submit queries without data or server
information.
Distributed data sharing is managed with Privacy Preserved Information Brokering (PPIB)
scheme. Attribute-correlation attack and inference attacks are handled by the PPIB. PPIB overlay
infrastructure consisting of two types of brokering components, brokers and coordinators. The brokers
acts as mix anonymizer are responsible for user authentication and query forwarding. The coordinators
concatenated in a tree structure, enforce access control and query routing based on the automata.
Automata segmentation and query segment encryption schemes are used in the Privacy-preserving
Query Brokering (QBroker). Automaton segmentation scheme is used to logically divide the global
automaton into multiple independent segments. The query segment encryption scheme consists of the
preencryption and postencryption modules.
The PPIB scheme is enhanced to support dynamic site distribution and load balancing
mechanism. Peer workloads and trust level of each peer are integrated with the site distribution process.
The PPIB is improved to adopt self reconfigurable mechanism. Automated decision support system for
administrators is included in the PPIB.
Brisbane Health-y Data: Queensland Data Linkage FrameworkARDC
Presentation given by Trisha Johnston and Catherine Taylor at the 'Sharing Health-y Data Workshop: Challenges and Solutions' event co-hosted by ANDS and HISA. Held on Wednesday 16th March 2016 at the Translational Research Institute, Brisbane, Australia.
Dr. Katherine Skinner is the Executive Director of the Educopia Institute, a not-for-profit educational organization that builds networks and collaborative communities to help cultural, scientific, and scholarly institutions achieve greater impact.
This presentation is about challenging the roles we need to play in order to “move the needle” in academic publishing into the next generation of scholarly communications.
A discussion on the research paper 'An Efficient Approximate Protocol for Privacy-Preserving Association Rule Mining' by 'Murat Kantarcioglu, Robert Nix , and Jaideep Vaidya'
Privacy Preserved Distributed Data Sharing with Load Balancing SchemeEditor IJMTER
Data sharing services are provided under the Peer to Peer (P2P) environment. Federated
database technology is used to manage locally stored data with a federated DBMS and provide unified
data access. Information brokering systems (IBSs) are used to connect large-scale loosely federated data
sources via a brokering overlay. Information brokers redirect the client queries to the requested data
servers. Privacy preserving methods are used to protect the data location and data consumer. Brokers are
trusted to adopt server-side access control for data confidentiality. Query and access control rules are
maintained with shared data details under metadata. A Semantic-aware index mechanism is applied to
route the queries based on their content and allow users to submit queries without data or server
information.
Distributed data sharing is managed with Privacy Preserved Information Brokering (PPIB)
scheme. Attribute-correlation attack and inference attacks are handled by the PPIB. PPIB overlay
infrastructure consisting of two types of brokering components, brokers and coordinators. The brokers
acts as mix anonymizer are responsible for user authentication and query forwarding. The coordinators
concatenated in a tree structure, enforce access control and query routing based on the automata.
Automata segmentation and query segment encryption schemes are used in the Privacy-preserving
Query Brokering (QBroker). Automaton segmentation scheme is used to logically divide the global
automaton into multiple independent segments. The query segment encryption scheme consists of the
preencryption and postencryption modules.
The PPIB scheme is enhanced to support dynamic site distribution and load balancing
mechanism. Peer workloads and trust level of each peer are integrated with the site distribution process.
The PPIB is improved to adopt self reconfigurable mechanism. Automated decision support system for
administrators is included in the PPIB.
Brisbane Health-y Data: Queensland Data Linkage FrameworkARDC
Presentation given by Trisha Johnston and Catherine Taylor at the 'Sharing Health-y Data Workshop: Challenges and Solutions' event co-hosted by ANDS and HISA. Held on Wednesday 16th March 2016 at the Translational Research Institute, Brisbane, Australia.
Every person involved,is concerned about the leakage of private data i.e privacy of the individual's data.Today privacy of data is one of the most serious concerns which people face on an individual as well as organisational level and it has to be dealt with in an effective
manner using privacy preserving data mining.
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches14894
In this paper we review on the
various privacy preserving data mining techniques like data
modification and secure multiparty computation based on the
different aspects.
Index Terms– Privacy and Security, Data Mining, Privacy
Preserving, Secure Multiparty Computation (SMC) and Data
Modification
Data (record) linkage brings together information from two different records that are believed to belong to the same person based on matching variables
If two records agree on all matching variables, it is unlikely that they would have agreed by chance, the level of assurance that the link is correct will be high (the pair belongs to the same person)
If all of the matching variables disagree, the pair will not be linked and it is unlikely that it belongs to the same person
Intermediate situations where some matching variables agree and some matching variables disagree, need to predict whether the pair is a true match or a non-match
Often need clerical intervention to determine matching status
Data Linkage is difficult in the presence of errors in collecting data and where no unique high quality identifier is available
Several anonymization techniques, such as generalization and bucketization, have been designed for privacy preserving microdata publishing. Recent work has shown that generalization loses considerable amount of information, especially for high-dimensional data. Bucketization, on the other hand, does not prevent membership disclosure and does not apply for data that do not have a clear separation between quasi-identifying attributes and sensitive attributes. In this paper, we present a novel technique called slicing, which partitions the data both horizontally and vertically. We show that slicing preserves better data utility than generalization and can be used for membership disclosure protection. Another important advantage of slicing is that it can handle high-dimensional data. We show how slicing can be used for attribute disclosure protection and develop an efficient algorithm for computing the sliced data that obey the ℓ-diversity requirement. Our workload experiments confirm that slicing preserves better utility than generalization and is more effective than bucketization in workloads involving the sensitive attribute. Our experiments also demonstrate that slicing can be used to prevent membership disclosure.
What are the research and technical challenges of linked data that are relevant to data science?
This presentation introduces the ideas of linked data using the BBC sport web site as an example. It then identifies several research challenges that remain to be addressed.
WESCML: A Data Standard for Exchanging Water and Energy Supply and Consumptio...Jonathan Yu
Slides from a talk given at the International Hydroinformatics Conference Incheon, South Korea 22 Aug '16 on the Water and Energy Supply and Consumption markup language data standard (WESCML) and its supporting tools.
http://wescml.org
Paper here: http://dx.doi.org/10.1016/j.proeng.2016.07.451
Presentation by Stuart Macdonald of the Edinburgh University Data Library at the Graduate School of Social and Political Science Induction, 15 and 16 Septeber, 2011, University of Edinburgh
CeRDI Research RUN Vietnam Agriculture GroupHelen Thompson
Federation University's Centre for eResearch and Digital Innovation (CeRDI) is participating in the Regional University Network (RUN) Vietnam Agriculture Group. This presentation provides some background on CeRDI initiatives in eResearch.
Areas of focus include capacity building and engagement, research collaborations around soil management, water resources, land use, crop productivity, climate change and adaption, biodiversity, participatory GIS and citizen science.
Major technology and research trends link to ubiquitous high-speed broadband, the petabyte age, open data policies and the opportunities for Universities and particularly regional universities to play a significant role in generating insight from data.
Mobile technologies… App development and responsive design – for student and staff recruitment, engagement, knowledge transfer
3d and visualisation technologies… Massive innovation and research opportunities
Centre for eResearch and Digital Innovation - Research OverviewHelen Thompson
The Centre for eResearch and Digital Innovation (CeRDI) is a Federation University Australia (FedUni) Centre focused on:
• The application of information communications technology (ICT) and the development of innovative, world class knowledge management systems;
• Significantly advancing the digital literacy and knowledge management capabilities and capacity of partner organisations;
• Fostering, development and implementation of eResearch within academia and industry; and
• Measuring the impact of eResearch and digital innovation through longitudinal research.
CeRDI is also gaining national and international recognition in innovative spatial information systems.
This presentation showcases some of the diverse range of projects are that being supported through the team at CeRDI.
Projects are at various stages of their evolution with many sharing common goals to inform ‘big picture’ understanding and enhance decision making, create greater efficiencies in communication, increase the quality of information and support policy formulation and evaluation.
Every person involved,is concerned about the leakage of private data i.e privacy of the individual's data.Today privacy of data is one of the most serious concerns which people face on an individual as well as organisational level and it has to be dealt with in an effective
manner using privacy preserving data mining.
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches14894
In this paper we review on the
various privacy preserving data mining techniques like data
modification and secure multiparty computation based on the
different aspects.
Index Terms– Privacy and Security, Data Mining, Privacy
Preserving, Secure Multiparty Computation (SMC) and Data
Modification
Data (record) linkage brings together information from two different records that are believed to belong to the same person based on matching variables
If two records agree on all matching variables, it is unlikely that they would have agreed by chance, the level of assurance that the link is correct will be high (the pair belongs to the same person)
If all of the matching variables disagree, the pair will not be linked and it is unlikely that it belongs to the same person
Intermediate situations where some matching variables agree and some matching variables disagree, need to predict whether the pair is a true match or a non-match
Often need clerical intervention to determine matching status
Data Linkage is difficult in the presence of errors in collecting data and where no unique high quality identifier is available
Several anonymization techniques, such as generalization and bucketization, have been designed for privacy preserving microdata publishing. Recent work has shown that generalization loses considerable amount of information, especially for high-dimensional data. Bucketization, on the other hand, does not prevent membership disclosure and does not apply for data that do not have a clear separation between quasi-identifying attributes and sensitive attributes. In this paper, we present a novel technique called slicing, which partitions the data both horizontally and vertically. We show that slicing preserves better data utility than generalization and can be used for membership disclosure protection. Another important advantage of slicing is that it can handle high-dimensional data. We show how slicing can be used for attribute disclosure protection and develop an efficient algorithm for computing the sliced data that obey the ℓ-diversity requirement. Our workload experiments confirm that slicing preserves better utility than generalization and is more effective than bucketization in workloads involving the sensitive attribute. Our experiments also demonstrate that slicing can be used to prevent membership disclosure.
What are the research and technical challenges of linked data that are relevant to data science?
This presentation introduces the ideas of linked data using the BBC sport web site as an example. It then identifies several research challenges that remain to be addressed.
WESCML: A Data Standard for Exchanging Water and Energy Supply and Consumptio...Jonathan Yu
Slides from a talk given at the International Hydroinformatics Conference Incheon, South Korea 22 Aug '16 on the Water and Energy Supply and Consumption markup language data standard (WESCML) and its supporting tools.
http://wescml.org
Paper here: http://dx.doi.org/10.1016/j.proeng.2016.07.451
Presentation by Stuart Macdonald of the Edinburgh University Data Library at the Graduate School of Social and Political Science Induction, 15 and 16 Septeber, 2011, University of Edinburgh
CeRDI Research RUN Vietnam Agriculture GroupHelen Thompson
Federation University's Centre for eResearch and Digital Innovation (CeRDI) is participating in the Regional University Network (RUN) Vietnam Agriculture Group. This presentation provides some background on CeRDI initiatives in eResearch.
Areas of focus include capacity building and engagement, research collaborations around soil management, water resources, land use, crop productivity, climate change and adaption, biodiversity, participatory GIS and citizen science.
Major technology and research trends link to ubiquitous high-speed broadband, the petabyte age, open data policies and the opportunities for Universities and particularly regional universities to play a significant role in generating insight from data.
Mobile technologies… App development and responsive design – for student and staff recruitment, engagement, knowledge transfer
3d and visualisation technologies… Massive innovation and research opportunities
Centre for eResearch and Digital Innovation - Research OverviewHelen Thompson
The Centre for eResearch and Digital Innovation (CeRDI) is a Federation University Australia (FedUni) Centre focused on:
• The application of information communications technology (ICT) and the development of innovative, world class knowledge management systems;
• Significantly advancing the digital literacy and knowledge management capabilities and capacity of partner organisations;
• Fostering, development and implementation of eResearch within academia and industry; and
• Measuring the impact of eResearch and digital innovation through longitudinal research.
CeRDI is also gaining national and international recognition in innovative spatial information systems.
This presentation showcases some of the diverse range of projects are that being supported through the team at CeRDI.
Projects are at various stages of their evolution with many sharing common goals to inform ‘big picture’ understanding and enhance decision making, create greater efficiencies in communication, increase the quality of information and support policy formulation and evaluation.
A roundtable with Peter McKeague (RCAHMS) and Stefano Campana (McDonald Research Institute, University of Cambridge and the University of Siena) at the Computer Applications and Quantitative Methods in Archaeology conference at the University of Siena on 1st April 2015.
This round table session seeks to build a case for developing a thematic SDI but is thematic SDI even necessary with existing digital infrastructure initiatives – Archaeolandscapes (Arcland), ARIADNE and Europeana – in place? Where are the current initiatives and exemplar projects, particularly for data created through fieldwork and scientific analysis, for harmonising spatial data?
Week 13 (Apr. 8) – Assemblages, Genealogies and Dynamic Nominalism
Course description:
The emphasis is to learn to envision data genealogically, as a social and technical assemblages, as infrastructure and reframe them beyond technological conceptions. During the term we will explore data, facts and truth; the power of data both big and small; governmentality and biopolitics; risk, probability and the taming of chance; algorithmic culture, dynamic nominalism, categorization and ontologies; the translation of people, space and social phenomena into and by data and software and the role of data in the production of knowledge.
This class format is a graduate MA seminar and a collaborative workshop. We will work with Ottawa Police Services and critically examine the socio-technological data assemblage of that institution. This includes a fieldtrip to the Elgin street station; a tour of the 911 Communication Centre and we will meet with data experts.
Big data and the dark arts - Jisc Digital Media 2015Jisc
There still remains a certain misunderstanding by the very definition of "big data" and the perceived hype around the term. This workshop clarified the concepts and give examples of relevant big data projects.
Tracey P. Lauriault (Programmable City team)
A genealogy of open data assemblages
Abstract: Evidence informed decision making, participatory public policy, government transparency and accountability, sustainable development, and data driven journalism were the initial drivers of making public data accessible. The access work of geomaticians, researchers, librarians, community developers and journalists has recently been recast as open data that includes a different set of actors. As open data matures as a practice, its principles, definitions and guidelines have been transformed into national performance indicators such as indexes, barometers, ratings and score cards; the private sector such as Gartner, McKinsey, and Deloitte are touting open data's innovation and business opportunities; while smart city initiatives offer tools and expertise to help government sense, monitor, measure and evaluate their cities. Open data today seems to have evolved far from its original ideals, even with civil society players such as Markets for Good, Sunlight Foundation, Open Knowledge Foundation, Code for America, and many others advocating for more social approaches. This talk proposes an assemblage approach to understanding open data and provides a genealogy of its development in different contexts and places.
Bio: Tracey P. Lauriault is a Programmable City Project Postdoctoral Researcher focussing on How are digital data generated and processed about cities and their citizens? She arrives from Canada where she was a researcher with the Geomatics and Cartographic Research Centre, at Carleton University, where she investigated Data, Infrastructures and Geographical Imaginations, spatial data infrastructures, open data and the preservation of and access to research and geomatics data; legal and policy issues associated with geospatial, administrative and civil society data; and cybercartography. She is a a member of the international Research Data Alliance Legal (RDA) Interoperability Working Group, the Natural Resources Canada Roundtable on Geomatics Legal and Policy Interest Group. She is also actively engaged in public policy research as it pertains to open data and their related infrastructures.
This presentation was given by Kirsty Lingstadt and Peter McKeague of RCAHMS at a one-day seminar, Towards a Collaborative Strategy for sector information management (TACOS) in York on 14 May 2014.
http://www.archaeologists.net/groups/imsig/tacos
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...Alasdair Gray
In recent years there has been a reproducibility crisis in science. Computational notebooks, such as Jupyter, have been touted as one solution to this problem. However, when executing analyses over live SPARQL endpoints, we get different answers depending upon when the analysis in the notebook was executed. In this paper, we identify some of the issues discovered in trying to develop a reproducible analysis over a collection of biomedical data sources and suggest some best practice to overcome these issues.
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Alasdair Gray
The Bioschemas community (http://bioschemas.org) is a loose collaboration formed by a wide range of life science resource providers and informaticians. The community is developing profiles over Schema.org to enable life science resources such as data about a specific protein, sample, or training event, to be more discoverable on the web. While the content of well-known resources such as Uniprot (for protein data) are easily discoverable, there is a long tail of specialist resources that would benefit from embedding Schema.org markup in a standardised approach.
The community have developed twelve profiles for specific types of life science resources (http://bioschemas.org/specifications/), with another six at an early draft stage. For each profile, a set of use cases have been identified. These typically focus on search, but several facilitate lightweight data exchange to support data aggregators such as Identifiers.org, FAIRsharing.org, and BioSamples. The next stage of the development of a profile consists of mapping the terms used in the use cases to existing properties in Schema.org and domain ontologies. The properties are then prioritised in order to support the use cases, with a minimal set of about six properties identified, along with a larger set of recommended and optional properties. For each property, an expected cardinality is defined and where appropriate, object values are specified from controlled vocabularies. Before a profile is finalised, it must first be demonstrated that resources can deploy the markup.
In this talk, we will outline the progress that has been made by the Bioschemas Community in a single year through three hackathon events. We will discuss the processes followed by the Bioschemas Community to foster collaboration, and highlight the benefits and drawbacks of using open Google documents and spreadsheets to support the community develop the profiles. We will conclude by summarising future opportunities and directions for the community.
An Identifier Scheme for the Digitising Scotland ProjectAlasdair Gray
The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland•À_s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers. The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Center - Scotland).
Supporting Dataset Descriptions in the Life SciencesAlasdair Gray
Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has its own set of properties and requirements as to which must be provided and which are more optional. Developing a dataset description for a given dataset to conform to a specific metadata profile is a challenging process.
In this talk, I will give an overview of some of the dataset description specifications that are available. I will discuss the difficulties in writing a dataset description that conforms to a profile and the tooling that I've developed to support dataset publishers in creating metadata description and validating them against a chosen specification.
Seminar talk given at the EBI on 5 April 2017
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Alasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.
Validata: A tool for testing profile conformanceAlasdair Gray
Validata (http://hw-swel.github.io/Validata/) is an online web application for validating a dataset description expressed in RDF against a community profile expressed as a Shape Expression (ShEx). Additionally it provides an API for programmatic access to the validator. Validata is capable of being used for multiple community agreed standards, e.g. DCAT, the HCLS community profile, or the Open PHACTS guidelines, and there are currently deployments to support each of these. Validata can be easily repurposed for different deployments by providing it with a new ShEx schema. The Validata code is available from GitHub (https://github.com/HW-SWeL/Validata).
Presentation given at SDSVoc https://www.w3.org/2016/11/sdsvoc
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsAlasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.
The goal of this presentation is to give an overview of the HCLS Community Profile and explain how it extends and builds upon other approaches.
Presentation given at SDSVoc (https://www.w3.org/2016/11/sdsvoc/)
Presentation given at the Open PHACTS project symposium.
The slides give an overview of the data in the 2.0 Open PHACTS drug discovery platform and the challenges that have been faced in the Open PHACTS project to reach this stage.
This presentation was prepared for my faculty Christmas conference.
Abstract: For the last 11 months I have been working on a top secret project with a world renowned Scandinavian industry partner. We are now moving into the exciting operational phase of this project. I have been granted an early lifting of the embargo that has stopped me talking about this work up until now. I will talk about the data science behind this big data project and how semantic web technology has enabled the delivery of Project X.
Data Integration in a Big Data Context: An Open PHACTS Case StudyAlasdair Gray
Keynote presentation at the EU Ambient Assisted Living Forum workshop The Crusade for Big Data in the AAL Domain.
The presentation explores the Open PHACTS project and how it overcame various Big Data challenges.
Data is being generated all around us – from our smart phones tracking our movement through a city to the city itself sensing various properties and reacting to various conditions. However, to maximise the potential from all this data, it needs to be combined and coerced into models that enable analysis and interpretation. In this talk I will give an overview of the techniques that I have developed for data integration: integrating streams of sensor data with background contextual data and supporting multiple interpretations of linking data together. At the end of the talk I will overview the work I will be conducting in the Administrative Data Research Centre for Scotland.
Scientific lenses to support multiple views over linked chemistry dataAlasdair Gray
When are two entries about a small molecule in different datasets the same? If they have the same drug name, chemical structure, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.
In this paper, we present an approach to enable applications to choose the equivalence criteria to apply between datasets. Thus, supporting multiple dynamic views over the Linked Data. For chemical data, we show that multiple sets of links can be automatically generated according to different equivalence criteria and published with semantic descriptions capturing their context and interpretation. This approach has been applied within a large scale public-private data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.
Scientific Lenses over Linked Data An approach to support multiple integrate...Alasdair Gray
When are two entries about a concept in different datasets the same? If they have the same name, properties, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.
In this presentation, I will introduce Scientific lenses, an approach that enables applications to vary the equivalence conditions between linked datasets. They have been deployed in the Open PHACTS Discovery Platform – a large scale data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.
Describing Scientific Datasets: The HCLS Community ProfileAlasdair Gray
Big Data presents an exciting opportunity to pursue large-scale analyses over collections of data in order to uncover valuable insights across a myriad of fields and disciplines. Yet, as more and more data is made available, researchers are finding it increasingly difficult to discover and reuse these data. One problem is that data are insufficiently described to understand what they are or how they were produced. A second issue is that no single vocabulary provides all key metadata fields required to support basic scientific use cases. A third issue is that data catalogs and data repositories all use different metadata standards, if they use any standard at all, and this prevents easy search and aggregation of data. Therefore, we need a community profile to indicate what are the essential metadata, and the manner in which we can express it.
The W3C Health Care and Life Sciences Interest Group have developed such a community profile that defines the required properties to provide high-quality dataset descriptions that support finding, understanding, and reusing scientific data, i.e. making the data FAIR (Findable, Accessible, Interoperable and Re-usable – http://datafairport.org). The specification reuses many notions and vocabulary terms from Dublin Core, DCAT and VoID, with provenance and versioning information being provided by PROV-O and PAV. The community profile is based around a three tier model; the summary description captures catalogue style metadata about the dataset, each version of the dataset is described separately as are the various distribution formats of these versions. The resulting community profile is generic and applicable to a wide variety of scientific data.
Tools are being developed to help with the creation and validation of these descriptions. Several datasets including those from Bio2RDF, EBI and IntegBio are already moving to release descriptions conforming to the community profile.
SensorBench is a benchmark suite for wireless sensor networks. The design of wireless sensor network systems sits within a multi-dimensional design space, where it can be difficult to understand the implications of specific decisions and to identify optimal solutions. SensorBench enables the systematic analysis and comparison of different techniques and platforms, enabling both development and user communities to make well informed choices. The benchmark identifies key variables and performance metrics, and specifies experiments that explore how different types of task perform under different metrics for the controlled variables. The benchmark is demonstrated by its application on representative platforms.
Full details of the benchmark are available from http://dl.acm.org/citation.cfm?id=2618252 (DOI: 10.1145/2618243.2618252)
Dataset Descriptions in Open PHACTS and HCLSAlasdair Gray
This presentation gives an overview of the dataset description specification developed in the Open PHACTS project (http://www.openphacts.org/). The creation of the specification was driven by a real need within the project to track the datasets used.
Details of the dataset metadata captured and the vocabularies used to model this metadata are given together with the tools developed to enable the specification's uptake.
Over the course of the last 12 months, the W3C Healthcare and Life Science Interest Group have been developing a community profile for dataset descriptions. This has drawn on the ideas developed in the Open PHACTS specification. A brief overview of the forthcoming community profile is given in the presentation.
This presentation was given to the Network Data Exchange project http://www.ndexbio.org/ on 2 April 2014.
Computing Identity Co-Reference Across Drug Discovery DatasetsAlasdair Gray
This paper presents the rules used within the Open PHACTS (http://www.openphacts.org) Identity Management Service to compute co-reference chains across multiple datasets. The web of (linked) data has encouraged a proliferation of identifiers for the concepts captured in datasets; with each dataset using their own identifier. A key data integration challenge is linking the co-referent identifiers, i.e. identifying and linking the equivalent concept in every dataset. Exacerbating this challenge, the datasets model the data differently, so when is one representation truly the same as another? Finally, different users have their own task and domain specific notions of equivalence that are driven by their operational knowledge. Consumers of the data need to be able to choose the notion of operational equivalence to be applied for the context of their application. We highlight the challenges of automatically computing co-reference and the need for capturing the context of the equivalence. This context is then used to control the co-reference computation. Ultimately, the context will enable data consumers to decide which co-references to include in their applications.
Incorporating Commercial and Private Data into an Open Linked Data Platform f...Alasdair Gray
The Open PHACTS Discovery Platform aims to provide an integrated information space to advance pharmacological research in the area of drug discovery. Effective drug discovery requires comprehensive data coverage, i.e. integrating all available sources of pharmacology data. While many relevant data sources are available on the linked open data cloud, their content needs to be combined with that of commercial datasets and the licensing of these commercial datasets respected when providing access to the data. Additionally, pharmaceutical companies have built up their own extensive private data collections that they require to be included in their pharmacological dataspace. In this paper we discuss the challenges of incorporating private and commercial data into a linked dataspace: focusing on the modelling of these datasets and their interlinking. We also present the graph-based access control mechanism that ensures commercial and private datasets are only available to authorized users.
http://link.springer.com/chapter/10.1007/978-3-642-41338-4_5
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
2. Estuarine Flooding
Financial implications
Damage
Loss of business
Personal factors
Emotional impact
Flood prediction
Locations
Severity
Requires correlating
Sea-state data
Weather forecasts
Details of sea defences
Response Planning
Evacuation routes
Personnel deployment
…
Requires more data
Traffic reports
Shipping
…
8 April 2015 SICSA Env. & Social Databases 2
Image: http://www.metro.co.uk/
3. Flood Predication
Solent Use Case
Busy shipping
channel
Two major ports
Complex tidal
and
wave patterns
8 April 2015 SICSA Env. & Social Databases 3
6. Data Linkage and Querying
Web of Data
8 April 2015 SICSA Env. & Social Databases 6
7. 1. Global ID – URI
2. Resolvable ID
3. Useful content
HTML for humans
RDF for machines
4. Link to other resources
Like the Web,
but for data!
Linked Data Approach
8 April 2015 SICSA Env. & Social Databases 7
“RDF and OWL do not
solve the interoperability
problem, they just lay it
bare on the table!”
10. Querying Approach
Use ontologies as common model
Requires:
Representation of data:
sensors and databases
Establishing mappings between ontology
models and data source schemas
Accessing data sources through queries
over ontology model
Expressing continuous queries over sensors
8 April 2015 SICSA Env. & Social Databases 10
11. WSN Resource Concerns
Energy
Running off battery
Computation Capabilities
Limited CPU
Limited memory
Limited storage
Radio Transmission
Limited range
Energy impact
Lost transmissions
8 April 2015 SICSA Env. & Social Databases 12
12. Data Matching
Administrative Data Research Centre - Scotland
Messy data
Probabilistic matches
Schema matching
John Grant
Fisherman
Fiona Sinclair
Ian Grant
Smithy
Born: 1861
Stuart Adam
Wheelwright
Morag Scott
Flora Adam
Seamstress
Born: 1866
Married: 1884
John Grant
Farmer
Fiona Grant
Iain Grant
Born: 1860
13
13. Administrative Data Research Network
Administrative Data Research Centre - Scotland
Administrative
Data Service
14
14. ADRC-Scotland
Administrative Data Research Centre - Scotland
Co-located with Farr Institute,
Scottish Government and NHS.
Universities of Aberdeen, Dundee,
Edinburgh, Glasgow, Herriot-Watt,
St Andrews and Stirling.
Expertise in administrative data and public
engagement, linkage, law and relevant computer
science techniques.
Provide research support, facilities, training
15
15. Research Focus
Administrative Data Research Centre - Scotland
http://www.gov.scot/Resource/0044/00442276-390.jpg
Schools, colleges and universities
The criminal and justice system
Social work services
Social welfare
Housing system
Transport system
Health system
Historical administrative data
16
16. Multiple Identities
Andy Law's Third Law
“The number of unique identifiers
assigned to an individual is never
less than the number of Institutions
involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
8 April 2015 SICSA Env. & Social Databases 17
P12047
X31045
GB:29384
http://rdf.ebi.ac.uk/resource/ch
embl/molecule/CHEMBL1642
https://www.ebi.ac.uk/chembl/co
mpound/inspect/CHEMBL1642
17. Query Performance
Response time
Data freshness
Reliability
Volume of requests
Hosting resources
8 April 2015 SICSA Env. & Social Databases 18
Data
Source
Data
Source
Data Warehouse
Queries
Data
Source
Data
Source
Mediator
Queries
18. How FAIR is your Data?
8 April 2015 SICSA Env. & Social Databases 19
19. Summary
Web of Data
Global
Identifiers
Interoperable
data
Domain
ontologies
Challenges
Data matching
Multiple
identifiers
Query
performance
FAIR data
8 April 2015 SICSA Env. & Social Databases 20
www.alasdairjggray.co.uk
A.J.G.Gray@hw.ac.uk
@gray_alasdair
Editor's Notes
Environmental decision support systems
Flood emergency response:
real-time data mash-ups
real-time data linkage
Strait of water separating Isle of Wight from English mainland
Two high tides -> increased opportunities for getting ships in and out -> better for business
Complex tidal pattern
Non-standard models
Overtopping: a wave or tide exceeds the height of the sea defence: simplified as threshold in graph
Sensor data provides current sea-state conditions
National Flood and Coastal Defences Database (NFCDD) provides height of sea walls, etc
Lots of forms of heterogeneity in the system
Contextual Data
Weather feed provides predicted wind speed and direction,
contextual streaming data
Maps -> contextual visual data
Report data in a form understandable to the user, ontology
Data from heterogeneous sources: discover relevant sources; different temporal modalities; different data models and representations
Interlink data: common representation, align data models/schemas, identify common entities
Query decomposition across distributed sources
Efficient in-network processing: Save energy, increase network lifetime
Enable new insights through novel user interfaces
Linked data offers a platform on which to do data science
Linked Data hugely successful since inception in 2006, revision 2009
About 300 datasets published
Wide range of topics
Coverage of 10,000+ athletes, 200+ countries, 400-500 disciplines and 30 venues
Page for every athlete and country drawing on open data
Internally
DBPedia and Geonames
Previous streaming extensions to SPARQL have problems
Bird habitat monitoring, Coastal monitoring, Glacier movement, Farms, Volcanoes…
Cost effective monitoring, high spatial/temporal resolution
What is the underlying technology/software?
Trade-off of capabilities vs QoS vs Lifetime
Every system performed their own bespoke evaluations, how do you compare?
Social science example from ADRC Scotland
Same problem in environmental science: bore holes in the North Sea
Four Administrative Data Research Centres (ADRCs), one in each UK country
England – led by University of Southampton
Northern Ireland – led by Queens Uni Belfast
Scotland – led by University of Edinburgh
Wales – led by Swansea University
Coordinating Administrative Data Service (ADS) – led by University of Essex
Each captures a subtly different view of the world
Are they the same? … depends on your point of view
Different URIs for different representations (content negotiation)