These slides are from my seminar to the University of Reading Department of Meteorology, November 2013. They contain a (hopefully not very technical) introduction to the concepts of Linked Data and how we are applying them in the CHARMe project (http://www.charme.org.uk). In CHARMe we are using Open Annotation to connect users of climate data with community-generated "commentary information" that helps them to understand a dataset's strengths and weaknesses.
The slide notes contain some helpful context, so you might like to download the PPT file!
The slides are licensed as "Creative Commons Attribution 3.0", meaning that you can do what you like with these slides provided that you credit the University of Reading for their creation. See http://creativecommons.org/licenses/by/3.0/.
The document discusses the concept of "Broad Data" which refers to the large amount of freely available but widely varied open data on the World Wide Web, including structured and semi-structured data. It provides examples such as the growing linked open data cloud and over 710,000 datasets available from governments around the world. Broad data poses new challenges for data search, modeling, integration and visualization of partially modeled datasets. International open government data search and linking government data to additional contexts are also discussed.
Slides from guest presentation at Aron Lindberg's Computational-Qualitative Field Research seminar: http://aronlindberg.github.io/computational_field_research/ Needed readings at https://www.dropbox.com/sh/1gx9s2zlnxvumbz/AAAV9uSAJHsiPeJhSsNnnM9Pa?dl=0
Presented to a webinar hosted by Nuance Inc, under the title "The Semantic Web: What it is and Why you should care" on 2/29/2012.
This talk presents a fast overview of the Semantic Web and recent application deployment in the space.
Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.
Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.
I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.
In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.
The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.
The Web of Data: do we actually understand what we built?Frank van Harmelen
Despite its obvious success (largest knowledge base ever built, used in practice by companies and governments alike), we actually understand very little of the structure of the Web of Data. Its formal meaning is specified in logic, but with its scale, context dependency and dynamics, the Web of Data has outgrown its traditional model-theoretic semantics.
Is the meaning of a logical statement (an edge in the graph) dependent on the cluster ("context") in which it appears? Does a more densely connected concept (node) contain more information? Is the path length between two nodes related to their semantic distance?
Properties such as clustering, connectivity and path length are not described, much less explained by model-theoretic semantics. Do such properties contribute to the meaning of a knowledge graph?
To properly understand the structure and meaning of knowledge graphs, we should no longer treat knowledge graphs as (only) a set of logical statements, but treat them properly as a graph. But how to do this is far from clear.
In this talk, I report on some of our early results on some of these questions, but I ask many more questions for which we don't have answers yet.
The document discusses solutions to overcoming the tragedy of the data commons through shared metadata. It describes how large scientific projects can share data at low cost by starting from overlapping common metadata terms and having their metadata teams work together. Reusing shared metadata leads to increased reusability of data across projects. The document advocates for developing metadata as evolving, linked resources rather than predefined standards, and provides examples of how this approach has helped scientific collaborations and government data sharing initiatives succeed.
The document discusses the concept of "Broad Data" which refers to the large amount of freely available but widely varied open data on the World Wide Web, including structured and semi-structured data. It provides examples such as the growing linked open data cloud and over 710,000 datasets available from governments around the world. Broad data poses new challenges for data search, modeling, integration and visualization of partially modeled datasets. International open government data search and linking government data to additional contexts are also discussed.
Slides from guest presentation at Aron Lindberg's Computational-Qualitative Field Research seminar: http://aronlindberg.github.io/computational_field_research/ Needed readings at https://www.dropbox.com/sh/1gx9s2zlnxvumbz/AAAV9uSAJHsiPeJhSsNnnM9Pa?dl=0
Presented to a webinar hosted by Nuance Inc, under the title "The Semantic Web: What it is and Why you should care" on 2/29/2012.
This talk presents a fast overview of the Semantic Web and recent application deployment in the space.
Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.
Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.
I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.
In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.
The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.
The Web of Data: do we actually understand what we built?Frank van Harmelen
Despite its obvious success (largest knowledge base ever built, used in practice by companies and governments alike), we actually understand very little of the structure of the Web of Data. Its formal meaning is specified in logic, but with its scale, context dependency and dynamics, the Web of Data has outgrown its traditional model-theoretic semantics.
Is the meaning of a logical statement (an edge in the graph) dependent on the cluster ("context") in which it appears? Does a more densely connected concept (node) contain more information? Is the path length between two nodes related to their semantic distance?
Properties such as clustering, connectivity and path length are not described, much less explained by model-theoretic semantics. Do such properties contribute to the meaning of a knowledge graph?
To properly understand the structure and meaning of knowledge graphs, we should no longer treat knowledge graphs as (only) a set of logical statements, but treat them properly as a graph. But how to do this is far from clear.
In this talk, I report on some of our early results on some of these questions, but I ask many more questions for which we don't have answers yet.
The document discusses solutions to overcoming the tragedy of the data commons through shared metadata. It describes how large scientific projects can share data at low cost by starting from overlapping common metadata terms and having their metadata teams work together. Reusing shared metadata leads to increased reusability of data across projects. The document advocates for developing metadata as evolving, linked resources rather than predefined standards, and provides examples of how this approach has helped scientific collaborations and government data sharing initiatives succeed.
Keystone summer school 2015 paolo-missier-provenancePaolo Missier
Lecture on Provenance modelling, given at the first Keystone Summer School, Malta July 2015.
With thanks to Prof. Luc Moreau for contributing some of the slide material from his own tutorial
This document discusses challenges in managing large amounts of scientific data from various sources like experiments, simulations, literature, and archives. It proposes making all scientific data available online to increase scientific information sharing and productivity. Key steps discussed are data ingest, organization, modeling, integration with literature, documentation, curation and long-term preservation. The cloud is presented as a way to provide scalable access and analysis of large datasets.
presents the foundational aspects of web analytics and some specifics such as the hotel problem. Discusses trace data, behaviorism, and other cool web analytics stuff
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
The document discusses teaching data ethics in data science education. It provides context about the eScience Institute and a data science MOOC. It then presents a vignette on teaching data ethics using the example of an alcohol study conducted in Barrow, Alaska in 1979. The study had methodological and ethical issues in how it presented results to the community. The document concludes by discussing incorporating data ethics into all of the Institute's data science programs and initiatives like automated data curation and analyzing scientific literature visuals.
Presentation at "International knowledge graph workshop" at KDD 2020. The short overview talk shows how we have moved from Semantic Web to Linked Data to Knowledge Graphs. We argue that the same "a little semantics goes a long way" principle from the early days of the Semantic Web still is needed today -- some lessons learned and steps ahead are outlined.
This document discusses the development of the Semantic Web, which aims to make web content machine-readable through the use of ontologies and structured metadata. It describes how the Semantic Web will allow software agents to automatically carry out complex tasks by understanding the meaning of web pages and data. Key aspects covered include using XML, RDF and ontologies to encode semantics and define relationships between terms. The document provides an example of how Semantic Web technologies could enable software agents to automatically schedule medical appointments using information from various online sources.
The document discusses electronic laboratory notebooks and blogs as a way to record scientific experiments and share data. It proposes using blogs to document experiments in a more collaborative way, while also capturing metadata and linking data to provide context. Challenges addressed include capturing the full context around experiments, facilitating collaboration and discussion, and improving access to data over time.
Bill Howe discussed emerging topics in responsible data science for the next decade. He described how the field will focus more on what should be done with data rather than just what can be done. Specifically, he talked about incorporating societal constraints like fairness, transparency and ethics into algorithmic decision making. He provided examples of unfair outcomes from existing algorithms and discussed approaches to measure and achieve fairness. Finally, he discussed the need for reproducibility in science and potential techniques for more automatic scientific claim checking and deep data curation.
Asteroid Observations - Real Time Operational Intelligence SeriesStormBourne, LLC
I recently submitted a response to NASA's RFI for Asteroid Observation and Characterization Ideas. I was invited to present at the Asteroid Initiative Idea Synthesis Workshop where I presented a small portion of this idea. This is the complete presentation which may help fill in some of the blanks from the shorter version of the talk.
Methods for Intrinsic Evaluation of Links in the Web of DataCristina Sarasua
The current Web of Data contains a large amount of interlinked data. However, there is still a limited understanding about the quality of the links connecting entities of different and distributed data sets. Our goal is to provide a collection of indicators that help assess existing interlinking. In this paper, we present a framework for the intrinsic evaluation of RDF links, based on core principles of Web data integration and foundations of Information Retrieval. We measure the extent to which links facilitate the discovery of an extended description of entities, and the discovery of other entities in other data sets. We also measure the use of different vocabularies. We analysed links extracted from a set of data sets from the Linked Data Crawl 2014 using these measures.
Basic explanation about graph mining for social network analysis (SNA). I tried to describe some metrics and benefit from SNA (focusing on telecommunication field). Basic spark with graphx script to analyse the graph also in the slide
Keynote talk presented at WebScience 2020 conference. Looks at roots of Web/Web Science and explores two possible futures and what web scientists and others can do about it. Even starts with a quote from Charles Dickins.
This document analyzes a dataset of over 37 billion tweets spanning 2006 to 2013 to study the evolution of Twitter users and their behavior. It finds that the percentage of U.S. and Canadian users dropped from over 80% to 32% as Twitter spread globally, and that the percentage of tweets in English fell from 83% to 52%. It also observes increases in tweet deletions, inactive accounts, and suspended accounts, as well as a shift from desktop to mobile usage, with over half of tweets now coming from mobile devices. The results provide insight into how Twitter and user behavior have changed over time.
Perception Determined Constructing Algorithm for Document ClusteringIRJET Journal
This document discusses an approach to document clustering called "Semantic Lingo" which aims to identify key concepts in documents and automatically generate an ontology based on these concepts to better conceptualize the documents. It provides background on challenges with traditional document clustering techniques and search engines. The proposed approach uses semantic information from domain ontologies to improve web search clustering quality by addressing issues like synonyms, polysemy and high dimensionality. It also discusses using text segments within documents that focus on one or more topics to aid multi-topic document clustering.
This document summarizes presentations from a webinar on data science and research data management. Jennifer Clark, a former librarian turned data scientist, discusses her career transition and skills useful for both roles. Margaret Henderson, director of research data management at VCU Libraries, outlines her experience transitioning from reference to research data and plans for developing data services. Jeroen Rombouts of 3TU.Datacentrum discusses lessons learned from developing a research data facility, including staffing models.
This document discusses democratizing data science in the cloud. It describes how cloud data management involves sharing resources like infrastructure, schema, data, and queries between tenants. This sharing enables new query-as-a-service systems that can provide smart cross-tenant services by learning from metadata, queries, and data across all users. Examples of possible services discussed include automated data curation, query recommendation, data discovery, and semi-automatic data integration. The document also describes some cloud data systems developed at the University of Washington like SQLShare and Myria that aim to realize this vision.
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...cscpconf
Semantic Similarity measures plays an important role in information retrieval, natural language processing and various tasks on web such as relation extraction, community mining, document clustering, and automatic meta-data extraction. In this paper, we have proposed a Pattern Retrieval Algorithm [PRA] to compute the semantic similarity measure between the words by
combining both page count method and web snippets method. Four association measures are used to find semantic similarity between words in page count method using web search engines. We use a Sequential Minimal Optimization (SMO) support vector machines (SVM) to find the optimal combination of page counts-based similarity scores and top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous word-pairs and nonsynonymous word-pairs. The proposed approach aims to improve the Correlation values,
Precision, Recall, and F-measures, compared to the existing methods. The proposed algorithm outperforms by 89.8 % of correlation value.
Search, Exploration and Analytics of Evolving DataNattiya Kanhabua
The document discusses techniques for extracting temporal information from documents, including determining a document's publication time and any times discussed in its content. It describes challenges in determining a document's publication time due to factors like time gaps between crawling and indexing. It also outlines approaches like using temporal language models to compare a document's words to time-labeled reference corpora or leveraging search statistics to estimate a publication time. The document provides examples of how content-based classification models and techniques like semantic preprocessing can help with temporal information extraction from documents.
Keynote for Theory and Practice of Digital Libraries 2017
The theory and practice of digital libraries provides a long history of thought around how to manage knowledge ranging from collection development, to cataloging and resource description. These tools were all designed to make knowledge findable and accessible to people. Even technical progress in information retrieval and question answering are all targeted to helping answer a human’s information need.
However, increasingly demand is for data. Data that is needed not for people’s consumption but to drive machines. As an example of this demand, there has been explosive growth in job openings for Data Engineers – professionals who prepare data for machine consumption. In this talk, I overview the information needs of machine intelligence and ask the question: Are our knowledge management techniques applicable for serving this new consumer?
Fox-Keynote-Now and Now of Data Publishing-nfdp13DataDryad
The document summarizes Peter Fox's presentation at the Now and Now for Data conference in Oxford, UK on May 22, 2013. Fox discusses different metaphors for making data publicly available, including data publication, ecosystems, and frameworks for conversations about data. He examines pros and cons of different approaches like data centers, publishers, and linked data. The presentation considers how to improve data sharing and what roles different stakeholders like producers and consumers play.
A 1015 update to the 2012 "Data Big and Broad" talk - http://www.slideshare.net/jahendler/data-big-and-broad-oxford-2012 - extends coverage, brings more in context of recent "big data" work.
Keystone summer school 2015 paolo-missier-provenancePaolo Missier
Lecture on Provenance modelling, given at the first Keystone Summer School, Malta July 2015.
With thanks to Prof. Luc Moreau for contributing some of the slide material from his own tutorial
This document discusses challenges in managing large amounts of scientific data from various sources like experiments, simulations, literature, and archives. It proposes making all scientific data available online to increase scientific information sharing and productivity. Key steps discussed are data ingest, organization, modeling, integration with literature, documentation, curation and long-term preservation. The cloud is presented as a way to provide scalable access and analysis of large datasets.
presents the foundational aspects of web analytics and some specifics such as the hotel problem. Discusses trace data, behaviorism, and other cool web analytics stuff
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
The document discusses teaching data ethics in data science education. It provides context about the eScience Institute and a data science MOOC. It then presents a vignette on teaching data ethics using the example of an alcohol study conducted in Barrow, Alaska in 1979. The study had methodological and ethical issues in how it presented results to the community. The document concludes by discussing incorporating data ethics into all of the Institute's data science programs and initiatives like automated data curation and analyzing scientific literature visuals.
Presentation at "International knowledge graph workshop" at KDD 2020. The short overview talk shows how we have moved from Semantic Web to Linked Data to Knowledge Graphs. We argue that the same "a little semantics goes a long way" principle from the early days of the Semantic Web still is needed today -- some lessons learned and steps ahead are outlined.
This document discusses the development of the Semantic Web, which aims to make web content machine-readable through the use of ontologies and structured metadata. It describes how the Semantic Web will allow software agents to automatically carry out complex tasks by understanding the meaning of web pages and data. Key aspects covered include using XML, RDF and ontologies to encode semantics and define relationships between terms. The document provides an example of how Semantic Web technologies could enable software agents to automatically schedule medical appointments using information from various online sources.
The document discusses electronic laboratory notebooks and blogs as a way to record scientific experiments and share data. It proposes using blogs to document experiments in a more collaborative way, while also capturing metadata and linking data to provide context. Challenges addressed include capturing the full context around experiments, facilitating collaboration and discussion, and improving access to data over time.
Bill Howe discussed emerging topics in responsible data science for the next decade. He described how the field will focus more on what should be done with data rather than just what can be done. Specifically, he talked about incorporating societal constraints like fairness, transparency and ethics into algorithmic decision making. He provided examples of unfair outcomes from existing algorithms and discussed approaches to measure and achieve fairness. Finally, he discussed the need for reproducibility in science and potential techniques for more automatic scientific claim checking and deep data curation.
Asteroid Observations - Real Time Operational Intelligence SeriesStormBourne, LLC
I recently submitted a response to NASA's RFI for Asteroid Observation and Characterization Ideas. I was invited to present at the Asteroid Initiative Idea Synthesis Workshop where I presented a small portion of this idea. This is the complete presentation which may help fill in some of the blanks from the shorter version of the talk.
Methods for Intrinsic Evaluation of Links in the Web of DataCristina Sarasua
The current Web of Data contains a large amount of interlinked data. However, there is still a limited understanding about the quality of the links connecting entities of different and distributed data sets. Our goal is to provide a collection of indicators that help assess existing interlinking. In this paper, we present a framework for the intrinsic evaluation of RDF links, based on core principles of Web data integration and foundations of Information Retrieval. We measure the extent to which links facilitate the discovery of an extended description of entities, and the discovery of other entities in other data sets. We also measure the use of different vocabularies. We analysed links extracted from a set of data sets from the Linked Data Crawl 2014 using these measures.
Basic explanation about graph mining for social network analysis (SNA). I tried to describe some metrics and benefit from SNA (focusing on telecommunication field). Basic spark with graphx script to analyse the graph also in the slide
Keynote talk presented at WebScience 2020 conference. Looks at roots of Web/Web Science and explores two possible futures and what web scientists and others can do about it. Even starts with a quote from Charles Dickins.
This document analyzes a dataset of over 37 billion tweets spanning 2006 to 2013 to study the evolution of Twitter users and their behavior. It finds that the percentage of U.S. and Canadian users dropped from over 80% to 32% as Twitter spread globally, and that the percentage of tweets in English fell from 83% to 52%. It also observes increases in tweet deletions, inactive accounts, and suspended accounts, as well as a shift from desktop to mobile usage, with over half of tweets now coming from mobile devices. The results provide insight into how Twitter and user behavior have changed over time.
Perception Determined Constructing Algorithm for Document ClusteringIRJET Journal
This document discusses an approach to document clustering called "Semantic Lingo" which aims to identify key concepts in documents and automatically generate an ontology based on these concepts to better conceptualize the documents. It provides background on challenges with traditional document clustering techniques and search engines. The proposed approach uses semantic information from domain ontologies to improve web search clustering quality by addressing issues like synonyms, polysemy and high dimensionality. It also discusses using text segments within documents that focus on one or more topics to aid multi-topic document clustering.
This document summarizes presentations from a webinar on data science and research data management. Jennifer Clark, a former librarian turned data scientist, discusses her career transition and skills useful for both roles. Margaret Henderson, director of research data management at VCU Libraries, outlines her experience transitioning from reference to research data and plans for developing data services. Jeroen Rombouts of 3TU.Datacentrum discusses lessons learned from developing a research data facility, including staffing models.
This document discusses democratizing data science in the cloud. It describes how cloud data management involves sharing resources like infrastructure, schema, data, and queries between tenants. This sharing enables new query-as-a-service systems that can provide smart cross-tenant services by learning from metadata, queries, and data across all users. Examples of possible services discussed include automated data curation, query recommendation, data discovery, and semi-automatic data integration. The document also describes some cloud data systems developed at the University of Washington like SQLShare and Myria that aim to realize this vision.
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...cscpconf
Semantic Similarity measures plays an important role in information retrieval, natural language processing and various tasks on web such as relation extraction, community mining, document clustering, and automatic meta-data extraction. In this paper, we have proposed a Pattern Retrieval Algorithm [PRA] to compute the semantic similarity measure between the words by
combining both page count method and web snippets method. Four association measures are used to find semantic similarity between words in page count method using web search engines. We use a Sequential Minimal Optimization (SMO) support vector machines (SVM) to find the optimal combination of page counts-based similarity scores and top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous word-pairs and nonsynonymous word-pairs. The proposed approach aims to improve the Correlation values,
Precision, Recall, and F-measures, compared to the existing methods. The proposed algorithm outperforms by 89.8 % of correlation value.
Search, Exploration and Analytics of Evolving DataNattiya Kanhabua
The document discusses techniques for extracting temporal information from documents, including determining a document's publication time and any times discussed in its content. It describes challenges in determining a document's publication time due to factors like time gaps between crawling and indexing. It also outlines approaches like using temporal language models to compare a document's words to time-labeled reference corpora or leveraging search statistics to estimate a publication time. The document provides examples of how content-based classification models and techniques like semantic preprocessing can help with temporal information extraction from documents.
Keynote for Theory and Practice of Digital Libraries 2017
The theory and practice of digital libraries provides a long history of thought around how to manage knowledge ranging from collection development, to cataloging and resource description. These tools were all designed to make knowledge findable and accessible to people. Even technical progress in information retrieval and question answering are all targeted to helping answer a human’s information need.
However, increasingly demand is for data. Data that is needed not for people’s consumption but to drive machines. As an example of this demand, there has been explosive growth in job openings for Data Engineers – professionals who prepare data for machine consumption. In this talk, I overview the information needs of machine intelligence and ask the question: Are our knowledge management techniques applicable for serving this new consumer?
Fox-Keynote-Now and Now of Data Publishing-nfdp13DataDryad
The document summarizes Peter Fox's presentation at the Now and Now for Data conference in Oxford, UK on May 22, 2013. Fox discusses different metaphors for making data publicly available, including data publication, ecosystems, and frameworks for conversations about data. He examines pros and cons of different approaches like data centers, publishers, and linked data. The presentation considers how to improve data sharing and what roles different stakeholders like producers and consumers play.
A 1015 update to the 2012 "Data Big and Broad" talk - http://www.slideshare.net/jahendler/data-big-and-broad-oxford-2012 - extends coverage, brings more in context of recent "big data" work.
INSPIRE Hackathon Webinar Intro to Linked Data and Semanticsplan4all
This document introduces linked data and the semantic web. It defines linked data as using URIs to identify things on the web and describe them using standard formats like RDF to link related things. This allows data on the web to be treated like a large database. The semantic web builds on linked data principles to publish structured data on the web that can be processed by machines, helping make information more discoverable and science more reproducible. Challenges include agreeing on definitions, performance of query languages, and the effort required to publish high-quality linked data.
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...AKSHAY BHAGAT
This document discusses the DataBridge project, which aims to enable easier discoverability and use of long tail science data. DataBridge will create a multidimensional network and social network for scientific data by mapping datasets connected by relationships between their metadata, usage, and the methods used to analyze them. This will allow researchers to more easily find relevant datasets by automatically forming communities of similar data. The document outlines DataBridge's vision and progress to date, including the algorithms it is investigating for measuring similarity between datasets in order to facilitate searching for collaborators and discoveries.
Accelerating Data-driven Discovery in Energy ScienceIan Foster
A talk given at the US Department of Energy, covering our work on research data management and analysis. Three themes:
(1) Eliminate data friction (use of SaaS for research data management)
(2) Liberate scientific data (research on data extraction, organization, publication)
(3) Create discovery engines at DOE facilities (services that organize data + computation)
Michael Mahoney discusses the rise of massive data from various sensors. He notes there are many types of sensors that generate large amounts of data, including physical, consumer, health, financial, internet, and astronomical sensors. While there are similarities between sensor applications, there are also differences in funding, customer demands, questions of interest, time sensitivity, and more. Analyzing massive data presents challenges due to its size, variability, and noise. New algorithms and statistical methods are needed to gain insights from these large and complex data sets. Mahoney advocates cross-disciplinary work to address the opportunities and difficulties presented by modern massive data.
Amit Sheth with TK Prasad, "Semantic Technologies for Big Science and Astrophysics", Invited Plenary Presentation, at Earthcube Solar-Terrestrial End-User Workshop, NJIT, Newark, NJ, August 13, 2014.
Like many other fields of Big Science, Astrophysics and Solar Physics deal with the challenges of Big Data, including Volume, Variety, Velocity, and Veracity. There is already significant work on handling volume related challenges, including the use of high performance computing. In this talk, we will mainly focus on other challenges from the perspective of collaborative sharing and reuse of broad variety of data created by multiple stakeholders, large and small, along with tools that offer semantic variants of search, browsing, integration and discovery capabilities. We will borrow examples of tools and capabilities from state of the art work in supporting physicists (including astrophysicists) [1], life sciences [2], material sciences [3], and describe the role of semantics and semantic technologies that make these capabilities possible or easier to realize. This applied and practice oriented talk will complement more vision oriented counterparts [4].
[1] Science Web-based Interactive Semantic Environment: http://sciencewise.info/
[2] NCBO Bioportal: http://bioportal.bioontology.org/ , Kno.e.sis’s work on Semantic Web for Healthcare and Life Sciences: http://knoesis.org/amit/hcls
[3] MaterialWays (a Materials Genome Initiative related project): http://wiki.knoesis.org/index.php/MaterialWays
[4] From Big Data to Smart Data: http://wiki.knoesis.org/index.php/Smart_Data
Web History 101, or How the Future is UnwrittenBookNet Canada
In 1989 computer scientist Tim Berners-Lee wrote “Information Management: A Proposal” to persuade CERN management that a global hypertext system was in their interests. That proposal gradually grew into what we now call the World Wide Web. This originating document contains not only the bits that would later become the Web, but also features for a future we’ve yet to realize. In this talk, we’ll take a look at some of those highlights and focus them on the world of publishing, proposing solutions to problems we’re still attempting to solve and fostering ideas for further daydreaming.
Linked Open Data in Libraries, Archives & MuseumsJon Voss
This document provides an overview of Linked Open Data for libraries, archives, and museums. It discusses the growing movement of LODLAM and how it allows these cultural institutions to represent their data as graphs using triples that describe entities in a machine-readable format. Key concepts covered include the use of URIs, RDF, vocabularies, and different legal tools for publishing open data.
Open Research Data: Licensing | Standards | FutureRoss Mounce
This document provides an overview of open research data, including definitions, licensing, standards, and history. It defines open data as data that anyone can freely access, use, modify, and share with few restrictions. For data to be truly open, it recommends using a CC0 public domain waiver or an attribution-only license. It discusses issues with non-commercial and no derivatives restrictions. The document also provides guidance on technical aspects like recommended file formats and standards. It briefly summarizes the history of data sharing, from centralized data centers to online supplementary data to emerging data paper journals. The key messages are that data should be FAIR (Findable, Accessible, Interoperable, Reusable) and that open data benefits both
This document discusses the emerging field of social semantic sensor web. It describes how the proliferation of sensors embedded in devices, homes, cars, etc. can be connected to the social web and annotated with semantic technologies. This would allow machines to better understand sensor data, such as using ontologies to infer weather conditions from different sensor readings. The document outlines technologies like the SSN ontology for describing sensors and how sensor data could be attached to social media posts. Finally, it discusses potential applications in areas like disaster management, traffic reporting, and crowdsourcing health data.
eScience: A Transformed Scientific MethodDuncan Hull
The document discusses the concept of eScience, which involves synthesizing information technology and science. It explains how science is becoming more data-driven and computational, requiring new tools to manage large amounts of data. It recommends that organizations foster the development of tools to help with data capture, analysis, publication, and access across various scientific disciplines.
This document discusses best practices for content delivery platforms to support artificial intelligence projects. It recommends that platforms (1) accept that they do not have all the data needed and should integrate third-party sources, (2) provide consistent tagging of content, (3) offer a lightweight programmatic interface, (4) embrace allowing large amounts of content to be taken offline for analysis, and (5) enable complex filtering and selection of data. The document also suggests platforms could consider offering preprocessed datasets or AI tools as new products.
The document summarizes a presentation by Mark A. Parsons on opportunities and challenges for data sharing and citation. The presentation discusses how all of society's grand challenges require diverse data shared across boundaries, and the vision of the Research Data Alliance (RDA) to openly share data. RDA builds social and technical bridges to enable open data sharing through developing infrastructure, standards, and best practices. The presentation also covers specific RDA activities like developing data citation recommendations and engaging members globally.
Security and Data Ownership in the Cloud
Andrew K. Pace, Executive Director, Networked Library Services, OCLC; Councilor-at-large, American Library Association
This document summarizes a presentation by Mark A. Parsons on infrastructure, relationships, trust, and the Research Data Alliance (RDA). The presentation discusses how research infrastructure now requires electronic infrastructure (e-infrastructure) due to data-intensive science. It also discusses how infrastructure emerges through relationships between people, technologies, and institutions. The RDA is introduced as a community working to build social and technical bridges to enable open data sharing across disciplines. Initial and future products being developed by RDA working groups are also summarized.
This document discusses getting to know data using R. It begins by outlining the typical steps in a data analysis, including defining the question, obtaining and cleaning the data, performing exploratory analysis, modeling, interpreting results, and creating reproducible code. It then describes different types of data science questions from descriptive to mechanistic. The remainder of the document provides more details on descriptive, exploratory, inferential, predictive, causal, and mechanistic analysis. It also discusses R, including its design, packages, data types like vectors, matrices, factors, lists, and data frames.
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014James Powell
The Internet represents the connections among computers and devices, the world wide web is a network of interconnected documents, and the semantic web is the closest thing we have today to a network of interconnected facts. Noticeably absent from these global networks is any sort of open, formal representation for an online global social network. Each users' online presence, and its immediate social network, are isolated and typically only available within the confines of the social networking site that hosts it. Discovery across explicit online social networks and implicit social networks such as those that can be inferred from co-authorship relationships and affiliations is, for all practical purposes, impossible. And yet there are practical and non-nefarious reasons why an organization might be interested in exploring portions of such a network. Outreach is one such interest. Los Alamos National Laboratory (LANL) prototyped EgoSystem to harvest and explore the professional social networks of post doctoral students. The project's goal is to enlist past students and other Lab alumni as ambassadors and advocates for LANL's ongoing mission. During this talk we will discuss the various technologies that support the EgoSystem and demonstrate some of its capabilities.
How the Web can change social science research (including yours)Frank van Harmelen
A presentation for a group of PhD students from the Leibniz Institutes (section B, social sciences) to discuss how they could use the Web, and even better the Web of Data, as an instrument in their research.
Similar to In search of lost knowledge: joining the dots with Linked Data (20)
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
In search of lost knowledge: joining the dots with Linked Data
1. In search of lost knowledge: joining
the dots with Linked Data
Jon Blower
j.d.blower@reading.ac.uk
Department of Meteorology
Reading e-Science Centre
National Centre for Earth Observation
University of Reading
With thanks to all CHARMe project partners!
5. How do you fill in the blanks?
• Hope the documentation contains pointers
• … to the right versions of things
• … and using consistent terminology
• In restricted communities, we can gather all
information into a central location and
harmonize it
• OR we hope that a Google search throws up the
right results
9. Lots of useful negative results, which aren’t published
Nearly 1/3 of positive results are false
1000 hypotheses
100 are true
900 are false
Test accuracy 95%
95 true
positives
5 false
negatives
45 false
positives
855 true
negatives
12. “There is a spurious decline in low-cloud fraction in
the ISCCP cloud database due to the viewing angle.
It’s in the literature, but you might not find it if
you’re not specifically looking for it. For example if
you’re using the data for model validation you may
not spot the problem.”
(Claire Barber, paraphrased)
13. Consequences
• We use the data that are most easily
available, not necessarily the best
• Constant re-discovery of the same issues
• Very hard to share information outside
communities
19. Fingers crossed!
• We hope that everything has a document
written about it, or we have nothing to cite
• … and we hope someone writes a new
document when “it” changes
• We hope that we are citing the right
document (i.e. the most authoritative etc)
20. “If … the Web made all the
online documents look like
one huge book, [the
Sir Tim Berners-Lee
Semantic Web] will
make all the data in the
world look like one huge
database.”
Photo by Susan Lesch, from Wikipedia
21. Why is the Semantic Web different?
• Focuses on things, not documents-aboutthings
• Links things together in a meaningful way
• Information is readable by computers
• Here’s how it works…
22. 1. Give things unique identifiers
Must be globally unique and persistent
http://www.reading.ac.uk/people/jon.blower
might represent me
http://nasa.gov/instruments/MODIS
might represent the MODIS instrument
–(these are illustrative, not accurate!)
23. 2. Express relationships between
things
The key concept is the triple
subject predicate object
E.g. “MODIS is carried by the Terra satellite”
all three things need to be unique identifiers
(we don’t use plain English!)
24. Examples of realistic triples
(identifiers are illustrative, not necessarily correct!)
http://www.reading.ac.uk/people/jon.blower
http://xmlns.com/foaf/0.1/knows
http://www.durham.ac.uk/staff/ed.llewellin
http://dx.doi.org/12345.678910
http://purl.org/spar/cito/citesAsDataSource
http://www.badc.ac.uk/datasets/sst
25. Aside: how to use the Semantic Web
to ruin jokes
“A hundred kilopascals go into a bar”
“100000”^^unit:Pascal
owl:sameAs
“1”^^unit:Bar
(Using OWL and QUDT ontologies)
26. 3. Publish your triples
• Can be as simple as putting a document in the
right format on your website
– (triples can be expressed in many formats)
• … or as complex as publishing a multi-billiontriple database
– E.g. Ordnance Survey
• Either way, your data become part of the
Web, not just on the web
27. 4. Profit!
• Everyone can publish their own “view of the world”
– Don’t need a single data structure “to rule them all”
• Use common identifiers and common vocabularies to link between
communities
• Different vocabularies can be mapped to each other
– E.g. map CF standard names to GRIB codes
• Gives a means to record provenance of information
– “Direct line of sight from decisions to data” - NASA
• Reasoning engines can traverse the graph of links and discover new
information
28. Example: SemantEco
Wildlife observations
Ecosystem impacts
Administrative areas
Pollution data and policies
“Find me all the sites where chloride pollution
exceeds the local policy limits”
“Show me the species over time in this region”
29. So what’s “Linked Data” then?
• The Web is the “web of documents”
• The Semantic Web is the “web of data”
• “Linked Data” is really a set of principles for
using the Semantic Web
– http://www.w3.org/DesignIssues/LinkedData.html
• (Many people use “Linked Data” and
“Semantic Web” somewhat interchangeably)
34. From observations to decisions
Observations
Satellites
(obs. sometimes
used for model
initialization)
Climate
Record
Creation
and
Curation
Climate Data
Records
Reports
Analysis
and
applications
Decision
making
Decisions,
Policies
Model results
Climate
Modelling
(Adapted from Dowell et al., 2013),
“Strategy Towards an Architecture for
Climate Monitoring from Space”)
35. Where can users go for help?
• Scientific literature
– Huge, verbose and inaccessible to some communities
– Not well linked to source data
• Technical reports and conference proceedings
– Hard to find, scattered or inaccessible
• Data centres
– increasingly strong at providing some important metadata, but don’t
usually include community feedback
– Not all countries and communities have data centres!
• Websites and blogs
– From CEOS Handbook to a scientist’s blog
– Increasingly useful, but scattered
36. How can climate data users decide whether a
dataset is fit for their purpose?
(N.B. We consider that “data quality” and “fitness for purpose”
are the same thing)
Not specific to climate data!
38. Examples of commentary metadata
• Post-fact annotations, e.g. citations, ad-hoc comments and notes;
• Results of assessments, e.g. validation campaigns,
intercomparisons with models or other observations, reanalysis;
• Provenance, e.g. dependencies on other datasets, processing
algorithms and chain, data source;
• Properties of data distribution, e.g. data policy and licensing,
timeliness (is the data delivered in real time?), reliability;
• External events that may affect the data, e.g. volcanic eruptions, ElNino index, satellite or instrument failure, operational changes to
the orbit calculations.
General rule: information originates from users or external entities,
not original data providers
– However, sometimes information is not available from the data
provider!
39. How will this be done?
• CHARMe will create
connected repositories of
commentary information
Data provider
website
rd
33rdparty
party
system
system
– Stored as triples in
“CHARMe nodes”,
• Information can be read
and entered through
websites or Web Services
• Using principles of Open
Linked Data and the
Semantic Web
CHARMe
node
CHARMe
node
CHARMe
node
40. Open Annotation
•
We are using Open Annotation for
recording commentary
– World Wide Web Consortium standard
•
Associates a body with a target
•
E.g. a publication could be the body,
an EO dataset could be the target
•
Can record the motivation behind an
annotation
– Bookmarking, classifying, commenting,
describing, editing, highlighting,
questioning, replying… (lots more)
– Covers a lot of CHARMe use cases!
•
An annotation can have multiple
targets
– A key requirement from users
41. Where does CHARMe fit in?
Observations
Satellites
(obs. sometimes
used for model
initialization)
Climate
Record
Creation
and
Curation
Climate
Modelling
(e.g. CMIP5)
Climate Data
Records
Reports
Analysis
Analysis
and
and
applications
applications
Decision
making
Decisions,
Policies
Model results
Supports analysts and scientists in
production of information for
decision-makers
43. Challenges
• Ensuring community adoption:
– We are “injecting” CHARMe capabilities into existing
websites used in the community
– We will “seed” the CHARMe system with information
to attract users (e.g. links between publications and
datasets)
• Ensuring quality of commentary metadata
– Moderation is a strong user requirement
– We will provide guides to creators of commentary –
what makes a comment helpful?
44. What CHARMe will enable
(some examples)
Users:
- “Find me all the documents that have been written about this dataset”
- “… in both peer-reviewed journals and the grey literature”
- “… and specifically about precipitation in Africa”
- “… in both NEODC’s and Astrium’s archives”
- “What factors might affect the quality of this dataset?”
e.g. upstream datasets, external events
- “What have other users already discovered about this dataset?”
- “I want to find information related to the dataset I’m looking at”
Data providers:
- “Who is using my dataset and what are they saying about it?”
- “Let me subscribe to new user comments and reply to them”
45. What CHARMe will not enable
• “Give me the best dataset on sea surface temperature”
– The “best” dataset depends on the application
• CHARMe will not provide a new “quality stamp” for
datasets
– But will be able to link to such things if other people publish
them, e.g. Bates maturity index, QA4EO certification
• CHARMe will not provide access to actual data
– (but will enable discovery of data)
• Not planning to create (another) “one-stop shop” for
information
– We want the information to appear where users are already
looking
47. How will you use CHARMe?
• Search for datasets on existing data provider website
• Use the “CHARMe button” to view or record
commentary on a particular dataset
48. “Significant events” viewer
•
•
•
•
Google Finance (above) matches stock prices with news events
CHARMe tool will match climate reanalysis data with “significant events”
– Algorithm changes, instrument failures, new data sources
Will allow user annotations on the data and events
Will be available on the Web
49. Fine-grained commentary
• Will allow creation and discovery about specific parts of
datasets
• E.g. variables, geographic locations, time ranges
• cf http://maphub.github.io
50. Jon Blower
Debbie Clifford
Tristan Quaife
Sam Williams
•
•
Using Linked Data to combine EO and socioeconomic data in 8 application areas
16-partner EU FP7 project, just started!
– Coordinated from Reading
51. Summary
•
Linked Data can join up information
from all over the Web
•
Makes disparate data sources
discoverable and processable
•
CHARMe is using Linked Data
techniques to help users of climate
data to connect with all the
experience in the community
– Techniques are not specific to
climate data
•
MELODIES project will combine
environmental data and
socioeconomic data to develop
real-world services using Linked
Data
Here’s a simplified representation of the things we are interested in in Earth Observation
And here’s a more complex version. The instrument produces “Level 0” data, which are then processed using a series of algorithms to produce something that the user community finds easier to use.
Documents may of course be written about any of these steps. All these documents will be written by different people at different times, and possibly about different versions of the same thing
Now let’s say you only know about a few of these things – a few publications, the algorithms you know already and the latest versions of a few datasets. Where do you go from here?
This person is trying to make a good decision based on climate data from satellites and from computer simulations. She also needs to know about other factors such as population growth and changes in urbanization. It’s a pretty hard job to bring together information from all these communities and correlate them. Everyone will use different standards, formats and terminology
If you are in a data-rich field that relies a lot on statistics, you may well be in this situation where the false positives outweigh the true ones. (95% test accuracy does not mean that there is a 95% chance that your hypothesis is true!)
Obviously, you can play with the numbers here, but in general this will be significant if you’re in the case where most hypotheses are false.
This may contribute to findings that results in the peer-reviewed literature can’t be reproduced – not necessarily due to poor experiments or tests.
Sites such as FigShare allow publication of stuff that would not make it into the peer-reviewed literature, but that are still useful.
Who here uses NASA instead of ESA because the data are easier to get hold of? NCEP instead of Met Office?
Just a reminder of what our simplified situation in EO looks like
This is how the Web represents our case: just a set of documents with simple links. The Web records no information about what the documents represent, or why they are linked.
Lots of links here, which is good! However, you need context (and natural language understanding) to determine what the links mean – are they affiliations, projects Richard works on, links to “institutional” (not personal) stuff, or what? A computer can’t reason based on this information (although link density gives some indication of relevance and popularity).
Here’s part of the citation list from one of my recent papers. Some of these are papers, some are standards, some are simple URLs. There is no information about why I’m citing them, unless you trawl the text and divine my meaning.
These are some of the problems of using documents-about-things to talk about stuff
These identifiers may look like web addresses (Uniform Resource Locators) but in fact are identifiers – just strings. You don’t expect to download me when you put my identifier in your web browser but you might get a representation of me (e.g. my home page)
The bottom one is a paper citing a dataset as a data source. Note that we can say *why* we’re citing the data
Your homework is to apply the same technique to the joke about the Dalai Lama going to a hot-dog stall and saying “make me one with everything”
The point about being part of the web is important, but hard to grasp. Your data can be found by semantically-aware web crawlers, which might be able to do something interesting with it, rather than just finding a document that requires a human to read.
So the Web of Data can more accurately record our situation, and can describe the nature of the links between things.
Lots of data, particularly public sector data, is being released as Open Linked Data
Met Office
Ordnance Survey
Office of National Statistics
Data.gov.uk
People need to publish not only their data, but the terms that they use and the relationships between them
A close-up of some of the geospatial part. Note the Met Office.
So what are we doing with Linked Data in the climate field? Here is a European project, about half way through,
Simple “value chain” for climate data observed from space.
See next slide for explanation of this
This is basically the same as the previous slide in words instead of pictures! The last paragraph is crucial – we’re looking at stuff the data provider doesn’t already know.
We’re not trying to turn the whole of climate data into Linked Data, but focusing on user-provided annotations
Important to note that decision-makers will probably not use CHARMe directly
So this person might be able to use CHARMe to find interesting information from all these communities, and find out what the experts in those communities think about the data. It will be easier to discover the common pitfalls.
CHARMe is often compared with Amazon, Tripadvisor etc, but the user base will be smaller and more specialist, therefore it is important for the information to be accurate. Can’t rely on the wisdom of crowds!
This is the most basic use case for CHARMe. In the next couple of slides we’ll look at more advanced uses. The fact that the annotations (i.e. the commentary information) will be machine-readable means that it’s possible to build all kinds of applications.
Here’s an “advanced application” for using CHARMe information, which will be correlated with climate reanalyses and external events.
Here’s another project that is just starting, which will use Linked Open Data to create new services and products. Many of the participants are small and medium-sized enterprises who will base business ideas on Open Data.