The presentation includes three parts: 1) a short introduction to semantic web and linked data; 2) a review of a few projects of interest in the field of earth science; and 3) details about the workflow and algorithms for computing similarity between entities in the semantic web.
Vocabularies as Linked Data - OUDCE March2014Keith.May
Presentation given as part of OUDCE course in Oxford 04-03-2014 on "Digital Data and Archaeology: Management, Preservation and Publishing.
Acknowledgements to Ceri Binding @Ceribin for many of the slides.
Visualising the Australian open data and research data landscapeJonathan Yu
"Visualising the Australian open data and research data landscape" at C3DIS May 2018 in Melbourne. In this talk, we presented work around the visualisation of an survey of open government and research data in Australia. This features a first attempt at formalising a quantitative based approach to measuring the data ecosystem in Australia.
Web data management has been a topic of interest for many years during which a number of different modelling approaches have been tried. The latest in this approaches is to use RDF (Resource Description Framework), which seems to provide real opportunity for querying at least some of the web data systematically. RDF has been proposed by the World Wide Web Consortium (W3C) for modeling Web objects as part of developing the “semantic web”. W3C has also proposed SPARQL as the query language for accessing RDF data repositories. The publication of Linked Open Data (LOD) on the Web has gained tremendous momentum over the last number of years, and this provides a new opportunity to accomplish web data integration. A number of approaches have been proposed for running SPARQL queries over RDFencoded Web data: data warehousing, SPARQL federation, and live linked query execution. In this talk, I will review these approaches with particular emphasis on some of our research within the context of gStore project (joint project with Prof. Lei Zou of Peking University and Prof. Lei Chen of Hong Kong University of Science and Technology), chameleondb project (joint work with Günes Aluç, Dr. Olaf Hartig, and Prof. Khuzaima Daudjee of University of Waterloo), and live linked query execution (joint work with Dr. Olaf Hartig).
Vocabularies as Linked Data - OUDCE March2014Keith.May
Presentation given as part of OUDCE course in Oxford 04-03-2014 on "Digital Data and Archaeology: Management, Preservation and Publishing.
Acknowledgements to Ceri Binding @Ceribin for many of the slides.
Visualising the Australian open data and research data landscapeJonathan Yu
"Visualising the Australian open data and research data landscape" at C3DIS May 2018 in Melbourne. In this talk, we presented work around the visualisation of an survey of open government and research data in Australia. This features a first attempt at formalising a quantitative based approach to measuring the data ecosystem in Australia.
Web data management has been a topic of interest for many years during which a number of different modelling approaches have been tried. The latest in this approaches is to use RDF (Resource Description Framework), which seems to provide real opportunity for querying at least some of the web data systematically. RDF has been proposed by the World Wide Web Consortium (W3C) for modeling Web objects as part of developing the “semantic web”. W3C has also proposed SPARQL as the query language for accessing RDF data repositories. The publication of Linked Open Data (LOD) on the Web has gained tremendous momentum over the last number of years, and this provides a new opportunity to accomplish web data integration. A number of approaches have been proposed for running SPARQL queries over RDFencoded Web data: data warehousing, SPARQL federation, and live linked query execution. In this talk, I will review these approaches with particular emphasis on some of our research within the context of gStore project (joint project with Prof. Lei Zou of Peking University and Prof. Lei Chen of Hong Kong University of Science and Technology), chameleondb project (joint work with Günes Aluç, Dr. Olaf Hartig, and Prof. Khuzaima Daudjee of University of Waterloo), and live linked query execution (joint work with Dr. Olaf Hartig).
Tufts Spatial Data Rescue: Crawling at-risk Government DataKyle Monahan
Much U. S. federal data are perceived as being at risk of becoming inaccessible through lack of maintenance and funding shifts. Organizations such as the End of Term Project and Data Rescue have emerged to coordinate the backup and rescue of at-risk federal data. Tufts University conducted a curated harvest to back up potentially at-risk federal, environmental and social justice geospatial data and associated tabular data. Tufts ongoing harvest has recovered over 40 TB of data. Tufts developed the Crawler to crawl all data: unzip files, identify file types, sizes, local directories, and harvest and process all related metadata using data mining and natural language processing (NLP) techniques. The Crawler results in detailed collection level and layer level metadata analytics for assessment, search and discovery.
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...Micah Altman
The WorldMap platform http://worldmap.harvard.edu is the largest open source collaborative mapping system in the world, with over 13,000 map layers contributed by thousands of users from Harvard and around the world. Researchers may upload large spatial datasets to the system, create data-driven visualizations, edit data, and control access. Users may keep their data private, share it in groups, or publish to the world.
The user base is interdisciplinary, including scholars from the humanities, social sciences, sciences, public health, design, planning, etc. All are able to access, view, and use one another’s data, either online, via map services, or by downloading.
Current work is underway to create and maintain a global registry of map services and take us a step closer to one-stop-access for public geospatial data. Another project is working on tools to support the visualization of spatial datasets with over a billion features. Current collaborations are underway with groups inside Harvard, such as Dataverse, HarvardX, and various departments, and with groups outside Harvard, such as Cornell University and the University of Pennsylvania. Major additional contributors to the underlying source code include the WorldBank, the U.S. State Department, and the United Nations.
The source code for the WorldMap platform is available on GitHub https://github.com/cga-harvard/cga-worldmap.
Location: E25-202
Discussant: Ben Lewis is system architect and project manager for WorldMap, an open source infrastructure that supports collaborative research centered on geospatial information. Before joining Harvard, Ben was a project manager with Advanced Technology Solutions of Pennsylvania, where he led the company in adopting platform independent approaches to GIS system development. Ben studied Chinese at the University of Wisconsin and has a Masters in Planning from the University of Pennsylvania. After Penn, Ben helped start the GIS Lab at U.C. Berkeley, founded the GIS group for transportation engineering firm McCormick Taylor, and coordinated the Land Acquisition Mapping System for South Florida Water Management District. Ben is especially interested in technologies that lower the barrier to spatial technology access.
Information Science Brown Bag talks, hosted by the Program on Information Science, consists of regular discussions and brainstorming sessions on all aspects of information science and uses of information science and technology to assess and solve institutional, social and research problems. These are informal talks. Discussions are often inspired by real-world problems being faced by the lead discussant.
An introduction to the Joint Information Systems Committee Resource Discovery iKit. Includes a look at controlled vocabularies declared in the Resource Discovery Framework (RDF)/Simple Knowledge Organisation System (SKOS) and wikipedia entries. Presented by Tony Ross at the CILIPS Centenary Conference Branch and Group Day which took place 5 Jun 2008.
Slides for my keynote presentateion "Linked Data for Digital History" presented at Semantic Web for Scientific History (SW4SH) co-located with ESWC 2015
Presented in : JIST2015, Yichang, China
Prototype: http://rc.lodac.nii.ac.jp/rdf4u/
Video: https://www.youtube.com/watch?v=z3roA9-Cp8g
Abstract: It is known that Semantic Web and Linked Open Data (LOD) are powerful technologies for knowledge management, and explicit knowledge is expected to be presented by RDF format (Resource Description Framework), but normal users are far from RDF due to technical skills required. As we learn, a concept-map or a node-link diagram can enhance the learning ability of learners from beginner to advanced user level, so RDF graph visualization can be a suitable tool for making users be familiar with Semantic technology. However, an RDF graph generated from the whole query result is not suitable for reading, because it is highly connected like a hairball and less organized. To make a graph presenting knowledge be more proper to read, this research introduces an approach to sparsify a graph using the combination of three main functions: graph simplification, triple ranking, and property selection. These functions are mostly initiated based on the interpretation of RDF data as knowledge units together with statistical analysis in order to deliver an easily-readable graph to users. A prototype is implemented to demonstrate the suitability and feasibility of the approach. It shows that the simple and flexible graph visualization is easy to read, and it creates the impression of users. In addition, the attractive tool helps to inspire users to realize the advantageous role of linked data in knowledge management.
GeoSemantic Technologies for Archaeological ResourcesPaul Cripps
The semantics of heritage data is a growing area of interest with ontologies such as the CIDOC-CRM providing semantic frameworks and exemplary projects such as STAR and STELLAR demonstrating what can be done using semantic technologies applied to archaeological resources. In the world of the Semantic Web, advances regarding geosemantics have emerged to extend research more fully into the spatio-temporal domain, for example extending the SPARQL standard to produce GeoSPARQL. Importantly, the use of semantic technologies, particularly the structure of RDF, aligns with graph and network based approaches, providing a rich fusion of techniques for geospatial analysis of heritage data expressed in such a manner.
This paper gives an overview of the ongoing G-STAR research project (GeoSemantic Technologies for Archaeological Resources) with reference to broader sectoral links particularly to commercial archaeology. Particular attention is paid to examining the integration of spatial data into the heritage Global Graph and the relationship between Spatial Data Infrastructure (SDI) and Linked Data, moving beyond notions of ‘location’ as simple nodes, placenames and coordinates towards fuller support for complex geometries and advanced spatial reasoning. Finally, the potential impacts of such research is discussed with particular reference to the current practice of commercial archaeology, access to and publishing of (legacy, big) data, and leveraging network models to better understand and manage change within archaeological information systems.
Comparing and matching archaeological excavation data for integration in onto...ariadnenetwork
Presentation by Anja Masur and Keith May
OAW ( Austrian Academy of Sciences).
OREA (Institute for Oriental and European Archaeology).
English Heritage;
University of South Wales
Full-day session on archaeological infrastructures and services at the 18th Cultural Heritage and New Technologies (CHNT) conference
Vienna, Austria
11th -13th November 2013
New Discovery Tools for Digital Humanities and Spatial Data (Summary of the J...Micah Altman
My colleague, (Merrick) Lex Berman, who is Web Service Manager & GIS Specialist, at the Center for Geographic Analysis at Harvard presented this as part of the Program on Information Science Brown Bag Series. Lex is an expert in applications related to digital humanities, GIS, and chinese history -- and has developed many interesting tools in this area.
In his talk, Lex notes how the library catalog has evolved from the description of items in physical collections into a wide-reaching net of services and tools for managing both physical collections and networked resources: The line between descriptive metadata and actual content is becoming blurred. Librarians and catalogers are now in the position of being not only docents of collections, but innovators in digital research, and this opens up a number of opportunities for retooling library discovery tools. His presentation will presented survey of methods and projects that have extended traditional catalogs of libraries and museums into online collections of digital objects in the field of humanities -- focusing on projects that use historical place names and geographic identifiers for linked open data will be discussed.
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)EarthCube
EAGER: Collaborative Research: EarthCube Building Blocks, Leveraging Semantics and Linked Data for Geoscience Data Sharing and Discovery or "OceanLink" is one of 15 EarthCube-funded components.
This presentation includes an OceanLink Project Overview (slides 1-12), followed by several presentations highlighting separate project efforts and updates to different audiences:
Slide 13: "Ontologies in a data-driven world." Montana State University Computer Science Department, March 3, 2014.
Slide 44: "Towards ontology patterns for ocean science repository integration", Ontology Summit 2014, Ontolog online session January 2014.
Slide 82: OceanLink: Using Patterns for Discovery in EarthCube, GeoVoCampSB2014, Santa Barbara, March, 2014
Slide 118: "Ontologies in a data driven world," IBM T.J. Watson Research Center, January 2014.
Deep Earth Computer: A Platform for Linked Science of the Deep Carbon Obser...Xiaogang (Marshall) Ma
Deep Carbon Observatory-Data Science is assembling a Deep Earth Computer for the Deep Carbon Observatory (DCO). The efforts will create a fundamental change in the conduct of Carbon-related research, resting upon a 21st century data science platform, and a series of aggregate data holdings that have never existed before. Data science combines aspects of informatics, data management, library science, computer science and physical science using cyberinfrastructure and information technology. The Deep Earth Computer we build provides these functions at minimum: an concept-type repository, an ability to identify and manage all key entities, agents and activities in the platform, a repository for archiving datasets and associated metadata, collaboration tools, and an integrated portal to manage diverse content and applications, with varied access levels and privacy options. The Deep Earth Computer sets up a platform for the Linked Science of the Deep Carbon Community, that is, not only scientific assets like data and methods behind scientific settings are opened and inter-connected, but also the people, organizations, groups, samples, instruments, activities, grants, meetings, etc. are recorded and inter-connected. Such a platform will promote collaborations among DCO community members, improve the openness and reproducibility of Carbon-related researches, and facilitate accreditation to resource (including publications, datasets, instruments, etc.) contributors.
Research Data Infrastructure for Geochemistry (DFG Roundtable)Kerstin Lehnert
This presentation provides an overview of different aspects of data management for geochemistry and resources available at the EarthChem@IEDA data facility.
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...CIGScotland
Presented at Linked Open Data: current practice in libraries and archives (Cataloguing & Indexing Group in Scotland 3rd Linked Open Data Conference), Edinburgh, 18 Nov 2013
Slides for a presentation on recent work with Web Archives at the Oxford Internet Institute (http://www.oii.ox.ac.uk/) given at WIRE2014 (http://wp.comminfo.rutgers.edu/nsfia/schedule/)
Dig the new breed: how open approaches can empower archaeologistsDART Project
Dig the new breed: how open approaches can empower archaeologists
The title contains a dodgy archaeology pun based on an album by The Jam.
A presentation by Anthony Beck at the Open Knowledge Conference (OKCon) 2010 on 24th April 2010.
Supporting the research lifecycle of geo-GSNL initiative through HPC and Rese...Raul Palma
Volcanic eruptions are among the most spectacular and dangerous phenomena on Earth, capable of generating disasters at various scales. The Geohazard Supersites and Natural Laboratories initiatives (GSNL) today is a network of 11 Supersites, including volcanoes and seismic areas. Complex algorithms are used to analyse these data and important information on the volcano activity. In addition to computing power and resources, researchers from the geo-gnsl community, as many other data-intensive science communities, are calling for innovative ways to manage their data, methods and other resources, which can enhance the visibility of scientific breakthroughs, encourage reuse, and foster a broader research accessibility. In this contribution we present the results of EVER-EST project (H2020-EINFRA-2015-1), in which we created in collaboration with different partners a virtual research environment (VRE) for Earth Science (https://vre.ever-est.eu/), embracing the research object concept and technologies at its core.
Tectonic Storytelling with Open Source and Digital Object Identifiers - a cas...Peter Löwe
The communication of advances in research to the common public for both education and decision making is an important aspect of scientific work. An even more crucial task is to gain recognition within the scientific community,
which is judged by impact factor and citation counts. Recently, the latter concepts have been extended from
textual publications to include data and software publications.
This paper presents a case study for science communication and data citation. For this, tectonic models, Free and Open Source Software (FOSS), best practices for data citation and a multimedia online-portal for scientific content
are combined. This approach creates mutual benefits for the stakeholders: Target audiences receive information on
the latest research results, while the use of Digital Object Identifiers (DOI) increases the recognition and citation of
underlying scientific data. This creates favourable conditions for every researcher as DOI names ensure citeability and long term availability of scientific research.
In the developed application, the FOSS tool for tectonic modelling GPlates is used to visualise and manipulate
plate-tectonic reconstructions and associated data through geological time. These capabilities are augmented by the Science on a Halfsphere project (SoaH) with a robust and intuitive visualisation hardware environment.
The tectonic models used for science communication are provided by the AGH University of Science and Technology.
They focus on the Silurian to Early Carboniferous evolution of Central Europe (Bohemian Massif) and were
interpreted for the area of the Geopark Bergstraße Odenwald based on the GPlates/SoaH hardware- and software stack.
As scientific story-telling is volatile by nature, recordings are a natural means of preservation for further use, reference and analysis. For this, the upcoming portal for audiovisual media of the German National Library of Science and Technology TIB is expected to become a critical service infrastructure. It allows complex search queries, including metadata such as DOI and media fragment identifiers (MFI), thereby linking data citation and science
communication.
Department of Geography and Geoinformation Science Seminar, George Mason University, Falls Church, VA, September 2015.
Increasingly, GIS is part of the collaboration between computer scientists, information scientists, and domain scientists to solve complex scientific questions. Successfully addressing scientific problems, such as informing regional decision- and policy-making for coastal zone management and marine spatial planning, requires integrative and innovative approaches to analyzing, modeling, and developing extensive and diverse data sets. The current chaotic distribution of available data sets, lack of documentation about them, and lack of easy-to-use access tools and computer modeling and analysis codes are still major obstacles for scientists and educators alike. Contributing solutions to these problems is part of an emerging science agenda at Esri for a range of environmental, conservation, climate and ocean sciences that will be discussed. The talk will highlight some recent projects in progress, including a new global map of ecological land units, new tools to support multidimensional scientific data, continued work on an ocean basemap, and more.
Tufts Spatial Data Rescue: Crawling at-risk Government DataKyle Monahan
Much U. S. federal data are perceived as being at risk of becoming inaccessible through lack of maintenance and funding shifts. Organizations such as the End of Term Project and Data Rescue have emerged to coordinate the backup and rescue of at-risk federal data. Tufts University conducted a curated harvest to back up potentially at-risk federal, environmental and social justice geospatial data and associated tabular data. Tufts ongoing harvest has recovered over 40 TB of data. Tufts developed the Crawler to crawl all data: unzip files, identify file types, sizes, local directories, and harvest and process all related metadata using data mining and natural language processing (NLP) techniques. The Crawler results in detailed collection level and layer level metadata analytics for assessment, search and discovery.
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...Micah Altman
The WorldMap platform http://worldmap.harvard.edu is the largest open source collaborative mapping system in the world, with over 13,000 map layers contributed by thousands of users from Harvard and around the world. Researchers may upload large spatial datasets to the system, create data-driven visualizations, edit data, and control access. Users may keep their data private, share it in groups, or publish to the world.
The user base is interdisciplinary, including scholars from the humanities, social sciences, sciences, public health, design, planning, etc. All are able to access, view, and use one another’s data, either online, via map services, or by downloading.
Current work is underway to create and maintain a global registry of map services and take us a step closer to one-stop-access for public geospatial data. Another project is working on tools to support the visualization of spatial datasets with over a billion features. Current collaborations are underway with groups inside Harvard, such as Dataverse, HarvardX, and various departments, and with groups outside Harvard, such as Cornell University and the University of Pennsylvania. Major additional contributors to the underlying source code include the WorldBank, the U.S. State Department, and the United Nations.
The source code for the WorldMap platform is available on GitHub https://github.com/cga-harvard/cga-worldmap.
Location: E25-202
Discussant: Ben Lewis is system architect and project manager for WorldMap, an open source infrastructure that supports collaborative research centered on geospatial information. Before joining Harvard, Ben was a project manager with Advanced Technology Solutions of Pennsylvania, where he led the company in adopting platform independent approaches to GIS system development. Ben studied Chinese at the University of Wisconsin and has a Masters in Planning from the University of Pennsylvania. After Penn, Ben helped start the GIS Lab at U.C. Berkeley, founded the GIS group for transportation engineering firm McCormick Taylor, and coordinated the Land Acquisition Mapping System for South Florida Water Management District. Ben is especially interested in technologies that lower the barrier to spatial technology access.
Information Science Brown Bag talks, hosted by the Program on Information Science, consists of regular discussions and brainstorming sessions on all aspects of information science and uses of information science and technology to assess and solve institutional, social and research problems. These are informal talks. Discussions are often inspired by real-world problems being faced by the lead discussant.
An introduction to the Joint Information Systems Committee Resource Discovery iKit. Includes a look at controlled vocabularies declared in the Resource Discovery Framework (RDF)/Simple Knowledge Organisation System (SKOS) and wikipedia entries. Presented by Tony Ross at the CILIPS Centenary Conference Branch and Group Day which took place 5 Jun 2008.
Slides for my keynote presentateion "Linked Data for Digital History" presented at Semantic Web for Scientific History (SW4SH) co-located with ESWC 2015
Presented in : JIST2015, Yichang, China
Prototype: http://rc.lodac.nii.ac.jp/rdf4u/
Video: https://www.youtube.com/watch?v=z3roA9-Cp8g
Abstract: It is known that Semantic Web and Linked Open Data (LOD) are powerful technologies for knowledge management, and explicit knowledge is expected to be presented by RDF format (Resource Description Framework), but normal users are far from RDF due to technical skills required. As we learn, a concept-map or a node-link diagram can enhance the learning ability of learners from beginner to advanced user level, so RDF graph visualization can be a suitable tool for making users be familiar with Semantic technology. However, an RDF graph generated from the whole query result is not suitable for reading, because it is highly connected like a hairball and less organized. To make a graph presenting knowledge be more proper to read, this research introduces an approach to sparsify a graph using the combination of three main functions: graph simplification, triple ranking, and property selection. These functions are mostly initiated based on the interpretation of RDF data as knowledge units together with statistical analysis in order to deliver an easily-readable graph to users. A prototype is implemented to demonstrate the suitability and feasibility of the approach. It shows that the simple and flexible graph visualization is easy to read, and it creates the impression of users. In addition, the attractive tool helps to inspire users to realize the advantageous role of linked data in knowledge management.
GeoSemantic Technologies for Archaeological ResourcesPaul Cripps
The semantics of heritage data is a growing area of interest with ontologies such as the CIDOC-CRM providing semantic frameworks and exemplary projects such as STAR and STELLAR demonstrating what can be done using semantic technologies applied to archaeological resources. In the world of the Semantic Web, advances regarding geosemantics have emerged to extend research more fully into the spatio-temporal domain, for example extending the SPARQL standard to produce GeoSPARQL. Importantly, the use of semantic technologies, particularly the structure of RDF, aligns with graph and network based approaches, providing a rich fusion of techniques for geospatial analysis of heritage data expressed in such a manner.
This paper gives an overview of the ongoing G-STAR research project (GeoSemantic Technologies for Archaeological Resources) with reference to broader sectoral links particularly to commercial archaeology. Particular attention is paid to examining the integration of spatial data into the heritage Global Graph and the relationship between Spatial Data Infrastructure (SDI) and Linked Data, moving beyond notions of ‘location’ as simple nodes, placenames and coordinates towards fuller support for complex geometries and advanced spatial reasoning. Finally, the potential impacts of such research is discussed with particular reference to the current practice of commercial archaeology, access to and publishing of (legacy, big) data, and leveraging network models to better understand and manage change within archaeological information systems.
Comparing and matching archaeological excavation data for integration in onto...ariadnenetwork
Presentation by Anja Masur and Keith May
OAW ( Austrian Academy of Sciences).
OREA (Institute for Oriental and European Archaeology).
English Heritage;
University of South Wales
Full-day session on archaeological infrastructures and services at the 18th Cultural Heritage and New Technologies (CHNT) conference
Vienna, Austria
11th -13th November 2013
New Discovery Tools for Digital Humanities and Spatial Data (Summary of the J...Micah Altman
My colleague, (Merrick) Lex Berman, who is Web Service Manager & GIS Specialist, at the Center for Geographic Analysis at Harvard presented this as part of the Program on Information Science Brown Bag Series. Lex is an expert in applications related to digital humanities, GIS, and chinese history -- and has developed many interesting tools in this area.
In his talk, Lex notes how the library catalog has evolved from the description of items in physical collections into a wide-reaching net of services and tools for managing both physical collections and networked resources: The line between descriptive metadata and actual content is becoming blurred. Librarians and catalogers are now in the position of being not only docents of collections, but innovators in digital research, and this opens up a number of opportunities for retooling library discovery tools. His presentation will presented survey of methods and projects that have extended traditional catalogs of libraries and museums into online collections of digital objects in the field of humanities -- focusing on projects that use historical place names and geographic identifiers for linked open data will be discussed.
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)EarthCube
EAGER: Collaborative Research: EarthCube Building Blocks, Leveraging Semantics and Linked Data for Geoscience Data Sharing and Discovery or "OceanLink" is one of 15 EarthCube-funded components.
This presentation includes an OceanLink Project Overview (slides 1-12), followed by several presentations highlighting separate project efforts and updates to different audiences:
Slide 13: "Ontologies in a data-driven world." Montana State University Computer Science Department, March 3, 2014.
Slide 44: "Towards ontology patterns for ocean science repository integration", Ontology Summit 2014, Ontolog online session January 2014.
Slide 82: OceanLink: Using Patterns for Discovery in EarthCube, GeoVoCampSB2014, Santa Barbara, March, 2014
Slide 118: "Ontologies in a data driven world," IBM T.J. Watson Research Center, January 2014.
Deep Earth Computer: A Platform for Linked Science of the Deep Carbon Obser...Xiaogang (Marshall) Ma
Deep Carbon Observatory-Data Science is assembling a Deep Earth Computer for the Deep Carbon Observatory (DCO). The efforts will create a fundamental change in the conduct of Carbon-related research, resting upon a 21st century data science platform, and a series of aggregate data holdings that have never existed before. Data science combines aspects of informatics, data management, library science, computer science and physical science using cyberinfrastructure and information technology. The Deep Earth Computer we build provides these functions at minimum: an concept-type repository, an ability to identify and manage all key entities, agents and activities in the platform, a repository for archiving datasets and associated metadata, collaboration tools, and an integrated portal to manage diverse content and applications, with varied access levels and privacy options. The Deep Earth Computer sets up a platform for the Linked Science of the Deep Carbon Community, that is, not only scientific assets like data and methods behind scientific settings are opened and inter-connected, but also the people, organizations, groups, samples, instruments, activities, grants, meetings, etc. are recorded and inter-connected. Such a platform will promote collaborations among DCO community members, improve the openness and reproducibility of Carbon-related researches, and facilitate accreditation to resource (including publications, datasets, instruments, etc.) contributors.
Research Data Infrastructure for Geochemistry (DFG Roundtable)Kerstin Lehnert
This presentation provides an overview of different aspects of data management for geochemistry and resources available at the EarthChem@IEDA data facility.
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...CIGScotland
Presented at Linked Open Data: current practice in libraries and archives (Cataloguing & Indexing Group in Scotland 3rd Linked Open Data Conference), Edinburgh, 18 Nov 2013
Slides for a presentation on recent work with Web Archives at the Oxford Internet Institute (http://www.oii.ox.ac.uk/) given at WIRE2014 (http://wp.comminfo.rutgers.edu/nsfia/schedule/)
Dig the new breed: how open approaches can empower archaeologistsDART Project
Dig the new breed: how open approaches can empower archaeologists
The title contains a dodgy archaeology pun based on an album by The Jam.
A presentation by Anthony Beck at the Open Knowledge Conference (OKCon) 2010 on 24th April 2010.
Supporting the research lifecycle of geo-GSNL initiative through HPC and Rese...Raul Palma
Volcanic eruptions are among the most spectacular and dangerous phenomena on Earth, capable of generating disasters at various scales. The Geohazard Supersites and Natural Laboratories initiatives (GSNL) today is a network of 11 Supersites, including volcanoes and seismic areas. Complex algorithms are used to analyse these data and important information on the volcano activity. In addition to computing power and resources, researchers from the geo-gnsl community, as many other data-intensive science communities, are calling for innovative ways to manage their data, methods and other resources, which can enhance the visibility of scientific breakthroughs, encourage reuse, and foster a broader research accessibility. In this contribution we present the results of EVER-EST project (H2020-EINFRA-2015-1), in which we created in collaboration with different partners a virtual research environment (VRE) for Earth Science (https://vre.ever-est.eu/), embracing the research object concept and technologies at its core.
Tectonic Storytelling with Open Source and Digital Object Identifiers - a cas...Peter Löwe
The communication of advances in research to the common public for both education and decision making is an important aspect of scientific work. An even more crucial task is to gain recognition within the scientific community,
which is judged by impact factor and citation counts. Recently, the latter concepts have been extended from
textual publications to include data and software publications.
This paper presents a case study for science communication and data citation. For this, tectonic models, Free and Open Source Software (FOSS), best practices for data citation and a multimedia online-portal for scientific content
are combined. This approach creates mutual benefits for the stakeholders: Target audiences receive information on
the latest research results, while the use of Digital Object Identifiers (DOI) increases the recognition and citation of
underlying scientific data. This creates favourable conditions for every researcher as DOI names ensure citeability and long term availability of scientific research.
In the developed application, the FOSS tool for tectonic modelling GPlates is used to visualise and manipulate
plate-tectonic reconstructions and associated data through geological time. These capabilities are augmented by the Science on a Halfsphere project (SoaH) with a robust and intuitive visualisation hardware environment.
The tectonic models used for science communication are provided by the AGH University of Science and Technology.
They focus on the Silurian to Early Carboniferous evolution of Central Europe (Bohemian Massif) and were
interpreted for the area of the Geopark Bergstraße Odenwald based on the GPlates/SoaH hardware- and software stack.
As scientific story-telling is volatile by nature, recordings are a natural means of preservation for further use, reference and analysis. For this, the upcoming portal for audiovisual media of the German National Library of Science and Technology TIB is expected to become a critical service infrastructure. It allows complex search queries, including metadata such as DOI and media fragment identifiers (MFI), thereby linking data citation and science
communication.
Department of Geography and Geoinformation Science Seminar, George Mason University, Falls Church, VA, September 2015.
Increasingly, GIS is part of the collaboration between computer scientists, information scientists, and domain scientists to solve complex scientific questions. Successfully addressing scientific problems, such as informing regional decision- and policy-making for coastal zone management and marine spatial planning, requires integrative and innovative approaches to analyzing, modeling, and developing extensive and diverse data sets. The current chaotic distribution of available data sets, lack of documentation about them, and lack of easy-to-use access tools and computer modeling and analysis codes are still major obstacles for scientists and educators alike. Contributing solutions to these problems is part of an emerging science agenda at Esri for a range of environmental, conservation, climate and ocean sciences that will be discussed. The talk will highlight some recent projects in progress, including a new global map of ecological land units, new tools to support multidimensional scientific data, continued work on an ocean basemap, and more.
Data Facilities Workshop - Panel on Current Concepts in Data Sharing & Intero...EarthCube
This series of presentations was given at the EarthCube Data Facilities End-User Workshop held January 15-17, 2014 in Washington, DC. This workshop provided a forum to discuss the unique requirements and challenges associated with developing the communication, collaboration, interoperability, and governance structures that will be required to build EarthCube in conjunction with existing and emerging NSF/GEO facilities.
This panel and discussion, specifically, outlined and explained several current concepts in data sharing and interoperability, featuring presentations by:
Paul Morin (UMN): Polar Cyberinfrastructure
Don Middleton (UCAR): Atmospheric/Climate
Kerstin Lehnert (LDEO): Domain Repositories & Physical Samples
David Schindel (CBOL, GRBio): Biological Perspective & Collections
Hank Leoscher (NEON): Observation Networks
Daniel Fuka (Virginia Tech) and Ruth Duerr (NSIDC): Brokering
Ilya Zaslavsky (UCSD): Cross-Domain Interoperability
Maths, Chemistry, Physics are very well suited for the Semantic Web, but very poorly represented. Here I show how valuable it can be and what (relatively little) needs to be dome
Semantically-Enabled Environmental Data Discovery and Integration: Demonstrat...Tatiana Tarasova
This presentation is about a framework for semantically-enabled data discovery and integration across multiple Earth science disciplines. Data harmonization was based on the principles of Linked Data. Previous works define the Data Cube extensions which are relevant to certain Earth science disciplines. To provide a generic and domain independent solution, we propose an upper level vocabulary (the ENVRI vocabulary) that allows us to express domain specific information at a higher level of abstraction.
From a human viewpoint we provide an interactive Web based user interface for data discovery and integration across multiple research infrastructures (http://portal.envri.eu). The system is demonstrated on a use case of the Iceland Volcano’s eruption on April 10, 2010.
From data portal to knowledge portal: Leveraging semantic technologies to sup...Xiaogang (Marshall) Ma
Scientific research practices regularly adopt new technologies and platforms in an effort to increase information timeliness, sharing and discoverability. There are many initiatives related to open data, open code, open access, open collections, composing the topic of Open Science in academia. Being open has two levels of meanings. The first is to make the data, code, sample collections and publications, etc. freely accessible online. The other is the annotation and connection between those resources to establish the provenance information for reproducible scientific research. In this paper we present our work on a web portal for the Deep Carbon Observatory (DCO) community. The DCO is a 10-year (2009-2019) initiative to intensify global attention and scientific effort in the burgeoning field of deep carbon science. Inspired by guiding questions such as “how much carbon does Earth contain?”, “where is it?” and “what can deep carbon tell us about origins?” more than 1000 scientists across the world are actively participating in the DCO community. The DCO web portal is a research collaboration website developed to keep track of all researchers, organizations, instruments, field sites, and research outputs related to the DCO community. We intend for the DCO web portal to be a knowledge portal - adopting state-of-the-art semantic technologies to support various stages of the scientific process within and beyond the DCO community.
Adoption of RDA DTR and PID in Deep Carbon Observatory Data PortalXiaogang (Marshall) Ma
The Deep Carbon Observatory (DCO) community is building a cyber-enabled platform for linked science, made available to the community by a multi-institutional data portal. Persistent identifiers and domain specific data types have been identified as key technological issues the portal must address. This presentation focuses on the DCO portal’s planned adoption of RDA DTR and PID methodologies and technologies as a means to address the DCO community's need for persistently identifiable and understandable data type information.
Knowledge Evolution in Distributed Geoscience Datasets and the Role of Semant...Xiaogang (Marshall) Ma
Knowledge evolves in geoscience, and the evolution is reflected in datasets. In a context with distributed data sources, the evolution of knowledge may cause considerable challenges to data management and re-use. For example, a short news published in 2009 (Mascarelli, 2009) revealed the geoscience community’s concern that the International Commission on Stratigraphy’s change to the definition of Quaternary may bring heavy reworking of geologic maps. Now we are in the era of the World Wide Web, and geoscience knowledge is increasingly modeled and encoded in the form of ontologies and vocabularies by using semantic technologies. Accordingly, knowledge evolution leads to a consequence called ontology dynamics. Flouris et al. (2008) summarized 10 topics of general ontology changes/dynamics such as: ontology mapping, morphism, evolution, debugging and versioning, etc. Ontology dynamics makes impacts at several stages of a data life cycle and causes challenges, such as: the request for reworking of the extant data in a data center, semantic mismatch among data sources, differentiated understanding of a same piece of dataset between data providers and data users, as well as error propagation in cross-discipline data discovery and re-use (Ma et al., 2014). This presentation will analyze the best practices in the geoscience community so far and summarize a few recommendations to reduce the negative impacts of ontology dynamics in a data life cycle, including: communities of practice and collaboration on ontology and vocabulary building, link data records to standardized terms, and methods for (semi-)automatic reworking of datasets using semantic technologies.
References:
Flouris, G., Manakanatas, D., Kondylakis, H., Plexousakis, D., Antoniou, G., 2008. Ontology change: classification and survey. The Knowledge Engineering Review 23 (2), 117-152.
Ma, X., Fox, P., Rozell, E., West, P., Zednik, S., 2014. Ontology dynamics in a data life cycle: Challenges and recommendations from a Geoscience Perspective. Journal of Earth Science 25 (2), 407-412.
Mascarelli, A.L., 2009. Quaternary geologists win timescale vote. Nature 459, 624.
Why Data Science Matters - 2014 WDS Data Stewardship Award LectureXiaogang (Marshall) Ma
A presentation with a review of technical trends in data management, publication and citation, and methodologies on data interoperability, provenance of research and semantic escience.
CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...Xiaogang (Marshall) Ma
This special issue invites research papers that demonstrate how semantic methodologies and technologies are currently meeting scientific or engineering goals in Earth and space science domains. Papers should highlight the innovative designs, methods or applications associated with the semantic technologies. Review papers presenting state-of-the-art knowledge about a subject in semantic e-Science and methodology and software papers about a new algorithm or software package are also welcome.
Ontology Development for Provenance Tracing in National Climate Assessment o...Xiaogang (Marshall) Ma
The periodical National Climate Assessment (NCA) of the US Global Change Research Program (USGCRP) [1] produces reports about findings of global climate change and the impacts of climate change on the United States. Those findings are of great public and academic concerns and are used in policy and management decisions, which make the provenance information of findings in those reports especially important. The USGCRP is developing a Global Change Information System (GCIS), in which the NCA reports and associated provenance information are the primary records.
We were modeling and developing Semantic Web applications for the GCIS. By applying a use case-driven iterative methodology [2], we developed an ontology [3] to represent the content structure of a report and the associated provenance information. We also mapped the classes and properties in our ontology into the W3C PROV-O ontology [4] to realize the formal presentation of provenance. We successfully implemented the ontology in several pilot systems for a recent National Climate Assessment report (i.e., the NCA3). They provide users the functionalities to browse and search provenance information with topics of interest. Provenance information of the NCA3 has been made structured and interoperable by applying the developed ontology. Besides the pilot systems we developed, other tools and services are also able to interact with the data in the context of the “Web of data” and thus create added values.
Our research shows that the use case-driven iterative method bridges the gap between Semantic Web researchers and earth and environmental scientists and is able to be deployed rapidly for developing Semantic Web applications. Our work also provides first-hand experience for re-using the W3C PROV-O ontology in the field of earth and environmental sciences, as the PROV-O ontology is recently ratified (on 04/30/2013) by the W3C as a recommendation and relevant applications are still rare.
[1] http://www.globalchange.gov
[2] Fox, P., McGuinness, D.L., 2008. TWC Semantic Web Methodology. Accessible at: http://tw.rpi.edu/web/doc/TWC_SemanticWebMethodology
[3] https://scm.escience.rpi.edu/svn/public/projects/gcis/trunk/rdf/schema/GCISOntology.ttl
[4] http://www.w3.org/TR/prov-o/
Ontology spectrum for geological data interoperability (PhD defense nov 2011)Xiaogang (Marshall) Ma
Ontology spectrum for geological data interoperability.
A 10-minutre layman presentation for my PhD defense at University of Twente, 2011/11/30. Full text of dissertation is accessible at: http://www.itc.nl/library/papers_2011/phd/ma.pdf
–Recent progress on geologic time ontologies and vocabularies;
–Their applications;
–Further works;
–Recommendations for other geoscience ontology and vocabulary works.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Exploring the Web of Data for Earth and Environmental Sciences
1. TWC
Exploring the Web of Data for
Earth and Environmental Sciences
Xiaogang (Marshall) Ma
Tetherless World Constellation
Rensselaer Polytechnic Institute
@MarshallXMamax7@rpi.edu
x.marshall.ma
rpi.edu/~max7
0000-0002-9110-7369MarshallXMa
2. TWCOutline
• Web of Data
– Semantic Web, RDF, Ontology, Linked Open Data
• Weaving the Web of Data
– OneGeology-Europe
– Global Change Information System
– Deep Carbon Observatory-Data Science
– Deep Time Data Infrastructure
• Exploring the Web of Data
– Semantic similarity
– Concept mapping
• Summary
– Semantic eScience
2
3. TWCSemantic Web
“The Semantic Web is an extension of the current web in
which information is given well-defined meaning, better
enabling computers and people to work in cooperation. ”
3
Berners-Lee et al., 2001. Sci. Amer.
4. TWCWeb of Documents vs. Web of Data
Back to the early 1990s
• HTML and URL
• Markup language and ways for
connecting resources
• Below the file level
• Stopped at the text level
4
The First Website
http://info.cern.ch/hypertext/WWW/TheProject.html
<a href=“…
5. TWC
Since the early 2000s…
• XML, RDF, OWL and URIs
• Markup language and ways for
connecting resources
• Below the file level
• Below the text level
• At the data level
Web of Documents vs. Web of Data
(cont.)
Miller, 2004
5
6. TWCResource Description
Framework (RDF)
• A standard of W3C
• RDF is made up of triples
– <subject, predicate, object>
• RDFS extends RDF with a standard “ontology vocabulary”
– Class, Property
– subClassOf
– Domain, Range
– Type
– …
6http://www.w3.org/RDF/ http://www.w3.org/TR/rdf-schema/
<Mozart, composed, The Magic Flute>
<Mozart, isA, Musician>
<The Magic Flute, isA, Opera>
<Musician, rdf:type, owl:Class>
<Musician, rdfs:subClassOf, Artist>
<composed, rdf:type, owl:ObjectProperty>
<composed, rdfs:domain, Musician>
<composed, rdfs:range, Opera>
7. TWCOntology
• The term ontology is originated from philosophy
– The study of the nature of existence
• For the Semantic Web purpose
– An ontology is the specification of a shared conceptualization of a
domain
7
Aristotle
(384 – 322 BCE)
I Ching (Book of Changes)
(c. 450 – 250 BCE)
8. TWCAn Ontology Spectrum
Enriching semantic expressions
Catalog Glossary Thesaurus
Conceptual
schema
Formal
assertionsTaxonomy
Alphanumerical list
Supergroup-subgroup Formal superclass-subclass
Hierarchy
Disjoint subclass;
transitive properties;
symmetric properties, etc.
8
(Ma et al., 2010; adapted from Welty, 2002; McGuinness, 2003;
Obrst, 2003; Uschold and Gruninger, 2004; Borgo et al., 2005)
9. TWCQuerying RDF Data
• Query Languages such as SPARQL
– Most forms of the query languages contain a set of triple patterns
– Triple patterns are like RDF triples except that each of the subject,
predicate and object may be a variable
9
SELECT ?x ?y
WHERE
{
?x composed ?y .
}
<Mozart, composed, The Magic Flute>
<Mozart, isA, Musician>
<The Magic Flute, isA, Opera>
• ?x, ?y are variables
• ?x composed ?y represents a
<subject, predicate, object> triple
Mozart The Magic Flute
Data
Query
Result
10. TWCQuerying RDF Data
• Query Languages such as SPARQL
– Most forms of the query languages contain a set of triple patterns
– Triple patterns are like RDF triples except that each of the subject,
predicate and object may be a variable
10
SELECT ?x ?y
WHERE
{
?x composed ?y .
}
<Mozart, composed, The Magic Flute>
<Mozart, isA, Musician>
<The Magic Flute, isA, Opera>
• ?x, ?y are variables
• ?x composed ?y represents a
<subject, predicate, object> triple
Mozart The Magic Flute
Data
Query
Result
Images courtesy of OneGeology
Correlation between
Mozart and Geology?
11. TWCA Vision of The Semantic Web
Machine-processable, global
Web standards:
• Assigning unambiguous
identifiers (URI)
• Expressing data, including
metadata (RDF)
• Capturing ontologies (OWL)
• Query, rules, transformations,
deployment, application spaces,
logic, proof, trust
Bratt, 2006, Emerging Web Technologies to Watch 11
14. TWCOneGeology-Europe
• 20 European nations
providing national geologic
maps at scale ~1: 1M
• Harmonized geological
terms and map legends
• Multilingual labels in 18
languages
• Central portal for data
browsing/query among
distributed data sources
A contribution to
INSPIRE
http://www.onegeology-europe.org
14
Recent Works in Geoscience
15. TWC
15
Federated query
Result of geologic units with age ‘Cenozoic - from 66 million years to today’
http://onegeology-europe.brgm.fr/geoportal/viewer.jsp
16. TWC
16
Distributed datasets:
Mismatches of geological
units across political
boundaries
Italy/France near
Cuneo/Colmar
Cambrian Carboniferous
(Asch et al., 2012)
(Ma et al., 2014)
Felsic and hornblendic gneisses
Granitic rocks
Wyoming/Colorado
(Base map courtesy:
OneGeology-Europe and USGS)
There are still works
to be done….
17. TWC
17
Ma et al., 2011, nGeo
Data Interoperability:
“Data should be discoverable, accessible, decodable,
understandable and usable, and data sharing should be
legal and ethical for all participants.”
Original image from: http://ehna.org
19. TWC
“Figure 1.2: Sea Level Rise: Past, Present, and Future” in draft NCA3
An example question of provenance tracing:
What are the NASA contributions to Figure 1.2 in the draft NCA3?
19
20. TWCCurrent result:
GCIS ontology version 1.1
Classes and
properties
representing a
brief structure of
the draft NCA3
GCIS
ontology
version
1.1
21. TWC
Classes and properties about sensors, instruments, platforms,
and algorithms, etc. that datasets are generated from
GCIS
ontology
version 1.1
21
22. TWC
22
Image from nature.com
Ma et al., 2014, nClimate
Provenance Documentation:
“Linking a range of observations and model outputs, research
activities, people and organizations involved in the production of
scientific findings with the supporting data sets and methods
used to generate them”
23. TWCDeep Carbon Virtual Observatory
Fox, 2014
http://deepcarbon.net
DCVO: A cyber-
enabled platform
for linked science
23
Deep Carbon Observatory
(2009-2019)
• Deep Energy
• Deep Life
• Extreme Physics and
Chemistry
• Reservoirs and Fluxes
24. TWCDeep Carbon Virtual Observatory
• A vision of the DCVO:
– A conceptual model of the interplay between data, people,
publication, instruments, models, organizations, etc.
– Identify, annotate and link all key entities, agents and
activities
– A repository for datasets and associated metadata
– Unique and powerful data and metadata visualization for
dissemination of information
– Collaboration tools for scientific efforts
– An integrated portal for diverse content and applications
Fox et al., 2014
24
http://deepcarbon.net
25. TWCDeep Time Data Infrastructure
(2015-2025)
Studying the co-evolution of geosphere and biosphere
25
http://www.wmkeck.org/grant-programs/research/medical-
research-grant-abstracts/science-and-engineering-2014
Deep Time Data Workshop
at AGU 2014
Vast amounts of data related to planetary
evolution through deep time:
• Mineralogy and Petrology
• Paleobiology and Paleontology
• Paleotectonics and Paleomagnetism
• Geochemistry and Geochronology
• Genomics and Proteomics
…
Short-term goal (2015-2017):
Develop, curate, and integrate diverse data
resources to focus on our planet’s changing
near-surface oxidation state and the rise of
oxygen through deep time
26. TWCDeep Time Data Infrastructure
(2015-2025)
Studying the co-evolution of geo- and biospheres
26
http://www.wmkeck.org/grant-programs/research/medical-research-grant-
abstracts/science-and-engineering-2014
Deep Time Data Workshop
at AGU 2014
Vast amounts of data related to planetary
evolution through deep time:
• Mineralogy and Petrology
• Paleobiology and Paleontology
• Paleotectonics and Paleomagnetism
• Geochemistry and Geochronology
• Genomics and Proteomics
…
Short-term goal (2015-2017):
Develop, curate, and integrate diverse data
resources to focus on our planet’s changing
near-surface oxidation state and the rise of
oxygen through deep time
Hysted, G., Downs, R.T., and Hazen, R.M. (2015)
Mineral frequency distribution data conform to a
LNRE model: Prediction of Earth’s “missing”
minerals. Mathematical Geosciences, in press.
27. TWCExploring the Web of Data
• Geoscience vocabularies and ontologies are increasingly
created and used
– Concept recognition, comparison and interlinking will improve the
quality of data integration
28. TWCExploring the Web of Data
• Geoscience vocabularies and ontologies are increasingly
created and used
– Concept recognition, comparison and inter-linking will improve the
quality of data integration
• SEM+: a tool for concept mapping in geoscience
– SEM: Similarity-based Entity Matching
– Compute semantic similarity between concepts
– Suggest possible linking
Zheng et al., 2015, ESIn
29. TWCSEM+: Similarity-based Entity Matching
Input Entities
Prefix-
blocking
Blocking
Computing
Similarity
Entity Blocks
Final
Matches
Selecting
Matches
Information Entropy based
Weighted Similarity (IEWS) Model
Information
Entropy
Computation
Triple-wise
Similarity
Computation
Similarity Matrix
Workflow of SEM+
30. TWCBlocking Algorithm
Input Entities
Prefix-
blocking
Blocking
Computing
Similarity
Entity Blocks
Final
Matches
Selecting
Matches
Information Entropy based
Weighted Similarity (IEWS) Model
Information
Entropy
Computation
Triple-wise
Similarity
Computation
Similarity Matrix
Input Entities
Prefix-
blocking
Blocking
Entity Blocks
• Input: two or more large sets of concepts
isc:Archean
rdf:type gts:GeochronologicEra , skos:Concept ;
rdfs:comment "younger bound-2500.0"@en , "older
bound-4000.0"@en ;
rdfs:label "Archean Eon"@en ;
gts:rank
<http://resource.geosciml.org/ontology/timescale/rank/
Eon> ;
… …
An example concept
31. TWCPrefix: Rare Keywords
Input Entities
Prefix-
blocking
Blocking
Computing
Similarity
Entity Blocks
Final
Matches
Selecting
Matches
Information Entropy based
Weighted Similarity (IEWS) Model
Information
Entropy
Computation
Triple-wise
Similarity
Computation
Similarity Matrix
Input Entities
Prefix-
blocking
Blocking
Entity Blocks
• Input: two or more large sets of concepts
• Blocking: group similar concepts
• Efficiency: reduce number of concept pairs
• Grouping concepts based on keywords in
their literal descriptions
• Intuition: Concepts that share more rare
keywords (prefix) are more likely to be similar
• Prefix-blocking: ‘prefixes’ are keywords that
belong to the least number of concepts
• The final similarity computation will only
apply to concepts in the same block
32. TWCConcept Blocks
Input Entities
Prefix-
blocking
Blocking
Computing
Similarity
Entity Blocks
Final
Matches
Selecting
Matches
Information Entropy based
Weighted Similarity (IEWS) Model
Information
Entropy
Computation
Triple-wise
Similarity
Computation
Similarity Matrix
Input Entities
Prefix-
blocking
Blocking
Entity Blocks
• Input: two or more large sets of concepts
• Blocking: group similar concepts
4 concepts & their keywords
w = {A, B, C, E, K, L}
x = {C,D, E, L}
y = {B, K, E, L}
z = {A, B, L}
Result blocks
lb=2, then:
A: {w, z}
C: {w, x}
D: {x}
K: {w, y}
Prefix-blocking
lw: size of a block
lb: blocking parameter
If lw > lb, remove that
block
• Output: concept blocks
34. TWCTriple-wise Similarity
Input Entities
Prefix-
blocking
Blocking
Computing
Similarity
Entity Blocks
Final
Matches
Selecting
Matches
Information Entropy based
Weighted Similarity (IEWS) Model
Information
Entropy
Computation
Triple-wise
Similarity
Computation
Similarity Matrix
Triple-wise
Similarity
Computation
• Similarity between two concepts c and c’
– Similarity between triples describing the two
concepts
– Triple-wise (pv) similarity: Simpv
• A challenge: property mapping
_:Boston rdfs:type _:t1
_:t1 rdfs:label ‘City’
is same as
_:Boston _:category ‘City’.
Example: property mapping
𝑆𝑖𝑚 𝑝𝑣 𝑝𝑣, 𝑝𝑣′ =
1. 𝑆𝑖𝑚𝑙 𝑣, 𝑣′ 𝑖𝑓 𝑣 𝑎𝑛𝑑 𝑣′ 𝑎𝑟𝑒 𝑏𝑜𝑡ℎ 𝑙𝑖𝑡𝑒𝑟𝑎𝑙
2. 𝑆𝑖𝑚 𝐹
𝑣, 𝑣′
𝑖𝑓 𝑣 𝑎𝑛𝑑 𝑣′
𝑎𝑟𝑒 𝑏𝑜𝑡ℎ 𝑈𝑅𝐼
3. 𝐺𝑒𝑡 𝑙𝑖𝑡𝑒𝑟𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 𝑎𝑛𝑑 𝑡ℎ𝑒𝑛 𝑢𝑠𝑒 𝑆𝑖𝑚𝑙
𝑖𝑓 𝑣 𝑖𝑠 𝑙𝑖𝑡𝑒𝑟𝑎𝑙 𝑎𝑛𝑑 𝑣′ 𝑖𝑠 𝑈𝑅𝐼
35. TWCSimilarity between Two Triple Values
Input Entities
Prefix-
blocking
Blocking
Computing
Similarity
Entity Blocks
Final
Matches
Selecting
Matches
Information Entropy based
Weighted Similarity (IEWS) Model
Information
Entropy
Computation
Triple-wise
Similarity
Computation
Similarity Matrix
Triple-wise
Similarity
Computation
𝑆𝑖𝑚 𝑝𝑣 𝑝𝑣, 𝑝𝑣′ =
1. 𝑆𝑖𝑚𝑙 𝑣, 𝑣′ 𝑖𝑓 𝑣 𝑎𝑛𝑑 𝑣′ 𝑎𝑟𝑒 𝑏𝑜𝑡ℎ 𝑙𝑖𝑡𝑒𝑟𝑎𝑙
2. 𝑆𝑖𝑚 𝐹
𝑣, 𝑣′
𝑖𝑓 𝑣 𝑎𝑛𝑑 𝑣′
𝑎𝑟𝑒 𝑏𝑜𝑡ℎ 𝑈𝑅𝐼
3. 𝐺𝑒𝑡 𝑙𝑖𝑡𝑒𝑟𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 𝑎𝑛𝑑 𝑡ℎ𝑒𝑛 𝑢𝑠𝑒 𝑆𝑖𝑚𝑙
𝑖𝑓 𝑣 𝑖𝑠 𝑙𝑖𝑡𝑒𝑟𝑎𝑙 𝑎𝑛𝑑 𝑣′ 𝑖𝑠 𝑈𝑅𝐼
• Similarity between two triple values
• For case 1, compute similarity using Lin’s method (Lin, 1998. ICML)
• For case 2, use another equation 𝑆𝑖𝑚 𝐹 𝑐, 𝑐′ recursively
– Here we only traverse URIs to the depth of three
• For case 3, first extract literal value of v’ and then use Siml
36. TWCSimilarity between Two Concepts
Input Entities
Prefix-
blocking
Blocking
Computing
Similarity
Entity Blocks
Final
Matches
Selecting
Matches
Information Entropy based
Weighted Similarity (IEWS) Model
Information
Entropy
Computation
Triple-wise
Similarity
Computation
Similarity Matrix
Triple-wise
Similarity
Computation
𝑆𝑖𝑚 𝑐, 𝑐′ =
𝑆𝑖𝑚 𝑝𝑣
𝑆𝑖𝑚 𝑝𝑣 + α 𝑃𝑉1 − 𝑆𝑖𝑚 𝑝𝑣 + β( 𝑃𝑉2 − 𝑆𝑖𝑚 𝑝𝑣)
• Similarity between two concepts c and c’
• Apply Jaccard similarity (Jaccard, 1912. New Phytologist)
• |𝑃𝑉1|: number of pvs in concept c
• |𝑃𝑉2|: number of pvs in concept c’
• α and β: coefficients of variation on the similarity measure on c and c’
unique description
37. TWCInformation Entropy
Input Entities
Prefix-
blocking
Blocking
Computing
Similarity
Entity Blocks
Final
Matches
Selecting
Matches
Information Entropy based
Weighted Similarity (IEWS) Model
Information
Entropy
Computation
Triple-wise
Similarity
Computation
Similarity Matrix
• Information entropy: Quantified measure
of uncertainty of information content
(Shannon, 1948)
• The amount of information of a property
can be quantified as information entropy
Properties are not equally
important for concept
description. A tripe describing
the Social Security Number is
more important than a triple of
name, to identify a person
Example
• X: a property
• P(xi): possibility of X obtaining a particular value xi
• Information Entropy of X:
𝐻 𝑋 = −
𝑖=1
𝑛
𝑃 𝑥𝑖 𝑙𝑜𝑔 𝑏 (𝑃 𝑥𝑖 )
Information
Entropy
Computation
38. TWCJoint Information Entropy
Input Entities
Prefix-
blocking
Blocking
Computing
Similarity
Entity Blocks
Final
Matches
Selecting
Matches
Information Entropy based
Weighted Similarity (IEWS) Model
Information
Entropy
Computation
Triple-wise
Similarity
Computation
Similarity Matrix
• Joint information entropy: a measure of the uncertainty associated with
a set of variables
• X1, …, Xn: a list of properties
• x1, … , xn: particular values of X1, …, Xn, respectively
• P(x1, … , xn): possibility of those values occurring together
• Joint Information Entropy of X1, …, Xn :
𝐻 X1, …, Xn = −
𝑥1
…
𝑥 𝑛
𝑃 𝑥1, … , 𝑥𝑛 𝑙𝑜𝑔 𝑏[𝑃 𝑥1, … , 𝑥𝑛 ]
Information
Entropy
Computation
39. TWCInformation Entropy
Input Entities
Prefix-
blocking
Blocking
Computing
Similarity
Entity Blocks
Final
Matches
Selecting
Matches
Information Entropy based
Weighted Similarity (IEWS) Model
Information
Entropy
Computation
Triple-wise
Similarity
Computation
Similarity Matrix
• P: the set of properties in 𝑆𝑖𝑚 𝑝𝑣
• H(P): information entropy of the common descriptions
𝑆𝑖𝑚 𝐹 𝑐, 𝑐′ = 𝐻(𝑃)
𝑆𝑖𝑚 𝑝𝑣
𝑆𝑖𝑚 𝑝𝑣 + α 𝑃𝑉1 − 𝑆𝑖𝑚 𝑝𝑣 + β( 𝑃𝑉2 − 𝑆𝑖𝑚 𝑝𝑣)
40. TWCSelecting Matches between Concepts
Input Entities
Prefix-
blocking
Blocking
Computing
Similarity
Entity Blocks
Final
Matches
Selecting
Matches
Information Entropy based
Weighted Similarity (IEWS) Model
Information
Entropy
Computation
Triple-wise
Similarity
Computation
Similarity Matrix
Final
Matches
Selecting
Matches
Similarity Matrix
41. TWCSummary
• eScience: the digital or electronic facilitation of science
• Semantic eScience
– A virtuous circle between science and semantic technologies
– Data driven + Knowledge driven?
• My understanding of Semantic eScience: AIR3
– Anyone can say anything on any topic
– Interoperability, interactivity, intercreativity
– The right information for the right person at the right time
41
Thanks for listening
@MarshallXMamax7@rpi.edu
43. TWC
43
Earth Resource Form
Environmental Impact Value
Exploration Activity Type
Exploration Result
UNFC Value
Earth Resource Expression
Earth Resource Shape
Enduse Potential
Mineral Occurrence Type
Mining Activity Type
Processing Activity Type
Mining Waste Type Value
Commodity Code
Mineral Deposit Group
Mineral Deposit Type
Product Value
Recently finished CGI vocabularies
• Construct a collection of vocabularies for
populating information interchange
documents and enabling interoperability
• Provide labels for concepts, scope to
various communities defined by
language, science domain, or application
domain
CGI Geoscience Terminology Workgroup
http://cgi-iugs.org/tech_collaboration/
geoscience_terminology_working_group.html
44. TWCPrior to 2005, we built systems;
Now we build frameworks
• Rough definitions
– Systems have very well-define entry and exit
points. A user tends to know when they are
using one. Options for extensions are limited
and usually require engineering
– Frameworks have many entry and use points.
Users often do not know when they are using
one. Extension points are part of the design
– Platforms are built on frameworks
(Fox, 2014)
44
Geoinformatics
45. TWCSemantic eScience
• Artificial Intelligence accelerates scientific discovery
– Data search, synthesis and hypothesis representation
– Data analysis: reasoning with models of the data
(Gil et al., 2014)
Image from science.com
A state-of-the-art example:
Hanalyzer (high-throughput analyzer)
• Uses natural language processing to
automatically extract a semantic network from
all PubMed papers relevant to a scientist
• Uses Semantic Web technology to integrate
assertions from other biomedical sources
• Reasons about the network to find new
correlations that suggest new genes to
investigate
45
(Leach et al., 2009)