MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access
Optimization of Search Results with Duplicate Page Elimination using Usage DataIDES Editor
The performance and scalability of search engines
are greatly affected by the presence of enormous amount of
duplicate data on the World Wide Web. The flooded search
results containing a large number of identical or near identical
web pages affect the search efficiency and seek time of the users
to find the desired information within the search results. When
navigating through the results, the only information left behind
by the users is the trace through the pages they accessed. This
data is recorded in the query log files and usually referred to
as Web Usage Data. In this paper, a novel technique for
optimizing search efficiency by removing duplicate data from
search results is being proposed, which utilizes the usage data
stored in the query logs. The duplicate data detection is
performed by the proposed Duplicate Data Detection (D3)
algorithm, which works offline on the basis of favored user
queries found by pre-mining the logs with query clustering.
The proposed result optimization technique is supposed to
enhance the search engine efficiency and effectiveness to a large
scale.
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...Lisette Giepmans
BioSHaRE conference July 28th, 2015, Milan - Latest tools and services for data sharing
Stream 1: Tools for data sharing analysis and enhancement
Opal is a software application to manage study data, and includes a feature enabling data harmonisation and data integration across studies. As such, Opal supports the development and implementation of processing algorithms required to transform study-specific data into a common harmonised format. Moreover, when connected to a Mica web interface, Opal allows users to seamlessly and securely search distributed datasets across several Opal instances.
Opal is freely available for download at www.obiba.org and is provided under the GPL3 open source licence. All studies or networks of studies using the Opal software for data storage, data management or data harmonisation must mention Opal in manuscripts, presentations, or other works made public and include a web link to the Maelstrom Research website (www.maelstrom-research.org).
Mica is a software application developed to create web portals for individual epidemiological studies or for study consortia. Features supported by Mica include a standardised study catalogue, study-specific and harmonised variable data dictionary browsers, online data access request forms, and communication tools (e.g. forums, events, news).
When used in conjunction with the Opal software, Mica also allows authenticated users (i.e. with username and password) to perform distributed queries on the content of study databases hosted on remote servers, and retrieve summary statistics of that content.
Mica is a Java-based, cross-platform, client-server application and comes along with the following two clients: the administrators' user interface and a content management system (Drupal) used to render the catalogue content on the study or consortium.
Mica is freely available for download at www.obiba.org and is provided under the GPL3 open source license.
LOP – Capturing and Linking Open Provenance on LOD Cyclerogers.rj
Presentation of the paper "LOP – Capturing and Linking Open Provenance on LOD Cycle" at 5th Internacional Workshop on Semantic Web Information Management (SWIM 2013). New York, USA – June 23, 2013
Optimization of Search Results with Duplicate Page Elimination using Usage DataIDES Editor
The performance and scalability of search engines
are greatly affected by the presence of enormous amount of
duplicate data on the World Wide Web. The flooded search
results containing a large number of identical or near identical
web pages affect the search efficiency and seek time of the users
to find the desired information within the search results. When
navigating through the results, the only information left behind
by the users is the trace through the pages they accessed. This
data is recorded in the query log files and usually referred to
as Web Usage Data. In this paper, a novel technique for
optimizing search efficiency by removing duplicate data from
search results is being proposed, which utilizes the usage data
stored in the query logs. The duplicate data detection is
performed by the proposed Duplicate Data Detection (D3)
algorithm, which works offline on the basis of favored user
queries found by pre-mining the logs with query clustering.
The proposed result optimization technique is supposed to
enhance the search engine efficiency and effectiveness to a large
scale.
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...Lisette Giepmans
BioSHaRE conference July 28th, 2015, Milan - Latest tools and services for data sharing
Stream 1: Tools for data sharing analysis and enhancement
Opal is a software application to manage study data, and includes a feature enabling data harmonisation and data integration across studies. As such, Opal supports the development and implementation of processing algorithms required to transform study-specific data into a common harmonised format. Moreover, when connected to a Mica web interface, Opal allows users to seamlessly and securely search distributed datasets across several Opal instances.
Opal is freely available for download at www.obiba.org and is provided under the GPL3 open source licence. All studies or networks of studies using the Opal software for data storage, data management or data harmonisation must mention Opal in manuscripts, presentations, or other works made public and include a web link to the Maelstrom Research website (www.maelstrom-research.org).
Mica is a software application developed to create web portals for individual epidemiological studies or for study consortia. Features supported by Mica include a standardised study catalogue, study-specific and harmonised variable data dictionary browsers, online data access request forms, and communication tools (e.g. forums, events, news).
When used in conjunction with the Opal software, Mica also allows authenticated users (i.e. with username and password) to perform distributed queries on the content of study databases hosted on remote servers, and retrieve summary statistics of that content.
Mica is a Java-based, cross-platform, client-server application and comes along with the following two clients: the administrators' user interface and a content management system (Drupal) used to render the catalogue content on the study or consortium.
Mica is freely available for download at www.obiba.org and is provided under the GPL3 open source license.
LOP – Capturing and Linking Open Provenance on LOD Cyclerogers.rj
Presentation of the paper "LOP – Capturing and Linking Open Provenance on LOD Cycle" at 5th Internacional Workshop on Semantic Web Information Management (SWIM 2013). New York, USA – June 23, 2013
An On-line Collaborative Data Management SystemCameron Kiddle
A presentation I prepared that was presented by Rob Simmonds at the Gateway Computing Environments 2010 Workshop in New Orleans on November 14, 2010. It provides an overview of a data management system that was developed for GeoChronos - an on-line collaborative platform for Earth observation scientists.
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...cscpconf
Web access log analysis is to analyze the patterns of web site usage and the features of users behavior. It is
the fact that the normal Log data is very noisy and unclear and it is vital to preprocess the log data for
efficient web usage mining process. Preprocessing comprises of three phases which includes data cleaning,
user identification and session construction. Session construction is very vital and numerous real world
problems can be modeled as traversals on graph and mining from these traversals would provide the
requirement for preprocessing phase. On the other hand, the traversals on unweighted graph have been
taken into consideration in existing works. This paper oversimplifies this to the case where vertices of
graph are given weights to reflect their significance. The proposed method constructs sessions as a Partial
Ancestral Graph which contains pages with calculated weights. This will help site administrators to find
the interesting pages for users and to redesign their web pages. After weighting each page according to
browsing time a PAG structure is constructed for each user session. Existing system in which there is a
problem of learning with the latent variables of the data and the problem can be overcome by the proposed
method.
Federated Search: The Good, The Bad And The Uglydorishelfer
Presented at the SLA 2007 Annual Conference in Denver, CO to the Science and Technology Division (Sci-Tech) on a program entitled: "Federated Searching: The Good, The Bad and the Ugly." Based on an article in Searcher and with additional contributions from Kathy Dabbour and Lynn Lampert on user and librarian assessment of Federated Searching.
Talk given by prof. T.K. Prasad at the workshop on Semantics in Geospatial Architectures: Applications and Implementation. The workshop was held from October 28-29, 2013 at Pyle Center (702 Langdon Street, Madison, WI), University of Wisconsin-Madison.
This talk was provided by Ursula Pieper of the National Agricultural Library for the NISO Virtual Conference, Using Open Source in Your Institution, held on Feb 17, 2016
CEDAR is a metadata management tool that lets user define metadata templates using a well described yet flexible metdata format. CEDAR then presents the forms represented by those templates to other users to fill out. CEDAR offers semantic precision (with support from the BioPortal ontology repository), metadata completion assistance, intelligent recommendations, support for JSON-LD and RDF metadata export, and an easy-to-use user interface.
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalIJSRD
Now days, World Wide Web has become a popular medium to search information, business, trading and so on. A well know problem face by web crawler is the existence of large fraction of distinct URL that correspond to page with duplicate or nearby duplicate contents. In fact as estimated about 29% of web page are duplicates. Such URL commonly named as dust represent an important problem in search engines. To deal with this problem, the first efforts is focus on comparing document content to detect and remove duplicate document without fetching their contents .To accomplish this, the proposed methods learn normalization rules to transform all duplicate URLs into the same canonical form. A challenging aspect of this strategy is deriving a set of general and precise rules. The new approach to detect and eliminate redundant content is DUSTER .When crawling the web duster take advantage of a multi sequence alignment strategy to learn rewriting rules able to transform to other URL which likely to have same content . Alignment strategy that can lead to reduction of 54% larger in the number of duplicate URL.
Query optimization in oodbms identifying subquery for query managementijdms
This paper is based on relatively newer approach for query optimization in object databases, which uses
query decomposition and cached query results to improve execution a query. Issues that are focused here is
fast retrieval and high reuse of cached queries, Decompose Query into Sub query, Decomposition of
complex queries into smaller for fast retrieval of result.
Here we try to address another open area of query caching like handling wider queries. By using some
parts of cached results helpful for answering other queries (wider Queries) and combining many cached
queries while producing the result.
Multiple experiments were performed to prove the productivity of this newer way of optimizing a query.
The limitation of this technique is that it’s useful especially in scenarios where data manipulation rate is
very low as compared to data retrieval rate.
Majority of the computer or mobile phone enthusiasts make use of the web for searching
activity. Web search engines are used for the searching; The results that the search engines get
are provided to it by a software module known as the Web Crawler. The size of this web is
increasing round-the-clock. The principal problem is to search this huge database for specific
information. To state whether a web page is relevant to a search topic is a dilemma. This paper
proposes a crawler called as “PDD crawler” which will follow both a link based as well as a
content based approach. This crawler follows a completely new crawling strategy to compute
the relevance of the page. It analyses the content of the page based on the information contained
in various tags within the HTML source code and then computes the total weight of the page. The page with the highest weight, thus has the maximum content and highest relevance.
Using Feedback from Data Consumers to Capture Quality Information on Environm...Anusuriya Devaraju
Data quality information is essential to facilitate reuse of Earth science data. Recorded quality information must be sufficient for other researchers to select suitable data sets for their analysis and confirm the results and conclusions. In the research data ecosystem, several entities are responsible for data quality. Data producers (researchers and agencies) play a major role in this aspect as they often include validation checks or data cleaning as part of their work. It is possible that the quality information is not supplied with published data sets; if it is available, the descriptions might be incomplete, ambiguous or address specific quality aspects. Data repositories have built infrastructures to share data, but not all of them assess data quality. They normally provide guidelines of documenting quality information. Some suggests that scholarly and data journals should take a role in ensuring data quality by involving reviewers to assess data sets used in articles, and incorporating data quality criteria in the author guidelines. However, this mechanism primarily addresses data sets submitted to journals. We believe that data consumers will complement existing entities to assess and document the quality of published data sets. This has been adopted in crowd-source platforms such as Zooniverse, OpenStreetMap, Wikipedia, Mechanical Turk and Tomnod. This paper presents a framework designed based on open source tools to capture and share data users’ feedback on the application and assessment of research data. The framework comprises a browser plug-in, a web service and a data model such that feedback can be easily reported, retrieved and searched. The feedback records are also made available as Linked Data to promote integration with other sources on the Web. Vocabularies from Dublin Core and PROV-O are used to clarify the source and attribution of feedback. The application of the framework is illustrated with the CSIRO’s Data Access Portal.
An On-line Collaborative Data Management SystemCameron Kiddle
A presentation I prepared that was presented by Rob Simmonds at the Gateway Computing Environments 2010 Workshop in New Orleans on November 14, 2010. It provides an overview of a data management system that was developed for GeoChronos - an on-line collaborative platform for Earth observation scientists.
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...cscpconf
Web access log analysis is to analyze the patterns of web site usage and the features of users behavior. It is
the fact that the normal Log data is very noisy and unclear and it is vital to preprocess the log data for
efficient web usage mining process. Preprocessing comprises of three phases which includes data cleaning,
user identification and session construction. Session construction is very vital and numerous real world
problems can be modeled as traversals on graph and mining from these traversals would provide the
requirement for preprocessing phase. On the other hand, the traversals on unweighted graph have been
taken into consideration in existing works. This paper oversimplifies this to the case where vertices of
graph are given weights to reflect their significance. The proposed method constructs sessions as a Partial
Ancestral Graph which contains pages with calculated weights. This will help site administrators to find
the interesting pages for users and to redesign their web pages. After weighting each page according to
browsing time a PAG structure is constructed for each user session. Existing system in which there is a
problem of learning with the latent variables of the data and the problem can be overcome by the proposed
method.
Federated Search: The Good, The Bad And The Uglydorishelfer
Presented at the SLA 2007 Annual Conference in Denver, CO to the Science and Technology Division (Sci-Tech) on a program entitled: "Federated Searching: The Good, The Bad and the Ugly." Based on an article in Searcher and with additional contributions from Kathy Dabbour and Lynn Lampert on user and librarian assessment of Federated Searching.
Talk given by prof. T.K. Prasad at the workshop on Semantics in Geospatial Architectures: Applications and Implementation. The workshop was held from October 28-29, 2013 at Pyle Center (702 Langdon Street, Madison, WI), University of Wisconsin-Madison.
This talk was provided by Ursula Pieper of the National Agricultural Library for the NISO Virtual Conference, Using Open Source in Your Institution, held on Feb 17, 2016
CEDAR is a metadata management tool that lets user define metadata templates using a well described yet flexible metdata format. CEDAR then presents the forms represented by those templates to other users to fill out. CEDAR offers semantic precision (with support from the BioPortal ontology repository), metadata completion assistance, intelligent recommendations, support for JSON-LD and RDF metadata export, and an easy-to-use user interface.
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalIJSRD
Now days, World Wide Web has become a popular medium to search information, business, trading and so on. A well know problem face by web crawler is the existence of large fraction of distinct URL that correspond to page with duplicate or nearby duplicate contents. In fact as estimated about 29% of web page are duplicates. Such URL commonly named as dust represent an important problem in search engines. To deal with this problem, the first efforts is focus on comparing document content to detect and remove duplicate document without fetching their contents .To accomplish this, the proposed methods learn normalization rules to transform all duplicate URLs into the same canonical form. A challenging aspect of this strategy is deriving a set of general and precise rules. The new approach to detect and eliminate redundant content is DUSTER .When crawling the web duster take advantage of a multi sequence alignment strategy to learn rewriting rules able to transform to other URL which likely to have same content . Alignment strategy that can lead to reduction of 54% larger in the number of duplicate URL.
Query optimization in oodbms identifying subquery for query managementijdms
This paper is based on relatively newer approach for query optimization in object databases, which uses
query decomposition and cached query results to improve execution a query. Issues that are focused here is
fast retrieval and high reuse of cached queries, Decompose Query into Sub query, Decomposition of
complex queries into smaller for fast retrieval of result.
Here we try to address another open area of query caching like handling wider queries. By using some
parts of cached results helpful for answering other queries (wider Queries) and combining many cached
queries while producing the result.
Multiple experiments were performed to prove the productivity of this newer way of optimizing a query.
The limitation of this technique is that it’s useful especially in scenarios where data manipulation rate is
very low as compared to data retrieval rate.
Majority of the computer or mobile phone enthusiasts make use of the web for searching
activity. Web search engines are used for the searching; The results that the search engines get
are provided to it by a software module known as the Web Crawler. The size of this web is
increasing round-the-clock. The principal problem is to search this huge database for specific
information. To state whether a web page is relevant to a search topic is a dilemma. This paper
proposes a crawler called as “PDD crawler” which will follow both a link based as well as a
content based approach. This crawler follows a completely new crawling strategy to compute
the relevance of the page. It analyses the content of the page based on the information contained
in various tags within the HTML source code and then computes the total weight of the page. The page with the highest weight, thus has the maximum content and highest relevance.
Similar to MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access
Using Feedback from Data Consumers to Capture Quality Information on Environm...Anusuriya Devaraju
Data quality information is essential to facilitate reuse of Earth science data. Recorded quality information must be sufficient for other researchers to select suitable data sets for their analysis and confirm the results and conclusions. In the research data ecosystem, several entities are responsible for data quality. Data producers (researchers and agencies) play a major role in this aspect as they often include validation checks or data cleaning as part of their work. It is possible that the quality information is not supplied with published data sets; if it is available, the descriptions might be incomplete, ambiguous or address specific quality aspects. Data repositories have built infrastructures to share data, but not all of them assess data quality. They normally provide guidelines of documenting quality information. Some suggests that scholarly and data journals should take a role in ensuring data quality by involving reviewers to assess data sets used in articles, and incorporating data quality criteria in the author guidelines. However, this mechanism primarily addresses data sets submitted to journals. We believe that data consumers will complement existing entities to assess and document the quality of published data sets. This has been adopted in crowd-source platforms such as Zooniverse, OpenStreetMap, Wikipedia, Mechanical Turk and Tomnod. This paper presents a framework designed based on open source tools to capture and share data users’ feedback on the application and assessment of research data. The framework comprises a browser plug-in, a web service and a data model such that feedback can be easily reported, retrieved and searched. The feedback records are also made available as Linked Data to promote integration with other sources on the Web. Vocabularies from Dublin Core and PROV-O are used to clarify the source and attribution of feedback. The application of the framework is illustrated with the CSIRO’s Data Access Portal.
The Materials Data Facility: A Distributed Model for the Materials Data Commu...Ben Blaiszik
Presentation given at the UIUC Workshop on Materials Computation: data science and multiscale modeling. Materials Data Facility data publication, discovery, Globus, and associated python and REST interfaces are discussed. Video available soon.
Presentation to IASSIST 2013, in the session Expanding Scholarship: Research Journals and Data Linkages. Describes PREPARDE workshop on repository accreditation for data publication and invites comments on guidelines.
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
Slides from our tutorial on Linked Data generation in the energy domain, presented at the Sustainable Places 2014 conference on October 2nd in Nice, France
RDM Roadmap to the Future, or: Lords and Ladies of the DataRobin Rice
Story of the new 2017-2020 University of Edinburgh RDM Roadmap, with a Tolkienesque theme for IASSIST-CARTO 2018 in Montreal: "Once upon a data point: sustaining our data storytellers".
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube
This webinar features project overviews of all EarthCube Awards (Building Blocks, Research Coordination Networks, Conceptual Designs, and Test Governance), followed by a call for involvement, and a Q&A session.
Agenda:
EarthCube Awards – Project Overviews
1.. EarthCube Web Services (Building Block)
2. EC3: Earth-Centered Community for Cyberinfrastructure (RCN)
3. GeoSoft (Building Block)
4. Specifying and Implementing ODSIP (Building Block)
5. A Broker Framework for Next Generation Geoscience (BCube) (Building Block)
6. Integrating Discrete and Continuous Data (Building Block)
7. EAGER: Collaborative Research (Building Block)
8. A Cognitive Computer Infrastructure for Geoscience (Building Block)
9. Earth System Bridge (Building Block)
10. CINERGI – Community Inventory of EC Resources for Geoscience Interoperability (BB)
11. Building a Sediment Experimentalist Network (RCN)
12. C4P: Collaboration and Cyberinfrastructure for Paleogeosciences (RCN)
13. Developing a Data-Oriented Human-centric Enterprise for Architecture (CD)
14. Enterprise Architecture for Transformative Research and Collaboration (CD)
15. EC Test Enterprise Governance: An Agile Approach (Test Governance)
A Call for Involvement!
5 minute presentation during the SC13 Birds of a Feather Session on the relationship between the Research Data Alliance and High Performance Computing.
Jisc Research Data Shared Service - Spring UpdateJisc RDM
Update 13.05.2016
Similar to MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access (20)
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access
1. 1
Mining and Utilizing Dataset Relevancy
from Oceanographic Dataset
(MUDROD) Metadata, Usage Metrics,
and User Feedback to Improve Data
Discovery and Access
Dr. Chaowei (Phil) Yang (PI, George Mason University)
Mr. Edward M Armstrong (Co-I, Jet Propulsion Laboratory, NASA)
AIST-14-82, NNX15AM85G Final Review
June 22, 2017
2. 2
Mining and Utilizing Dataset Relevancy from Oceanographic Data (MUDROD)
PI: Chaowei (Phil) Yang, George Mason University
TRLin = 5
4/09
Objective
• Improve data discovery, selection and access to NASA
Observational Data.
• Intuitive interface to federated data holdings.
• Enable new user communities to discover and access
data for their projects.
• Reduce time for scientists to discover, download and
reformat data.
• Implement extensible ontology framework.
• Improve discovery accuracy of oceanographic data
• Foundation for Managing Big Data.
• Demonstrate MUDROD at PO.DAAC.
Approach
• Setup collaboration, testing environment.
• Design MUDROD knowledge base system.
• Reconstruct sessions from web logs
• Calculate vocabulary similarity from logs and metadata.
• Update semantic search and conduct alpha testing.
• Integrate MUDROD alpha into P.O.DAAC.
• Enhance knowledge base, to include GCMD.
• Demonstrate Prototype.
03/16
Key Milestones
AIST-14-82 NNX15AM85G
CoIs: E. Armstrong, T. Huang, D. Moroni, JPL;
• Start 06/15
• Identify Use cases 01/16
• Design search, query, reasoning engine 03/16
• Ontological System Implementation 07/16
• Complete Beta test at P.O. DAAC 12/16
• Integrated test 02/17
• PO.DAAC metadata discovery Demo (TRL 7) 05/17
MUDROD Engine Architecture
3. 3
Mining and Utilizing Dataset Relevancy from Oceanographic Data (MUDROD)
PI: Chaowei (Phil) Yang, George Mason University
TRLin = 5 TRLout = 7
4/09
Objective
• Improve data discovery, selection and access to NASA
Observational Data.
• Intuitive interface to federated data holdings.
• Enable new user communities to discover and access
data for their projects.
• Reduce time for scientists to discover, download and
reformat data.
• Implement extensible ontology framework.
• Improve discovery accuracy of oceanographic data
• Foundation for Managing Big Data.
• Demonstrate MUDROD at PO.DAAC.
Accomplishments
• Developed framework
• Analyze web logs to discover user knowledge (query and
data relationships)
• Construct knowledge base by combining semantics and
profile analyzer
• Improve data discover
• Resulted in
• better ranking;
• recommendation;
• ontology navigation
03/16 AIST-14-0082
CoIs: T. Huang, D. Moroni, E. Armstrong, JPL;
4. 4
Mr. Edward M Armstrong
(Senior Data Engineer , NASA Jet
Propulsion Laboratory)
Mr. David Moroni
(Data Engineer, NASA Jet
Propulsion Laboratory)
Team Members: Investigators
Dr. Chaowei Phil Yang
(Professor and Director, NSF
Spatiotemporal Innovation Center,
George Mason University)
Mr. Thomas Huang
(Principal Scientist, NASA Jet
Propulsion Laboratory)
5. 5
Team Members: Students and Post-Docs
Ms. Yun Li
(Ph.D. student, George Mason
University; Research Assistant ).
Mr. Yongyao Jiang
(Ph.D. student, George Mason
University; Research Assistant ).
Dr. Lewis J. McGibbney
(Data Scientist, NASA Jet
Propulsion Laboratory).
Mr. Chris Finch,
(Software Engineer, NASA Jet
Propulsion Laboratory).
Mr. Frank Greguska,
(Software Engineer, NASA Jet Propulsion
Laboratory) .
Mr. Gary Chen
6. 6
Presentation Contents
• Background / Objectives / Technical Advancements
• TRL assessment
• Summary of Accomplishments and Plans
• Schedule
• Financials
• Publications
• List of Acronyms
7. 7
Background / Objectives / Tech Advance (1)
• Keyword-based matching (traditional search engines)
– User query: ocean wind
– Final query: ocean AND wind
• Reveal the real intent of user query
– ocean wind = “ocean wind” OR “greco” OR
“surface wind” OR “mackerel breeze” …
• PO.DAAC UWG Recommendation 2014-07
• NASA ESDSWG Search Relevance Recommendations 2016 & 2017
Data Discovery Problems
8. 8
Background / Objectives / Tech Advance (2)
• Analyze web logs to discover user knowledge (query and data
relationships)
• Construct knowledge base by combining semantics and profile
analyzer
• Improve data discovery by 1) better ranking; 2) recommendation; 3)
ontology navigation
Objectives
9. 9
Background / Objectives / Tech Advance (2)
• Web log preprocessing
• Semantic analysis of user queries & Navigation
• Machine learning based search ranking
• Data Recommendation
Tech Advance (four technological modules)
10. 10
Presentation Contents
• Background and Objectives
• TRL assessment
• Summary of Accomplishments and Plans
• Schedule
• Financials
• Publications
• List of Acronyms
11. 11
Accomplishments Since last Review
TRL Assessment (1)
• Performance improvement
• Improved log mining performance using cloud computing
• Hybrid Recommendation algorithm
• Stem MUDROD ontologies from a number of sources
• create an OWL manifestation based on NASA GCMD-DIF data model
• create a PO.DAAC-specific extension of the SWEET concept
• Ingest PO.DAAC extractions into ESIP Ontology Portal
• Improve web UI
• Test MUDROD and ingest latest PO.DAAC logs
• Deploy MUDROD on JPL server
12. 12
Current TRL Rational
TRL Assessment (2)
Component Current TRL Project end TRL Description
Semantic search engine
Search Dispatcher 7 7
Translating a user search query into a
set of new semantic queries
Similarity calculator 7 7
Calculating the semantic similarity from
weblogs, metadata, and ontology
Recommendation module 7 7
Recommending similar datasets to the
clicked dataset
Ranking module 7 7
Re-ranking the search results based on
RankSVM ML algorithm
Knowledge base
Ontology 7 7
Extensions from SWEET ontology for
earth science data
Triple Store 7 7 ESIP ontology repository
Vocabulary linkage discovery engine
Profile analyzer 7 7
Extracting user browsing pattern from
raw web logs
Web services/GUI
Ranking service/presenter 7 7
Providing and presenting the ranked
results
Recommendation service/presenter 7 7
Providing and presenting the related
datasets
Ontology navigation service/presenter 7 7
Providing and presenting related
searches
13. 13
Presentation Contents
• Background and Objectives
• TRL assessment
• Summary of Accomplishments and Plans
• Schedule
• Financials
• Publications
• List of Acronyms
14. 14
Overview
• Web log preprocessing
• Semantic analysis of user queries & Navigation
• Machine learning based search ranking
• Data Recommendation
Functions/Modules
16. 16
• Requests sent from client e.g. browser, cmd line tool, etc. recorded by
server
• Log files provided by PO.DAAC (HTTP(S), FTP)
Client IP: 68.180.228.99
Request date/time: [31/Jan/2015:23:59:13 -0800]
Request: " GET /datasetlist/... HTTP/1.1 "
HTTP Code: 200
Bytes returned: 84779
Referrer/previous page: “/ghrsst/"
User agent/browser: "Mozilla/5.0 ...
68.180.228.99 - - [31/Jan/2015:23:59:13 -0800] "GET /datasetlist/... HTTP/1.1" 200
84779 "/ghrsst/" "Mozilla/5.0 ..."
Web logs
17. 17
Goal: reconstruct user browsing pattern
(search history & clickstream) from a set
of raw logs
Web logs
User identification
Crawler detection
Structure
reconstruction
Session
identification
Search
history
Clickstrea
m
Additional steps include: word
normalization, stop words removal, and
stemming
Data preprocess
19. 19
Data preprocess results
1. User search history 2. Clickstream
Jiang, Y., Y. Li, C. Yang, E. M. Armstrong, T. Huang & D. Moroni (2016) “Reconstructing Sessions from Data Discovery
and Access Logs to Build a Semantic Knowledge Base for Improving Data Discovery” ISPRS International Journal of
Geo-Information, 5, 54.
22. 22
User search history
• Create query – user matrix
• Calculate binary cosine similarity
Conceptual example
23. 23
Clickstream
• Hypothesis: similar queries can result in similar clicking behavior
• If two queries are similar, the data that get clicked after they are
searched would be more likely to be similar
Query b
Data
Query a
Similar?
24. 24
Metadata
• Hypothesis: semantically related terms tend to appear in the same
metadata more frequently
• Essentially the same as the clickstream analysis
• Perform Latent Semantic Analyses (LSA) over the term – metadata
matrix
Query b
Metadata
Query a
Similar?
25. 25
Integration
• All four results could be converted to
• Problem:
– None of them are perfect (uncertainty in data, hypothesis and
method)
– Metadata and ontology might have unknown terms to search
engine end users
– Sometimes, similarity values from different methods are
inconsistent
Concept
A
Concept
B
Similarity
26. 26
Integration
• The maximum similarity of all of the components (large similarity
appears to be more reliable)
• The adjustment increment becomes larger when the similarity exists in
more sources
28. 28
Results and evaluation
Query
Search history Clickstream Metadata
SWEE
T
Integrated list
ocean
temperature
sea surface
temperature(0.66), sea
surface topography(0.56),
ocean wind(0.56),
aqua(0.49)
sea surface
temperature(0.94),
sst(0.94), group high
resolution sea surface
temperature dataset(0.89),
ghrsst(0.87)
sst(0.96), ghrsst(0.77), sea
surface temperature(0.72),
surface temperature(0.63),
reynolds(0.58)
None
sst(1.0), sea surface temperature(1.0),
ghrsst(1.0), group high resolution sea
surface temperature dataset(0.99),
reynolds sea surface temperature(0.74)
Sample group Overall accuracy
Most popular 10 queries 88%
Least popular 10 queries 61%
Randomly selected 10 queries 83%
By domain experts
Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T.
Huang & D. Moroni (2017) A Comprehensive
Approach to Determining the Linkage Weights
among Geospatial Vocabularies - An Example with
Oceanographic Data Discovery. International
Journal of Geographical Information Science (minor
revision)
30. 30
Background
• Ranking is a long-standing problem in geospatial data discovery
• Typically, hundreds, even thousands of matches
• Can get larger as more Earth observation data is being collected
31. 31
Objective and Methods
• Put the most desired data to the top of the result list
• What features can represent users’ search preferences for
geospatial data?
• How can the ranking function reach a balance of all these features?
• Identified eleven features from
– Geospatial metadata attributes
– Query – metadata content
overlap
– User behavior from web logs
32. 32
Ranking features – Metadata attributes
Features Description
Release date The date when the data was published
Processing level (PL) The processing level of image products, ranging from
level 0 to level 4.
Version number The publish version of the data
Spatial resolution The spatial resolution of the data
Temporal resolution The temporal resolution of the data
• Five metadata features
• Verified by domains experts
• Query-independent: static, depends on the data itself, won’t
change with the query
33. 33
Spatial query-metadata overlap
• Spatial similarity between query area and the coverage of a particular
data
• Overlap area normalized by the original area of query and data
34. 34
Ranking features – User behavior
• All-time, monthly, user popularity, and semantic popularity (retrieved
from web logs)
• Semantic popularity: the number of times that the data has been
clicked after searching a particular query and its highly related ones
(query-dependent)
35. 35
RankSVM
• One of the well-recognized ML ranking algorithm
• Convert a ranking problem into a classification problem that a
regular SVM algorithm can solve
• 3 main steps
1) Standardize: mean = 0, std = 1
o SVM is not scale invariant
o Over-optimized
o Longer to train
2) For any pair of training data, calculate the difference
3) A ranking problem becomes a binary classification problem,
where SVM is applied to find the optimal decision boundary
36. 36
Architecture
• All of these (except for
training) can be
finished within 2
seconds
• None of the open
source mainstream
ML library provide any
ranking algorithm
• Implemented it by
ourselves with the aid
of Spark MLlib
Index
User query
Semantic
query
Top K
retrieval
Ranking
model
Learning
algorithm
Training
data
Re-ranked
results
Feature
extractor
Weblogs
User clicksKnowledge
base
37. 37
NDCG (K) for five different ranking
methods at varying K (1-40)
Jiang, Y., Y. Li, C.
Yang, K. Liu, E.
M. Armstrong, T.
Huang, D.
Moroni & L. J.
McGibbney
(2017) Towards
intelligent
geospatial
discovery: a
machine learning
ranking
framework.
International
Journal of Digital
Earth (minor
revision)
39. 39
How to recommend geospatial data?
• Use geospatial metadata for content-based recommendation
-Metadata spatiotemporal similarity
-Metadata attribute similarity
-Metadata content similarity
• Leverage user behaviors data for CF recommendation
-Session-based co-occurrence of data
40. 40
Attribute type Attribute name Attribute description
Spatiotemporal attributes DatasetCoverage-EastLon The East longitude of the bounding rectangle
DatasetCoverage-WestLon The West longitude of the bounding rectangle
DatasetCoverage-NorthLat The North latitude of the bounding rectangle
DatasetCoverage-SouthLat The South latitude of the bounding rectangle
DatasetCoverage-StartTimeLong The start time of the data
DatasetCoverage-StopTimeLong The end time of the data
Categorical geographic
attributes
DatasetRegion-Region Region of dataset. Such as global, Atlantic
Dataset-ProjectionType Project type like cylindrical lat-lon
Dataset-ProcessingLevel Data processing level
DatasetPolicy-DataFormat Data format e.g. HDF, NetCDF
DatasetSource-Sensor-ShortName Short name of sensor
Ordinal geographic
attributes
Dataset-TemporalResolution Temporal resolution of dataset
Dataset-TemporalRepeat Temporal resolution of dataset
Dataset-SpatialResolution Spatial resolution of dataset
Descriptive attributes Dataset-description Describe the content of the dataset
Geographic metadata
42. 42
• Fixed number of values
• No intrinsic ordering
• sensor-name: "AMSR-E", "MODIS", "AVHRR-3" and "WindSat"
𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑐𝑎𝑙_𝑣𝑎𝑟_𝑠𝑖𝑚 𝑣𝑖, 𝑣𝑗 =
𝑣 𝑖 ∩ 𝑣 𝑗
𝑣 𝑖 ∪ 𝑣 𝑗
Categorical similarity
43. 43
Ordinal attribute is similar to categorical attribute but its values has a
clear order, e.g. spatial resolution
• Converted into rank from 1 to R
• Nominalize these ranks for similarity calculation
𝑜𝑟𝑑𝑖𝑛𝑎𝑙 𝑣𝑎𝑟
𝑠𝑖𝑚 𝑣 𝑖,𝑣 𝑗
= 1 − 𝑛𝑜𝑟𝑚 𝑟𝑎𝑛𝑘 𝑣 𝑖
− 𝑛𝑜𝑟𝑚 𝑟𝑎𝑛𝑘 𝑣 𝑗
𝑛𝑜𝑟𝑚_𝑟𝑎𝑛𝑘 𝑣𝑖 =
𝑅𝑎𝑛𝑘𝑣 𝑖+1
𝑅+1
Ordinal similarity
44. 44
Original text
Aquarius Level 3 sea surface salinity (SSS) standard mapped image
data contains gridded 1 degree spatial resolution SSS averaged over
daily, 7 day, monthly, and seasonal time scales. This particular data
set is the seasonal climatology, Ascending sea surface salinity
product for version 4.0 of the Aquarius data set, which is the official
end of prime mission public data release from the AQUARIUS/SAC-D
mission. Only retrieved values for Ascending passes have been used
to create this product. The Aquarius instrument is onboard the
AQUARIUS/SAC-D satellite, a collaborative effort between NASA and
the Argentinian Space Agency Comision Nacional de Actividades
Espaciales (CONAE). The instrument consists of three radiometers in
push broom alignment at incidence angles of 29, 38, and 46 degrees
incidence angles relative to the shadow side of the orbit. Footprints
for the beams are: 76 km (along-track) x 94 km (cross-track), 84 km x
120 km and 96km x 156 km, yielding a total cross-track swath of 370
km. The radiometers measure brightness temperature at 1.413 GHz
in their respective horizontal and vertical polarizations (TH and TV). A
scatterometer operating at 1.26 GHz measures ocean backscatter in
each footprint that is used for surface roughness corrections in the
estimation of salinity. The scatterometer has an approximate 390km
swath.
Extracted terms
Radiometers Measure Brightness Temperature,AQUARIUS/SAC
Mission,Image Data,Broom Alignment,Resolution
SSS,AQUARIUS/SAC
Satellite,Scatterometer,Scatterometer,Aquarius Data,Argentinian
Space Agency Comision Nacional,Incidence Angles,Time
Scales,Actividades Espaciales,Cross-track Swath,Official
End,Aquarius Instrument,Shadow Side,Ascending Sea Surface
Salinity Product,Level,Level,Surface Roughness
Corrections,Data Release,Salinity,Density, Salinity,Density,
AQUARIUS,L3,SSS,SMIA,SEASONAL-CLIMATOLOGY,V4
Step 1: Phrase extraction
1. Extract term candidates from metadata description with POS (part of speech) Tagging
2. Introduce “occurrence” and “strength” to filter out terms from candidates.
“occurrence”: occurrences number of terms
“strength”: the number of words in a term
Descriptive similarity
45. 45
Step 2: Represent metadata in the phrase vector space (The
dimension lower than word feature space)
Term 1 Term 2 Term 3 Term 4 Term 5 Term 6 Term 7 … Term N
dataser1 1 0 1 0 0 0 0 1
dataser2 0 0 0 1 1 0 0 0
…
dataset k 1 0 1 0 0 0 0 1
Step 3: Calculate cosine similarity
Metadata abstract semantic similarity
46. 46
Session
1
Session
2
…
Session
N
Data 1 1 0 1
Data 2 0 0 0
…
Data k 1 0 1
Calculate metadata similarity based on session level co-occurrence
Similarity (i,j) =
𝑁(𝑖ᴖ𝑗)
𝑁(𝑖)∗ 𝑁(𝑗)
N(i): The number of sessions in which dataset i was viewed or download
N(j): The number of sessions in which dataset j was viewed or download
N(𝑖ᴖ𝑗): The number of sessions in which both dataset i and j were viewed or download
Session based recommendation
47. 47
Recommendation
method
Pros Cons Strategy
Descriptive
similarity
1. Natural language
processing methods
can be adopted to
find latent semantic
relationship
1. Many datasets has nearly
same abstract with few words/values
changed
2. It is hard to extract detailed
attributed from description
Used as the basic
method of
recommendation
algorithm
Attribute similarity
(spatiotemporal,
ordinal, categorical)
1. As structured
data, geographic
metadata have many
variables.
1. Variable values may be null or
wrong
2. The quality depends on the weight
assigned to every variable
As supplement to
semantic similarity
Session
concurrence
1. Reflect users’
preference
1. Cold start problem: Newly
published data don’t have usage
data
Fine-tune
recommendation
list
Recommend (i) = 𝑊 𝑠𝑠 ∗ 𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑣𝑒𝑐 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑖 + ∑𝑊 𝑐𝑣 ∗ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑎𝑙 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑖) + ∑𝑊 𝑜𝑣 ∗
𝑂𝑟𝑑𝑖𝑎𝑛𝑙 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑖) + 𝑊𝑠𝑡𝑣 ∗ 𝑆𝑝𝑎𝑡𝑖𝑜𝑇𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 + 𝑊𝑠𝑜 ∗ 𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦
Hybrid recommendation
48. 48
Quantitative Evaluation
Hybrid similarity
outperform other
similarities since it
integrates metadata
attributes and user
preference.
Y. Li, Jiang, Y., C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) A Geospatial Data
Recommender System based on Metadata and User Behaviour (revision)
49. 49
Plan Forward
• Add more features (e.g., temporal similarity)
• Create training data from web logs for RankSVM
• Develop a query understanding module to better interpret user’s search
intent (e.g. “ocean wind level 3” -> “ocean wind” AND “level 3”)
• Support Solr
• Support near real-time data ingestion to dynamically update knowledge
base
• Integration with DOMS and OceanXtremes to support an ocean science
analytics center
• Leverage advanced computing techniques to speed up the process
50. 50
What has been done
• Offline/sequential log analysis
– Time and computation intensive
• Ranking
– Use manually labeled data as train data
– RankSVM
• Offline recommendation
– Cannot respond to what user clicked in the last few minutes
• Offline Similarity calculation
– Cannot be updated in real-time
– Cannot parse multi-concept queries
52. 52
What remains to be done
• Support for Solr
• Add faceted search to the UI and backend
• Support for real-time ranking, recommendation,
similarity calculation
• Support for DOMS and OceanXstreams
55. 55
Proposed timeline
55
Sept. –
Oct.
Nov. –
Dec.
Jan. –
Feb.
March -
April
May -
June
July –
Aug.
Support for Solr
Add faceted search
Support for DOMS &
OceanXstreams
Real-time ranking
Real-time
recommendation
Real-time similarity
56. 56
Research points
• Semantic similarity
– Online matrix analysis
– Query parsing
• Ranking
– Generate training data from user clicks
– Online deep learning algorithm
• Near real-time log ingesting
– Spark Streaming APIs
– Real-time sessionization
• High performance log mining
– Use parallel computing to speed up
• Recommendation
– Session-based real-time recommendation
– Personalized recommendation
57. 57
Summary
• Log mining enables a data portal integrating implicit user
preferences
• Word similarity retrieved by data mining tasks expands any given
query to improve search recall and precision.
• The rich set of ranking features and the ML algorithm provide
substantial advantages over using other ranking methods
• The recommendation algorithm can discover latent data relevancy
• The proposed architecture enables the loosely coupled software
structure of a data portal and avoids the cost of replacing the existing
system
58. 58
Presentation Contents
• Background and Objectives
• TRL assessment
• Summary of Accomplishments and Plans Forward
• Schedule
• Financials
• Publications
• List of Acronyms
64. 64
Publications
Journal / Conference Papers (over 10 peer review publications)
1. Jiang, Y., Y. Li, C. Yang, E. M. Armstrong, T. Huang & D. Moroni (2016) Reconstructing Sessions from Data Discovery and Access
Logs to Build a Semantic Knowledge Base for Improving Data Discovery. ISPRS International Journal of Geo-Information, 5, 54.
2. Y. Li, Jiang, Y., C. Yang, K. Liu, E. M. Armstrong, T. Huang & D. Moroni (2016) Leverage cloud computing to improve data access
log mining. IEEE Oceans 2016.
3. Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang & D. Moroni (2017) A Comprehensive Approach to Determining the
Linkage Weights among Geospatial Vocabularies - An Example with Oceanographic Data Discovery. International Journal of
Geographical Information Science (minor revision)
4. Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) Towards intelligent geospatial
discovery: a machine learning ranking framework. International Journal of Digital Earth (minor revision)
5. Y. Li, Jiang, Y., C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) A Geospatial Data Recommender
System based on Metadata and User Behaviour (revision)
6. Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) A smart web-based data discovery
system for ocean sciences. (ongoing)
7. Yang, C., et al., 2017. Big Data and cloud computing: innovation opportunities and challenges. International Journal of
Digital Earth, 10(1), pp.13-53. (the 2nd most read paper of IJDE in it’s decadal history)
8. Yang, C.P., Yu, M., Xu, M., Jiang, Y., Qin, H., Li, Y., Bambacus, M., Leung, R.Y., Barbee, B.W., Nuth, J.A. and Seery, B., 2017,
March. An architecture for mitigating near earth object's impact to the earth. In Aerospace Conference, 2017 IEEE (pp. 1-13). IEEE
9. Yu, M. and Yang, C., 2016. Improving the Non-Hydrostatic Numerical Dust Model by Integrating Soil Moisture and Greenness
Vegetation Fraction Data with Different Spatiotemporal Resolutions. PloS one, 11(12), p.e0165616.
10. Vance, T.C., Merati, N., Yang, C. and Yuan, M., 2016, September. Cloud computing for ocean and atmospheric science. In OCEANS
2016 MTS/IEEE Monterey (pp. 1-4). IEEE. (also as a book with the same name)
11. Liu, K., Yang, C., Li, W., Gui, Z., Xu, C. and Xia, J., 2016. Using semantic search and knowledge reasoning to improve the discovery
of earth science records: an example with the ESIP semantic testbed. In Mobile Computing and Wireless Networks: Concepts,
Methodologies, Tools, and Applications (pp. 1375-1389). IGI Global.
12. Xia, J., Yang, C., Liu, K., Gui, Z., Li, Z., Huang, Q. and Li, R., 2015. Adopting cloud computing to optimize spatial web portals for
better performance to support Digital Earth and other global geospatial initiatives. International Journal of Digital Earth, 8(6), pp.451-
475.
65. 65
Students Cultivated
• Ph.D. Dissertations focused on MUDROD: Yongyao Jiang and
Yun Li (GMU), will complete in phase 2 (OceanWorks)
• Graduate Students worked on tasks; Fei Hu, Kai Liu (graduate in
2017 summer), Manzhu Yu (graduate in 2017 summer), Han Qin
(graduated), Mengchao Xu, Ishan Shams (GMU)
• Undergraduate students: Joseph Michael/Univ. of Richmond
(HBCU), Alena/Virginia Tech (Female)
• Other (Approximately 20 presentations at AGU, ESIP, AAG,
GeoInformatics, and other national and international conferences)
66. 66
Acronyms
List of Acronyms
• MUDROD Mining and Utilizing Dataset Relevancy from
Oceanographic Dataset
• HTTP Hypertext Transfer Protocol
• FTP File Transfer Protocol
• SWEET Semantic web for earth and environmental
terminology
• PO.DAAC Physical Oceanography Distributed Active
Archive Center
• LSA Latent Semantic Analysis
• SVM Support vector machine
• NDCG Normalized Discounted Cumulative Gain
67. 67
1. NASA AIST Program (NNX15AM85G)
2. PO.DAAC SWEET Ontology Team (Initially funded by
ESTO)
3. Hydrology DAAC Rahul Ramachandran (providing the
earlier version of NOESIS)
4. ESDIS for providing testing logs of CMR
5. All team members at JPL and GMU
Acknowledgements