This document discusses a potential type of malicious software called a "semantic virus" that could undermine the utility of semantic search engines by generating and submitting random, syntactically valid but semantically incorrect RDF data. The virus would take existing RDF triples as input and generate new, related triples that are fake, such as claiming "Galway is part of England". It could submit this noisy data at scale using services like Ping The Semantic Web. While digital signatures can address authentication, there is an open challenge to validate the quality and accuracy of semantic data at internet scale.
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...Mark Wilkinson
This slide deck accompanies the manuscript "Interoperability and FAIRness through a novel combination of Web technologies", submitted to PeerJ Computer Science: https://doi.org/10.7287/peerj.preprints.2522v1
It describes the output of the "Skunkworks" FAIR implementation group, who were tasked with building a prototype infrastructure that would fulfill the FAIR Principles for scholarly data publishing. We show how a novel combination of the Linked Data Platform, RDF Mapping Language (RML) and Triple Pattern Fragments (TPF) can be combined to create a scholarly publishing infrastructure that is markedly interoperable, at both the metadata and the data level.
This slide deck (or something close) will be presented at the Dutch Techcenter for Life Sciences Partners Workshop, November 4, 2016.
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015Mark Wilkinson
The primary slide deck for the SADI tutorial. We explain the motivation, simple SADI services, more complex SADI services, and then do a detailed walk-through of building a service, including the Perl service code and examples of service invocation at the command line, and using the SHARE client. You will want to look at the sample data/queries in this slide deck: http://www.slideshare.net/markmoby/sample-data-and-other-ur-ls-55737183 and the example service code in this slide deck: http://www.slideshare.net/markmoby/example-code-for-the-sadi-bmi-calculator-web-service?related=1
This document describes CrowdSource, a system that uses natural language processing to infer high-level malware capabilities based on low-level strings extracted from malware binaries. It trains a machine learning model on millions of technical documents from StackExchange to map low-level strings to high-level capabilities. The system was evaluated on 1,457 malware samples and shown to detect 14 capabilities with an average F1-score of 0.86 and can analyze tens of thousands of samples per day.
This document summarizes research on detecting duplicated code in web applications. It discusses how much duplicated code exists in large software systems, around 20-50%. Detecting duplicated code, or "clones", helps with program understanding, maintenance and refactoring. The document reviews different approaches for detecting clones, including detecting both small fragments and larger structural clones involving groups of files. It also discusses challenges like the large number of clones detected and need to analyze relationships between clones. Several tools for clone detection are evaluated. Finally, information extraction from web pages is discussed, including approaches that detect templates to extract structured data without examples.
Improving VIVO search through semantic ranking.Deepak K
The document discusses improvements made to the VIVO search engine between versions 1.2.1 and 1.3. Version 1.3 transitioned from Lucene to SOLR, enriching the index with additional data from semantic relationships. This led to more relevant search results with better ranking than version 1.2.1. Indexing was also improved through multithreading. Experiments were discussed using synonym analysis, spelling correction, and natural language techniques to further improve search results and support fact-based questioning.
This document summarizes a study comparing two topic modeling techniques, Correlated Topic Models (CTM) and Latent Dirichlet Allocation (LDA), on Twitter data related to asthma. Two datasets were analyzed - tweets with the keyword "asthma" and tweets with the hashtag "#asthma". Both CTM and LDA were used to identify topics in each dataset for different numbers of clusters. The results show that CTM better captures the relationships between topics as the number of clusters increases, while LDA performs similarly for both small and large cluster sizes. The topics identified included terms like "asthma", "hygiene", "pets", "allergy", and "pollution".
HRGRN: enabling graph search and integrative analysis of Arabidopsis signalin...Araport
The biological networks controlling plant signal transduction, metabolism and gene regulation are composed of not only genes, RNA, protein and compounds but also the complicated interactions among them. Yet, even in the most thoroughly studied model plant Arabidopsis thaliana, the knowledge regarding these interactions are scattered throughout literatures and various public databases. Thus, new scientific discovery by exploring these complex and heterogeneous data remains a challenge task for biologists.
We developed a graph-search empowered platform named HRGRN to search known and, more importantly, discover the novel relationships among genes in Arabidopsis biological networks. The HRGRN includes over 51,000 “nodes” that represent very large sets of genes, proteins, small RNAs, and compounds and approximately 150,000 “edges” that are classified into nine types of interactions (interactions between proteins, compounds and proteins, transcription factors (TFs) and their downstream target genes, small RNAs and their target genes, kinases and downstream target genes, transporters and substrates, substrate/product compounds and enzymes, as well as gene pairs with similar expression patterns to provide deep insight into gene-gene relationships) to comprehensively model and represent the complex interactions between nodes. .
The HRGRN allows users to discover novel interactions between genes and/or pathways, and build sub-networks from user-specified seed nodes by searching the comprehensive collections of interactions stored in its back-end graph databases using graph traversal algorithms. The HRGRN database is freely available at http://plantgrn.noble.org/hrgrn/. Currently, we are collaborating the Araport team to develop REST-like web services and provide the HRGRN’s graph search functions to Araport system.
Linked Data, the Semantic Web, and You discusses key concepts related to Linked Data and the Semantic Web. It introduces Uniform Resource Identifiers (URIs), Resource Description Framework (RDF), ontologies, SPARQL query language, and library projects applying these technologies like BIBFRAME, the Digital Public Library of America, and Europeana. The goal is to connect structured data on the web through shared vocabularies and relationships between resources from different sources.
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...Mark Wilkinson
This slide deck accompanies the manuscript "Interoperability and FAIRness through a novel combination of Web technologies", submitted to PeerJ Computer Science: https://doi.org/10.7287/peerj.preprints.2522v1
It describes the output of the "Skunkworks" FAIR implementation group, who were tasked with building a prototype infrastructure that would fulfill the FAIR Principles for scholarly data publishing. We show how a novel combination of the Linked Data Platform, RDF Mapping Language (RML) and Triple Pattern Fragments (TPF) can be combined to create a scholarly publishing infrastructure that is markedly interoperable, at both the metadata and the data level.
This slide deck (or something close) will be presented at the Dutch Techcenter for Life Sciences Partners Workshop, November 4, 2016.
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015Mark Wilkinson
The primary slide deck for the SADI tutorial. We explain the motivation, simple SADI services, more complex SADI services, and then do a detailed walk-through of building a service, including the Perl service code and examples of service invocation at the command line, and using the SHARE client. You will want to look at the sample data/queries in this slide deck: http://www.slideshare.net/markmoby/sample-data-and-other-ur-ls-55737183 and the example service code in this slide deck: http://www.slideshare.net/markmoby/example-code-for-the-sadi-bmi-calculator-web-service?related=1
This document describes CrowdSource, a system that uses natural language processing to infer high-level malware capabilities based on low-level strings extracted from malware binaries. It trains a machine learning model on millions of technical documents from StackExchange to map low-level strings to high-level capabilities. The system was evaluated on 1,457 malware samples and shown to detect 14 capabilities with an average F1-score of 0.86 and can analyze tens of thousands of samples per day.
This document summarizes research on detecting duplicated code in web applications. It discusses how much duplicated code exists in large software systems, around 20-50%. Detecting duplicated code, or "clones", helps with program understanding, maintenance and refactoring. The document reviews different approaches for detecting clones, including detecting both small fragments and larger structural clones involving groups of files. It also discusses challenges like the large number of clones detected and need to analyze relationships between clones. Several tools for clone detection are evaluated. Finally, information extraction from web pages is discussed, including approaches that detect templates to extract structured data without examples.
Improving VIVO search through semantic ranking.Deepak K
The document discusses improvements made to the VIVO search engine between versions 1.2.1 and 1.3. Version 1.3 transitioned from Lucene to SOLR, enriching the index with additional data from semantic relationships. This led to more relevant search results with better ranking than version 1.2.1. Indexing was also improved through multithreading. Experiments were discussed using synonym analysis, spelling correction, and natural language techniques to further improve search results and support fact-based questioning.
This document summarizes a study comparing two topic modeling techniques, Correlated Topic Models (CTM) and Latent Dirichlet Allocation (LDA), on Twitter data related to asthma. Two datasets were analyzed - tweets with the keyword "asthma" and tweets with the hashtag "#asthma". Both CTM and LDA were used to identify topics in each dataset for different numbers of clusters. The results show that CTM better captures the relationships between topics as the number of clusters increases, while LDA performs similarly for both small and large cluster sizes. The topics identified included terms like "asthma", "hygiene", "pets", "allergy", and "pollution".
HRGRN: enabling graph search and integrative analysis of Arabidopsis signalin...Araport
The biological networks controlling plant signal transduction, metabolism and gene regulation are composed of not only genes, RNA, protein and compounds but also the complicated interactions among them. Yet, even in the most thoroughly studied model plant Arabidopsis thaliana, the knowledge regarding these interactions are scattered throughout literatures and various public databases. Thus, new scientific discovery by exploring these complex and heterogeneous data remains a challenge task for biologists.
We developed a graph-search empowered platform named HRGRN to search known and, more importantly, discover the novel relationships among genes in Arabidopsis biological networks. The HRGRN includes over 51,000 “nodes” that represent very large sets of genes, proteins, small RNAs, and compounds and approximately 150,000 “edges” that are classified into nine types of interactions (interactions between proteins, compounds and proteins, transcription factors (TFs) and their downstream target genes, small RNAs and their target genes, kinases and downstream target genes, transporters and substrates, substrate/product compounds and enzymes, as well as gene pairs with similar expression patterns to provide deep insight into gene-gene relationships) to comprehensively model and represent the complex interactions between nodes. .
The HRGRN allows users to discover novel interactions between genes and/or pathways, and build sub-networks from user-specified seed nodes by searching the comprehensive collections of interactions stored in its back-end graph databases using graph traversal algorithms. The HRGRN database is freely available at http://plantgrn.noble.org/hrgrn/. Currently, we are collaborating the Araport team to develop REST-like web services and provide the HRGRN’s graph search functions to Araport system.
Linked Data, the Semantic Web, and You discusses key concepts related to Linked Data and the Semantic Web. It introduces Uniform Resource Identifiers (URIs), Resource Description Framework (RDF), ontologies, SPARQL query language, and library projects applying these technologies like BIBFRAME, the Digital Public Library of America, and Europeana. The goal is to connect structured data on the web through shared vocabularies and relationships between resources from different sources.
Linked Data, the Semantic Web, and You discusses key concepts related to Linked Data and the Semantic Web. It defines Linked Data as a set of best practices for publishing and connecting structured data on the web using URIs, HTTP, RDF, and other standards. It also explains semantic web technologies like RDF, ontologies, SKOS, and SPARQL that enable representing and querying structured data on the web. Finally, it discusses how libraries are applying these concepts through projects like BIBFRAME, FAST, library linked data platforms, and the LD4L project to represent bibliographic data as linked open data.
Plagiarism check is one of the most essential things in your academic writing. No matter how much effort you put in to write your assignment, you cannot avoid traces of plagiarism. Therefore, you need to use various plagiarism detectors.
Question Answering over Linked Data - Reasoning IssuesMichael Petychakis
Question answering system plays a vital role in search engine optimization model. Natural language processing methods are typically applied in QA system for inquiring user’s question and numerous steps are also followed for alteration of questions to query form for receiving a precise answer. This presentation analyzes diverse question answering systems that are based on semantic web technologies and ontologies with different formats of queries.It ends by addressing various reasoning alternatives.
Have you been surfing the internet for the past hour searching for a reliable online plagiarism checker tool? In case you are sceptical about submitting your essays and assignments worrying about unintentional plagiarism, MyEssayHelp Plagiarism checker tool can resolve your worries.
ICAR 2015
Workshop 10 (TUESDAY, JULY 7, 2015, 4:30-6:00 PM)
The Arabidopsis information portal for users and developers
Blake Meyers (University of Delaware)
A Community Collaborator Perspective: Case study 2 - Small RNA DBs
Araport is a one-stop community platform for Arabidopsis thaliana data integration, sharing, and analysis. It contains gene reports, expression data, sequences, variants, and community-contributed data tracks and modules. Key features include the ThaleMine gene search and analysis tool, JBrowse genome browser with over 100 tracks, and regularly updated Araport11 genome annotation. The platform is built and maintained by a collaboration between academic institutions and is intended to support open data sharing across the Arabidopsis research community.
This document discusses the Semantic Web and Linked Open Data. It explains how the Semantic Web helps integrate data by using shared vocabularies and URIs to normalize meanings between data sources. As more datasets adopt Semantic Web principles by exposing structured data through URIs and RDF formats, individual datasets become less isolated and are interconnected to form a large knowledge base. The document provides examples of querying and exploring Linked Open Data through SPARQL and the LOD Cloud. It also offers recommendations for publishing and working with Linked Open Data.
Developing Apps: Exposing Your Data Through AraportMatthew Vaughn
This document discusses developing apps and exposing data through Araport, an open platform for plant science data and tools. It provides an overview of why researchers should contribute to Araport, how to create web services and apps, commonly asked questions, and available resources. Key steps to creating a web service include implementing REST APIs, describing data sources in metadata, and testing and sharing the service. Developing a science app involves using app templates, consuming Araport APIs to build interactive tools, and publishing apps for others.
The document discusses discovery of resources in the International Image Interoperability Framework (IIIF). It proposes a three component approach: 1) a central registry of links to IIIF content, 2) crawling software to populate search engines by following links in the registry, and 3) user-oriented search engines over the crawled content. Key questions addressed include what should be included in the registry, how crawlers should work, what data search engines should index, and how users can access search results. The document seeks input on next steps such as deciding on a format for the registry and APIs to support functions like submission and browsing.
The Araport project aims to integrate a wide range of Arabidopsis thaliana data types through a federated data approach and web services. It will provide researchers with tools to expose their own data via the Araport interface. Some key data types Araport will integrate include genes, proteins, pathways, orthologs, germplasm, phenotypes, and interactions. Data will come from sources like TAIR, NCBI, UniProt, and community contributors. Users will be able to access data through interfaces like JBrowse and ThaleMine, and developers can create custom analysis apps. The richness of data in Araport depends on community participation.
Groundhog day: near duplicate detection on twitterDan Nguyen
This document presents a framework for detecting near-duplicate tweets on Twitter. The framework analyzes tweet pairs using three approaches: (1) comparing syntactic characteristics like word overlap, (2) measuring semantic similarity, and (3) analyzing contextual information. Machine learning is used to learn patterns that help identify duplicate tweets. The framework is integrated into a Twitter search engine called Twinder to diversify search results and improve search quality. Extensive experiments evaluate strategies for detecting duplicate tweets and analyzing features that impact detection. The results show semantic features can boost duplicate detection performance.
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...Chunlei Wu
This document summarizes the work of Chunlei Wu and collaborators on building an ecosystem of interoperable biological knowledge APIs using BioThings API and SmartAPI. It describes how BioThings aggregates data from multiple sources into gene, variant, and drug annotation services. It also explains how the BioThings SDK and SmartAPI help developers build and describe interoperable APIs. The goal is to enhance data and API integration across translational medicine resources.
This document analyzes data collected from two sources - Hacker Web forums and the Shodan search engine - to answer research questions about cybersecurity topics. For Hacker Web, it finds that discussions mentioning "victims" or "targets" have increased over time. The most discussed topics across forums are Windows, government, malware, and botnets. For Shodan, it estimates that over 2% of Samsung SmartTVs are publicly accessible and potentially exploitable. It also finds that traffic signal systems in Louisiana, especially in Metairie and New Orleans, are the most vulnerable to hacking in the US.
WWW2014 Overview of W3C Linked Data Platform 20140410Arnaud Le Hors
The document summarizes the Linked Data Platform (LDP) being developed by the W3C Linked Data Platform Working Group. It describes the challenges of using Linked Data for application integration today and how the LDP specification aims to address these by defining HTTP-based patterns for creating, reading, updating and deleting Linked Data resources and containers in a standardized, RESTful way. The LDP models resources as HTTP entities that can be manipulated via standard methods and represent their state using RDF, addressing questions around resource management that the original Linked Data principles did not.
Coreference Extraction from Identric’s Documents - Solution of Datathon 2018Data Science Society
The whole NLP Data Science solution @ https://goo.gl/iEFb1L
Syntactic Parsing or Dependency Parsing is the task of recognizing a sentence and assigning a syntactic structure to it. The most widely used syntactic structure is the parse tree which can be generated using some parsing algorithms. These parse trees are useful in various applications like grammar checking or more importantly it plays a critical role in the semantic analysis stage. For example to answer the question “Who is the point guard for the LA Laker in the next game ?” we need to figure out its subject, objects, attributes to help us figure out that the user wants the point guard of the LA Lakers specifically for the next game. This was mostly the identification and extraction NLP task for team Coala at the First Global Online Datathon.
Social Media Data Collection & AnalysisScott Sanders
A non-technical primer on how to collect and analyze social media data. This was an invited lecture by Biostatistics and Bioinformatics Department in the School of Public Health at the University of Louisville.
The document discusses the Semantic Web, which aims to make web data more easily processable by machines through linking related information. It has four main components - URIs for identification, RDF for describing data, RDF Schema for describing data properties, and OWL for adding reasoning. This allows machines to better interpret and draw conclusions from web data. Challenges include dealing with the vastness, vagueness, uncertainty and inconsistency of web data. The document outlines benefits like more precise information retrieval and simplified application integration. It encourages contributions to developing Semantic Web languages and applications.
The Semantic Web is a vision of information that is understandable by computers. Although there is great exploitable potential, we are still in "Generation Zero'' of the Semantic Web, since there are few real-world compelling applications. The heterogeneity, the volume of data and the lack of standards are problems that could be addressed through some nature inspired methods. The paper presents the most important aspects of the Semantic Web, as well as its biggest issues; it then describes some methods inspired from nature - genetic algorithms, artificial neural networks, swarm intelligence, and the way these techniques can be used to deal with Semantic Web problems.
Linked Data, the Semantic Web, and You discusses key concepts related to Linked Data and the Semantic Web. It defines Linked Data as a set of best practices for publishing and connecting structured data on the web using URIs, HTTP, RDF, and other standards. It also explains semantic web technologies like RDF, ontologies, SKOS, and SPARQL that enable representing and querying structured data on the web. Finally, it discusses how libraries are applying these concepts through projects like BIBFRAME, FAST, library linked data platforms, and the LD4L project to represent bibliographic data as linked open data.
Plagiarism check is one of the most essential things in your academic writing. No matter how much effort you put in to write your assignment, you cannot avoid traces of plagiarism. Therefore, you need to use various plagiarism detectors.
Question Answering over Linked Data - Reasoning IssuesMichael Petychakis
Question answering system plays a vital role in search engine optimization model. Natural language processing methods are typically applied in QA system for inquiring user’s question and numerous steps are also followed for alteration of questions to query form for receiving a precise answer. This presentation analyzes diverse question answering systems that are based on semantic web technologies and ontologies with different formats of queries.It ends by addressing various reasoning alternatives.
Have you been surfing the internet for the past hour searching for a reliable online plagiarism checker tool? In case you are sceptical about submitting your essays and assignments worrying about unintentional plagiarism, MyEssayHelp Plagiarism checker tool can resolve your worries.
ICAR 2015
Workshop 10 (TUESDAY, JULY 7, 2015, 4:30-6:00 PM)
The Arabidopsis information portal for users and developers
Blake Meyers (University of Delaware)
A Community Collaborator Perspective: Case study 2 - Small RNA DBs
Araport is a one-stop community platform for Arabidopsis thaliana data integration, sharing, and analysis. It contains gene reports, expression data, sequences, variants, and community-contributed data tracks and modules. Key features include the ThaleMine gene search and analysis tool, JBrowse genome browser with over 100 tracks, and regularly updated Araport11 genome annotation. The platform is built and maintained by a collaboration between academic institutions and is intended to support open data sharing across the Arabidopsis research community.
This document discusses the Semantic Web and Linked Open Data. It explains how the Semantic Web helps integrate data by using shared vocabularies and URIs to normalize meanings between data sources. As more datasets adopt Semantic Web principles by exposing structured data through URIs and RDF formats, individual datasets become less isolated and are interconnected to form a large knowledge base. The document provides examples of querying and exploring Linked Open Data through SPARQL and the LOD Cloud. It also offers recommendations for publishing and working with Linked Open Data.
Developing Apps: Exposing Your Data Through AraportMatthew Vaughn
This document discusses developing apps and exposing data through Araport, an open platform for plant science data and tools. It provides an overview of why researchers should contribute to Araport, how to create web services and apps, commonly asked questions, and available resources. Key steps to creating a web service include implementing REST APIs, describing data sources in metadata, and testing and sharing the service. Developing a science app involves using app templates, consuming Araport APIs to build interactive tools, and publishing apps for others.
The document discusses discovery of resources in the International Image Interoperability Framework (IIIF). It proposes a three component approach: 1) a central registry of links to IIIF content, 2) crawling software to populate search engines by following links in the registry, and 3) user-oriented search engines over the crawled content. Key questions addressed include what should be included in the registry, how crawlers should work, what data search engines should index, and how users can access search results. The document seeks input on next steps such as deciding on a format for the registry and APIs to support functions like submission and browsing.
The Araport project aims to integrate a wide range of Arabidopsis thaliana data types through a federated data approach and web services. It will provide researchers with tools to expose their own data via the Araport interface. Some key data types Araport will integrate include genes, proteins, pathways, orthologs, germplasm, phenotypes, and interactions. Data will come from sources like TAIR, NCBI, UniProt, and community contributors. Users will be able to access data through interfaces like JBrowse and ThaleMine, and developers can create custom analysis apps. The richness of data in Araport depends on community participation.
Groundhog day: near duplicate detection on twitterDan Nguyen
This document presents a framework for detecting near-duplicate tweets on Twitter. The framework analyzes tweet pairs using three approaches: (1) comparing syntactic characteristics like word overlap, (2) measuring semantic similarity, and (3) analyzing contextual information. Machine learning is used to learn patterns that help identify duplicate tweets. The framework is integrated into a Twitter search engine called Twinder to diversify search results and improve search quality. Extensive experiments evaluate strategies for detecting duplicate tweets and analyzing features that impact detection. The results show semantic features can boost duplicate detection performance.
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...Chunlei Wu
This document summarizes the work of Chunlei Wu and collaborators on building an ecosystem of interoperable biological knowledge APIs using BioThings API and SmartAPI. It describes how BioThings aggregates data from multiple sources into gene, variant, and drug annotation services. It also explains how the BioThings SDK and SmartAPI help developers build and describe interoperable APIs. The goal is to enhance data and API integration across translational medicine resources.
This document analyzes data collected from two sources - Hacker Web forums and the Shodan search engine - to answer research questions about cybersecurity topics. For Hacker Web, it finds that discussions mentioning "victims" or "targets" have increased over time. The most discussed topics across forums are Windows, government, malware, and botnets. For Shodan, it estimates that over 2% of Samsung SmartTVs are publicly accessible and potentially exploitable. It also finds that traffic signal systems in Louisiana, especially in Metairie and New Orleans, are the most vulnerable to hacking in the US.
WWW2014 Overview of W3C Linked Data Platform 20140410Arnaud Le Hors
The document summarizes the Linked Data Platform (LDP) being developed by the W3C Linked Data Platform Working Group. It describes the challenges of using Linked Data for application integration today and how the LDP specification aims to address these by defining HTTP-based patterns for creating, reading, updating and deleting Linked Data resources and containers in a standardized, RESTful way. The LDP models resources as HTTP entities that can be manipulated via standard methods and represent their state using RDF, addressing questions around resource management that the original Linked Data principles did not.
Coreference Extraction from Identric’s Documents - Solution of Datathon 2018Data Science Society
The whole NLP Data Science solution @ https://goo.gl/iEFb1L
Syntactic Parsing or Dependency Parsing is the task of recognizing a sentence and assigning a syntactic structure to it. The most widely used syntactic structure is the parse tree which can be generated using some parsing algorithms. These parse trees are useful in various applications like grammar checking or more importantly it plays a critical role in the semantic analysis stage. For example to answer the question “Who is the point guard for the LA Laker in the next game ?” we need to figure out its subject, objects, attributes to help us figure out that the user wants the point guard of the LA Lakers specifically for the next game. This was mostly the identification and extraction NLP task for team Coala at the First Global Online Datathon.
Social Media Data Collection & AnalysisScott Sanders
A non-technical primer on how to collect and analyze social media data. This was an invited lecture by Biostatistics and Bioinformatics Department in the School of Public Health at the University of Louisville.
The document discusses the Semantic Web, which aims to make web data more easily processable by machines through linking related information. It has four main components - URIs for identification, RDF for describing data, RDF Schema for describing data properties, and OWL for adding reasoning. This allows machines to better interpret and draw conclusions from web data. Challenges include dealing with the vastness, vagueness, uncertainty and inconsistency of web data. The document outlines benefits like more precise information retrieval and simplified application integration. It encourages contributions to developing Semantic Web languages and applications.
The Semantic Web is a vision of information that is understandable by computers. Although there is great exploitable potential, we are still in "Generation Zero'' of the Semantic Web, since there are few real-world compelling applications. The heterogeneity, the volume of data and the lack of standards are problems that could be addressed through some nature inspired methods. The paper presents the most important aspects of the Semantic Web, as well as its biggest issues; it then describes some methods inspired from nature - genetic algorithms, artificial neural networks, swarm intelligence, and the way these techniques can be used to deal with Semantic Web problems.
The document discusses the vision of the Semantic Web and how it allows data to be shared and reused across applications. It outlines some of the key components of the Semantic Web like ontology, RDF, and URIs. It also discusses some common misconceptions about the Semantic Web, including that it is not about building AI applications or that it requires large ontologies. The Semantic Web is envisioned to seamlessly integrate with the existing Web to allow easier sharing and integration of data.
Provide a solution for Semantic Web issues - metadata vocabularies, ontological modeling resources, automated reasoning according to user profile - within a Web browser. It focused on issues such as automatic classification of sites visited by the user, with some similar references in terms of content or design.
The document discusses the evolution of the semantic web from its origins in military technology to its current use in commercial applications. It describes how semantic web standards like RDF, RDFS, and OWL were developed and how the semantic web has transformed in areas like markets, linked data, and scaling. The talk outline focuses on the origins of the semantic web, key developments through 2010, transformations in three application areas, related markets and companies, and the linked data and scaling revolution.
OpenCalais is a web service that analyzes text and generates semantic metadata about entities, events, and facts mentioned. It links these entities to datasets like DBpedia, Wikipedia, and Geonames using URIs. This allows the data to be interconnected and explored through the web of linked data. OpenCalais follows the principles of linked data by assigning HTTP URIs to entities and linking them to other open datasets. It returns the extracted metadata in RDF format, integrating the analyzed content into the larger linked data cloud.
The document discusses the trends and advancements of Web 3.0, also known as the Semantic Web. Web 3.0 aims to make internet data machine-readable through standards that encode semantics and metadata. This allows data to be shared and reused across applications through common data formats and exchange protocols. Key technologies that enable the Semantic Web include Resource Description Framework, Web Ontology Language, and SPARQL query language. Challenges to the Semantic Web include dealing with the vastness of data, vagueness, uncertainty, inconsistency, and potential for deceit.
The document discusses the trends and advancements of Web 3.0, also known as the Semantic Web. Web 3.0 aims to make internet data machine-readable through standards that encode semantics and metadata. This allows data to be shared and reused across applications through common data formats and exchange protocols. Key technologies that enable the Semantic Web include Resource Description Framework, Web Ontology Language, and SPARQL query language. Challenges to the Semantic Web include dealing with the vastness of data, vagueness, uncertainty, inconsistency, and potential for deceit.
The document summarizes recent developments in semantic search engines. It discusses the principles of the semantic web and languages like RDF, RDFS, and OWL. It then summarizes the Falcons semantic search engine and how it indexes and searches semantic web objects. It also discusses efforts by Google, Yahoo, and Microsoft to incorporate semantic data through rich snippets, SearchMonkey, and Schema.org. Finally, it introduces the Kngine search engine as a new promising engine that aims to go beyond existing sources by indexing structured information on the web.
- The speaker discusses how the semantic web connects all types of information like people, companies, products, etc. using richer semantics to enable better search, targeted ads, collaboration, and personalization.
- Semantic technologies will play a key role in transforming the web from just a file server to an intelligent database over the next decade.
- The speaker demonstrates his company Twine's semantic web platform which allows users to organize, share, and discover content around their interests.
Linked Data Generation for the University Data From Legacy Database dannyijwest
Web was developed to share information among the users through internet as some hyperlinked documents.
If someone wants to collect some data from the web he has to search and crawl through the documents to
fulfil his needs. Concept of Linked Data creates a breakthrough at this stage by enabling the links within
data. So, besides the web of connected documents a new web developed both for humans and machines, i.e.,
the web of connected data, simply known as Linked Data Web. Since it is a very new domain, still a very
few works has been done, specially the publication of legacy data within a University domain as Linked
Data.
Deploying PHP applications using Virtuoso as Application Serverwebhostingguy
Virtuoso can act as an application server for PHP applications, providing both web server and database functionality. It exposes application data as RDF, allowing for more advanced querying across applications. Existing PHP applications like PHPBB, Drupal, and WordPress have been set up to work with Virtuoso and expose their data as RDF through a mapping process. Developers can build Virtuoso from source to include PHP support, enabling the hosting of PHP applications and accessing of application data as RDF through a SPARQL endpoint.
The speaker discusses the semantic web and its potential to make data on the web smarter and more connected. He outlines several approaches to semantics like tagging, statistics, linguistics, semantic web, and artificial intelligence. The semantic web allows data to be self-describing and linked, enabling applications to become more intelligent. The speaker demonstrates a prototype semantic web application called Twine that helps users organize and share information about their interests.
The World Wide Web is booming and radically vibrant due to the well established standards and widely accountable framework which guarantees the interoperability at various levels of the application and the society as a whole. So far, the web has been functioning at the random rate on the basis of the human intervention and some manual processing but the next generation web which the researchers called semantic web, edging for automatic processing and machine-level understanding. The well set notion, Semantic Web would be turn possible if only there exists the further levels of interoperability prevails among the applications and networks. In achieving this interoperability and greater functionality among the applications, the W3C standardization has already released the well defined standards such as RDF/RDF Schema and OWL. Using XML as a tool for semantic interoperability has not achieved anything effective and failed to bring the interconnection at the larger level. This leads to the further inclusion of inference layer at the top of the web architecture and its paves the way for proposing the common design for encoding the ontology representation languages in the data models such as RDF/RDFS. In this research article, we have given the clear implication of semantic web research roots and its ontological background process which may help to augment the sheer understanding of named entities in the web.
Discovering Resume Information using linked data dannyijwest
In spite of having different web applications to create and collect resumes, these web applications suffer
mainly from a common standard data model, data sharing, and data reusing. Though, different web
applications provide same quality of resume information, but internally there are many differences in terms
of data structure and storage which makes computer difficult to process and analyse the information from
different sources. The concept of Linked Data has enabled the web to share data among different data
sources and to discover any kind of information while resolving the issues like heterogeneity,
interoperability, and data reusing between different data sources and allowing machine process-able data
on the web.
Finding knowledge, data and answers on the Semantic Webebiquity
Web search engines like Google have made us all smarter by providing ready access to the world's knowledge whenever we need to look up a fact, learn about a topic or evaluate opinions. The W3C's Semantic Web effort aims to make such knowledge more accessible to computer programs by publishing it in machine understandable form.
<p>
As the volume of Semantic Web data grows software agents will need their own search engines to help them find the relevant and trustworthy knowledge they need to perform their tasks. We will discuss the general issues underlying the indexing and retrieval of RDF based information and describe Swoogle, a crawler based search engine whose index contains information on over a million RDF documents.
<p>
We will illustrate its use in several Semantic Web related research projects at UMBC including a distributed platform for constructing end-to-end use cases that demonstrate the semantic web’s utility for integrating scientific data. We describe ELVIS (the Ecosystem Location Visualization and Information System), a suite of tools for constructing food webs for a given location, and Triple Shop, a SPARQL query interface which searches the Semantic Web for data relevant to a given query ELVIS functionality is exposed as a collection of web services, and all input and output data is expressed in OWL, thereby enabling its integration with Triple Shop and other semantic web resources.
This document compares three APIs for processing RDF in the .NET Framework: SemWeb, LinqToRdf, and Rowlex. SemWeb provides low-level RDF interaction and the others build on it. LinqToRdf allows LINQ querying of RDF graphs while Rowlex maps RDF triples to object-oriented classes. All three APIs lack documentation and support as they were last updated in 2008-2009. SemWeb has the best performance while LinqToRdf has the lowest due to additional processing of LINQ queries to SPARQL.
The document discusses the emergence of the semantic web, which aims to make data on the web more interconnected and machine-readable. It describes Tim Berners-Lee's vision of a "Giant Global Graph" that connects all web documents based on what they are about rather than just linking documents. This would allow user data and profiles to be seamlessly shared across different sites without having to re-enter the same information. The semantic web uses standards like RDF, RDFS and OWL to represent relationships between data in a graph structure and enable automated reasoning. Several companies are working to build applications that take advantage of this interconnected semantic data.
This document is the manual for PHP, the PHP Documentation Group's copyright from 1997 to 2002. It contains information about installing and configuring PHP on various operating systems like Unix, Linux, Windows, etc. It also covers PHP syntax, functions, classes, and other features. The manual is distributed under the GNU General Public License and parts of it are also distributed under the Open Publication License. It was translated into Italian with contributions from multiple people.
Broadband network virus detection system based on bypass monitorUltraUploader
The document describes a Broadband Network Virus Detection System (VDS) based on bypass monitoring that can detect viruses on high-speed networks. The VDS uses four detection engines to analyze network traffic for viruses based on binary content, URLs, emails, and scripts. It accurately logs statistical information on detected viruses like name, source/target IPs, and spread frequency. The VDS mirrors network traffic to a detection engine in real-time without needing to reassemble packets into files. This allows it to efficiently detect viruses directly in network packets or data streams on gigabit-speed networks.
This document discusses botnets and their applications. It begins with an overview of botnets, how they are controlled through command and control servers, and how rootkits can help conceal botnet activity. It then explores how botnets can be used for spam, phishing, click fraud, identity theft, and distributed denial-of-service attacks. Detection and mitigation techniques are also summarized, including network intrusion detection, honeynets, DNS monitoring, and modeling botnet propagation across timezones. Recent botnets like AgoBot, PhatBot, and Bobax are also examined in the context of spam distribution. Open research questions around botnet membership detection, click fraud detection, and phishing detection are presented.
Bot software spreads, causes new worriesUltraUploader
Bot software infects millions of computers worldwide without the owners' knowledge and turns them into zombies that perform malicious tasks as part of a bot network. These bot networks, which can include thousands of infected computers, are used to spread viruses and worms, send spam emails, install spyware, and launch denial-of-service attacks. While initially just an automated way to spread malware, bot networks are now also used for criminal activities like identity theft due to their ability to stealthily command a large number of compromised computers. Security experts warn that the proliferation of bot networks poses serious risks and is very difficult to stop given their automation and scale.
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...UltraUploader
The document discusses blended threats that combine exploits and vulnerabilities with computer viruses. It begins with definitions of blended attacks and buffer overflows. It then describes three generations of buffer overflow techniques as well as other vulnerabilities exploited by blended threats, such as URL encoding and MIME header parsing. The document also discusses past threats like the Morris worm and CodeRed that blended exploits with viruses, and techniques used to combat future blended threats through defense in depth.
Win32/Blaster was a worm that exploited a vulnerability in Windows RPC to infect systems running Windows 2000 and Windows XP. It installed itself to automatically run on startup and then attempted to infect other systems on the local network and randomly selected IP addresses. The infection process involved exploiting the RPC vulnerability to execute a remote shell, downloading the worm binary, and executing it. It also launched a SYN flooding DDoS attack against Windows Update sites each month after the 16th. The worm spread quickly after the vulnerability was disclosed and highlighted the increasing automation and harm of worms.
Bird binary interpretation using runtime disassemblyUltraUploader
The document describes BIRD (Binary Interpretation using Runtime Disassembly), a binary analysis and instrumentation infrastructure for the Windows/x86 platform. BIRD combines static and dynamic disassembly to guarantee that every instruction in a binary is analyzed before execution. It provides services to convert binary code to assembly and insert instrumentation code without affecting program semantics. The prototype took 12 student months to develop and can successfully analyze applications like Microsoft Office, Internet Explorer, and IIS with low overhead of below 4%.
Biologically inspired defenses against computer virusesUltraUploader
This document discusses two biologically inspired approaches to computer virus detection and removal: a neural network virus detector that learns to identify infected and uninfected programs, and a computer immune system that can automatically identify, analyze, and remove new viruses from a system. The neural network technique has been incorporated into an IBM commercial antivirus product, while the computer immune system is still in prototype form. Both aim to replace human analysis of viruses to allow faster response times needed to address increasing rates of new virus creation and spread.
1. The document discusses biological viruses and computer viruses, providing background on how biological viruses work by hijacking cellular mechanisms of DNA replication, transcription and translation. It defines a computer virus as a piece of code with self-replicating ability that relies on other programs to exist, similar to biological viruses. 2. Computer viruses can cause damage by infecting programs which then infect other programs, potentially spreading like an epidemic across connected computers. 3. The document argues that a better understanding of biological and computer mechanisms can help improve defenses against viruses.
Biological aspects of computer virologyUltraUploader
This document discusses biological aspects of computer viruses and how factors that influence the spread of biological pathogens can also affect the propagation of computer malware. It analyzes three major factors that influence the spread of a computer worm: the infection propagator, which examines characteristics of exploited vulnerabilities like prevalence and age; the target locator, which focuses on how worms find new targets; and the worm's virulence, which looks at aspects that increase its infectiousness. The document suggests studying computer virus propagation through the lens of epidemiology models used for infectious diseases.
Biological models of security for virus propagation in computer networksUltraUploader
This document discusses how biological models of disease propagation and defense mechanisms in living organisms can inspire new approaches to computer network security and virus detection. Specifically, it describes how genetic regulatory networks that turn off harmful genes, protein interaction networks that model cellular processes, and epidemiological models of disease spread can provide models for automatically detecting and containing computer viruses without relying solely on pre-defined virus signatures. The authors propose several new security models drawing on these biological analogies, such as using surrogate code to maintain system functionality when parts are shut off, modeling network interactions to determine how viruses propagate, and evolving network services in real-time to reconstitute functionality after attacks.
Biological models of security for virus propagation in computer networks
Anatomy of a semantic virus
1. Anatomy of a Semantic Virus
Peyman Nasirifard
Digital Enterprise Research Institute
National University of Ireland, Galway
IDA Business Park, Lower Dangan, Galway, Ireland
peyman.nasirifard@deri.org
Abstract. In this position paper, I discuss a piece of malicious auto-
mated software that can be used by an individual or a group of users
for submitting valid random noisy RDF-based data based on predefined
schemas/ontologies to Semantic search engines. The result will under-
mine the utility of semantic searches. I did not implement the whole
virus, but checked its feasibility. The open question is whether nature
inspired reasoning can address such problems which are more related to
information quality aspects.
1 Introduction and Overview
Semantic-Web-Oriented fellows encourage other communities to generate/use/share
RDF statements based on predefined schemas/ontologies etc. to ease the inter-
operability among applications by making the knowledge machine-processable.
The emergence of semantic-based applications (e.g. Semantic digital libraries1
,
SIOC-enabled shared workspaces2
, Semantic URL shorten tools 3
) and also APIs
(e.g. Open Calais4
) etc. are good evidences to prove the cooperation among ap-
plication developers to talk using the famous subject-predicate-object notion.
However talking with the same alphabets but various dialects brings ambiguity-
related problems which have been addressed by some researchers and are out of
scope of this paper.
Searching, indexing, querying and reasoning over (publicly) available RDF
data bring motivating use cases for Semantic search engine fellows. The crawlers
of Semantic search engines crawl the Web and index RDF statements (triples)
they discover on the net for further reasoning and querying. Some of them are
also open to crawl the deep Web by enabling users to submit the links to their
RDF data.
Since the birth of computer software, especially operating systems, clever
developers and engineers benefited from software security leaks and developed
1
http://www.jeromedl.org/
2
http://www.bscw.de/
3
http://bit.ly/
4
http://www.opencalais.com/
2. software viruses which in some cases brought lots of disasters to governments,
businesses and individuals5
.
In this paper, I describe a potential piece of software which can be used by a
malicious user or a group of synergic malicious users in order to undermine the
utility the Semantic search engines. In brief, what the virus does, is generating
automatically random noisy knowledge which will be indexed by Semantic search
engines. My main motivation of presenting this idea here is identifying some
research challenges in trust layer of the well-known Semantic Web tower.
It is worthwhile mentioning that the title of this paper Anatomy of a Se-
mantic Virus is perhaps misleading. Actually, I am not going to describe the
anatomy of a virus that is based on the Semantic Web6
, but rather I focus on a
distributed virus that targets Semantic Web data.
The structure of this position paper proceeds like the following: In the next
part, I describe the problem and a scenario that demonstrates the method that
the potential virus may operate upon. In section 3, I have a discussion on poten-
tial directions of finding solutions. Finally, I conclude this short position paper.
2 Problem
Semantic Web search engines (e.g. SWSE7
, Swoogle8
) crawl and index new Se-
mantic Web documents containing RDF statements. There are some services
available on the net (e.g. Ping The Semantic Web9
(PTSW)) that enable end
users to publicly submit and announce the availability of their Semantic Web
data. These submissions can be later fetched by Semantic search engines for
indexing and further reasoning.
The main module of the potential virus is a piece of code that receives as input
several triples and generates as output several triples based on the inputs and also
predefined schemas, so that the generated RDF triples are syntactically correct,
but semantically wrong (fake). Figure 1 shows a simple example. As illustrated
in the figure, the input is two RDF triples: ”Galway is part of Ireland” and
”London is part of England”. The RDF schema has already defined that Galway
and London are instances of the concept City, whereas Ireland and England are
Countries. In this example, the virus exchanges the object (or subject) parts of
triples, taking to account the fact that both objects (or subjects) are instances
of the same class (Country or City). The generated result will be ”Galway is
part of England” and ”London is part of Ireland”; which both are correct RDF
statements, but wrong (fake) knowledge. Note that the whole process is done
by a malicious software and it is not kind of supervised editing and/or does not
have human in the loop.
5
http://www.landfield.com/isn/mail-archive/2000/May/0067.html
6
I see this a bit strange, as common computer viruses do not communicate to each
other and interoperability among viruses is not well-defined.
7
http://swse.org/
8
http://swoogle.umbc.edu/
9
http://pingthesemanticweb.com/
3. Fig. 1. Main module of the virus
The number of fake clones that can be generated is all possible instances of
various concepts within RDF document.
Note that the same problem may exist on the Web and somebody may put
fake knowledge on common Web pages. Moreover there exist lots of tricks to
get high ranking in search engines, but as we know, the growth of the Semantic
Web is not as fast as the Web10
[1] and such malicious activities are feasible
and can be performed using available RDF documents. Meanwhile, Semantic
Web’s main promise is to make the knowledge machine-processable, whereas the
unstructured data on the Web is more suited for humans and obviously current
machines do not have the wisdom and sense of humans.
On the other hand, someone may claim that due to the success of collab-
orative information gathering platforms like Wikipedia 11
, the motivation for
producing wrong knowledge in RDF is weak. However, we all benefit from plat-
forms like Wikipedia, but we rarely use its articles to cite in scientific papers.
The reason is perhaps the fact that the authors of such articles are unknown
and we can not really trust on the content. The same applies to the RDF data.
If we gather a large amount of Semantic Web data in RDF, can we really trust
them? How to exclude potential fake triples from the knowledge base?
2.1 Scenario
Here I present a simple scenario to describe the possible attack that a virus can
affect RDF data. As we know, publicly available services like PTSW, provide pro-
cessable feeds that include the recently-added/updated RDF documents. These
feeds are used by malicious software. However, the virus may even use Semantic
search engines to find RDF data from the net.
10
http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html and
http://sw.deri.org/2007/06/ontologymap/
11
http://www.wikipedia.org/
4. Fig. 2. Possible attack
In our scenario, the malicious software will parse the output feed of the PTSW
and get an index of the published RDF files. Then it fetches the RDF statements
from the net and changes them so that the generated RDF will be still valid. The
result will be then submitted as an updated (or new) RDF after a random time
interval with a random IP address (TCP/IP level) using a random hosting to
PTSW which will be indexed by Semantic search engines. The malicious software
may even submit the content directly to semantic search engines, if they provide
such functionality. Figure 2 demonstrates the overall view of the possible attack
which can be performed using PTSW service.
The main problem arises, when a group of people or even an individual in
large scale employs several instances of the malicious software and generates
fake RDF triples which will be submitted/indexed to/by the Semantic search
crawlers. If search engines are not capable to cope with this situation, the result
will undermine the utility of semantic searches.
3 Discussion
Digital signature is a vertical layer in the Semantic Web tower. There exist some
third-parties that issue certificates for authorized users. However we may use
digital signatures, certificates or any other means to cope with authentication
and authorization aspects of RDF data, but we can not cope with the Quality
aspects of the information (accuracy, validity, etc.). Moreover, we can not really
bound the usage of Semantic Web to only authenticated, authorized and/or
certified parties. Otherwise, we are highly eliminating its usage.
It is important to consider that the source of a piece of data is an important
factor in validity and accuracy which are two important concepts of information
quality. However, the virus is not able to change the origin of RDF document,
but it is able to edit the RDF with fake statements. As virused RDF is still
5. valid based on a schema, it can not be simply tracked for possible manipulation.
One research problem that arises with this issue is investigation on the cloned
graphs to find out the original one and perhaps log the cloned versions as illegal
graphs. Probably one naive approach is using a trusted knowledge body (univer-
sal common sense facts) that verify the material generated by others. But maybe
this also brings some limitations and we do not have a really comprehensive
knowledge base for the whole universe facts. On the other hand, nature inspired
reasoning tries to benefit from other domains to address mainly the complex
reasoning challenges within Semantic Web. The open question is whether nature
inspired reasoning can be useful in this area to validate the quality aspects of
data.
To my view, this problem and its potential solutions can bring also some
commercial interests. As an example, building a trusted knowledge party that
can validate RDF-based knowledge generated by people or giving authorities to
people to evaluate (semi-automatically) the generated RDF-based knowledge by
others.
4 Conclusion
In this position paper, I presented briefly a method that can be used by a piece of
automated software to maliciously target Semantic Web data, in order to put lots
of noisy elements into the knowledge base. I mentioned that the search results of
Semantic search engines may not be really trustable, as they may contain fake
noisy knowledge and machines can not really benefit from them, unless we are
certain that the existing knowledge in their repositories is true reliable one.
The fact that I presented this idea here is exploring some research problems
that I am not aware of their solutions, after reviewing literature and having some
discussions with senior Semantic Web researchers. Generating meaningful clones
of a given graph based on a schema (virus) and identifying the original one from
a bunch of cloned graphs (anti-virus) are possible research directions that can be
further explored. I personally did not implement the whole virus, but I checked
its feasibility using PTSW and a set of fake triples.
Acknowledgments. I thank Vassilios Peristeras and Stefan Decker for their
supports. This work is supported by Ecospace (Integrated Project on eProfes-
sional Collaboration Space) project: FP6-IST-5-35208
References
1. Ding, L., Finin, T.: Characterizing the Semantic Web on the Web. Proceedings of
the 5th International Semantic Web Conference, 2006.