Entities, Graphs, and Crowdsourcing for better Web SearcheXascale Infolab
Gianluca Demartini presented on using entities, graphs, and crowdsourcing for better web search. He discussed using crowdsourcing to perform entity linking and disambiguation on web pages through a system called ZenCrowd. ZenCrowd combines algorithmic and manual linking by automating manual linking via crowdsourcing tasks and using a probabilistic reasoning framework to assess workers. He also discussed using entity factor graphs for scientific literature disambiguation by modeling workers, links, clicks, and constraints within a probabilistic framework. The system was experimentally evaluated on news articles from several sources linked to entities from knowledge bases like Freebase and DBPedia.
1) Entity-centric data management stores information at the entity level and integrates information by interlinking entities. This provides advantages over keyword-based and relational database approaches.
2) The XI Pipeline extracts mentions from text and performs named entity recognition, entity linking, and entity typing to associate entities with text.
3) Approaches like ZenCrowd and TRank leverage both algorithms and human computation through crowdsourcing to improve entity linking and fine-grained entity typing.
Ontology-Based Word Sense Disambiguation for Scientific LiteratureeXascale Infolab
This document presents an approach for ontology-based word sense disambiguation for scientific literature. It leverages the structure of community-based ontologies to improve sense identification. The approach represents concepts as context vectors based on their relations in documents and ontologies. It evaluates techniques based on minimum distance between concepts in ontologies, shortest path between concepts in ontologies, and neighboring concepts in ontologies. Combining these graph-based models with context vectors achieves the best precision for word sense disambiguation on two scientific datasets.
This document presents SANAPHOR, an ontology-based coreference resolution system that improves upon existing approaches by leveraging semantic information. It first links entities in document clusters to semantic types and ontologies. It then splits or merges clusters based on these semantic relationships. The system was evaluated on the CoNLL-2012 dataset, where it improved coreference resolution performance over the baseline Stanford system, particularly for noun clusters. By utilizing semantic knowledge, SANAPHOR demonstrates the benefits of enhancing syntactic coreference resolution with an additional semantic layer.
Thomas Heinis is a post-doctoral researcher in the database group at EPFL. His research focuses on scalable data management algorithms for large-scale scientific applications. Thomas is a part of the "Human Brain Project" and currently works with neuroscientists to develop the data management infrastructure necessary for scaling up brain simulations. Prior to joining EPFL, Thomas completed his Ph.D. in the Systems Group at ETH Zurich, where he pursued research in workflow execution systems as well as data provenance.
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...eXascale Infolab
This document proposes three methods - LEXT, REXT, and LERIXT - for disambiguating the domain and range of properties in linked data by using context information. LEXT uses the type of subject resources, REXT uses the type of object resources, and LERIXT uses both. The methods were evaluated against expert judgments and achieved up to 96.5% precision for LEXT and 91.4% for REXT. LERIXT generated too many new sub-properties.
Entities, Graphs, and Crowdsourcing for better Web SearcheXascale Infolab
Gianluca Demartini presented on using entities, graphs, and crowdsourcing for better web search. He discussed using crowdsourcing to perform entity linking and disambiguation on web pages through a system called ZenCrowd. ZenCrowd combines algorithmic and manual linking by automating manual linking via crowdsourcing tasks and using a probabilistic reasoning framework to assess workers. He also discussed using entity factor graphs for scientific literature disambiguation by modeling workers, links, clicks, and constraints within a probabilistic framework. The system was experimentally evaluated on news articles from several sources linked to entities from knowledge bases like Freebase and DBPedia.
1) Entity-centric data management stores information at the entity level and integrates information by interlinking entities. This provides advantages over keyword-based and relational database approaches.
2) The XI Pipeline extracts mentions from text and performs named entity recognition, entity linking, and entity typing to associate entities with text.
3) Approaches like ZenCrowd and TRank leverage both algorithms and human computation through crowdsourcing to improve entity linking and fine-grained entity typing.
Ontology-Based Word Sense Disambiguation for Scientific LiteratureeXascale Infolab
This document presents an approach for ontology-based word sense disambiguation for scientific literature. It leverages the structure of community-based ontologies to improve sense identification. The approach represents concepts as context vectors based on their relations in documents and ontologies. It evaluates techniques based on minimum distance between concepts in ontologies, shortest path between concepts in ontologies, and neighboring concepts in ontologies. Combining these graph-based models with context vectors achieves the best precision for word sense disambiguation on two scientific datasets.
This document presents SANAPHOR, an ontology-based coreference resolution system that improves upon existing approaches by leveraging semantic information. It first links entities in document clusters to semantic types and ontologies. It then splits or merges clusters based on these semantic relationships. The system was evaluated on the CoNLL-2012 dataset, where it improved coreference resolution performance over the baseline Stanford system, particularly for noun clusters. By utilizing semantic knowledge, SANAPHOR demonstrates the benefits of enhancing syntactic coreference resolution with an additional semantic layer.
Thomas Heinis is a post-doctoral researcher in the database group at EPFL. His research focuses on scalable data management algorithms for large-scale scientific applications. Thomas is a part of the "Human Brain Project" and currently works with neuroscientists to develop the data management infrastructure necessary for scaling up brain simulations. Prior to joining EPFL, Thomas completed his Ph.D. in the Systems Group at ETH Zurich, where he pursued research in workflow execution systems as well as data provenance.
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...eXascale Infolab
This document proposes three methods - LEXT, REXT, and LERIXT - for disambiguating the domain and range of properties in linked data by using context information. LEXT uses the type of subject resources, REXT uses the type of object resources, and LERIXT uses both. The methods were evaluated against expert judgments and achieved up to 96.5% precision for LEXT and 91.4% for REXT. LERIXT generated too many new sub-properties.
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictioneXascale Infolab
1) The document presents HINGE, a new method for embedding hyper-relational knowledge graphs that aims to better capture information from facts containing multiple relations and entities.
2) HINGE uses a CNN to learn representations from base triplets and their associated key-value pairs to characterize the plausibility of facts.
3) An evaluation on link prediction tasks shows HINGE outperforms baselines and demonstrates that the triplet structure encodes essential information, while other representations discard important information.
Representation Learning on Graphs with Complex Structures
Invited talk, Deep Learning for Graphs and Structured Data Embedding Workshop
WWW2019, San Francisco, May 13, 2019
A force directed approach for offline gps trajectory mapeXascale Infolab
SIGSPATIAL 2018 paper
A Force-Directed Approach for Offline GPS Trajectory Map Matching
Efstratios Rappos (University of Applied Sciences of Western Switzerland (HES-SO)),
Stephan Robert (University of Applied Sciences of Western Switzerland (HES-SO)),
Philippe Cudré-Mauroux (University of Fribourg)
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...eXascale Infolab
This document proposes HistoSketch, a method for sketching streaming histograms that preserves similarity and adapts to concept drift. It works by:
1) Generating weighted samples from histograms such that the probability two sketches match equals histogram similarity.
2) Incrementally updating sketches using a weight decay factor to forget older data and adapt to drift over time.
3) Evaluating HistoSketch on classification tasks involving synthetic and real-world streaming data, finding it approximates histogram similarity well using small, fixed-size sketches while adapting rapidly to drift.
This document presents SwissLink, a high-precision context-free entity linking system. It extracts unambiguous surface forms (labels) from knowledge bases like DBpedia and Wikipedia to link entity mentions without context. It catalogs the surface forms, removes ambiguous ones using ratio and percentile methods, and performs fast string matching to link mentions. Evaluation on 30 Wikipedia articles shows the percentile-ratio method achieves over 95% precision and 45% recall, balancing precision and recall.
The document proposes a novel crowdsourcing system architecture and scheduling algorithm to address job starvation in multi-tenant crowd-powered systems. The architecture introduces HIT-Bundles to group heterogeneous tasks and control task serving. The Worker Conscious Fair Scheduling algorithm balances fairness and priority while minimizing worker context switching between tasks. Experiments on Amazon Mechanical Turk show the approach increases throughput over baseline schedulers and adapts to varying workforce levels and job priorities.
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web requires data management systems to constantly improve their scalability and efficiency. Despite recent advances in distributed Linked Data management, efficiently processing large amounts of Linked Data in a scalable way is still very challenging. In spite of their seemingly simple data models, Linked Data actually encode rich and complex graphs mixing both instance and schema level data. At the same time, users are increasingly interested in investigating or visualizing large collections of online data by performing complex analytic queries. The heterogeneity of Linked Data on the Web also poses new challenges to database systems. The capacity to store, track, and query provenance data is becoming a pivotal feature of Linked Data Management Systems. In this thesis, we tackle issues revolving around processing queries on big, unstructured, and heterogeneous Linked Data graphs.
This document summarizes a presentation given at SSSW 2015 on making sense of semantic data. It discusses challenges in understanding semantic web data, including a "language gap" between semantic web languages like SPARQL and natural language. It presents an approach to bridging this gap through automatically verbalizing SPARQL queries in English. Evaluation results show this helps non-experts understand queries better and faster than the SPARQL format. It also discusses the "semantic gap" caused by mismatches between a question's semantics and a knowledge graph, and presents an approach using templates to generate SPARQL queries from natural language questions.
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataeXascale Infolab
Uduvudu exploits the semantic and structured nature of Linked Data to generate the best possible representation for a human based on a catalog of available Matchers and Templates. Matchers and Templates are designed that they can be build through an intuitive editor interface.
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web poses new challenges to database systems. In particular, because of this heterogeneity, the capacity to store, track, and query provenance data is becoming a pivotal feature of modern triple stores. In this paper, we tackle the problem of efficiently executing provenance-enabled queries over RDF data. We propose, implement and empirically evaluate five different query execution strategies for RDF queries that incorporate knowledge of provenance. The evaluation is conducted on Web Data obtained from two different Web crawls (The Billion Triple Challenge, and the Web Data Commons). Our evaluation shows that using an adaptive query materialization execution strategy performs best in our context. Interestingly, we find that because provenance is prevalent within Web Data and is highly selective, it can be used to improve query processing performance. This is a counterintuitive result as provenance is often associated with additional overhead.
Micro-task crowdsourcing is rapidly gaining popularity among research communities and businesses as a means to leverage Human Computation in their daily operations. Unlike any other service, a crowdsourcing platform is in fact a marketplace subject to human factors that affect its performance, both in terms of speed and quality. Indeed, such factors shape the dynamics of the crowdsourcing market. For example, a known behavior of such markets is that increasing the reward of a set of tasks would lead to faster results. However, it is still unclear how different dimensions interact with each other: reward, task type, market competition, requester reputation, etc.
In this paper, we adopt a data-driven approach to (A) perform a long-term analysis of a popular micro-task crowdsourcing platform and understand the evolution of its main actors (workers, requesters, and platform). (B) We leverage the main findings of our five year log analysis to propose features used in a predictive model aiming at determining the expected performance of any batch at a specific point in time. We show that the number of tasks left in a batch and how recent the batch is are two key features of the prediction. (C) Finally, we conduct an analysis of the demand (new tasks posted by the requesters) and supply (number of tasks completed by the workforce) and show how they affect task prices on the marketplace.
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
The detection and correction of grammatical errors still represent very hard problems for modern error-correction systems. As an example, the top-performing systems at the preposition correction challenge CoNLL-2013 only achieved a F1 score of 17%.
In this paper, we propose and extensively evaluate a series of approaches for correcting prepositions, analyzing a large body of high-quality textual content to capture language usage. Leveraging n-gram statistics, association measures, and machine learning techniques, our system is able to learn which words or phrases govern the usage of a specific preposition. Our approach makes heavy use of n-gram statistics generated from very large textual corpora. In particular, one of our key features is the use of n-gram association measures (e.g., Pointwise Mutual Information) between words and prepositions to generate better aggregated preposition rankings for the individual n-grams.
We evaluate the effectiveness of our approach using cross-validation with different feature combinations and on two test collections created from a set of English language exams and StackExchange forums. We also compare against state-of-the-art supervised methods. Experimental results from the CoNLL-2013 test collection show that our approach to preposition correction achieves ~30% in F1 score which results in 13% absolute improvement over the best performing approach at that challenge.
OLTPBenchmark is a multi-threaded load generator. The framework is designed to be able to produce variable rate, variable mixture load against any JDBC-enabled relational database. The framework also provides data collection features, e.g., per-transaction-type latency and throughput logs.
Together with the framework we provide the following OLTP/Web benchmarks:
TPC-C
Wikipedia
Synthetic Resource Stresser
Twitter
Epinions.com
TATP
AuctionMark
SEATS
YCSB
JPAB (Hibernate)
CH-benCHmark
Voter (Japanese "American Idol")
SIBench (Snapshot Isolation)
SmallBank
LinkBench
CH-benCHmark
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)eXascale Infolab
Internet Infrastructures for Big Data
Talk given at Verisign's Distinguished Speaker Series, 2014
Prof. Philippe Cudre-Mauroux
eXascale Infolab
http://exascale.info/
This document discusses a project called MEM0R1ES that aims to automatically organize a person's digital information from various devices and online services to generate useful digital memories. The project develops techniques for entity search, typing, clustering, and elicitation to extract, integrate and expose personal information from heterogeneous graphs. It has produced several open-source software components and published results in top conferences. The document outlines current research directions and concludes that the project addresses important societal issues through stimulating collaboration between institutions.
Crowdsourcing is useful for curating information about tail entities, which are less popular entities like local restaurants, niche sports, or emerging music bands. Targeted crowdsourcing platforms like Pick-A-Crowd aim to match tasks to workers who can provide higher quality answers by considering a worker's social profile and task context. Transactive search uses the knowledge of crowds to reconstruct memories and answer questions by targeting the right people to search sources like Twitter photos or event attendees.
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictioneXascale Infolab
1) The document presents HINGE, a new method for embedding hyper-relational knowledge graphs that aims to better capture information from facts containing multiple relations and entities.
2) HINGE uses a CNN to learn representations from base triplets and their associated key-value pairs to characterize the plausibility of facts.
3) An evaluation on link prediction tasks shows HINGE outperforms baselines and demonstrates that the triplet structure encodes essential information, while other representations discard important information.
Representation Learning on Graphs with Complex Structures
Invited talk, Deep Learning for Graphs and Structured Data Embedding Workshop
WWW2019, San Francisco, May 13, 2019
A force directed approach for offline gps trajectory mapeXascale Infolab
SIGSPATIAL 2018 paper
A Force-Directed Approach for Offline GPS Trajectory Map Matching
Efstratios Rappos (University of Applied Sciences of Western Switzerland (HES-SO)),
Stephan Robert (University of Applied Sciences of Western Switzerland (HES-SO)),
Philippe Cudré-Mauroux (University of Fribourg)
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...eXascale Infolab
This document proposes HistoSketch, a method for sketching streaming histograms that preserves similarity and adapts to concept drift. It works by:
1) Generating weighted samples from histograms such that the probability two sketches match equals histogram similarity.
2) Incrementally updating sketches using a weight decay factor to forget older data and adapt to drift over time.
3) Evaluating HistoSketch on classification tasks involving synthetic and real-world streaming data, finding it approximates histogram similarity well using small, fixed-size sketches while adapting rapidly to drift.
This document presents SwissLink, a high-precision context-free entity linking system. It extracts unambiguous surface forms (labels) from knowledge bases like DBpedia and Wikipedia to link entity mentions without context. It catalogs the surface forms, removes ambiguous ones using ratio and percentile methods, and performs fast string matching to link mentions. Evaluation on 30 Wikipedia articles shows the percentile-ratio method achieves over 95% precision and 45% recall, balancing precision and recall.
The document proposes a novel crowdsourcing system architecture and scheduling algorithm to address job starvation in multi-tenant crowd-powered systems. The architecture introduces HIT-Bundles to group heterogeneous tasks and control task serving. The Worker Conscious Fair Scheduling algorithm balances fairness and priority while minimizing worker context switching between tasks. Experiments on Amazon Mechanical Turk show the approach increases throughput over baseline schedulers and adapts to varying workforce levels and job priorities.
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web requires data management systems to constantly improve their scalability and efficiency. Despite recent advances in distributed Linked Data management, efficiently processing large amounts of Linked Data in a scalable way is still very challenging. In spite of their seemingly simple data models, Linked Data actually encode rich and complex graphs mixing both instance and schema level data. At the same time, users are increasingly interested in investigating or visualizing large collections of online data by performing complex analytic queries. The heterogeneity of Linked Data on the Web also poses new challenges to database systems. The capacity to store, track, and query provenance data is becoming a pivotal feature of Linked Data Management Systems. In this thesis, we tackle issues revolving around processing queries on big, unstructured, and heterogeneous Linked Data graphs.
This document summarizes a presentation given at SSSW 2015 on making sense of semantic data. It discusses challenges in understanding semantic web data, including a "language gap" between semantic web languages like SPARQL and natural language. It presents an approach to bridging this gap through automatically verbalizing SPARQL queries in English. Evaluation results show this helps non-experts understand queries better and faster than the SPARQL format. It also discusses the "semantic gap" caused by mismatches between a question's semantics and a knowledge graph, and presents an approach using templates to generate SPARQL queries from natural language questions.
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataeXascale Infolab
Uduvudu exploits the semantic and structured nature of Linked Data to generate the best possible representation for a human based on a catalog of available Matchers and Templates. Matchers and Templates are designed that they can be build through an intuitive editor interface.
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web poses new challenges to database systems. In particular, because of this heterogeneity, the capacity to store, track, and query provenance data is becoming a pivotal feature of modern triple stores. In this paper, we tackle the problem of efficiently executing provenance-enabled queries over RDF data. We propose, implement and empirically evaluate five different query execution strategies for RDF queries that incorporate knowledge of provenance. The evaluation is conducted on Web Data obtained from two different Web crawls (The Billion Triple Challenge, and the Web Data Commons). Our evaluation shows that using an adaptive query materialization execution strategy performs best in our context. Interestingly, we find that because provenance is prevalent within Web Data and is highly selective, it can be used to improve query processing performance. This is a counterintuitive result as provenance is often associated with additional overhead.
Micro-task crowdsourcing is rapidly gaining popularity among research communities and businesses as a means to leverage Human Computation in their daily operations. Unlike any other service, a crowdsourcing platform is in fact a marketplace subject to human factors that affect its performance, both in terms of speed and quality. Indeed, such factors shape the dynamics of the crowdsourcing market. For example, a known behavior of such markets is that increasing the reward of a set of tasks would lead to faster results. However, it is still unclear how different dimensions interact with each other: reward, task type, market competition, requester reputation, etc.
In this paper, we adopt a data-driven approach to (A) perform a long-term analysis of a popular micro-task crowdsourcing platform and understand the evolution of its main actors (workers, requesters, and platform). (B) We leverage the main findings of our five year log analysis to propose features used in a predictive model aiming at determining the expected performance of any batch at a specific point in time. We show that the number of tasks left in a batch and how recent the batch is are two key features of the prediction. (C) Finally, we conduct an analysis of the demand (new tasks posted by the requesters) and supply (number of tasks completed by the workforce) and show how they affect task prices on the marketplace.
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
The detection and correction of grammatical errors still represent very hard problems for modern error-correction systems. As an example, the top-performing systems at the preposition correction challenge CoNLL-2013 only achieved a F1 score of 17%.
In this paper, we propose and extensively evaluate a series of approaches for correcting prepositions, analyzing a large body of high-quality textual content to capture language usage. Leveraging n-gram statistics, association measures, and machine learning techniques, our system is able to learn which words or phrases govern the usage of a specific preposition. Our approach makes heavy use of n-gram statistics generated from very large textual corpora. In particular, one of our key features is the use of n-gram association measures (e.g., Pointwise Mutual Information) between words and prepositions to generate better aggregated preposition rankings for the individual n-grams.
We evaluate the effectiveness of our approach using cross-validation with different feature combinations and on two test collections created from a set of English language exams and StackExchange forums. We also compare against state-of-the-art supervised methods. Experimental results from the CoNLL-2013 test collection show that our approach to preposition correction achieves ~30% in F1 score which results in 13% absolute improvement over the best performing approach at that challenge.
OLTPBenchmark is a multi-threaded load generator. The framework is designed to be able to produce variable rate, variable mixture load against any JDBC-enabled relational database. The framework also provides data collection features, e.g., per-transaction-type latency and throughput logs.
Together with the framework we provide the following OLTP/Web benchmarks:
TPC-C
Wikipedia
Synthetic Resource Stresser
Twitter
Epinions.com
TATP
AuctionMark
SEATS
YCSB
JPAB (Hibernate)
CH-benCHmark
Voter (Japanese "American Idol")
SIBench (Snapshot Isolation)
SmallBank
LinkBench
CH-benCHmark
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)eXascale Infolab
Internet Infrastructures for Big Data
Talk given at Verisign's Distinguished Speaker Series, 2014
Prof. Philippe Cudre-Mauroux
eXascale Infolab
http://exascale.info/
This document discusses a project called MEM0R1ES that aims to automatically organize a person's digital information from various devices and online services to generate useful digital memories. The project develops techniques for entity search, typing, clustering, and elicitation to extract, integrate and expose personal information from heterogeneous graphs. It has produced several open-source software components and published results in top conferences. The document outlines current research directions and concludes that the project addresses important societal issues through stimulating collaboration between institutions.
Crowdsourcing is useful for curating information about tail entities, which are less popular entities like local restaurants, niche sports, or emerging music bands. Targeted crowdsourcing platforms like Pick-A-Crowd aim to match tasks to workers who can provide higher quality answers by considering a worker's social profile and task context. Transactive search uses the knowledge of crowds to reconstruct memories and answer questions by targeting the right people to search sources like Twitter photos or event attendees.
ScienceWISE: A Web-based Interactive Semantic Platform for Scientific Collaboration
1. ScienceWISE
A
Web-‐based
Interac2ve
Seman2c
Pla7orm
for
Scien2fic
Collabora2on
Karl
Aberer,
Alexey
Boyarsky,
Philippe
Cudré-‐Maurox,
Gianluca
Demar-ni,
and
Oleg
Ruchayskiy
2. ScienceWISE
• An
on-‐line
pla7orm
for
scien2fic
publica2ons
• A
seman2c
layer
on
top
of
ArXiv.org