There is a pressing need to tackle the usability challenges in querying massive, ultra-heterogeneous entity graphs which use thousands of node and edge types in recording millions to billions of entities (persons, products, organizations) and their relationships. Widely known instances of such graphs include Freebase, DBpedia and YAGO. Applications in a variety of domains are tapping into such graphs for richer semantics and better intelligence. Both data workers and application developers are often overwhelmed by the daunting task of understanding and querying these data, due to their sheer size and complexity. To retrieve data from graph databases, the norm is to use structured query languages such as SQL, SPARQL, and those alike. However, writing structured queries requires extensive experience in query language and data model. The database community has long recognized the importance of graphical query interface to the usability of data management systems. Yet, relatively little has been done. Existing visual query builders allow users to build queries by drawing query graphs, but do not offer suggestions to users regarding what nodes and edges to include. At every step of query formulation, a user would be inundated with possibly hundreds of or even more options.
Towards improving the usability of graph query systems, Orion is a visual query interface that iteratively assists users in query graph construction by making suggestions using machine learning methods. In its active mode, Orion suggests top-k edges to be added to a query graph, without being triggered by any user action. In its passive mode, the user adds a new edge manually, and Orion suggests a ranked list of labels for the edge. Orion’s edge ranking algorithm, Random Correlation Paths (RCP), makes use of a query log to rank candidate edges by how likely they will match users’ query intent. Extensive user studies using Freebase demonstrated that Orion users have a 70% success rate in constructing complex query graphs, a significant improvement over the 58% success rate by users of a baseline system that resembles existing visual query builders. Furthermore, using active mode only, the RCP algorithm was compared with several methods adapting other machine learning algorithms such as random forests and naive Bayes classifier, as well as recommendation systems based on singular value decomposition. On average, RCP required only 40 suggestions to correctly reach a target query graph while other methods required 2-4 times as many suggestions.
SAFE: Policy Aware SPARQL Query Federation Over RDF Data CubesMuhammad Saleem
This document describes SAFE (Policy Aware SPARQL Query Federation Over RDF Data Cubes), a system for securely querying distributed RDF data cubes. SAFE uses source selection, access policy filtering, and query rewriting to enable policy-aware querying over clinical data from multiple sources while preserving privacy. It selects relevant data sources for a query based on triple patterns and an index, filters sources based on access policies for the user, and rewrites the query to retrieve and integrate results from authorized sources only. Evaluation shows SAFE can efficiently perform source selection and query execution over large real-world datasets compared to existing federated query systems.
Systematic Testing for Resource Leaks in Android ApplicationsDacong (Tony) Yan
The use of mobile devices and the complexity of their software continue to grow rapidly. This growth presents significant challenges for software correctness and performance. In addition to traditional defects, a key consideration are defects related to the limited resources available on these devices. Resource leaks in an application, due to improper management of resources, can lead to slowdowns, crashes, and negative user experience. Despite a large body of existing work on leak detection, testing for resource leaks remains a challenging problem. We propose a novel and comprehensive approach for systematic testing for resource leaks in Android software. Similar to existing testing techniques, the approach is based on a GUI model, but is focused specifically on coverage criteria aimed at resource leak defects. These criteria are based on neutral cycles: sequences of GUI events that should have a "neutral" effect and should not lead to increases in resource usage.
We have defined several test coverage criteria based on different categories of neutral cycles in the GUI model. This approach is informed by knowledge of typical causes of resource leaks in Android software. We have also developed LeakDroid, a tool that generates test cases to trigger repeated execution of neutral cycles. When the test cases are executed, resource usage is monitored for suspicious behaviors. The approach has been evaluated on several Android applications. The evaluation demonstrates that the proposed test generation effectively uncovers a variety of resource leaks.
The amount of structured data published on the Web is constantly growing. A significant part of this data is published in accordance to the Linked Data principles. The explicit graph structure enables machines and humans to retrieve descriptions of entities and discover information about relations to other entities. In many cases, descriptions of single entities include thousands of statements and for human users it becomes difficult to comprehend the data unless a selection of the most relevant facts is provided.
In this paper we introduce LinkSUM, a lightweight link-based approach for the relevance-oriented summarization of knowledge graph entities. LinkSUM optimizes the combination of the PageRank algorithm with an adaption of the Backlink method together with new approaches for predicate selection. Both, quantitative and qualitative evaluations have been conducted to study the performance of the method in comparison to an existing entity summarization approach. The results show a significant improvement over the state of the art and lead us to conclude that prioritizing the selection of related resources leads to better summaries.
Linked Data knowledge sources such as DBpedia, Freebase, and Wikidata currently offer large amounts of factual data. As the amount of information that can be grasped by users is limited, data summaries are needed. If a summary relates to a specific entity we refer to it as entity summarization. Unfortunately, in many settings, the summaries of entities are tightly bound to user interfaces. This practice poses problems for efficient and objective comparison and evaluation.
In this paper we focus on the question of how to make summaries exchangeable between multiple interfaces and multiple summarization services in order to facilitate evaluation and testing. We introduce SUMMA, an API definition that enables to decouple generation and presentation of summaries. It enables multiple consumers to retrieve summaries from multiple providers in a unified and lightweight way.
SPARQL Query Verbalization for Explaining Semantic Search Engine QueriesBasil Ell
This presentation was given at the 11th Extended Semantic Web Conference (ESWC '14), Anissaras/Heronissou, Crete, Greece, and is related the publication of the same title.
In this paper we introduce Spartiqulation, a system that translates SPARQL queries into English text. Our aim is to allow casual end users of semantic applications with limited to no expertise in the SPARQL query language to interact with these applications in a more intuitive way. The verbalization approach exploits domain-independent template-based natural language generation techniques, as well as linguistic cues in labels and URIs.
Since more than a decade, theoretical research on ontology evolution has been published in literature and frameworks for managing ontology changes have been developed. However, there are less studies that analyze widely used ontologies developed in a collaborative manner to understand community-driven ontology evolution in practice.
We performed an empirical analysis on how four well-known ontologies (DBpedia, Schema.org, PROV-O, and FOAF) have evolved through their lifetime and an analysis of the data quality issues caused by some of the ontology changes.
The use of R statistical package in controlled infrastructure. The case of Cl...Adrian Olszewski
This document provides an overview of using the R statistical package in clinical research applications. It discusses R's history and increasing use in evidence-based medicine and clinical trials. It also addresses common myths and challenges regarding using R in regulated clinical research and drug development settings, such as concerns about validation and compliance. The presenter outlines R's capabilities for statistical analysis tasks and its ability to interface with other packages like SAS. He discusses steps for preparing R for use in industry, including validation of results and creating a controlled environment.
SAFE: Policy Aware SPARQL Query Federation Over RDF Data CubesMuhammad Saleem
This document describes SAFE (Policy Aware SPARQL Query Federation Over RDF Data Cubes), a system for securely querying distributed RDF data cubes. SAFE uses source selection, access policy filtering, and query rewriting to enable policy-aware querying over clinical data from multiple sources while preserving privacy. It selects relevant data sources for a query based on triple patterns and an index, filters sources based on access policies for the user, and rewrites the query to retrieve and integrate results from authorized sources only. Evaluation shows SAFE can efficiently perform source selection and query execution over large real-world datasets compared to existing federated query systems.
Systematic Testing for Resource Leaks in Android ApplicationsDacong (Tony) Yan
The use of mobile devices and the complexity of their software continue to grow rapidly. This growth presents significant challenges for software correctness and performance. In addition to traditional defects, a key consideration are defects related to the limited resources available on these devices. Resource leaks in an application, due to improper management of resources, can lead to slowdowns, crashes, and negative user experience. Despite a large body of existing work on leak detection, testing for resource leaks remains a challenging problem. We propose a novel and comprehensive approach for systematic testing for resource leaks in Android software. Similar to existing testing techniques, the approach is based on a GUI model, but is focused specifically on coverage criteria aimed at resource leak defects. These criteria are based on neutral cycles: sequences of GUI events that should have a "neutral" effect and should not lead to increases in resource usage.
We have defined several test coverage criteria based on different categories of neutral cycles in the GUI model. This approach is informed by knowledge of typical causes of resource leaks in Android software. We have also developed LeakDroid, a tool that generates test cases to trigger repeated execution of neutral cycles. When the test cases are executed, resource usage is monitored for suspicious behaviors. The approach has been evaluated on several Android applications. The evaluation demonstrates that the proposed test generation effectively uncovers a variety of resource leaks.
The amount of structured data published on the Web is constantly growing. A significant part of this data is published in accordance to the Linked Data principles. The explicit graph structure enables machines and humans to retrieve descriptions of entities and discover information about relations to other entities. In many cases, descriptions of single entities include thousands of statements and for human users it becomes difficult to comprehend the data unless a selection of the most relevant facts is provided.
In this paper we introduce LinkSUM, a lightweight link-based approach for the relevance-oriented summarization of knowledge graph entities. LinkSUM optimizes the combination of the PageRank algorithm with an adaption of the Backlink method together with new approaches for predicate selection. Both, quantitative and qualitative evaluations have been conducted to study the performance of the method in comparison to an existing entity summarization approach. The results show a significant improvement over the state of the art and lead us to conclude that prioritizing the selection of related resources leads to better summaries.
Linked Data knowledge sources such as DBpedia, Freebase, and Wikidata currently offer large amounts of factual data. As the amount of information that can be grasped by users is limited, data summaries are needed. If a summary relates to a specific entity we refer to it as entity summarization. Unfortunately, in many settings, the summaries of entities are tightly bound to user interfaces. This practice poses problems for efficient and objective comparison and evaluation.
In this paper we focus on the question of how to make summaries exchangeable between multiple interfaces and multiple summarization services in order to facilitate evaluation and testing. We introduce SUMMA, an API definition that enables to decouple generation and presentation of summaries. It enables multiple consumers to retrieve summaries from multiple providers in a unified and lightweight way.
SPARQL Query Verbalization for Explaining Semantic Search Engine QueriesBasil Ell
This presentation was given at the 11th Extended Semantic Web Conference (ESWC '14), Anissaras/Heronissou, Crete, Greece, and is related the publication of the same title.
In this paper we introduce Spartiqulation, a system that translates SPARQL queries into English text. Our aim is to allow casual end users of semantic applications with limited to no expertise in the SPARQL query language to interact with these applications in a more intuitive way. The verbalization approach exploits domain-independent template-based natural language generation techniques, as well as linguistic cues in labels and URIs.
Since more than a decade, theoretical research on ontology evolution has been published in literature and frameworks for managing ontology changes have been developed. However, there are less studies that analyze widely used ontologies developed in a collaborative manner to understand community-driven ontology evolution in practice.
We performed an empirical analysis on how four well-known ontologies (DBpedia, Schema.org, PROV-O, and FOAF) have evolved through their lifetime and an analysis of the data quality issues caused by some of the ontology changes.
The use of R statistical package in controlled infrastructure. The case of Cl...Adrian Olszewski
This document provides an overview of using the R statistical package in clinical research applications. It discusses R's history and increasing use in evidence-based medicine and clinical trials. It also addresses common myths and challenges regarding using R in regulated clinical research and drug development settings, such as concerns about validation and compliance. The presenter outlines R's capabilities for statistical analysis tasks and its ability to interface with other packages like SAS. He discusses steps for preparing R for use in industry, including validation of results and creating a controlled environment.
This document discusses duplicate and near duplicate detection at scale. It begins with an outline of the topics to be covered, including a search system assessment, text extraction assessment, duplicates and near duplicates case study, and exploration of near duplicates using minhash. It then discusses experiments with detecting duplicates and near duplicates in a web index of over 12.5 million documents. The results found over 2.8 million documents had the same text digest, and different techniques like text profiles and minhash were able to further group documents. However, minhash was found to be prohibitively slow at this scale. The document explores reasons for this and concludes more work is needed to efficiently detect near duplicates at large scale.
I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.
Users are tapping into massive, heterogeneous entity graphs for many applications. It is challenging to select entity graphs for a particular need, given abundant datasets from many sources and the oftentimes scarce information for them. We propose methods to produce preview tables for compact presentation of important entity types and relationships in entity graphs. The preview tables assist users in attaining a quick and rough preview of the data. They can be shown in a limited display space for a user to browse and explore, before she decides to spend time and resources to fetch and investigate the complete dataset. We formulate several optimization problems that look for previews with the highest scores according to intuitive goodness measures, under various constraints on preview size and distance between preview tables. The optimization problem under distance constraint is NP-hard. We design a dynamic-programming algorithm and an Apriori-style algorithm for finding optimal previews. Results from experiments, comparison with related work and user studies demonstrated the scoring measures’ accuracy and the discovery algorithms’ efficiency.
Databases are the heart of most PHP projects but roughly TWO PERCENT of PHP programmers have had any real training in Structured Query Language, SQL. Then they wonder why their queries perform poorly, why they get N+1 problems, and suddenly the database becomes the choke point of the project. This presentation will cover the basics of relational algebra (no algebra, math or calculus skills needed!!!!), how to think in sets with Venn Diagrams, and how to let the database do the heavy lifting for you. So if you want to write high performing database queries and be admired as a database deity by your co workers then you need to be in this session!
Web services provide programmatic interfaces to online services and tools, allowing for machine-to-machine communication across the web. They have become widely used in the life sciences, with major providers like EMBL-EBI, DDBJ, and NCBI offering hundreds of services. However, ensuring the sustainability, usability, and reliability of web services remains an ongoing challenge. Catalogs aim to help users discover and understand available services, but require community involvement to maintain accurate and up-to-date information.
This document presents a method for detecting hotspots (services responsible for suboptimal performance) in a service-oriented architecture. The method reconstructs service call graphs from logged metrics to identify the top-k services whose optimization would most reduce latency or cost. It proposes quantifying each service's impact and using a greedy algorithm to select services in descending order of impact. Evaluation on a large-scale production dataset found the approach more effective than a baseline and produced consistent results over time, helping engineering teams optimize performance.
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
Scientific Software: Sustainability, Skills & SociologyNeil Chue Hong
This document discusses the importance of software sustainability. It notes that software is everywhere, long-lived, and hard to define, which makes sustainability challenging. It emphasizes that software sustainability requires cultivating skills in developers and researchers, providing proper incentives, and recognizing that people are a key part of maintaining software over long periods of time.
We present VIIQ (pronounced as wick), an interactive and iterative visual query formulation interface that helps users construct query graphs specifying their exact query intent. Heterogeneous graphs are increasingly used to represent complex relationships in schema-less data, which are usually queried using query graphs. Existing graph query systems offer little help to users in easily choosing the exact labels of the edges and vertices in the query graph. VIIQ helps users easily specify their exact query intent by providing a visual interface that lets them graphically add various query graph components, backed by an edge suggestion mechanism that suggests edges relevant to the user’s query intent. In this demo we present: 1) a detailed description of the various features and user-friendly graphical interface of VIIQ, 2) a brief description of the edge suggestion algorithm, and 3) a demonstration scenario that we intend to show the audience.
The document summarizes Kenneth Emeka Odoh's presentation on recommender systems and his solution to the WSDM Challenge competition. It includes discussions of the top solutions which used techniques like light gradient boosted machines, neural networks, and ensemble modeling. It also describes Kenneth's solution using bidirectional LSTMs with techniques like batch normalization and dropout to avoid overfitting on the time series song listening data. Overall, the presentation covered many state-of-the-art recommender system techniques for sequential and time series prediction tasks.
This document is a resume for Wang Lichang. It summarizes his objective as a software engineer, his education including a Master's degree from Northwest A&F University in China and current studies at UC San Diego. It also outlines his relevant work experience developing web and desktop applications, including for genomic data sharing and analysis. His technical skills include languages like C++, R and Python, and areas like data encryption, compression and high performance computing. He has received several honors and awards for his academic performance.
An introduction to R is a document usefulssuser3c3f88
R is a language and environment for statistical computing and graphics. It provides functions for data manipulation, calculation, and graphical displays. Key features of R include its ability to produce publication-quality plots, perform statistical tests, fit models to data, and develop statistical software. R has an extensive library of additional user-contributed packages that extend its capabilities. The document provides information on downloading and using R, reading data into R, customizing plots, and interactive plotting functions.
Detecting Hacks: Anomaly Detection on Networking DataJames Sirota
See https://medium.com/@jamessirota for a series of blog entries that goes with this deck...
Defense in Depth for Big Data
Network Anomaly Detection Overview
Volume Anomaly Detection
Feature Anomaly Detection
Model Architecture
Deployment on OpenSOC Platform
Questions
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...WSO2
In this webinar, Srinath Perera, director of research at WSO2, will discuss
Big data landscape: concepts, use cases, and technologies
Real-time analytics with WSO2 CEP
Batch analytics with WSO2 BAM
Combining batch and real-time analytics
Introducing WSO2 Machine Learner
Netflix Edge Engineering Open House Presentations - June 9, 2016Daniel Jacobson
Netflix's Edge Engineering team is responsible for handling all device traffic for to support the user experience, including sign-up, discovery and the triggering of the playback experience. Developing and maintaining this set of massive scale services is no small task and its success is the difference between millions of happy streamers or millions of missed opportunities.
This video captures the presentations delivered at the first ever Edge Engineering Open House at Netflix. This video covers the primary aspects of our charter, including the evolution of our API and Playback services as well as building a robust developer experience for the internal consumers of our APIs.
Ontology-based data access: why it is so cool!Josef Hardi
A brief introduction about ontology-based data access (shortly OBDA) and its core implementation. I presented too a recent simple benchmark between -ontop- and Semantika---two most available software for OBDA framework---in term of query performance (including details in the appendix section). The slides were presented for Friday Research Meeting in Stanford Center for Biomedical Informatics Research (BMIR).
License: Creative Commons by Attribution 3.0
Database Basics with PHP -- Connect JS Conference October 17th, 2015Dave Stokes
This presentation covers the basics of using a relational database for PHP developers. Included are using Venn Diagrams, examining queries, and letting the database do the heavy lifting.
This document discusses the challenges and opportunities biology faces with increasing data generation. It outlines four key points:
1) Research approaches for analyzing infinite genomic data streams, such as digital normalization which compresses data while retaining information.
2) The need for usable software and decentralized infrastructure to perform real-time, streaming data analysis.
3) The importance of open science and reproducibility given most researchers cannot replicate their own computational analyses.
4) The lack of data analysis training in biology and efforts at UC Davis to address this through workshops and community building.
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
The IDIR Lab at the University of Texas at Arlington has been on the frontier of computational journalism research in the past several years. This talk will discuss two systems developed in the thrust. (1) ClaimBuster, a fact-checking system, uses machine learning and natural language processing techniques to help journalists find political claims to fact-check. We will use it to find factual statements in the 2016 United States presidential debates that are worth checking, with the goal of expanding its use to other types of discourses, articles, social media, as well as non-political claims. Still under development, a near real-time test during the recent August 6th Republican debate of the 2016 election suggests a statistically significant agreement between ClaimBuster and professional fact-checkers. This talk will discuss the ideas behind ClaimBuster, its data collection process, the experience with the Republican debate, and our quest for the "Holy Grail"---a completely-automated, live fact-checking machine that replaces human fact-checkers. (2) The talk will also briefly discuss FactWatcher, a system that helps journalists identify data-backed, attention-seizing facts which serve as leads to news stories. Given an append-only database, upon the arrival of a new tuple, FactWatcher monitors if the tuple triggers any new facts. Its algorithms efficiently search for facts without exhaustively testing all possible ones. Furthermore, FactWatcher provides multiple features in striving for an end-to-end system, including fact ranking, fact-to-statement translation and keyword-based fact search. In 2014 the system won an Excellent Demonstration Award at VLDB, one of the two most prestigious conferences in the database area.
This document provides an overview of Chengli Li's computational journalism research at UT-Arlington, including:
- The history of the research starting in 2009 with fact-finding problems in sports journalism.
- Key projects including ClaimBuster for data-driven fact-checking and FactWatcher for exceptional fact finding from real-world events.
- Challenges in graph data usability and systems developed like GQBE, Orion, and TableView to make graph data easier to understand and explore.
- Current students and collaborators on projects involving automated fact-checking, monitoring claims on social media, and graph data management.
More Related Content
Similar to Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
This document discusses duplicate and near duplicate detection at scale. It begins with an outline of the topics to be covered, including a search system assessment, text extraction assessment, duplicates and near duplicates case study, and exploration of near duplicates using minhash. It then discusses experiments with detecting duplicates and near duplicates in a web index of over 12.5 million documents. The results found over 2.8 million documents had the same text digest, and different techniques like text profiles and minhash were able to further group documents. However, minhash was found to be prohibitively slow at this scale. The document explores reasons for this and concludes more work is needed to efficiently detect near duplicates at large scale.
I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.
Users are tapping into massive, heterogeneous entity graphs for many applications. It is challenging to select entity graphs for a particular need, given abundant datasets from many sources and the oftentimes scarce information for them. We propose methods to produce preview tables for compact presentation of important entity types and relationships in entity graphs. The preview tables assist users in attaining a quick and rough preview of the data. They can be shown in a limited display space for a user to browse and explore, before she decides to spend time and resources to fetch and investigate the complete dataset. We formulate several optimization problems that look for previews with the highest scores according to intuitive goodness measures, under various constraints on preview size and distance between preview tables. The optimization problem under distance constraint is NP-hard. We design a dynamic-programming algorithm and an Apriori-style algorithm for finding optimal previews. Results from experiments, comparison with related work and user studies demonstrated the scoring measures’ accuracy and the discovery algorithms’ efficiency.
Databases are the heart of most PHP projects but roughly TWO PERCENT of PHP programmers have had any real training in Structured Query Language, SQL. Then they wonder why their queries perform poorly, why they get N+1 problems, and suddenly the database becomes the choke point of the project. This presentation will cover the basics of relational algebra (no algebra, math or calculus skills needed!!!!), how to think in sets with Venn Diagrams, and how to let the database do the heavy lifting for you. So if you want to write high performing database queries and be admired as a database deity by your co workers then you need to be in this session!
Web services provide programmatic interfaces to online services and tools, allowing for machine-to-machine communication across the web. They have become widely used in the life sciences, with major providers like EMBL-EBI, DDBJ, and NCBI offering hundreds of services. However, ensuring the sustainability, usability, and reliability of web services remains an ongoing challenge. Catalogs aim to help users discover and understand available services, but require community involvement to maintain accurate and up-to-date information.
This document presents a method for detecting hotspots (services responsible for suboptimal performance) in a service-oriented architecture. The method reconstructs service call graphs from logged metrics to identify the top-k services whose optimization would most reduce latency or cost. It proposes quantifying each service's impact and using a greedy algorithm to select services in descending order of impact. Evaluation on a large-scale production dataset found the approach more effective than a baseline and produced consistent results over time, helping engineering teams optimize performance.
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
Scientific Software: Sustainability, Skills & SociologyNeil Chue Hong
This document discusses the importance of software sustainability. It notes that software is everywhere, long-lived, and hard to define, which makes sustainability challenging. It emphasizes that software sustainability requires cultivating skills in developers and researchers, providing proper incentives, and recognizing that people are a key part of maintaining software over long periods of time.
We present VIIQ (pronounced as wick), an interactive and iterative visual query formulation interface that helps users construct query graphs specifying their exact query intent. Heterogeneous graphs are increasingly used to represent complex relationships in schema-less data, which are usually queried using query graphs. Existing graph query systems offer little help to users in easily choosing the exact labels of the edges and vertices in the query graph. VIIQ helps users easily specify their exact query intent by providing a visual interface that lets them graphically add various query graph components, backed by an edge suggestion mechanism that suggests edges relevant to the user’s query intent. In this demo we present: 1) a detailed description of the various features and user-friendly graphical interface of VIIQ, 2) a brief description of the edge suggestion algorithm, and 3) a demonstration scenario that we intend to show the audience.
The document summarizes Kenneth Emeka Odoh's presentation on recommender systems and his solution to the WSDM Challenge competition. It includes discussions of the top solutions which used techniques like light gradient boosted machines, neural networks, and ensemble modeling. It also describes Kenneth's solution using bidirectional LSTMs with techniques like batch normalization and dropout to avoid overfitting on the time series song listening data. Overall, the presentation covered many state-of-the-art recommender system techniques for sequential and time series prediction tasks.
This document is a resume for Wang Lichang. It summarizes his objective as a software engineer, his education including a Master's degree from Northwest A&F University in China and current studies at UC San Diego. It also outlines his relevant work experience developing web and desktop applications, including for genomic data sharing and analysis. His technical skills include languages like C++, R and Python, and areas like data encryption, compression and high performance computing. He has received several honors and awards for his academic performance.
An introduction to R is a document usefulssuser3c3f88
R is a language and environment for statistical computing and graphics. It provides functions for data manipulation, calculation, and graphical displays. Key features of R include its ability to produce publication-quality plots, perform statistical tests, fit models to data, and develop statistical software. R has an extensive library of additional user-contributed packages that extend its capabilities. The document provides information on downloading and using R, reading data into R, customizing plots, and interactive plotting functions.
Detecting Hacks: Anomaly Detection on Networking DataJames Sirota
See https://medium.com/@jamessirota for a series of blog entries that goes with this deck...
Defense in Depth for Big Data
Network Anomaly Detection Overview
Volume Anomaly Detection
Feature Anomaly Detection
Model Architecture
Deployment on OpenSOC Platform
Questions
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...WSO2
In this webinar, Srinath Perera, director of research at WSO2, will discuss
Big data landscape: concepts, use cases, and technologies
Real-time analytics with WSO2 CEP
Batch analytics with WSO2 BAM
Combining batch and real-time analytics
Introducing WSO2 Machine Learner
Netflix Edge Engineering Open House Presentations - June 9, 2016Daniel Jacobson
Netflix's Edge Engineering team is responsible for handling all device traffic for to support the user experience, including sign-up, discovery and the triggering of the playback experience. Developing and maintaining this set of massive scale services is no small task and its success is the difference between millions of happy streamers or millions of missed opportunities.
This video captures the presentations delivered at the first ever Edge Engineering Open House at Netflix. This video covers the primary aspects of our charter, including the evolution of our API and Playback services as well as building a robust developer experience for the internal consumers of our APIs.
Ontology-based data access: why it is so cool!Josef Hardi
A brief introduction about ontology-based data access (shortly OBDA) and its core implementation. I presented too a recent simple benchmark between -ontop- and Semantika---two most available software for OBDA framework---in term of query performance (including details in the appendix section). The slides were presented for Friday Research Meeting in Stanford Center for Biomedical Informatics Research (BMIR).
License: Creative Commons by Attribution 3.0
Database Basics with PHP -- Connect JS Conference October 17th, 2015Dave Stokes
This presentation covers the basics of using a relational database for PHP developers. Included are using Venn Diagrams, examining queries, and letting the database do the heavy lifting.
This document discusses the challenges and opportunities biology faces with increasing data generation. It outlines four key points:
1) Research approaches for analyzing infinite genomic data streams, such as digital normalization which compresses data while retaining information.
2) The need for usable software and decentralized infrastructure to perform real-time, streaming data analysis.
3) The importance of open science and reproducibility given most researchers cannot replicate their own computational analyses.
4) The lack of data analysis training in biology and efforts at UC Davis to address this through workshops and community building.
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
Similar to Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs (20)
The IDIR Lab at the University of Texas at Arlington has been on the frontier of computational journalism research in the past several years. This talk will discuss two systems developed in the thrust. (1) ClaimBuster, a fact-checking system, uses machine learning and natural language processing techniques to help journalists find political claims to fact-check. We will use it to find factual statements in the 2016 United States presidential debates that are worth checking, with the goal of expanding its use to other types of discourses, articles, social media, as well as non-political claims. Still under development, a near real-time test during the recent August 6th Republican debate of the 2016 election suggests a statistically significant agreement between ClaimBuster and professional fact-checkers. This talk will discuss the ideas behind ClaimBuster, its data collection process, the experience with the Republican debate, and our quest for the "Holy Grail"---a completely-automated, live fact-checking machine that replaces human fact-checkers. (2) The talk will also briefly discuss FactWatcher, a system that helps journalists identify data-backed, attention-seizing facts which serve as leads to news stories. Given an append-only database, upon the arrival of a new tuple, FactWatcher monitors if the tuple triggers any new facts. Its algorithms efficiently search for facts without exhaustively testing all possible ones. Furthermore, FactWatcher provides multiple features in striving for an end-to-end system, including fact ranking, fact-to-statement translation and keyword-based fact search. In 2014 the system won an Excellent Demonstration Award at VLDB, one of the two most prestigious conferences in the database area.
This document provides an overview of Chengli Li's computational journalism research at UT-Arlington, including:
- The history of the research starting in 2009 with fact-finding problems in sports journalism.
- Key projects including ClaimBuster for data-driven fact-checking and FactWatcher for exceptional fact finding from real-world events.
- Challenges in graph data usability and systems developed like GQBE, Orion, and TableView to make graph data easier to understand and explore.
- Current students and collaborators on projects involving automated fact-checking, monitoring claims on social media, and graph data management.
A database application differs form regular applications in that some of its inputs may be database queries. The program will execute the queries on a database and may use any result values in its subsequent program logic. This means that a user-supplied query may determine the values that the application will use in subsequent branching conditions. At the same time, a new database application is often required to work well on a body of existing data stored in some large database. For systematic testing of database applications, recent techniques replace the existing database with carefully crafted mock databases. Mock databases return values that will trigger as many execution paths in the application as possible and thereby maximize overall code coverage of the database application.
In this paper we offer an alternative approach to database application testing. Our goal is to support software engineers in focusing testing on the existing body of data the application is required to work well on. For that, we propose to side-step mock database generation and instead generate queries for the existing database. Our key insight is that we can use the information collected during previous program executions to systematically generate new queries that will maximize the coverage of the application under test, while guaranteeing that the generated test cases focus on the existing data.
The document describes a system called Facetedpedia that automatically generates faceted interfaces for sets of Wikipedia articles. It discusses challenges in determining what facets to use and how to rank them. Facets are groupings of attribute articles related to target articles. It defines facets and introduces methods to calculate the cost of single facets and rank multiple facets based on coverage and minimizing user effort in navigation. The goal is to select good query-dependent facets to provide structure for exploring related Wikipedia articles.
Wikipedia is the largest user-generated knowledge base. We propose a structured query mechanism, entity-relationship query, for searching entities in Wikipedia corpus by their properties and inter-relationships. An entity-relationship query consists of arbitrary number of predicates on desired entities. The semantics of each predicate is specified with keywords. Entity-relationship query searches entities directly over text rather than pre-extracted structured data stores. This characteristic brings two benefits: (1) Query semantics can be intuitively expressed by keywords; (2) It avoids information loss that happens during extraction. We present a ranking framework for general entity-relationship queries and a position-based Bounded Cumulative Model for accurate ranking of query answers. Experiments on INEX benchmark queries and our own crafted queries show the effectiveness and accuracy of our ranking method.
This paper studies the problem of prominent streak discovery in sequence data. Given a sequence of values, a prominent streak is
a long consecutive subsequence consisting of only large (small)
values. For finding prominent streaks, we make the observation
that prominent streaks are skyline points in two dimensions– streak
interval length and minimum value in the interval. Our solution
thus hinges upon the idea to separate the two steps in prominent
streak discovery– candidate streak generation and skyline operation over candidate streaks. For candidate generation, we propose the concept of local prominent streak (LPS). We prove that prominent streaks are a subset of LPSs and the number of LPSs is less than the length of a data sequence, in comparison with the quadratic number of candidates produced by a brute-force baseline method. We develop efficient algorithms based on the concept of LPS. The non-linear LPS-based method (NLPS) considers a superset of LPSs as candidates, and the linear LPS-based method (LLPS) further guarantees to consider only LPSs. The results of experiments using multiple real datasets verified the effectiveness of the proposed methods and showed orders of magnitude performance improvement against the baseline method.
The document proposes algorithms to efficiently find "skyline groups" from a database. A skyline group is a subset of k tuples from the database that is not dominated by any other k-tuple combination based on aggregate functions like sum, min, or max. The document presents two algorithms:
1) Order Specific Method (OSM) which leverages the property that if a k-tuple is skyline, one of its (k-1)-tuple subsets must also be skyline to prune the search space.
2) Weak Candidate Method (WCM) which leverages a weaker property for pruning when finding skyline groups under min and max aggregates.
Both algorithms significantly reduce the search space compared to a
This document discusses finding groups of k objects from a larger set that have strong collective scores across multiple merits or criteria. It provides examples of applications like crowdsourcing, question answering, and product reviews. It then outlines the framework of an approach called CrewScout that finds skyline groups - groups that are not dominated by other groups on any criteria - to identify optimal collections of objects.
We study the novel problem of finding new, prominent situational facts, which are emerging statements about objects that stand out within certain contexts. Many such facts are newsworthy—e.g., an athlete’s outstanding performance in a game, or a viral video’s impressive popularity. Effective and efficient identification of these facts assists journalists in reporting, one of the main goals of computational journalism. Technically, we consider an ever-growing table of objects with dimension and measure attributes. A situational fact is a “contextual” skyline tuple that stands out against historical tuples in a context, specified by a conjunctive constraint involving dimension attributes, when a set of measure attributes are compared. New tuples are constantly added to the table, reflecting events happening in the real world. Our goal is to discover constraint-measure pairs that qualify a new tuple as a contextual skyline tuple, and discover them quickly before the event becomes yesterday’s news. A brute-force approach requires exhaustive comparison with every tuple, under every constraint, and in every measure subspace. We design algorithms in response to these challenges using three corresponding ideas—tuple reduction, constraint pruning, and sharing computation across measure subspaces. We also adopt a simple prominence measure to rank the discovered facts when they are numerous. Experiments over two real datasets validate the effectiveness and efficiency of our techniques.
We witness an unprecedented proliferation of knowledge graphs that record millions of heterogeneous entities and their diverse relationships. While knowledge graphs are structure-flexible and content-rich, it is difficult to query them. The challenge lies in the gap between their overwhelming complexity and the limited database knowledge of non-professional users. If writing structured queries over “simple” tables is difficult, it gets even harder to query complex knowledge graphs. As an initial step toward improving the usability of knowledge graphs, we propose to query such data by example entity tuples, without requiring users to write complex graph queries. Our system, GQBE (Graph Query By Example), is a proof of concept to show the possibility of this querying paradigm working in practice. The proposed framework automatically derives a hidden query graph based on input query tuples and finds approximate matching answer graphs to obtain a ranked list of top-k answer tuples. It also makes provisions for users to give feedback on the presented top-k answer tuples. The feedback is used to refine the query graph to better capture the user intent. We conducted initial experiments on the real-world Freebase dataset, and observed appealing accuracy and efficiency. Our proposal of querying by example tuples provides a complementary approach to the existing keyword-based and query-graph-based methods, facilitating user-friendly graph querying. To the best of our knowledge, GQBE is among the first few emerging systems to query knowledge graphs by example entity tuples.
Towards computational journalism, we present FactWatcher, a system that helps journalists identify data-backed, attention-seizing facts which serve as leads to news stories. FactWatcher discovers three types of facts, including situational facts, one-of-the-few facts, and prominent streaks, through a unified suite of data model, algorithm framework, and fact ranking measure. Given an append-only database, upon the arrival of a new tuple, FactWatcher monitors if the tuple triggers any new facts. Its algorithms efficiently search for facts without exhaustively testing all possible ones. Furthermore, FactWatcher provides multiple features in striving for an end-to-end system, including fact ranking, fact-to-statement translation and keyword-based fact search.
CrewScout is an expert-team finding system based on the concept of skyline teams and efficient algorithms for finding such teams. Given a set of experts, CrewScout finds all k-expert skyline teams, which are not dominated by any other k-expert teams. The dominance between teams is governed by comparing their aggregated expertise vectors. The need for finding expert teams prevails in applications such as question answering, crowdsourcing, panel selection, and project team formation. The new contributions of this paper include an end-to-end system with an interactive user interface that assists users in choosing teams and an demonstration of its application domains.
This is the first study of crowdsourcing Pareto-optimal object finding over partial orders and by pairwise comparisons, which has applications in public opinion collection, group decision making, and information exploration. Departing from prior studies on crowdsourcing skyline and ranking queries, it considers the case where objects do not have explicit attributes and preference relations on objects are strict partial orders. The partial orders are derived by aggregating crowdsourcers’ responses to pairwise comparison questions. The goal is to find all Pareto-optimal objects by the fewest possible questions. It employs an iterative question-selection framework. Guided by the principle of eagerly identifying non-Pareto optimal objects, the framework only chooses candidate questions which must satisfy three conditions. This design is both sufficient and efficient, as it is proven to find a short terminal question sequence. The framework is further steered by two ideas—macro-ordering and micro-ordering. By different micro-ordering heuristics, the framework is instantiated into several algorithms with varying power in pruning questions. Experiment results using both real crowdsourcing marketplace and simulations exhibited not only orders of magnitude reductions in questions when compared with a brute-force approach, but also close-to-optimal performance from the most efficient instantiation.
In this research, we deployed an automated fact-checking system (ClaimBuster) on the 2016 presidential primary debates and assessed its performance compared to professional news organizations. In real-time, ClaimBuster scored the statements made by candidates on check-worthiness, the likelihood that the statement contained an important factual claim. Its discrimination compared to professional journalists was high. Statements chosen for fact checking by CNN and PolitiFact had been scored much higher by ClaimBuster than those not selected. The topics of statements chosen for checking also mirrored the topics of those scored highly by ClaimBuster. Differences between the political parties and individual candidates on the use of factual claims are also presented.
ClaimBuster is the first end-to-end fact-checking system that can detect check-worthy factual claims, fact-check the claims, and provide summaries of its fact-checking. It was developed by researchers from the University of Mississippi and University of Texas at Arlington to help combat the spread of fake news and misinformation. The system uses natural language processing techniques to classify sentences from documents and debates as opinions, unimportant facts, or check-worthy factual claims. It then fact-checks the check-worthy claims and provides a summary of its fact-checking results.
This paper introduces how ClaimBuster, a fact-checking platform, uses natural language processing and supervised learning to detect important factual claims in political discourses. The claim spotting model is built using a human-labeled dataset of check-worthy factual claims from the U.S. general election debate transcripts. The paper explains the architecture and the components of the system and the evaluation of the model. It presents a case study of how ClaimBuster live covers the 2016 U.S. presidential election debates and monitors social media and Australian Hansard for factual claims. It also describes the current status and the long-term goals of ClaimBuster as we keep developing and expanding it.
This paper introduces TableView, a tool that helps generate preview tables of massive heterogeneous entity graphs. Given the large number of datasets from many different sources, it is challenging to select entity graphs for a particular need. TableView produces preview tables to offer a compact presentation of important entity types and relationships in an entity graph. Users can quickly browse the preview in a limited space instead of spending precious time or resources to fetch and explore the complete dataset that turns out to be inappropriate for their needs. In this paper we present: 1) a detailed description of TableView’s graphical user interface and its various features, 2) a brief description of TableView’s preview scoring measures and algorithms, and 3) the demonstration scenarios we intend to present to the audience.
We present Maverick, a general, extensible framework that discovers exceptional facts about entities in knowledge graphs. To the best of our knowledge, there was no previous study of the problem. We model an exceptional fact about an entity of interest as a context-subspace pair, in which a subspace is a set of attributes and a context is defined by a graph query pattern of which the entity is a match. The entity is exceptional among the entities in the context, with regard to the subspace. The search spaces of both patterns and subspaces are exponentially large. Maverick conducts beam search on the patterns which uses a match-based pattern construction method to evade the evaluation of invalid patterns. It applies two heuristics to select promising patterns to form the beam in each iteration. Maverick traverses and prunes the subspaces organized as a set enumeration tree by exploiting the upper bound properties of exceptionality scoring functions. Results of experiments and user studies using real-world datasets demonstrated substantial performance improvement of the proposed framework over the baselines as well as its effectiveness in discovering exceptional facts.
In this paper, we show that by using a relatively simple neural network architecture and including edge (i.e., nonsensical) cases into a dataset we can more reliably identify factual claims than predecessor SVM models. Doping the dataset with these nonsensical example results in a more robust model overall that is resistant to being tricked into classifying sentences into a certain category based on easily met criteria. Furthermore, we show that the use of multiple word-embeddings makes little difference to the overall accuracy of the model, but particular embeddings perform differently on text that contains digits (i.e., 0-9) which can be leveraged by using multiple models to come to a conclusion on the score for a particular piece of text. Our results also show, that for our particular dataset trying to differentiate sentences into more than two categories might hurt the overall accuracy of the models, or at least not provide any substantial benefits compared to the binary classification scenario.
We study the problem of continuous object dissemination—given a large number of users and continuously arriving new objects, deliver an object to all users who prefer the object. Many real world applications analyze users’ preferences for effective object dissemination. For continuously arriving objects, timely finding users who prefer a new object is challenging. In this paper, we consider an append-only table of objects with multiple attributes and users’ preferences on individual attributes are modeled as strict partial orders. An object is preferred by a user if it belongs to the Pareto frontier with respect to the user’s partial orders. Users’ preferences can be similar. Exploiting shared computation across similar preferences of different users, we design algorithms to find target users of a new object. In order to find users of similar preferences, we study the novel problem of clustering users’ preferences that are represented as partial orders. We also present an approximate solution of the problem of finding target users which is more efficient than the exact one while ensuring sufficient accuracy. Furthermore, we extend the algorithms to operate under the semantics of sliding window. We present the results from comprehensive experiments for evaluating the efficiency and effectiveness of the proposed techniques.
More from The Innovative Data Intelligence Research (IDIR) Laboratory, University of Texas at Arlington (20)
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)Rebecca Bilbro
To honor ten years of PyData London, join Dr. Rebecca Bilbro as she takes us back in time to reflect on a little over ten years working as a data scientist. One of the many renegade PhDs who joined the fledgling field of data science of the 2010's, Rebecca will share lessons learned the hard way, often from watching data science projects go sideways and learning to fix broken things. Through the lens of these canon events, she'll identify some of the anti-patterns and red flags she's learned to steer around.
Discover the cutting-edge telemetry solution implemented for Alan Wake 2 by Remedy Entertainment in collaboration with AWS. This comprehensive presentation dives into our objectives, detailing how we utilized advanced analytics to drive gameplay improvements and player engagement.
Key highlights include:
Primary Goals: Implementing gameplay and technical telemetry to capture detailed player behavior and game performance data, fostering data-driven decision-making.
Tech Stack: Leveraging AWS services such as EKS for hosting, WAF for security, Karpenter for instance optimization, S3 for data storage, and OpenTelemetry Collector for data collection. EventBridge and Lambda were used for data compression, while Glue ETL and Athena facilitated data transformation and preparation.
Data Utilization: Transforming raw data into actionable insights with technologies like Glue ETL (PySpark scripts), Glue Crawler, and Athena, culminating in detailed visualizations with Tableau.
Achievements: Successfully managing 700 million to 1 billion events per month at a cost-effective rate, with significant savings compared to commercial solutions. This approach has enabled simplified scaling and substantial improvements in game design, reducing player churn through targeted adjustments.
Community Engagement: Enhanced ability to engage with player communities by leveraging precise data insights, despite having a small community management team.
This presentation is an invaluable resource for professionals in game development, data analytics, and cloud computing, offering insights into how telemetry and analytics can revolutionize player experience and game performance optimization.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of March 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
1. Tackling Usability Challenges in Querying
Massive, Ultra-heterogeneous Graphs
Chengkai Li
Associate Professor, Department of Computer Science and Engineering
Director, Innovative Database and Information Systems Research (IDIR) Laboratory
University of Texas at Arlington
North Carolina State University, Feb. 9th, 2016