My name is Garrett Wolf and I’m a student from Arizona State University. Today I’m going to talk about our research on “Query Processing over Incomplete Autonomous Databases”. This work was performed in conjunction with Hemal Khatri (who is currently with MSN Live Search), Bhaumik Chokshi, Jianchun Fan (who is currently with Amazon), Yi Chen, and Subbarao Kambhampati.
Over the last decade, we’ve seen a increase in the number of databases which are accessible via the web. These databases are made available through web-form based interfaces which allow users to enter a query. This query is then sent to the autonomous database the results of which are returned and presented to the user in html format. Access to these databases allows users to view vast amounts of information they were unable to view before. However, because of the large number of online databases, users are often troubled with deciding which database to access. As a result, mediator systems are being developed to provide users with a single point of access to multiple databases thus allowing the user to query several without having to visit each site individually.
One issue that mediators face when dealing with these databases is incomplete data. Incompleteness may arise for a number of reasons: Inaccurate Extraction/Recognition – Often web databases are populated via automated extraction techniques. Such techniques are inherently imperfect and thus give rise to incompleteness in the databases they populate (e.g. an automated extractor extracts the first name / last name pair as “Billy” / “Bob Thorton” respectively when in fact the extraction should have been “Billy Bob” / “Thorton”). Similarly, incompleteness can be a result of handwritten forms which are feed into a computer and scanned/converted to electronic text via Optical Character Recognition (OCR) software. Even the best OCR software will have problems converting a doctor’s handwriting into meaningful text. Incomplete Entry – Many times, the incompleteness is a direct result of incomplete entry on the behalf of the user. Consider an online classifieds website where a user may go to sell their car. When filling out the form, a user might intentionally / unintentionally leave the “Make” attribute blank assuming it is obvious as they entered the “Model” as “Accord”. Heterogeneous Schemas – In a mediator scenario, a user sends their query to the mediator system which then turns around and issues the query to multiple sources. Many times these sources differ in the schemas they support locally. When a source fails to support all the attributes in the global mediator schema, the source essentially contains missing values over all such attributes. User-defined Schemas – Recently there has been a shift towards systems which support user-defined schemas (e.g. one example is Google Base which gives users significant freedom to define and list their own attributes). Often times this freedom can lead to “redundant attributes” (e.g. some users decide to use an attribute called “Make” whereas others may choose an attribute called “Manufacturer”). Moreover, this freedom can also lead to the proliferation of null values (e.g. a tuple which gives a value for “Make” is unlikely to give a value for “Manufacturer” and vice versa). In order to find out just how much incompleteness is out there, we took a sample from 3 online databases, namely AutoTrader.com, CarsDirect.com, and Google Base. In each case, we counted up the percentage of tuples which we found to be incomplete (by incomplete we mean that the tuple contains at least one missing value). As you can see, these real life databases contain a significant percentage of incomplete tuples. If we look at Google Base, we notice that 100% of the tuples were found to be incomplete. This is due to the large number of attributes found in the global schema. Because users define their own attributes, there are over 200 distinct attributes used to describe the vehicles found in our sample. Because none of the tuples provided values for all 200+ attributes, each tuple is essentially incomplete w.r.t. the attributes they lack.
EMPHASIZE: Our system differs from existing ones when a tuple has a null value *on the query constrained attributes*. So what is the problem we really want to solve? The problem is that current mediator systems only return “certain answers”, namely those which exactly satisfy all the user query constraints. As a result, such systems have high precision but suffer from low recall as there are tuples which do not exactly satisfy the user query constraints but which the user would still find relevant. Consider a user looking for a used car. The user is interested in finding a “Honda Accord” with a “Sedan” body style and they don’t want to spend more than “$12,000”. Current mediator systems would return certain answers such as these in response to the user’s query. This approach works fine if we assume that all the tuples in the database are complete tuples, however we’ve previously shown that this is often not the case. Let’s assume these tuples each contained a missing value on one of their attributes. Are these tuples no longer relevant to the query? Consider the first tuple whose “Body Style” is missing. Wouldn’t the user still be interested in receiving this tuple? Sure the body style attribute is missing but perhaps the user is able to clearly distinguish the car’s body style by simply looking at the picture. How about the second tuple whose “Make” is missing. Should this tuple be returned in response to the query or not? Maybe the user is decently knowledgeable when it comes to cars and he knows that “Accords” are made by “Honda”. Given this, isn’t it likely that the user would still find the tuple relevant despite it having a missing “Make” as they already know that “Accords” are made by “Honda”? Such a tuple would not be returned by systems only supporting certain answers. Finally, let’s take a look at the third tuple. This tuple is missing a value for “Price” but does that automatically mean the user wouldn’t find it relevant? Let’s assume that the system has seen enough tuples that it knows with some probability that an “Accord” built in “1999” is going to cost less than “$12,000”. If the system made this prediction and provided an explanation to the user, isn’t it likely that the user would still be interested in the tuple? So the problem we face is that current systems which only support certain answers would not return any of these 3 tuples despite them being relevant to the user. The question we want to answer is “How can we support query processing over incomplete, autonomous databases in order to retrieve uncertain results in a ranked fashion?”.
Before presenting our approach, I’m going to quickly go over some possible approaches. Assume a user asks a query to retrieve all cars with “Body Style = Convertible”. What are the possible approaches we could take? The first approach we might take is called “CertainOnly”. This approach is the approach we discussed previously, namely the one taken by traditional databases. Here only exact matches are returned. It’s obvious that this approach is no good and that it suffers from “Low Recall”. The second approach we might take is called “AllReturned”. This approach tries to improve on the previous approach by increasing recall. To do so, the approach treats missing or NULL values as a wildcard in that it matches any concrete value. Hence in addition to returning all the tuples having “Body Style = Convertible”, it would also return all tuples whose body style is missing. Again we see that this approach is no good in that many tuples with missing body styles are not all likely to be convertibles. Moreover, the approach may be infeasible as many sources do not allow us to directly query for tuples with null values. The third approach we might take is called “AllRanked”. This approach is similar to the previous approach but additionally it tries to improve precision by first predicting the values of missing attributes and then ranking the incomplete tuples in order of their likelihood of matching the query constraints. Unfortunately this approach is no good either as it is costly due to the fact that it must still retrieve all tuples with missing body styles (whether they have a high probability of being convertibles or not) and then it must additionally rank these tuples. Not to mention the fact that such an approach might still be infeasible if the source does not support direct querying of null values. Given the drawbacks of each of these approaches, we developed another approach which we call QPIAD.
Now I’m going to briefly run through QPIAD’s approach to retrieving relevant uncertain results. The approach uses query rewriting and ranking to retrieve and present these uncertain results to the user. Let’s assume that this is the autonomous database and we are given a query “Body Style=Convertible” an asked to retrieve all the relevant tuples. Again the set of relevant tuples is likely to contain both certain and uncertain results. The first step in the QPIAD approach is to issue the original query to the database and retrieve the certain results which we refer to as the “base result set” or simply the “base set”. We then use the base set tuples along with a mined attribute dependency to generate a set of rewritten queries. So for example lets say we have an “Approximate Functional Dependency” which we mined from our sample data. We have an AFD for each attribute and in this scenario, we want to use an AFD which specifies Body Style on the right hand side. Here we have an AFD “Model ~> Body Style” which basically says that given a car’s model, we can determine its body style with some degree of confidence. Because the original query was on body style, we generate rewritten queries using the model attribute as we know that it can be used to determine a car’s body style approximately. Using the base set, we find all distinct values for model and create a rewritten query for each. The intuition behind this approach is that if we know that these models have tuples where the body style is convertible, then its likely that if we issue queries for these models, we will get additional tuples whose body style is also convertible or more importantly we will get tuples whose body style is missing but is in fact a convertible. Because these rewritten queries are not equally likely to bring back convertible cars, we first order them before issuing them to the data source. Later I’ll discuss exactly how we order the queries but for now, lets just assume they have been ordered. Once the rewritten queries are ordered, they are then issued to the data source to retrieve the relevant uncertain results. A nice feature of the QPIAD approach is that all the tuples retrieved by a rewritten query are of the same rank and hence we do not have to sort the entire set of results, we merely sort the set of rewritten queries.
NOTE SURE HOW MUCH OF THIS I’LL ACTUALLY GO OVER Here we have a diagram of the QPIAD architecture. In the next few slides, I will cover each of the pieces above and then we will come back to this diagram and hopefully it will make a little more sense by then.
The QPIAD system relies on several statistics to support its ranking and rewriting mechanisms. The first statistic is a form of attribute correlation which is called an Approximate Functional Dependency (AFD). An AFD is similar to a Functional Dependency (FD) except that it only holds approximately over a majority of the tuples. AFDs have one or more attributes on the left hand side of an arrow and a single attribute on the right hand side. These attributes make up the determining and determined sets respectively. Each AFD has an associated confidence score based on the fraction of tuples for which an AFD holds. To obtain these AFDs we first gather a sample from the autonomous database. We then run an algorithm called TANE which produces AFDs and AKeys where AKeys are simiply approximate keys. Finally, we prune the set of AFDs which contain AKeys in their determining sets as these correlations are not likely to be useful for rewriting and/or ranking. The second statistic is in the form of value distributions using Naïve Bayes Classifiers. The predictions output by these classifiers are taken as our estimated precision and used in ranking of rewritten queries. When constructing the classifiers, we use an AFD’s determining set attributes for feature selection. This helps in keeping our rewritten queries from becoming too restrictive. The third statistic is used to estimate the selectivity of a rewritten query. This statistic is the product of 3 things, namely 1) the selectivity of the rewritten query when it is issued on the sample database 2) the ratio of the original database size to the size of the sample 3) the percent of incomplete tuples we encounter while constructing the sample. The product of these three make up our estimated selectivity measure which is also used in the ranking of rewritten queries.
On the previous slide, after generating the set of rewritten queries, we turned around and issued them to the autonomous source. However, in real life scenarios, its likely that the source may impose resource limitations restricting the number of rewritten queries we are allowed to issue. Even in cases where such limitations are not explicitly defined, sending too many queries to a source in a short period of time could result in the source blocking the mediator’s IP address thereby crippling the system. As a result of these limitations, rather than sending the entire set of rewritten queries, a good idea would be to select the top-K queries and issue those. An important point to consider when selecting these top-K queries is the balance between precision and recall (where precision is actually estimated precision based on missing value distributions and recall is actually estimated recall based on the estimated precision and estimated selectivity of a query). For example, assume a source imposes a 5 query limitation. If we choose the top 5 queries based on their precision alone, we may run the risk of retrieving a very low number of tuples despite the tuples being highly precise. In the worst case, we might end up issuing all 5 queries all of which could return empty result sets. On the other hand, if we issue queries with the highest recall, the user is not likely to find the results very relevant to their query. In order to achieve a balance between the two (Precision & Recall), we decided to use an F-Measure based selection function with a configurable alpha parameter. In the IR literature, F-Measure is defined as the weighted harmonic mean of the precision and recall measures. We chose a generalized form of the F-Measure function which allows us to set a parameter, alpha, thereby giving more or less weight to precision and/or recall depending on the type of resource limitations we face or the preferences of the users themselves. As you can see, by simply setting alpha=0, we can reduce the F-Measure function to only take precision into account. By setting alpha to .5 we give precision twice the weight of recall and by setting alpha to 2, we give recall twice the weight of precision. Using this approach, we are able to select the top-k rewritten queries we wish to send to the source. One point to note is that although F-Measure is used for selecting the top-k queries, it does not determine the order in which they are sent to the source. The reason is that we still would like to retrieve the most precise tuples first.
In traditional databases, we only show the users the certain tuples and hence the user has no reason to doubt the results presented to them. However, when we begin to consider uncertain tuples, the user may be hesitant to fully trust the answers which they are shown. Therefore, in order to gain the user’s trust, QPIAD must provide explanations to the user outlining the reasoning behind each answer it provides. So lets say a users asks the following query. QPIAD will first show the user the certain results. Its obvious that here no explanation is needed as these are exact answers to the query. However once QPIAD shows uncertain results, the user’s trust beings to fade. As a result, QPIAD provides explanations like these to help the user understand the uncertain results. These explanations are derived from the AFDs and missing value probabilities. For incomplete tuples, the values of the determining set attributes are used to justify the prediction along with the probability that the missing value is in fact the value the user was looking for. So in the end, QPIAD provides the user with certain answers, relevant uncertain answers, and explanations for these uncertain answers.
One of the features of the QPIAD system is that it allows us to leverage correlations between data sources. Lets say we have two sources and a user who issues a query “Body=Coupe”. As we can see, the first source, namely Cars.com supports the body style attribute however the second source, Yahoo Autos does not. Therefore, when issuing this query, Yahoo Autos would be excluded from our source selection. A solution to this problem involves a mediator which simply takes the union of these two local schemas and produces a global schema which is used to answer user queries. Now the user would send their query to the mediator. The mediator would in turn issue the original query to Cars.com and retrieve the certain results. Next, the mediator would use the base set and an AFD to generate a set of rewritten queries which it could then turn around and issue to Yahoo Autos. Because these queries are rewritten, they now place a constraint on Model, an attribute which is supported by the source. Finally, the mediator retrieves these uncertain results from Yahoo Autos and shows them to the user. The benefit here is that rather than completely disregarding Yahoo Autos from our source selection, we can include its results as uncertain answers. Correlated sources can be used in two ways: They can be used when one source does not support all the attributes found in the global schema. They can be used when we want to include a source but do not yet have a data sample or statistics from the source.
In addition to correlated sources, the QPIAD system also supports aggregate and join queries. Aggregate Queries Assume we are given a query which asks us to find out how many cars have “Body=Convt”. The database we are asked the query over is incomplete and hence the tuples with missing values for Body have an associated probability distribution which we can use to predict the likelihood of various missing values. So one way of computing this aggregate would be to count up all the certain tuples and then count up all the uncertain tuples and include a portion of their count relative to the probability that each uncertain tuple is in fact what we are looking for. So for this query we could compute the aggregate as 1 for t1, .9 for t2, 1 for t3, and .4 for t4. This gives us an aggregate count of 3.3. Another method of computing this aggregate would be to only include uncertain tuples whose most likely value is the one we are looking for and in this case include the tuple’s entire count rather than a relative portion as we did before. So for this query we could compute the aggregate as 1 for t1, 1 for t2 because its most likely value is convertible, and 1 for t3. Notice that here we did not include t4 as its most likely value is not convertible. Here we arrive at an aggregate count of 3. Our experience has shown that the first method tends to over predict the aggregate value and therefore we adopted the second approach. In addition to aggregate queries, our system supports joins but I will refer you to the paper for further details.
Here we have a screenshot of our current implementation of the QPIAD system. This is the Query Builder page where users can construct their queries using the AJAX-enabled interface. After building their query and issuing it to the system, the users are presented with the Results Navigator page where they can see the certain results, uncertain results, explanations, and even the queries that were used to retrieve each of the tuples.
Our evaluation was performed using three datasets. The first is made up of 55,000 tuples which we scraped from Cars.com. The second contains 200,000 tuples from the Office of Defect Investigation. This dataset contains consumer complaints related to problems they might have had with their vehicles and specifically which parts of the vehicle were involved. The third is made up of 45,000 tuples taken from the United States Census dataset which we obtained from the UCI repository. In our evaluations, we used a 10% sample for most of the experiments but we also examined samples ranging from 3-15% of the size of the autonomous database. Throughout our experiments, we started with tuples having no missing values and then artificially introduced null values such that 10% of the tuples were made incomplete. The reason for the artificial introduction of null values was so we would have the ground truth to which we could make comparisons for the evaluation. We aimed at evaluating three main areas. The first is the performance of our ranking and rewriting methods in terms of quality and efficiency respectively. The second is the robustness of our learning methods with respect to classifier accuracy, variations in sample size, etc. The third is the effectiveness of our extensions such as correlated sources, aggregates, and joins.
First I’ll present the evaluation of our Ranking and Rewriting techniques. Here we compared QPIAD and the AllReturned approach in terms of the quality of the answers they retrieve. As we can see, both approaches achieve the same levels of recall but QPIAD maintains a much higher precision while doing so. The reason for this is that QPIAD attempts to retrieve only the relevant tuples whereas AllReturned retrieves all tuples having missing values on the constrained attributes many of which are not likely to be relevant to the users query.
Next we compared QPIAD and the AllRanked approach in terms of efficiency. By efficiency we are referring to the number of tuples retrieved by each approach to achieve some level of recall. As we can see, QPIAD is able to achieve the same levels of recall as AllRanked but does so while retrieving much fewer tuples. You will notice that the curve for AllRanked is horizontal around the 700 tuple mark. The reason for this is that AllRanked must retrieve all tuples with missing values on constrained attributes. Therefore even to achieve the lowest levels of recall, AllRanked must retrieve the entire set of uncertain tuples relative to the constrained attributes. However, QPIAD only retrieves a small subset of the uncertain tuples, namely those which are likely to be relevant to the users query. Another point to note in terms of efficiency is that QPIAD only ranks the rewritten queries rather than the tuples themselves because the tuples have the same rank as the query that retrieved them. AllRanked on the other hand must rank the full set of uncertain tuples which as we can see is much larger than the number of rewritten queries generated by QPIAD.
To evaluate the effectiveness of our approach in terms of aggregate queries, we generated a very large set of queries with various combinations of attributes and values. These queries were then issued to the database using to approaches. The first approach simply considers the certain answers and computes the aggregate over them. The second approach using QPIAD’s missing value prediction and rewriting techniques to include uncertain tuples in the final aggregate result. The aggregates computed by each of these approaches was then compared with the ground truth database and the accuracies were computed. We can see from this plot that QPIAD’s missing value prediction increases the fraction of queries that achieve higher levels of accuracy. For example, here we see that approximately 20% more queries were able to achieve 100% accuracy when the QPIAD approach is used. For joins, we wanted to evaluate the effect of alpha on precision and recall. Here we used three values for alpha, namely 0, .5, and 2 which correspond to precision only, precision twice recall, and recall twice precision respectively. This plot shows that by adjusting alpha up from zero, we can obtain high levels of recall without sacrificing much precision.
Because our approach relies heavily on the underlying learning methods, we felt a evaluation was necessary. In terms of our AFD-Enhanced NBC classifiers, we can see that using AFDs for feature selection helps to improve our prediction accuracy in many of the cases. In addition, using the AFDs for feature selection is helpful for maintaining consistency with our rewriting approach. Moreover, we compared our AFD-Enhanced classifiers with BayesNet classifiers and found the accuracy to be comparable while being much less costly to learn. Given that these classifiers are built using a sample of the autonomous data source, we wanted to evaluated QPIAD’s robustness w.r.t. the size of the sample we obtain. Here we can see that when varying the sample from 3 – 15% the effects on precision are hardly noticeable. We found this to be the case in our experiments on the Census database as well.
Now that we’ve covered our approach to retrieving relevant uncertain results from incomplete autonomous databases, I’m quickly going to cover the related work. Obviously our work fits into the category of Querying Incomplete Databases. Here there are often two approaches, namely the Possible World approaches and the Probabilistic approaches. Our work falls into the category of Probabilistic approaches. Naturally our work is then similar to Probabilistic Databases which usually store probabilities associated with each tuple or attribute. However, in our work, the mediator does not have the capability to modify the underlying autonomous databases. Our approach utilizes query rewriting to retrieve the relevant uncertain answers and thus is similar to the work on Query Reformulation / Relaxation. However, the focus of our work is on retrieving tuples with missing values on constrained attributes. Finally because we are retrieving tuples with missing values on constrained attributes which we later predict, our work is similar to the work on Learning Missing Values. The key point regarding our work is that we require schema level dependencies between attributes as well as distribution information over missing values.
On the web there exists many autonomous databases which users access via form-based interfaces. When querying these autonomous databases, often users do not clearly define what it is they are looking for. Users may specify queries which are overly general (e.g. A user may ask the query Q:(Model=Civic) when what they really want is a “Civic” with low mileage) Similarly, users may specify queries which are overly specific (e.g. A user may ask the query Q:(Model=Civic) when what they really want is a reliable Japanese car in which case, an “Accord” or “Corolla” may suit their needs) Therefore, in addition to returning the tuples which exactly satisfy the user’s query constraints, we would also like to return tuples with values which are similar to the original query constraints. In addition to posing imprecise queries, another concern is that the data provided by the autonomous databases may be incomplete due to the methods used to populate them. Many autonomous databases are populated by lay web users entering in data through forms (e.g. a user trying to sell their car may enter the “Model” as “Civic” omit the “Make” assuming it is obvious) Similarly, many autonomous databases are populated using automated extraction techniques (e.g. often these extraction techniques are not able to extract all the desired information especially when dealing with free-form text as in Craigslist.com) Therefore, in addition to returning the tuples which exactly satisfy the user’s query constraints, we would also like to return tuples which have “null/missing” values on the constrained attributes but are highly likely to be relevant to the user. Thus, we would like to rewrite the user’s original query in order to retrieve such similar and incomplete tuples. However, rather than randomly sending the rewritten queries to the autonomous database, we rather issue them intelligently such that the tuples they return are likely to highly relevant to the user (in addition to keeping the network/processing costs manageable). A general solution to this problem is a model we call “Expected Relevance Ranking (ERR)” which ranks in order of the expected relevance to the user. Here the ERR model can be defined in terms of Relevance and Density functions. *** CHALLENGE 1 *** This model brings forth our first challenge, namely, how do we automatically and non-intrusively assess the relevance and density functions? Once the ranking model has been established, we must go back and consider how should the query rewriting work in the first place? *** CHALLENGE 2 *** How can we rewrite the user’s original query to bring back both similar and incomplete tuples? After we figure out how to rewrite the users query and rank the queries/tuples in order of their expected relevance to the user, we come across a final challenge. *** CHALLENGE *** Given that we are showing the user tuples which do not exactly satisfy the constraints of their query, how can we explain the results in order to gain the user’s trust? on challenges on querying autonomous db, as you are about to leave the slide--that handling autonomous db naturally brings together challenges that cross the traditional IR and DB boundaries.