Efficient computation of range aggregates


Published on

Dear Students
Ingenious techno Solution offers an expertise guidance on you Final Year IEEE & Non- IEEE Projects on the following domain
For further details contact us:
044-42046028 or 8428302179.

Ingenious Techno Solution
#241/85, 4th floor
Rangarajapuram main road,
Kodambakkam (Power House)

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Efficient computation of range aggregates

  1. 1. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 Efficient Computation of Range Aggregates against Uncertain Location Based Queries Ying Zhang1 Xuemin Lin1,2 Yufei Tao3 Wenjie Zhang1 Haixun Wang4 1 University of New South Wales, {yingz,lxue, zhangw}@cse.unsw.edu.au 2 NICTA 3 Chinese University of Hong Kong, taoyf@cse.cuhk.edu.hk 4 Microsoft Research Asia, haixunw@microsoft.com Abstract—In many applications, including location based services, queries may not be precise. In this paper, we study the problem of efficiently computing range aggregates in a multidimensional space when the query location is uncertain. Specifically, for a query point Q whose location is uncertain and a set S of points in a multi-dimensional space, we want to calculate the aggregate (e.g., count, average and sum) over the subset S of S such that for each p ∈ S , Q has at least probability θ within the distance γ to p. We propose novel, efficient techniques to solve the problem following the filtering-and-verification paradigm. In particular, two novel filtering techniques are proposed to effectively and efficiently remove data points from verification. Our comprehensive experiments based on both real and synthetic data demonstrate the efficiency and scalability of our techniques. Index Terms—Uncertainty, Index, Range aggregate query ✦ 1 I NTRODUCTION p5 will be destroyed. Similarly, objects p2 , p3 and p6 will be destroyed if the actual falling point is q2 . In this appli- t.c om om Query imprecision or uncertainty may be often caused cation, the risk of civilian casualties may be measured by po t.c by the nature of many applications, including location the total number n of civilian objects which are within γ gs po based services. The existing techniques for processing distance away from a possible blast point with at least lo s .b og location based spatial queries regarding certain query θ probability. Note that the probabilistic threshold is set ts .bl points and data points are not applicable or inefficient ec ts by the commander based on the levels of trade-off that oj c when uncertain queries are involved. In this paper, we pr oje she wants to make between the risk of civilian damages re r investigate the problem of efficiently computing distance and the effectiveness of military attacks; for instance, it is lo rep based range aggregates over certain data points and unlikely to cause civilian casualties if n = 0 with a small xp lo ee xp uncertain query points as described in the abstract. In θ. Moreover, different weight values may be assigned .ie ee general, an uncertain query Q is a multi-dimensional to these target points and hence the aggregate can be w e w .i point that might appear at any location x following conducted based on the sum of the values. w w :// w a probabilistic density function pdf (x) within a region tp //w Q.region. There is a number of applications where a ht ttp: p1 p3 query point may be uncertain. Below are two sample h γ q1 applications. γ a q2 Motivating Application 1. A blast warhead carried by p5 p2 p6 a missile may destroy things by blast pressure waves in Q its lethal area where the lethal area is typically a circular p4 p7 area centered at the point of explosion (blast point) with Q : s h a d o w e d re g i o n t o i n d i c a t e t h e p o s s i b l e l o c a t i o n s o f t h e q u e ry radius γ [24] and γ depends on the explosive used. q1, q 2 : to i n d i c a te tw o p o s s i b l e l o c a ti o n s o f Q While firing such a missile, even the most advanced γ : q u e ry d i s t a n c e laser-guided missile cannot exactly hit the aiming point with 100% guarantee. The actual falling point (blast Fig. 1. Missile Example point) of a missile blast warhead regarding a target point usually follows some probability density functions Motivating Application 2. Similarly, we can also esti- (P DF s); different P DF s have been studied in [24] where mate the effectiveness of a police vehicle patrol route bivariate normal distribution is the simplest and the most using range aggregate against uncertain location based common one [24]. In military applications, firing such query Q. For example, Q in Fig. 1 now corresponds a missile may not only destroy military targets but may to the possible locations of a police patrol vehicle in a also damage civilian objects. Therefore, it is important to patrol route. A spot (e.g., restaurant, hotel, residential avoid the civilian casualties by estimating the likelihood property), represented by a point in {p1 , p2 , . . . , p7 } in of damaging civilian objects once the aiming point of a Fig. 1, is likely under reliable police patrol coverage [11] blast missile is determined. As depicted in Fig. 1, points if it has at least θ probability within γ distance to a {pi } for 1 ≤ i ≤ 7 represent some civilian objects (e.g., moving patrol vehicle, where γ and θ are set by domain residential buildings, public facilities ). If q1 in Fig. 1 is experts. The number of spots under reliable police patrol the actual falling point of the missile, then objects p1 and coverage is often deployed to evaluate the effectivenessDigital Object Indentifier 10.1109/TKDE.2011.46 1041-4347/11/$26.00 © 2011 IEEE
  2. 2. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2 of the police patrol route. tion to be stored. Both of them can be applied to Motivated by the above applications, in the paper we continuous case and discrete case. study the problem of aggregate computation against the • Extensive experiments are conducted to demon- data points which have at least probability θ to be within strate the efficiency of our techniques. distance γ regarding an uncertain location based query. • While we focus on the problem of range counting for uncertain location based queries in the paper, our Challenges. A naive way to solve this problem is that techniques can be immediately extended to other for each data point p ∈ S, we calculate the probability, range aggregates. namely falling probability, of Q within γ distance to p, The remainder of the paper is organized as follows. select p against a given probability threshold, and then Section 2 formally defines the problem and presents conduct the aggregate. This involves the calculation of preliminaries. In Section 3, following the filtering-and- an integral regarding each p and Q.pdf for each p ∈ S; verification framework, we propose three filtering tech- unless Q.pdf has a very simple distribution (e.g., uniform niques. Section 4 evaluates the proposed techniques with distributions), such a calculation may often be very ex- extensive experiments. Then some possible extensions of pensive and the naive method may be computationally our techniques are discussed in Section 5. This is fol- prohibitive when a large number of data points is in- lowed by related work in Section 6. Section 7 concludes volved. In the paper we target the problem of efficiently the paper. computing range aggregates against an uncertain Q for arbitrary Q.pdf and Q.region. Note that when Q.pdf is a uniform distribution within a circular region Q.region, 2 BACKGROUND I NFORMATION a circular “window” can be immediately obtained ac- We first formally define the problem in Section 2.1, then cording to γ and Q.region so that the computation of Section 2.2 presents the PCR technique [26] which is em- range aggregates can be conducted via the window ployed in the filtering technique proposed in Section 3.3. aggregates [27] over S. t.c om Notation Definition om po t.c Q uncertain location based query Contributions. Our techniques are developed based on gs po S a set of points the standard filtering-and-verification paradigm. We first lo s q instance of an uncertain query Q .b og discuss how to apply the existing probabilistically con- d dimensionality ts .bl Pq the probability of the q to appear strained regions (PCR) technique [26] to our problem. ec ts θ and γ probabilistic threshold and query distance oj c Then, we propose two novel distance based filtering pr oje Pf all (Q, p, γ) the falling probability of p regarding techniques, statistical filtering (STF) and anchor point re r Q and γ lo rep filtering (APF), respectively, to address the inherent lim- Qθ,γ (S) {p|p ∈ S ∧ Pf all (Q, p, γ) ≥ θ} xp lo p, x, y, b(S) point (a set of data points) ee xp its of the PCR technique. The basic idea of the STF e R tree entry .ie ee technique is to bound the falling probability of the points Cp,r a circle(sphere) centred at p with radius r w e w .i by applying some well known statistical inequalities δ(x, y) the distance between x and y w w δmax(min) (r1 , r2 ) :// w where only a small amount of statistic information about the maximal(minimal) distance tp //w between two rectangular regions the uncertain location based query Q is required. The ht ttp: gQ mean of Q STF technique is simple and space efficient (only d + 2 ηQ weighted average distance of Q h float numbers required where d denotes the dimension- σQ variance of Q arbitrarily small positive constant value ality), and experiments show that it is effective. For the a anchor point scenarios where a considerable “large” space is available, nap the number of anchor points we propose a view based filter which consists of a set of LPf all (p, γ) lower bound of the Pf all (p, γ) anchor points. An anchor point may reside at any location U Pf all (p, γ) upper bound of the Pf all (p, γ) nd the number of different distances and its falling probability regarding Q is pre-computed pre-computed for each anchor point for several γ values. Then many data points might be Da a set of distance values used by effectively filtered based on their distances to the anchor anchor point a points. For a given space budget, we investigate how to TABLE 1 construct the anchor points and their accessing orders. The summary of notations. To the best of our knowledge, we are the first to identify the problem of computing range aggregates against uncertain location based query. In this paper, we 2.1 Problem Definition investigate the problem regarding both continuous and In the paper, S is a set of points in a d-dimensional discrete Q. Our principle contributions can be summa- numerical space. The distance between two points x and rized as follows. y is denoted by δ(x, y). Note that techniques developed in the paper can be applied to any distance metrics [5]. • We propose two novel filtering techniques, STF and In the examples and experiments, the Euclidean distance APF, respectively. The STF technique has a decent is used. For two rectangular regions r1 and r2 , we have filtering power and only requires the storage of very δmax (r1 , r2 ) = max∀x∈r1 ,y∈r2 δ(x, y) and limited pre-computed information. APF provides the flexibility to significantly enhance the filtering 0 if r1 ∩ r2 = ∅ δmin (r1 , r2 ) = (1) power by demanding more pre-computed informa- min∀x∈r1 ,y∈r2 δ(x, y) otherwise
  3. 3. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 3 An uncertain (location based) query Q may be de- max, sum, avg, etc., over some non-locational attributes scribed by a continuous or a discrete distribution as fol- (e.g., weight value of the object in missile example). lows. Example 1. In Fig. 2, S = {p1 , p2 , p3 } and Q = {q1 , q2 , q3 } Definition 1 (Continuous Distribution). An uncertain where Pq1 = 0.4, Pq2 = 0.3 and Pq3 = 0.3. According query Q is described by a probabilistic density function Q.pdf . to Definition 3, for the given γ, we have Pf all (p1 , γ) = Let Q.region represent the region where Q might appear, then 0.4, Pf all (p2 , γ) = 1, and Pf all (p3 , γ) = 0.6. Therefore, x∈Q.region Q.pdf (x)dx = 1; Qθ,γ (S) = {p2 , p3 } if θ is set to 0.5, and hence |Qθ,γ (S)| = 2. Definition 2 (Discrete Distribution). An uncertain query Q consists of a set of instances (points) {q1 , q2 , . . . , qn } in a d- 2.2 Probabilistically Constrained Regions (PCR) dimensional numerical space where qi appears with probability Pqi , and q∈Q Pq = 1; In [26], Tao et al. study the problem of range query on uncertain objects, in which the query is a rectangular Note that, in Section 5 we also cover the applications window and the location of each object is uncertain. where Q can have a non-zero probability to be absent; Although the problem studied in [26] is different with that is, x∈Q.region Q.pdf (x)dx = c or q∈Q Pq = c for a the one in this paper, in Section 3.3 we show how c < 1. to modify the techniques developed in [26] to support For a point p, we use Pf all (Q, p, γ) to represent the uncertain location based query. probability of Q within γ distance to p, called falling In the following part, we briefly introduce the Prob- probability of p regarding Q and γ. It is formally defined abilistically Constrained Region (PCR) technique devel- below. oped in [26]. Same as the uncertain location based query, For continuous cases, an uncertain object U is modeled by a probability density function U.pdf (x) and an uncertain region U.region. Pf all (Q, p, γ) = Q.pdf (x)dx (2) x∈Q.region ∧ δ(x,p)≤γ The probability that the uncertain object U falls in the rectangular window query rq , denoted by Pf all (U, rq ), t.c om om For discrete cases, is defined as x∈U.region∩rq U.pdf (x)dx. In [26], the prob- po t.c gs po Pf all (Q, p, γ) = Pq (3) abilistically constrained region of the uncertain object lo s .b og q∈Q ∧ δ(q,p)≤γ U regarding probability θ (0 ≤ θ ≤ 0.5), denoted by ts .bl U.pcr(θ), is employed in the filtering technique. Partic- ec ts In the paper hereafter, Pf all (Q, p, γ) is abbreviated to oj c ularly, U.pcr(θ) is a rectangular region constructed as pr oje Pf all (p, γ), and Q.region and Q.pdf are abbreviated to Q follows. re r lo rep and pdf respectively, whenever there is no ambiguity. It For each dimension i, the projection of U.pcr(θ) xp lo is immediate that Pf all (p, γ) is a monotonically increas- [U.pcri− (θ), U.pcri+ (θ)] ee xp is denoted by where .ie ee ing function with respect to distance γ. x∈U.region&xi ≤U.pcri− (θ) U.pdf (x)dx = θ and w e w .i x∈U.region&xi ≥U.pcri+ (θ) U.pdf (x)dx = θ. Note that w w γ :// w γ xi represents the coordinate value of the point x tp //w p1 q1 on i-th dimension. Then U.pcr(θ) corresponds to ht ttp: Q q2 p3 a rectangular region [U.pcr− (θ), U.pcr+ (θ)] where h p2 q3 U.pcr− (θ) (U.pcr+ (θ)) is the lower (upper) corner and the coordinate value of U.pcr− (θ) (U.pcr+ (θ)) on i-th γ dimension is U.pcri− (θ) (U.pcri+ (θ)). Fig. 3(a) illustrates the U.pcr(0.2) of the uncertain object U in 2 dimensional space. Therefore, the probability mass of U on the left Fig. 2. Example of Pf all (Q, p, γ) (right) side of l1− (l1+ ) is 0.2 and the probability mass of Problem Statement. U below (above) the l2− (l2+ ) is 0.2 as well. Following In many applications, users are only interested in the is a motivating example of how to derive the lower and points with falling probabilities exceeding a given prob- upper bounds of the falling probability based on PCR. abilistic threshold regarding Q and γ. In this paper we Example 2. According to the definition of PCR, in Fig. 3(b) investigate the problem of probabilistic threshold based the probabilistic mass of U in the shaded area is 0.2, i.e., uncertain location range aggregate query on points data; x∈U.region&x1 ≥U.pcr1+ (θ) U.pdf (x)dx = 0.2. Then, it is im- it is formally defined below. mediate that Pf all (U, rq1 ) < 0.2 because rq1 does not intersect Definition 3. [Uncertain Range Aggregate Query] Given U.pcr(0.2). Similarly, we have Pf all (U, rq2 ) ≥ 0.2 because the a set S of points, an uncertain query Q, a query distance shaded area is enclosed by rq2 . γ and a probabilistic threshold θ, we want to compute an The following theorem [26] formally introduces how aggregate function (e.g., count, avg, and sum) against points to prune or validate an uncertain object U based on p ∈ Qθ,γ (S), where Qθ,γ (S) denotes a subset of points U.pcr(θ) or U.pcr(1 − θ). Note that we say an uncertain {p} ⊆ S such that Pf all (p, γ) ≥ θ. object is pruned (validated) if we can claim Pf all (U, rq ) < θ In this paper, our techniques will be presented based (Pf all (U, rq ) ≥ θ) based on the P CR. on the aggregate count. Nevertheless, they can be imme- Theorem 1. Given an uncertain object U , a range query rq diately extended to cover other aggregates, such as min, (rq is a rectangular window) and a probabilistic threshold θ.
  4. 4. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4 shows that techniques proposed in this section can be U region immediately applied to the discrete case. rq 2 l 2+ 3.1 A framework for filtering-and-verification Algo- rithm l 2- In this subsection, following the filtering-and-verification paradigm we present a general framework to support l 1+ uncertain range aggregate query based on the filtering l 1- rq1 technique. To facilitate the aggregate query computation, U .pcr(0.2) U .pcr(0.2) l 1+ we assume a set S of points is organized by an aggregate (a) PCR (b) PCR based filtering R-Tree [22], denoted by RS . Note that an entry e of RS might be a data entry or an intermediate entry Fig. 3. A 2d probabilistically constrained region (PCR (0.2)) where a data entry corresponds to a point in S and an 1) For θ > 0.5, U can be pruned if rq does not fully contain intermediate entry groups a set of data entries or child U.pcr(1 − θ); intermediate entries. Assume a filter, denoted by F , is 2) For θ ≤ 0.5, the pruning condition is that rq does not available to prune or validate a data entry (i.e., a point) intersect U.pcr(θ); or an intermediate entry (i.e., a set of points). 3) For θ > 0.5, the validating criterion is that rq com- Algorithm 1 illustrates the framework of the filtering- pletely contains the part of Umbb on the right (left) of and-verification Algorithm. Note that details of the fil- plane U.pcri− (1−θ) (U.pcri+ (1−θ)) for some i ∈ [1, d], tering techniques will be introduced in the following where Umbb is the minimal bounding box of uncertain subsections. The algorithm consists of two phases. In the region U.region; filtering phase (Line 3-16), for each entry e of RS to be t.c om 4) For θ ≤ 0.5 the validating criterion is that rq completely processed, we do not need to further process e if it is om po t.c contains the part of Umbb on the left (right) of plane pruned or validated by the filter F . We say an entry e is gs po U.pcri− (θ) (U.pcri+ (θ)) for some i ∈ [1, d]; pruned (validated) if the filter can claim Pf all (p, γ) < θ lo s .b og (Pf all (p, γ) ≥ θ) for any point p within embb . The counter ts .bl 3 Filtering-and-Verification A LGORITHM cn is increased by |e| (Line 6) if e is validated where ec ts oj c |e| denotes the aggregate value of e (i.e., the number pr oje According to the definition of falling probability (i.e., of data points in e). Otherwise, the point p associated re r Pf all (p, γ)) in Equation 2, the computation involves in- lo rep with e is a candidate point if e corresponds to a data xp lo tegral calculation, which may be expensive in terms of entry (Line 10), and all child entries of e are put into the ee xp CPU cost. Based on Definition 3, we only need to know .ie ee queue for further processing if e is an intermediate entry whether or not the falling probability of a particular point w e (Line 12). The filtering phase terminates when the queue w .i w w regarding Q and γ exceeds the probabilistic threshold :// w is empty. In the verification phase (Line 17-21), candidate tp //w for the uncertain aggregate range query. This motivates points are verified by the integral calculations according ht ttp: us to follow the filtering-and-verification paradigm for the to Equation 2. h uncertain aggregate query computation. Particularly, in the filtering phase, effective and efficient filtering tech- Cost Analysis. The total time cost of Algorithm 1 is as niques will be applied to prune or validate the points. We follows. say a point p is pruned (validated) regarding the uncertain Cost = Nf × Cf + Nio × Cio + Nca × Cvf (4) query Q, distance γ and probabilistic threshold θ if we can claim that Pf all (p, γ) < θ ( Pf all (p, γ) ≥ θ ) based on Particularly, Nf represents the number of entries being the filtering techniques without explicitly computing the tested by the filter on Line 5 and Cf is the time cost Pf all (p, γ). The points that cannot be pruned or validated for each test. Nio denotes the number of nodes (pages) will be verified in the verification phase in which their accessed (Line 13) and Cio corresponds to the delay of falling probabilities are calculated. Therefore, it is desirable each node (page) access of RS . Nca represents the size to develop effective and efficient filtering techniques to of candidate set C and Cvf is the computation cost for prune or validate points such that the number of points each verification (Line 15) in which numerical integral being verified can be significantly reduced. computation is required. With a reasonable filtering time In this section, we first present a general framework cost (i.e., Cvf ), the dominant cost of Algorithm 1 is for the filtering-and-verification Algorithm based on fil- determined by Nio and Nca because Cio and Cvf might tering techniques in Section 3.1. Then a set of filtering be expensive. Therefore, in the paper we aim to develop techniques are proposed. Particularly, Section 3.2 pro- effective and efficient filtering techniques to reduce Nca poses the statistical filtering technique. Then we investi- and Nio . gate how to apply the PCR based filtering technique in Filtering. Suppose there is no filter F in Algorithm 1, Section 3.3. Section 3.4 presents the anchor point based all points in S will be verified. Regarding the example filtering technique. in Fig. 4, 5 points p1 , p2 , p3 , p4 and p5 will be veri- For presentation simplicity, we consider the continuous fied. A straitforward filtering technique is based on the case of the uncertain query in this section. Section 3.5 minimal and maximal distances between the minimal
  5. 5. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 5 Algorithm 1 Filtering-and-Verification(RS , Q, F , γ, θ) in Algorithm 1 is significantly increased. In the following subsections, we present three filtering techniques, named Input: RS : an aggregate R tree on data set S, STF, PCR and APF respectively, which can significantly Q : uncertain query, F : Filter, γ : query distance, enhance the filtering capability of the filter. θ : probabilistic threshold. Output: |Qθ,γ (S)| 3.2 Statistical Filter Description: 1: Queue := ∅; cn := 0; C := ∅; In this subsection, we propose a statistical filtering tech- 2: Insert root of RS into Queue; nique, namely STF. After introducing the motivation 3: while Queue = ∅ do of the technique, we present some important statistic 4: e ← dequeue from the Queue; information of the uncertain query and then show how 5: if e is validated by the filter F then to derive the lower and upper bounds of the falling 6: cn := cn + |e|; probability of a point regarding an uncertain query Q, 7: else distance γ and probabilistic threshold θ. 8: if e is not pruned by the filter F then Motivation. As shown in Fig. 5, given an uncertain 9: if e is data entry then query Q1 and γ we cannot prune point p based on the 10: C := C ∪ p where p is the data point e MMD technique, regardless of the value of θ, although represented; intuitively the falling probability of p regarding Q1 is 11: else likely to be small. Similarly, we cannot validate p for Q2 . 12: put all child entries of e into Queue; This motivates us to develop a new filtering technique 13: end if which is as simple as MMD, but can exploit θ to en- 14: end if hance the filtering capability. In the following part, we 15: end if show that lower and upper bounds of Pf all (p, γ) can be 16: end while derived based on some statistics of the uncertain query. t.c om om 17: for each point p ∈ C do Then a point may be immediately pruned (validated) po t.c 18: if Pf all (Q, p, γ) ≥ θ then based on the upper(lower) bound of Pf all (p, γ), denoted gs po 19: cn := cn + 1; by U Pf all (p, γ) (LPf all (p, γ)). lo s .b og 20: end if Example 4. In Fig. 5 suppose θ = 0.5 and we have ts .bl U Pf all (Q1 , p, γ) = 0.4 (LPf all (Q2 , p, γ) = 0.6) based on ec ts 21: end for oj c pr oje 22: Return cn the statistical bounds, then p can be safely pruned (vali- re r dated) without explicitly computing its falling probability lo rep bounding boxes(MBBs) of an entry and the uncertain regarding Q1 (Q2 ). Regarding the running example in Fig.4, xp lo query. Clearly, for any θ we can safely prune an entry if ee xp suppose θ = 0.2 and we have U Pf all (p2 , γ) = 0.15 while δmin (Qmbb , embb ) > γ or validate it if δmax (Qmbb , embb ) ≤ .ie ee U Pf all (pi , γ) ≥ 0.2 for 3 ≤ i ≤ 5, then p2 is pruned. There- w e γ. We refer this as maximal/minimal distance based filter- w .i fore, three points (p3 , p4 and p5 ) are verified in Algorithm 1 w w ing technique, namely MMD. MMD technique is time ef- :// w when MMD and statistical filtering techniques are applied. tp //w ficient as it takes only O(d) time to compute the minimal ht ttp: and maximal distances between Qmbb and embb . Recall Q1 γ h that Qmbb is the minimal bounding box of Q.region. Q2 g Q1 p g Q2 p1 p5 p4 Fig. 5. Motivation Example p2 Statistics of the uncertain query p3 To apply the statistical filtering technique, the follow- ing statistics of the uncertain query Q are pre-computed. Q : u n c er t ai n l oc at i on b as ed r an g e q u er y Definition 4 (mean (gQ )). gQ = x∈Q x × Q.pdf (x)dx. Fig. 4. Running Example Definition 5 (weighted average distance (ηQ )). ηQ equals Example 3. As shown in Fig. 4, suppose the MMD filtering x∈Q δ(x, gQ ) × Q.pdf (x)dx technique is applied in Algorithm 1, then p1 is pruned and Definition 6 (variance (σQ )). σQ equals x∈Q δ(x, gQ )2 × the other 4 points p2 , p3 , p4 and p5 will be verified. Q.pdf (x)dx Although the MMD technique is very time efficient, its Derive lower and upper bounds of Pf all (p, γ). filtering capacity is limited because it does not make use For a point p ∈ S, the following theorem shows how of the distribution information of the uncertain query Q to derive the lower and upper bounds of Pf all (p, γ) and the probabilistic threshold θ. This motivates us to de- based on above statistics of Q. Then, without explicitly velop more effective filtering techniques based on some computing Pf all (p, γ), we may prune or validate the point pre-computations on the uncertain query Q such that the p based on U Pf all (p, γ) and LPf all (p, γ) derived based on number of entries (i.e., points ) being pruned or validated the statistics of Q.
  6. 6. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 6 Theorem 2. Given an uncertain query Q and a distance γ, σ = V ar(Y ), according to Lemma 1 when γ > μ we and suppose the mean gQ , weighted average distance ηQ and have variance σQ of Q are available. Then for a point p, we have 1 1 P r(Y ≤ γ) ≤ P (Y ≥ γ ) ≤ (6) 1) If γ > μ1 , Pf all (p, γ) ≥ 1 − (γ−μ1 )2 , where μ1 = (γ −μ )2 1+ σ2 1+ 2 σ1 2 2 δ(gQ , p) + ηQ and σ1 = σQ − ηQ + 4ηQ × δ(gQ , p). Because values of μ, σ 2 , μ and σ 2 may change re- 1 2) If γ < δ(gQ , p) − ηQ − , Pf all (p, γ) ≤ (γ −μ2 )2 , garding different point p ∈ S, we cannot pre-compute 1+ 2 σ2 them. Nevertheless, in the following part we show that where μ2 2 =Δ+ 2 = σQ − ηQ , σ2 2 + 4ηQ × Δ, Δ = ηQ their upper bounds can be derived based on the statistic γ + γ + − δ(p, gQ ) and γ > 0. The represents an information of the Q, which can be pre-computed based arbitrarily small positive constant value. on the probabilistic distribution of Q. Before the proof of Theorem 2, we first introduce the p Cantelli’s Inequality [19] described by Lemma 1 which is Q γ one-sided version of the Chebyshev Inequality. Lemma 1. Let X be an univariate random variable with the expected value μ and the finite variance σ 2 . Then for any 1 C > 0, P r(X − μ ≥ C × σ) ≤ 1+C 2 . gQ ε Following is the proof of Theorem 2. Proof: Intuition of the Proof. For a given point p ∈ S, p γ its distance to Q can be regarded as an univariate random variable Y , and we have Pf all (p, γ) = P r(Y ≤ γ). Given Fig. 6. Proof of Upper bound γ, we can derive the lower and upper bounds of P r(Y ≤ Based on the triangle inequality, for any x ∈ Q we γ) (Pf all (p, γ)) based on the statistical inequality in have δ(x, p) ≤ δ(x, gQ )+δ(p, gQ ) and δ(x, p) ≥ | δ(x, gQ )− t.c om om Lemma 1 if the expectation (E(Y )) and variance(V ar(Y )) δ(p, gQ ) | for any x ∈ Q. Then we have po t.c of the random variable Y are available. Although E(Y ) and V ar(Y ) take different values regarding different μ =gs po y × Y.pdf (y)dy = δ(x, p) × pdf (x)dx lo s .b og points, we show that the upper bounds of E(Y ) and y∈Y x∈Q ts .bl ec ts V ar(Y ) can be derived based on mean(gQ ), weighted ≤ (δ(p, gQ ) + δ(x, gQ )) × pdf (x)dx oj c pr oje average distance (ηQ ) and variance(σQ ) of the query Q. x∈Q re r ≤ δ(gQ , p) + ηQ = μ1 lo rep Then, the correctness of the theorem follows. xp lo Details of the Proof. The uncertain query Q is a ran- ee xp and dom variable which equals x ∈ Q.region with prob- .ie ee ability Q.pdf (x). For a given point p, let Y denote σ2 = E(Y 2 ) − E 2 (Y ) w e w .i w w the distance distribution between p and Q; that is, :// w Y is an univariate random variable and Y.pdf (l) = ≤ (δ(gQ , p) + δ(x, gQ ))2 pdf (x)dx tp //w x∈Q ht ttp: x∈Q.region and δ(x,p)=l Q.pdf (x)dx for any l ≥ 0. Conse- −(δ(gQ , p) − ηQ )2 h quently, we have Pf all (p, γ) = P r(Y ≤ γ) according to Equation 2. Let μ = E(Y ), σ 2 = V ar(Y ) and C = γ−μ , σ = 2 δ(gQ , p) × δ(x, gQ ) × pdf (x)dx then based on lemma 1, if γ > μ we have x∈Q 1 + δ(x, gQ )2 × pdf (x)dx P r(Y ≥ γ) = P r(Y − μ ≥ C × σ) ≤ 1+ ( γ−μ )2 σ x∈Q 2 +2 × δ(gQ , p) × ηQ − ηQ 2 2 Then it is immediate that = σQ − ηQ + 4ηQ × δ(gQ , p) = σ1 1 Together with Inequality 5, we have P r(Y ≤ γ) ≥ P r(Y ≤ γ) ≥ 1 − P r(Y ≥ γ) ≥ 1 − (γ−μ)2 (5) 1+ 1 1 σ2 1− (γ−μ)2 ≥ 1− (γ−μ1 )2 if μ1 < γ. With similar 1+ σ2 1+ 2 σ1 According to Inequation 5 we can derive the lower rationale, let Δ = δ(gQ , p ) = γ + γ + − δ(p, gQ ) we bound of Pf all (p, γ). Next, we show how to derive have μ ≥ Δ + ηQ = μ2 and σ 2 ≤ σQ − ηQ + 4ηQ × 2 upper bound of Pf all (p, γ). As illustrated in Fig. 6, 2 Δ = σ2 . Based on Inequality 6, we have P r(Y ≤ γ) ≤ let p denote a dummy point on the line pgQ with 1 1 (γ −μ )2 ≤ (γ −μ2 )2 if γ < δ(gQ , p)− ηQ − . Therefore, δ(p , p) = γ + γ + where is an arbitrarily small 1+ σ 2 1+ 2 σ2 positive constant value. Similar to the definition of the correctness of the theorem follows. Y , let Y be the distance distribution between p and The following extension is immediate based on the Q; that is, Y is an univariate random variable where similar rationale of Theorem 2. Y .pdf (l) = x∈Q.region and δ(x,p )=l Q.pdf (x)dx for any Extension 1. Suppose r is a rectangular region, we can l ≥ 0. Then, as shown in Fig. 6, for any point x ∈ Q use δmin (r, gQ ) and δmax (r, gQ ) to replace δ(p, gQ ) in with δ(x, p ) ≤ γ (shaded area), we have δ(x, p) > γ. This Theorem 2 for lower and upper probabilistic bounds implies that P (Y ≤ γ) ≤ P (Y ≥ γ ). Let μ = E(Y ) and computation respectively.
  7. 7. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7 Based on Extension 1, we can compute the upper Q.pcr(0.2) according to Theorem 1 in Section 2.2. Conse- and lower bounds of Pf all (embb , γ) where embb is the quently, only p3 and p5 go to the verification phase when minimal bounding box of the entry e, and hence prune Q.pcr(0.2) is available. or validate e in Algorithm 1. Since gQ , ηQ and σQ are Same as [26], [28], a finite number of P CRs are pre- pre-computed, the dominant cost in filtering phase is computed for the uncertain query Q regarding different the distance computation between embb and gQ which probability values. For a given θ at query time, if the is O(d). Q.pcr(θ) is not pre-computed we can choose two pre- computed P CRs Q.pcr(θ1 ) and Q.pcr(θ2 ) where θ1 (θ2 ) is the largest (smallest) existing probability value smaller 3.3 PCR based Filter (larger) than θ. We can apply the modified PCR tech- Motivation. Although the statistical filtering technique nique as the filter in Algorithm 1, and the filtering time can significantly reduce the candidate size in Algo- regarding each entry tested is O(m+log(m)) in the worst rithm 1, the filtering capacity is inherently limited be- case , where m is the number of P CRs pre-computed by cause only a small amount of statistics are employed. the filter. This motivates us to develop more sophisticated filtering The PCR technique can significantly enhance the filter- techniques to further improve the filtering capacity; that ing capacity when a particular number of PCR s are pre- is, we aim to improve the filtering capacity with more computed. The key of the PCR filtering technique is to pre-computations (i.e., more information kept for the partition the uncertain query along each dimension. This filter). In this subsection, the PCR technique proposed may inherently limit the filtering capacity of the PCR in [26] will be modified for this purpose. based filtering technique. As shown in Fig. 7, we have to use two rectangular regions for pruning and validation R+ , p C p, purpose, and hence the Cp,γ is enlarged (shrunk) during Q region the computation. As illustrated in Fig. 7, all instances of t.c om Q in the striped area is counted for Pf all (p, γ) regarding om po t.c p R+,p , while all of them have distances larger than γ. Sim- gs po ilar observation goes to R−,p . This limitation is caused lo s .b og Q . pcr ( 0 .4 ) by the transformation, and cannot be remedied by in- ts .bl creasing the number of P CRs. Our experiments also ec ts R oj c ,p confirm that the PCR technique cannot take advantage of pr oje Fig. 7. Transform query re r the large index space. This motivates us to develop new lo rep filtering technique to find a better trade-off between the xp lo PCR based Filtering technique. The PCR technique ee xp filtering capacity and pre-computation cost (i.e., index proposed in [26] cannot be directly applied for filtering .ie ee size). w e in Algorithm 1 because the range query studied in [26] w .i w w is a rectangular window and objects are uncertain. Nev- :// w tp //w ertheless we can adapt the PCR technique as follows. 3.4 Anchor Points based Filter ht ttp: As shown in Fig. 7, let Cp,γ represent the circle (sphere) h centered at p with radius γ. Then we can regard the The anchor (pivot) point technique is widely employed uncertain query Q and Cp,γ as an uncertain object and in various applications, which aims to reduce the query the range query respectively. As suggested in [28], we computation cost based on some pre-computed anchor can use R+,p (mbb of Cp,γ ) and R−,p (inner box) as (pivot) points. In this subsection, we investigate how to shown in Fig. 7 to prune and validate the point p based apply anchor point technique to effectively and efficiently on the P CRs of Q respectively. For instance, if θ = 0.4 reduce the candidate set size. Following is a motivating the point p in Fig. 7 can be pruned according to case 2 example for the anchor point based filtering technique. of Theorem 1 because R1 does not intersect Q.pcr(0.4). = 0.2 Note that similar transformation can be applied for the p1 intermediate entries as well. p5 p4 R+ , p1 R+ , p3 p1 p2 R+ , p5 o p3 d p3 p5 R+ , p2 Q.mbb Co ,d p4 p2 R+ , p4 Fig. 9. Running example regarding the anchor point Q.mbb Q.pcr(0.2) Motivating Example. Regarding our running example, Fig. 8. Running example in Fig. 9 the shaded area, denoted by Co,d , is the circle centered at o with radius d. Suppose the probabilistic Example 5. Regarding the running example in Fig. 8, mass of Q in Co,d is 0.8, then when θ = 0.2 we can safely suppose Q.pcr(0.2) is pre-computed, then p1 , p2 and p4 prune p1 , p2 , p3 and p4 because Cpi ,γ does not intersect are pruned because R+,p1 , R+,p2 and R+,p4 do not overlap Co,d for i = 1, 2, 3 and 4.
  8. 8. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 8 In the paper, an anchor point a regarding the uncertain Ca,δ(a,p)−γ− ⊆ Ca,δ(a,p)+γ and Cp,γ ∩ Cp,δ(a,p)−γ− = ∅, query Q is a point in multidimensional space whose this implies that Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ) − falling probability against different γ values are pre- Pf all (a, δ(a, p) − γ − ). computed. We can prune or validate a point based on its distance to the anchor point. For better filtering capability, Let LPf all (p, γ) and U Pf all (p, γ) denote the lower a set of anchor points will be employed. and upper bounds derived from Lemma 2 regarding In the following part, Section 3.4.1 presents the anchor Pf all (p, γ). Then we can immediately validate a point p if point filtering technique. In Section 3.4.2, we investigate LPf all (p, γ) ≥ θ, or prune p if U Pf all (p, γ) < θ. how to construct anchor points for a given space budget, Clearly, it is infeasible to keep Pf all (a, l) for arbitrary followed by a time efficient filtering algorithm in Sec- l ≥ 0. Since Pf all (a, l) is a monotonic function with tion 3.4.3. respect to l, we keep a set Da = {li } with size nd for each anchor point such that Pf all (a, li ) = nid for 1 ≤ i ≤ nd . 3.4.1 Anchor Point filtering technique (APF) Then for any l > 0, we use U Pf all (a, l) and LPf all (a, l) For a given anchor point a regarding the uncertain query to represent the upper and lower bound of Pf all (a, l) Q, suppose Pf all (a, l) is pre-computed for arbitrary dis- respectively. Particularly, U Pf all (a, l) = Pf all (a, li ) where tance l. Lemma 2 provides lower and upper bounds li is the smallest li ∈ Da such that li ≥ l. Similarly, of Pf all (p, γ) for any point p based on the triangle LPf all (a, l) = Pf all (a, lj ) where lj is the largest lj ∈ Da inequality. This implies we can prune or validate a point such that lj ≤ l. Then we have the following theorem by based on its distance to an anchor point. rewriting Lemma 2 in a conservative way. Theorem 3. Given an uncertain query Q and an anchor point Q S2 a, for any rectangular region r and distance γ, we have: ε 1) If γ > δmax (a, r), Pf all (r, γ) ≥ LPf all (a, γ − a p γ γ δmax (a, r)). t.c om om a p 2) Pf all (r, γ) ≤ U Pf all (a, δmax (a, r) + γ) po t.c −LPf all (a, δmin (a, r) −γ − ) where is an arbitrarily gs po Q γ − δ (a , p ) S1 lo ssmall positive value. .b og ts .bl Let LPf all (r, γ) and U Pf all (r, γ) represent the lower ec ts (a) Lower Bound (b) Upper Bound oj c and upper bounds of the falling probability derived from pr oje Fig. 10. Lower and Upper Bound Theorem 3. We can safely prune (validate) an entry e if re r lo rep Lemma 2. Let a denote an anchor point regarding the U Pf all (embb , γ) < θ (LPf all (embb , γ) ≥ θ). Recall that embb xp lo represents the minimal bounding box of e. It takes O(d) ee xp uncertain query Q. For any point p ∈ S and a distance γ, we time to compute δmax (a, embb ) and δmin (a, embb ). Mean- .ie ee have w e while, the computation of LPf all (a, l) and U Pf all (a, l) for w .i 1) If γ > δ(a, p), Pf all (p, γ) ≥ Pf all (a, γ − δ(a, p)). w w any l > 0 costs O(log nd ) time because pre-computed :// w 2) Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ) − Pf all (a, δ(a, p) − tp //w distance values in Da are sorted. Therefore, the filtering γ − ) where is an arbitrarily small positive value. 1 ht ttp: time of each entry is O(d + log nd ) for each anchor point. h Proof: Suppose γ > δ(a, p), then according to the tri- angle inequality for any x ∈ Q with δ(x, a) ≤ γ − δ(a, p), 3.4.2 Heuristic with a finite number of anchor points we have δ(x, p) ≤ δ(a, p)+δ(x, a) ≤ δ(a, p)+(γ−δ(a, p)) = Let AP denote a set of anchor points for the uncertain γ. This implies that Pf all (p, γ) ≥ Pf all (a, γ − δ(a, p)) query Q. We do not need to further process an entry e in according to Equation 2. Fig. 10(a) illustrates an example Algorithm 1 if it is filtered by any anchor point a ∈ AP. of the proof in 2 dimensional space. In Fig. 10(a), we have Intuitively, the more anchor points employed by Q, the Ca,γ−δ(a,p) ⊆ Cp,γ if γ > δ(a, p). Let S denote the striped more powerful the filter will be. However, we cannot area which is the intersection of Ca,γ−δ(a,p) and Q. employ a large number of anchor points due to the space Clearly, we have Pf all (a, γ − δ(a, p)) = x∈S Q.pdf (x)dx and filtering time limitations. Therefore, it is important and δ(x, p) ≤ γ for any x ∈ S. Consequently, Pf all (p, γ) to investigate how to choose a limited number of anchor ≥ Pf all (a, γ − δ(a, p)) holds. points such that the filter can work effectively. With similar rationale, for any x ∈ Q we have δ(x, a) ≤ δ(a, p) + γ if δ(x, p) ≤ γ. This implies Anchor points construction. We first investigate how that Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ). Moreover, for to evaluate the “goodness” of an anchor point regard- any x ∈ Q with δ(x, a) ≤ δ(a, p) − γ − , we have ing the computation of LPf all (p, γ). Suppose all anchor δ(x, a) > γ. Recall that represents an arbitrarily small points have the same falling probability functions; that constant value. This implies that x does not contribute is Pf all (ai , l) = Pf all (aj , l) for any two anchor points to Pf all (p, γ) if δ(x, a) ≤ δ(a, p) − γ − . Consequently, ai and aj . Then the closest anchor point regarding p Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ) − Pf all (a, δ(a, p) − γ − ) will provide the largest LPf all (p, γ). Since there is no a holds. As shown in Fig. 10(b), we have Pf all (p, γ) ≤ priori knowledge about the distribution of the points, we Pf all (a, δ(a, p) + γ) because Cp,γ ⊆ Ca,δ(a,p)+γ . Since assume they follow the uniform distribution. Therefore, anchor points should be uniformly distributed. If falling 1. We have Pf all (a, δ(a, p) − γ − ) = 0 if δ(a, p) ≤ γ probabilistic functions of the anchor points are different,