Powerpoint Slides


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Talk about the traditional database infrastructure being client server in which either data request or query sent to server. The server returns data / answer to query to client. Client may cache the data to answer its own queries. Basically, optimizations such as when should client cache and data shipping be used or when server should execute query has been done extensively in research. Data producers always ignored. cannot ignore data producers anymore. Producers produce faster than feasible to store or consume. This has also led to stream data research as well. Producer example -- sensors, mobile nodes, satellite, routers in a network in a network management architecture, etc. Producers have possibly storage, communication and computational power
  • Architecture might include where does data get stored, where does computation occur. Many possibilities (1) do not store at all -- compute completely at either the producers/ or at the centralized server but let data flow through. -- stream model …weakness: data needs to be archived at times. 2) store at the servers only -- more traditional -- has problems since too much of data. 3) at the sensors most fresh data, server caches some parts of the data. -- at the server or the sensor, or the client, or collaboratively. Many possibilities -- each has pros and cons.
  • Quasar has taken a heirarchical architecture… data flows from the producers to the consumer (server). Servers could be themselves set at different hierarchies, Clients then get a cache of the overall data that they maintain. Query from client resolved by client. If client cannot do so, it goes to appropriate server, server might need to go to producer themselves to get the appropriate data.
  • Quasar exploits that application have tolerance to data quality. E.g., tracking -- if our track is within 5 meters of real track, that is fine. Or a value is within a bounded error of actual it is fine. Data produced in the environment is anyway has error -- measurement error, error due to transmission time, etc. Quasar explores the tradeoff in quality with resource constraint. There are two renditions of the problem: maximize application quality given maximum resource consumption given minimum quality requirement of application, minimize the resource utilization Resources: energy, battery, communication, etc.
  • Shows that given producer and consumer, one can agree on a error in the value. Quality requirement is that at any instance of time, the value at the server is within error threshold of the value at the producer. Assumption instantaneous messages (else, can be guaranteed by waiting for max messaging time). Protocol: sensor sends update if the threshold is violated. Do this using 2 slides -- 1 slide for problem defn. and the second for protocol
  • Shows that given producer and consumer, one can agree on a error in the value. Quality requirement is that at any instance of time, the value at the server is within error threshold of the value at the producer. Assumption instantaneous messages (else, can be guaranteed by waiting for max messaging time). Protocol: sensor sends update if the threshold is violated. Do this using 2 slides -- 1 slide for problem defn. and the second for protocol
  • Instead of value -- one can use prediction model M1, M2, .. Mn. Collection protocol changes to that when the producer value deviates from the predicted value based on the model , then update. Both the producer and server have to agree on the prediction model at any time. Issues: 1) who should do the model fitting server: has global knowledge, but does not have the most recent knowledge. Also, needs to communicate a model update to server. Client has local knowledge. Also, can piggyback the model parameters with the message being sent regarding the model violation or refresh.
  • Instead of value -- one can use prediction model M1, M2, .. Mn. Collection protocol changes to that when the producer value deviates from the predicted value based on the model , then update. Both the producer and server have to agree on the prediction model at any time. Issues: 1) who should do the model fitting server: has global knowledge, but does not have the most recent knowledge. Also, needs to communicate a model update to server. Client has local knowledge. Also, can piggyback the model parameters with the message being sent regarding the model violation or refresh.
  • Problem definition: we want data to be collected at a certain accuracy E_archive. Explain what this means in terms of error. If tolerance for archive is smaller than E_collection for a producer, then nothing to do -- we can simply use the collection. If E_archive is smaller then many solutions: (1) A trivial solution is to simply change the data collection error E_collection to E_archive. This however may result in high overhead (multiple messages). Notice that while collection is required for real-time support, data archive required eventually -- no real-time requirement on this. This, we could collect data over a longer time and compress the data. Notice that there is some degree of redundancy in archival
  • Keep collecting time series data and compress it as much as possible. Once the compressed data fills out the memory of the sensor, then send the data over to the server.
  • A query in Quasar has a certain quality requirement. Given that data at the server is at a given quality as well, we have to translate the application (query) quality requirement into a data quality requirement. Restrict to just aggregate queries over sensor values with a spatial condition for now. So queries are: min max sum average over spatial areas. More complex queries also possible, but we are restricting ourselves now. Aggregate queries over sensors where condition is also over the sensor value done by alstons work at stanford.
  • Explain in about 5 slides how MRA tree works and how the average value is computed all the way to the sensors. Basically, explain that while it is possible to optimize the execution time of queries, our consideration for optimization at this time is the amount of communication and hence we try to minimize the number of probes.
  • Explain the application. The objective We need a track that is bounded error of actual error. Use acoustic sensors, sensors organized as a grid with base stations that collect data and transmit it to the base station..
  • Acoustic sensors. Sensors use quasar state diagram of sensors
  • Server does three things…. As before.
  • Server does three things…. As before.
  • Server does three things…. As before.
  • Server does three things…. As before.
  • Server does three things…. As before.
  • Alice did not initially specify that auto transmission is important, she had an opportunity to add that later even if a particular vehicle marked without any auto transmission it will be retrieved. Similarly, if user specified wrong words, synonyms will be used to take care of the problem different users have different search criteria
  • Powerpoint Slides

    1. 1. Quality Aware Sensor Database (QUASAR) Project ** Sharad Mehrotra Department of Information and Computer Science University of California, Irvine **Supported in part by a collaborative NSF ITR grant entitled “real-time data capture, analysis, and querying of dynamic spatio-temporal events” in collaboration with UCLA, U. Maryland, U. Chicago
    2. 2. Talk Outline <ul><li>Quasar Project </li></ul><ul><ul><li>motivation and background </li></ul></ul><ul><ul><li>data collection and archival components </li></ul></ul><ul><ul><li>query processing </li></ul></ul><ul><ul><li>tracking application using QUASAR framework </li></ul></ul><ul><ul><li>challenges and ongoing work </li></ul></ul><ul><li>Brief overview of other research projects </li></ul><ul><ul><li>MARS Project - incorporating similarity retrieval and refinement over structured and semi-structured data to aid interactive data analysis/mining </li></ul></ul><ul><ul><li>Database as a Service (DAS) Project - supporting the application service provider model for data management </li></ul></ul>
    3. 3. Emerging Computing Infrastructure… Instrumented wide-area spaces In-body, in-cell, in-vitro spaces <ul><li>Generational advances to computing infrastructure </li></ul><ul><ul><li>sensors will be everywhere </li></ul></ul><ul><li>Emerging applications with limitless possibilities </li></ul><ul><ul><li>real-time monitoring and control, analysis </li></ul></ul><ul><li>New challenges </li></ul><ul><ul><li>limited bandwidth & energy </li></ul></ul><ul><ul><li>highly dynamic systems </li></ul></ul><ul><li>System architectures are due for an overhaul </li></ul><ul><ul><li>at all levels of the system OS, middleware, databases, applications </li></ul></ul>
    4. 4. Impact to Data Management … <ul><li>Traditional data management </li></ul><ul><ul><li>client-server architecture </li></ul></ul><ul><ul><li>efficient approaches to data storage & querying </li></ul></ul><ul><ul><li>query shipping versus data shipping </li></ul></ul><ul><ul><li>data changes with explicit update </li></ul></ul><ul><li>Emerging Challenge </li></ul><ul><ul><li>data producers must be considered as “first class” entities </li></ul></ul><ul><ul><ul><li>sensors generate continuously changing highly dynamic data </li></ul></ul></ul><ul><ul><ul><li>sensors may store, process, and communicate data </li></ul></ul></ul>Data/query request Data/query result client server Data producers
    5. 5. Data Management Architecture Issues <ul><li>Where to store data? </li></ul><ul><ul><li>Do not store -- stream model </li></ul></ul><ul><ul><ul><li>not suitable if we wish to archive data for future analysis or if data is too important to lose </li></ul></ul></ul><ul><ul><li>at the producers </li></ul></ul><ul><ul><ul><li>limited storage, network, compute resources </li></ul></ul></ul><ul><ul><li>at the servers </li></ul></ul><ul><ul><ul><li>server may not be able to cope with high data production rates. May lead to data staleness and/or wasted resources </li></ul></ul></ul><ul><li>Where to compute? </li></ul><ul><ul><li>At the client, server, data producers </li></ul></ul>Data/query request Data/query result server producer cache client Data producers
    6. 6. Quasar Architecture <ul><li>Hierarchical architecture </li></ul><ul><ul><li>data flows from producers to server to clients periodically </li></ul></ul><ul><ul><li>queries flow the other way: </li></ul></ul><ul><ul><ul><li>If client cache does not suffices, then </li></ul></ul></ul><ul><ul><ul><li>query routed to appropriate server </li></ul></ul></ul><ul><ul><ul><li>If server cache does not suffice, then access current data at producer </li></ul></ul></ul><ul><ul><li>This is a logical architecture-- producers could also be clients. </li></ul></ul>server Client cache Server cache & archive producer cache data flow Query flow client producer
    7. 7. Quasar: Observations & Approach <ul><li>Applications can tolerate errors in sensor data </li></ul><ul><ul><li>applications may not require exact answers: </li></ul></ul><ul><ul><ul><li>small errors in location during tracking or error in answer to query result may be OK </li></ul></ul></ul><ul><ul><li>data cannot be precise due to measurement errors, transmission delays, etc. </li></ul></ul><ul><li>Communication is the dominant cost </li></ul><ul><ul><li>limited wireless bandwidth, source of major energy drain </li></ul></ul><ul><li>Quasar Approach </li></ul><ul><ul><li>exploit application error tolerance to reduce communication between producer and server </li></ul></ul><ul><ul><li>Two approaches </li></ul></ul><ul><ul><ul><li>Minimize resource usage given quality constraints </li></ul></ul></ul><ul><ul><ul><li>Maximize quality given resource constraints </li></ul></ul></ul>
    8. 8. Quality-based Data Collection Problem <ul><li>Let P = < p[1], p[2], …, p[n] > be a sequence of environmental measurements (time series) generated by the producer, where n = now </li></ul><ul><li>Let S = <s[1], s[2], …, s[n]> be the server side representation of the sequence </li></ul><ul><li>A within-  quality data collection protocol guarantees that </li></ul><ul><ul><li>for all i error(p[i], s[i]) <  </li></ul></ul><ul><li> is derived from application quality tolerance </li></ul>Sensor time series … p[n], p[n-1], …, p[1]
    9. 9. Simple Data Collection Protocol <ul><li>sensor Logic (at time step n) </li></ul><ul><ul><li>Let p’ = last value sent to server </li></ul></ul><ul><ul><li> if error(p[n], p’) >  </li></ul></ul><ul><ul><li> send p[n] to server </li></ul></ul><ul><li>server logic (at time step n) </li></ul><ul><ul><li>If new update p[n] received at step n </li></ul></ul><ul><ul><ul><li>s[n] = p[n] </li></ul></ul></ul><ul><ul><ul><li>Else </li></ul></ul></ul><ul><ul><ul><li>s[n] = last update sent by sensor </li></ul></ul></ul><ul><ul><li>guarantees maximum error at server less than equal to  </li></ul></ul>Sensor time series … p[n], p[n-1], …, p[1]
    10. 10. Exploiting Prediction Models <ul><li>Producer and server agree upon a prediction model ( M ,  ) </li></ul><ul><li>Let s pred [i] be the predicted value at time i based on ( M ,  ) </li></ul><ul><li>sensor Logic (at time step n) </li></ul><ul><ul><li>if error(p[n], s pred [n] ) >  </li></ul></ul><ul><ul><ul><li>send p[n] to server </li></ul></ul></ul><ul><li>server logic (at time step n) </li></ul><ul><li>If new update p[n] received at step n </li></ul><ul><ul><ul><li>s[n] = p[n] </li></ul></ul></ul><ul><ul><ul><li>Else </li></ul></ul></ul><ul><ul><ul><li>s[n] = s pred [n] based on model ( M ,  ) </li></ul></ul></ul>
    11. 11. Challenges in Prediction <ul><li>Simple versus complex models? </li></ul><ul><ul><ul><li>Complex and more accurate models require more parameters (that will need to be transmitted). </li></ul></ul></ul><ul><ul><ul><li>Goal is to minimize communication not necessarily best prediction </li></ul></ul></ul><ul><li>How is a model M generated? </li></ul><ul><ul><ul><li>static -- one out of a fixed set of models </li></ul></ul></ul><ul><ul><ul><li>dynamic -- dynamically learn a model from data </li></ul></ul></ul><ul><li>When should a model M or parameters  be changed? </li></ul><ul><ul><ul><li>immediately on model violation : </li></ul></ul></ul><ul><ul><ul><ul><li>too aggressive -- violation may be a temporary phenomena </li></ul></ul></ul></ul><ul><ul><ul><li>never changed: </li></ul></ul></ul><ul><ul><ul><ul><li>too conservative -- data rarely follows a single model </li></ul></ul></ul></ul>
    12. 12. Challenges in Prediction (cont.) <ul><li>who does the model update? </li></ul><ul><ul><ul><li>Server </li></ul></ul></ul><ul><ul><ul><ul><li>Long-haul prediction models possible, since server maintains history </li></ul></ul></ul></ul><ul><ul><ul><ul><li>might not predict recent behavior well since server does not know exact S sequence; server has only samples </li></ul></ul></ul></ul><ul><ul><ul><ul><li>extra communication to inform the producer </li></ul></ul></ul></ul><ul><ul><ul><li>Producer </li></ul></ul></ul><ul><ul><ul><ul><li>better knowledge of recent history </li></ul></ul></ul></ul><ul><ul><ul><ul><li>long haul models not feasible since producer does not have history </li></ul></ul></ul></ul><ul><ul><ul><ul><li>producers share computation load </li></ul></ul></ul></ul><ul><ul><ul><li>Both </li></ul></ul></ul><ul><ul><ul><ul><li>server looks for new models, sensor performs parameter fitting given existing models . </li></ul></ul></ul></ul>
    13. 13. Archiving Sensor Data <ul><li>Often sensor-based applications are built with only the real-time utility of time series data. </li></ul><ul><ul><li>Values at time instants <<n are discarded. </li></ul></ul><ul><li>Archiving such data consists of maintaining the entire S sequence, or an approximation thereof. </li></ul><ul><li>Importance of archiving: </li></ul><ul><ul><li>Discovering large-scale patterns </li></ul></ul><ul><ul><li>Once-only phenomena, e.g., earthquakes </li></ul></ul><ul><ul><li>Discovering “events” detected post facto by “rewinding” the time series </li></ul></ul><ul><ul><li>Future usage of data which may be not known while it is being collected </li></ul></ul>
    14. 14. Problem Formulation <ul><li>Let P = < p[1], p[2], …, p[n] > be the sensor time series </li></ul><ul><li>Let S = < s[1], s[2], …, s[n] > be the server side representation </li></ul><ul><li>A within  archive quality data archival protocol guarantees that </li></ul><ul><ul><li>error(p[i], s[i]) <  archive </li></ul></ul><ul><li>Trivial Solution: modify collection protocol to collect data at quality guarantee of min(  archive ,  collect ) </li></ul><ul><ul><li>then prediction model by itself will provide a  archive quality data stream that can be archived. </li></ul></ul><ul><li>Better solutions possible since </li></ul><ul><ul><li>archived data not needed for immediate access by real-time or forecasting applications (such as monitoring, tracking) </li></ul></ul><ul><ul><li>compression can be used to reduce data transfer </li></ul></ul>
    15. 15. Data Archival Protocol <ul><li>Sensors compresses observed time series p[1:n] and sends a lossy compression to the server </li></ul><ul><li>At time n : </li></ul><ul><ul><li>p[1:n-n lag ] is at the server in compressed form s’ [1:n-n lag ] within-  archive </li></ul></ul><ul><ul><li>s[n-n lag +1:n] is estimated via a predictive model ( M ,  ) </li></ul></ul><ul><ul><ul><li>collection protocol guarantees that this remains within-  collect </li></ul></ul></ul><ul><ul><li>s[n+1:  ] can be predicted but its quality is not guaranteed (because it is in the future and thus the sensor has not observed these values) </li></ul></ul>… p[n], p[n-1], .. compress Sensor memory buffer Sensor updates for data collection Compressed representation for archiving processing at sensor exploited to reduce communication cost and hence battery drain
    16. 16. Piecewise Constant Approximation (PCA) <ul><li>Given a time series S n = s[1:n] a piecewise constant approximation of it is a sequence </li></ul><ul><li>PCA(S n ) = < (c i , e i ) > </li></ul><ul><li>that allows us to estimate s[j] as: </li></ul><ul><li>s capt [j] = c i if j in [e i-1 +1, e i ] </li></ul><ul><li> = c 1 if j<e 1 </li></ul>Time Value e 1 e 2 e 3 e 4 c 1 c 2 c 3 c 4
    17. 17. Online Compression using PCA <ul><li>Goal: Given stream of sensor values, generate a within-  archive PCA representation of a time series </li></ul><ul><li>Approach (PMC-midrange) </li></ul><ul><ul><li>Maintain m , M as the minimum/maximum values of observed samples since last segment </li></ul></ul><ul><ul><li>On processing p[n], update m and M if needed </li></ul></ul><ul><ul><ul><li>if M - m > 2  archive , output a segment (( m + M )/2, n) </li></ul></ul></ul>Time Value Example:  archive = 1.5 1 2 3 4 5 2 3 4 2.5 6
    18. 18. Online Compression using PCA <ul><li>PMC-MR … </li></ul><ul><ul><li>guarantees that each segment compresses the corresponding time series segment to within-  archive </li></ul></ul><ul><ul><li>requires O(1) storage </li></ul></ul><ul><ul><li>is instance optimal </li></ul></ul><ul><ul><ul><li>no other PCA representation with fewer segments can meet the within-  archive constraint </li></ul></ul></ul><ul><li>Variant of PMC-MR </li></ul><ul><ul><li>PMC-MEAN, which takes the mean of the samples seen thus far instead of mid range. </li></ul></ul>
    19. 19. Improving PMC using Prediction <ul><li>Observation : Prediction models guarantee a within-  collect version of the time series at server even before the compressed time series arrives from the producer. </li></ul><ul><li>Can the prediction model be exploited to reduce the overhead of compression. </li></ul><ul><ul><li>If  archive >  collect no additional effort is required for archival --> simply archive the predicted model. </li></ul></ul><ul><li>Approach : </li></ul><ul><ul><li>Define an error time series E[i] = p[i]-s pred [i] </li></ul></ul><ul><ul><li>Compress E[1:n] to within-  archive instead of compressing p[1:n] </li></ul></ul><ul><ul><li>The archive contains the prediction parameters and the compressed error time series </li></ul></ul><ul><ul><li>Within-  archive of E[I] + ( M ,  ) can be used to reconstruct a within-  archive version of p </li></ul></ul>
    20. 20. Combing Compression and Prediction (Example) Predicted Time Series Actual Time Series Actual Time Series Compressed Time Series (7 segments) Compressed Error (2 segments) Error = Actual – Predicted
    21. 21. Estimating Time Series Values <ul><li>Historical samples (before n-n lag ) is maintained at the server within-  archive </li></ul><ul><li>Recent samples (between n-n lag +1 and n ) is maintained by the sensor and predicted at the server. </li></ul><ul><li>If an application requires  q precision, then: </li></ul><ul><ul><li>if  q   collect then it must wait for  time in case a parameter refresh is en route </li></ul></ul><ul><ul><li>if  q   archive but  q <  collect then it may probe the sensor or wait for a compressed segment </li></ul></ul><ul><ul><li>Otherwise only probing meets precision </li></ul></ul><ul><li>For future samples (after n ) immediate probing not available as an option </li></ul>
    22. 22. Experiments <ul><li>Data sets: </li></ul><ul><ul><li>Synthetic Random-Walk </li></ul></ul><ul><ul><ul><li>x[1] = 0 and x[i]=x[i-1]+s n where s n drawn uniformly from [-1,1] </li></ul></ul></ul><ul><ul><li>Oceanographic Buoy Data </li></ul></ul><ul><ul><ul><li>Environmental attributes (temperature, salinity, wind-speed, etc.) sampled at 10min intervals from a buoy in the Pacific Ocean (Tropical Atmosphere Ocean Project, Pacific Marine Environment Laboratory) </li></ul></ul></ul><ul><ul><li>GPS data collected using IPAQs </li></ul></ul><ul><li>Experiments to test: </li></ul><ul><ul><li>Compression Performance of PMC </li></ul></ul><ul><ul><li>Benefits of Model Selection </li></ul></ul><ul><ul><li>Query Accuracy over Compressed Data </li></ul></ul><ul><ul><li>Benefits of Prediction/Compression Combination </li></ul></ul>
    23. 23. Compression Performance K/n ratio: number of segments/number of samples
    24. 24. Query Performance Over Compressed Data “ How many sensors have values > v ?” (Mean selectivity = 50)
    25. 25. Impact of Model Selection K/n ratio: number of segments/number of samples.  pred is the localization tolerance in meters <ul><li>Objects moved at approximately constant speed (+ measurement noise) </li></ul><ul><li>Three models used: </li></ul><ul><ul><li>loc[n] = c </li></ul></ul><ul><ul><li>loc[n] = c+vt </li></ul></ul><ul><ul><li>loc[n] = c+vt+0.5at 2 </li></ul></ul><ul><li>Parameters v, a were estimated at sensor over moving-window of 5 samples </li></ul>
    26. 26. Combining Prediction with Compression K/n ratio: number of segments/number of samples
    27. 27. QUASAR Client Time Series GPS Mobility Data from Mobile Clients (iPAQs) Latitude Time Series: 1800 samples Compressed Time Series (PMC-MR, ICDE 2003) Accuracy of ~ 100 m 130 segments
    28. 28. Query Processing in Quasar <ul><li>Problem Definition </li></ul><ul><ul><li>Given </li></ul></ul><ul><ul><ul><li>sensor time series with quality-guarantees captured at the server </li></ul></ul></ul><ul><ul><ul><li>A query with a specified quality-tolerance </li></ul></ul></ul><ul><ul><li>Return </li></ul></ul><ul><ul><ul><li>query results incurring least cost </li></ul></ul></ul><ul><li>Techniques depend upon </li></ul><ul><ul><li>nature of queries </li></ul></ul><ul><ul><li>Cost measures </li></ul></ul><ul><ul><ul><li>resource consumption -- energy, communication, I/O </li></ul></ul></ul><ul><ul><ul><li>query response time </li></ul></ul></ul>
    29. 29. Aggregate Queries S min Q = 2 max Q = 7 count Q = 3 sum Q = 2+7+6 = 15 avg Q = 15/3 = 5 9 6 3 8 2 7 Q
    30. 30. Processing Aggregate Queries (minimize producer probe) <ul><li>MIN Query </li></ul><ul><ul><li>c = min j (s i .high) </li></ul></ul><ul><ul><li>b = c -  query </li></ul></ul><ul><ul><li>Probe all sensors where s j .low < b </li></ul></ul><ul><ul><ul><li>only s1 and s3 will be probed </li></ul></ul></ul><ul><li>Sum Query </li></ul><ul><ul><li>select a minimal subset S’  S such that </li></ul></ul><ul><ul><li> si in S’ (  j pred ) >=  si in S (  j pred )-  query </li></ul></ul><ul><ul><li>If  query = 15, only s1 will be probed </li></ul></ul>Let S = <s 1 ,s 2 , …,s n > be set of sensors that meet the query criteria s i .high = s i pred [t] +  j pred s j .low = s i pred [t] -  j pred a b c s1 s2 sn s3 s1 s2 s5 s3 s4 10 5 2 5 3
    31. 31. Minimizing Cost at Server <ul><li>Error tolerance of queries can be exploited to reduce processing at server. </li></ul><ul><li>Key Idea </li></ul><ul><ul><li>Use a multi-resolution index structure (MRA-tree) for processing aggregate queries at server. </li></ul></ul><ul><ul><li>An MRA-Tree is a modified multi-dimensional index trees (R-Tree, quadtree, Hybrid tree, etc.) </li></ul></ul><ul><ul><li>A non-leaf node contains (for each of its subtrees) four aggregates { MIN , MAX , COUNT , SUM } </li></ul></ul><ul><ul><li>A leaf node contains the actual data points (sensor models) </li></ul></ul>
    32. 32. MRA Tree Data Structure Spatial View A B C D E F G A B C D E F G Tree Structure View S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 7 S 2 S 3 S 4 S 5 S 6 S 1 S 8
    33. 33. Non-Leaf Node Disk Page Pointers (each costs 1 I/O) Leaf Node Probe “Pointers” (each costs 2 messages) MRA-Tree Node Structure min max count sum 2 3 9 4 4 2 9 5 1 4 4 2 6 1 6 6 M 1  1 M 2  2 M 3  3
    34. 34. <ul><li>Two sets of nodes: </li></ul><ul><ul><li>NP (partial contribution to the query) </li></ul></ul><ul><ul><li>NC (complete contribution) </li></ul></ul>Node Classification Q N disjoint contains Q N Q N is contained Q N partially overlaps
    35. 35. Aggregate Queries using MRA Tree <ul><li>Initialize NP with the root </li></ul><ul><li>At each iteration: Remove one node N from NP and for each N child of its children </li></ul><ul><ul><li>discard, if N child disjoint with Q </li></ul></ul><ul><ul><li>insert into NP if Q is contained or partially overlaps with N child </li></ul></ul><ul><ul><li>“ insert” into NC if Q contains N child (we only need to maintain agg NC ) </li></ul></ul><ul><ul><li>compute the best estimate based on contents of NP and NC </li></ul></ul>Q N
    36. 36. MIN (and MAX) 3 9 4 5 Interval min NC = min { 4, 5 } = 4 min NP = min { 3, 9 } = 3 L = min { min NC , min NP } = 3 H = min NC = 4 hence, I = [3, 4] Estimate Lower bound: E ( min Q ) = L = 3 Traversal Choose N  NP : min N = min NP
    37. 37. MRA Tree Traversal A B C D E F G <ul><li>Progressive answer refinement until NP is exhausted </li></ul><ul><li>Greedy priority-based local decision for next node to be explored based on: </li></ul><ul><ul><li>Cost (1 I/O or 2 messages) </li></ul></ul><ul><ul><li>Benefit (Expected Reduction in answer uncertainty) </li></ul></ul>S 7 S 2 S 3 S 4 S 5 S 6 S 1 S 8
    38. 38. Adaptive Tracking of mobile objects in sensor networks <ul><li>Tracking Architecture </li></ul><ul><ul><li>A network of wireless acoustic sensors arranged as a grid transmitting via a base station to server </li></ul></ul><ul><ul><li>A track of the mobile object generated at the base station or server </li></ul></ul><ul><li>Objective </li></ul><ul><ul><li>Track a mobile object at the server such that the track deviates from the real trajectory within a user defined error threshold  track with minimum communication overhead. </li></ul></ul>Track visualization Base station 1 Base station 2 Base station 3 Server Show me the approximate track of the object with precision  Wireless Sensor Grid object Wireless link
    39. 39. Sensor Model <ul><li>Wireless sensors : battery operated, energy constrained </li></ul><ul><li>Operate on the received acoustic waveforms </li></ul><ul><li>Signal attenuation of target object given by : I s (t) = P /4  r 2 </li></ul><ul><li>P : source object power </li></ul><ul><li> r= distance of object from sensor </li></ul><ul><li>I s (t) = intensity reading at time t at i th sensor </li></ul><ul><li>I th : Intensity threshold at i th sensor </li></ul>
    40. 40. Sensor States <ul><li>S0 : Monito r ( processor on, sensor on, radio off ) </li></ul><ul><ul><li>shift to S1 if intensity above threshold </li></ul></ul><ul><li>S1 : Active state ( processor on, sensor on, radio on) </li></ul><ul><ul><li>send intensity readings to base station. </li></ul></ul><ul><ul><li>On receiving message from BS containing error tolerance shift to S2 </li></ul></ul><ul><li>S2 : Quasi-active (processor on, sensor on, radio intermittent) </li></ul><ul><ul><li>send intensity reading to BS if error from previous reading exceeds error threshold </li></ul></ul><ul><li>Quasar Collection approach used in Quasi-active state </li></ul>S0 (Initial state) S2 S1 Receive BS message I i < I th I i > I th I i < I th
    41. 41. Server side protocol <ul><li>Server maintains : </li></ul><ul><ul><li>list of sensors in the active/ quasi-active state </li></ul></ul><ul><ul><li>history of their intensity readings over a period of time </li></ul></ul><ul><li>Server Side Protocol </li></ul><ul><ul><li>convert track quality to a relative intensity error at sensors </li></ul></ul><ul><ul><li>Send relative intensity error to sensor when sensor state = S1( quasi- active state) </li></ul></ul><ul><ul><li>Triangulate using n sensor readings at discrete time intervals. </li></ul></ul>
    42. 42. Basic Triangulation Algorithm (using 3 sensor readings) P: source object power, I i = intensity reading at i th sensor (x-x1) 2 + (y- y1) 2 = P/4  I 1 (x-x2) 2 + (y- y2) 2 = P/4  I 2 (x-x3) 2 + (y- y3) 2 = P/4  I 3 Solving we get (x, y)=f(x1,x2,x3,y1,y2,y3, P,I 1 , I 2 , I 3, ) (x, y) <ul><li>More complex approaches to amalgamate more than three sensor readings possible </li></ul><ul><li>Based on numerical methods -- do not provide a closed form equation between sensor reading and tracking location ! </li></ul><ul><li>Server can use simple triangulation to convert track quality to sensor intensity quality tolerances and a more complex approach to track. </li></ul>(x1, y1) (x2, y2) (x3, y3)
    43. 43. Adaptive Tracking : Mapping track quality to sensor reading Intensity ( I 2 ) time   Intensity ( I 3 ) time   t i t ( i+1 ) t i t ( i+1 ) t i t ( i+1 ) X (m) Y (m) <ul><ul><li>Claim 1 (power constant) </li></ul></ul><ul><ul><ul><li>Let I i be the intensity value of sensor </li></ul></ul></ul><ul><ul><ul><li>If then, track quality is guaranteed to be within  track </li></ul></ul></ul><ul><ul><ul><li>where and C is a constant derived from the known locations of the sensors and the power of the object. </li></ul></ul></ul><ul><ul><li>Claim 2 (power varies between [P min , P max ]) </li></ul></ul><ul><ul><li>If then </li></ul></ul><ul><ul><li>track quality is guaranteed to be within  track where C ’ = C/ P 2 and is a constant . </li></ul></ul><ul><ul><li>The above constraint is a conservative estimate. Better bounds possible </li></ul></ul> track Intensity ( I 1 ) time   
    44. 44. Adaptive Tracking: prediction to improve performance <ul><li>Communication overhead further reduced by exploiting the predictability of the object being tracked </li></ul><ul><li>Static Prediction : sensor & server agree on a set of prediction models </li></ul><ul><ul><li>only 2 models used: stationary & constant velocity </li></ul></ul><ul><li>Who Predicts: sensor based mobility prediction protocol </li></ul><ul><ul><li>Every sensor by default follows a stationary model </li></ul></ul><ul><ul><li>Based on its history readings may change to constant velocity model (number of readings limited by sensor memory size) </li></ul></ul><ul><ul><li>informs server of model switch </li></ul></ul>
    45. 45. Actual Track versus track on Adaptive Tracking (error tolerance 20m) <ul><li>A restricted random motion : the object starts at (0,d) and moves from one node to another randomly chosen node until it walks out of the grid. </li></ul>
    46. 46. Energy Savings due to Adaptive Tracking <ul><li>total energy consumption over all sensor nodes for random mobility model with varying  track or track error. </li></ul><ul><li>significant energy savings using adaptive precision protocol over non adaptive tracking ( constant line in graph) </li></ul><ul><li>for a random model, prediction does not work well ! </li></ul>
    47. 47. Energy consumption with Distance from BS <ul><li>total energy consumption over all sensor nodes for random mobility model with varying base station distance from sensor grid. </li></ul><ul><li>As base station moves away, one can expect energy consumption to increase since transmission cost varies as d n ( n =2 ) </li></ul><ul><li>adaptive precision algorithm gives us better results with increasing base station distance </li></ul>
    48. 48. Challenges & Ongoing Work <ul><li>Ongoing Work: </li></ul><ul><ul><li>Supporting a larger class of SQL queries </li></ul></ul><ul><ul><li>Supporting continuous monitoring queries </li></ul></ul><ul><ul><li>Larger class of sensors (e.g., video sensors) </li></ul></ul><ul><ul><li>Better approaches to model fitting/switching in prediction </li></ul></ul><ul><li>In the future: </li></ul><ul><ul><li>distributed Quasar architecture </li></ul></ul><ul><ul><li>optimizing quality given resource constraints </li></ul></ul><ul><ul><li>supporting applications with real-time constraints </li></ul></ul><ul><ul><li>dealing with failures </li></ul></ul>
    49. 49. The DAS Project ** Goals: Support Database as a Service on the Internet Collaboration: IBM (Dr. Bala Iyer) UCI (Gene Tsudik) ** Supported in part by NSF ITR grant entitled “Privacy in Database as a Service” and by the IBM Corporation
    50. 50. Software as a Service <ul><li>Get … </li></ul><ul><ul><li>what you need </li></ul></ul><ul><ul><li>when you need </li></ul></ul><ul><li>Pay … </li></ul><ul><ul><li>what you use </li></ul></ul><ul><li>Don’t worry … </li></ul><ul><ul><li>how to deploy, implement, maintain, upgrade </li></ul></ul>
    51. 51. Software As a Service: Why? <ul><li>Advantages </li></ul><ul><ul><li>reduced cost to client </li></ul></ul><ul><ul><ul><li>pay for what you use and not for hardware, software infrastructure or personnel to deploy, maintain, upgrade… </li></ul></ul></ul><ul><ul><li>reduced overall cost </li></ul></ul><ul><ul><ul><li>cost amortization across users </li></ul></ul></ul><ul><ul><li>Better service </li></ul></ul><ul><ul><ul><li>leveraging experts across organizations </li></ul></ul></ul><ul><li>Driving Forces </li></ul><ul><ul><li>Faster, cheaper, more accessible networks </li></ul></ul><ul><ul><li>Virtualization in server and storage technologies </li></ul></ul><ul><ul><li>Established e-business infrastructures </li></ul></ul><ul><li>Already in Market </li></ul><ul><ul><li>ERP and CRM (many examples) </li></ul></ul><ul><ul><li>More horizontal storage services, disaster recovery services, e-mail services, rent-a-spreadsheet services etc. </li></ul></ul><ul><ul><li>Sun ONE, Oracle Online Services, Microsoft .NET My Services etc </li></ul></ul>Better Service for Cheaper
    52. 52. Database As a Service <ul><li>Why? </li></ul><ul><ul><li>Most organizations need DBMSs </li></ul></ul><ul><ul><li>DBMSs extremely complex to deploy, setup, maintain </li></ul></ul><ul><ul><li>require skilled DBAs with high cost </li></ul></ul>
    53. 53. What do we want to do? <ul><li>Database as a Service ( DAS ) Model </li></ul><ul><ul><li>DB management transferred to service provider for </li></ul></ul><ul><ul><ul><li>backup, administration, restoration, space management, upgrades etc. </li></ul></ul></ul><ul><ul><li>use the database “as a service” provided by an ASP </li></ul></ul><ul><ul><ul><li>use SW, HW, human resources of ASP, instead of your own </li></ul></ul></ul>Application Service Provider (ASP) Server BUT…. User
    54. 54. Challenges <ul><li>Economic/business model? </li></ul><ul><ul><li>How to charge for service, what kind of service guarantees can be offered, costing of guarantees, liability of service provider. </li></ul></ul><ul><li>Powerful interfaces to support complete application development environment </li></ul><ul><ul><li>User Interface for SQL, support for embedded SQL programming, support for user defined interfaces, etc. </li></ul></ul><ul><li>Scalability in the web environment </li></ul><ul><ul><li>overheads due to network latency (data proxies?) </li></ul></ul><ul><li>Privacy and Security </li></ul><ul><ul><li>Protecting data at service providers from intruders and attacks. </li></ul></ul><ul><ul><li>Protecting clients from misuse of data by service providers </li></ul></ul><ul><ul><li>Ensuring result integrity </li></ul></ul>
    55. 55. Data privacy from service provider Encrypted User Database User Data <ul><li>The problem is we do not trust “the service provider” for sensitive information! </li></ul><ul><ul><li>Fact 1: Theft of intellectual property due to database vulnerabilities costs American businesses $103 billion annually </li></ul></ul><ul><ul><li>Fact 2: 45% of those attacks are conducted by insiders ! (CSI/FBI Computer Crime and Security Survey, 2001) </li></ul></ul><ul><ul><li>encrypt the data and store it </li></ul></ul><ul><ul><li>but still be able to run queries over the encrypted data </li></ul></ul><ul><ul><li>do most of the work at the server </li></ul></ul>Server Application Service Provider Untrusted Server Site User
    56. 56. System Architecture Server Site Original Query Server Side Query Encrypted Results Actual Results Service Provider Client Site Client Side Query ? ? ? Encrypted User Database Query Translator Temporary Results Result Filter Metadata User
    57. 57. NetDB2 Service <ul><li>Developed in collaboration with IBM </li></ul><ul><li>Deployed on the Internet about 2 years ago </li></ul><ul><ul><li>Been used by 15 universities and more than 2500 students to help teaching database classes </li></ul></ul><ul><li>Currently offered through IBM Scholars Program </li></ul>4 2 3 1
    58. 58. MARS Project ** <ul><li>Goals: integration of similarity retrieval and query refinement over structured and semi-structured databases to help interactive data analysis/mining </li></ul>**Supported in part by NSF CAREER award, NSF grant entitled “learning digital behavior” and a KDD grant entitled “Mining events and entities over large spatio-temporal data sets”
    59. 59. Similarity Search in Databases (SR) Alice Honda sedan, inexpensive, after 1994, around LA Used Car Catalog Year Model Mileage Transmission Location Color Price ... Bob Honda sedan, inexpensive, after 1994, around LA M 3975 LA 90K 94 Honda Accord A 3500 LA 150K 95 Honda Accord Similarity search (Bob – location more important) Similarity is Subjective: results reflect personal interpretation of `around’,`inexpensive’, and relative importance Exact Search semantics (unranked) Similarity search (Alice – price more important) .8 A 6000 LA 50K 95 Honda Prelude .7 A 6500 LA 30K 98 Honda Accord Honda Accord Honda Accord Honda Accord 94 95 94 60K 150K 90K Irvine LA LA 5000 3500 3975 A A M .5 1 1 1 M 3975 LA 90K 94 Honda Accord .8 A 3500 Malibu 100K 94 Toyota Camry Honda Accord Honda Accord Honda Accord 94 94 95 70K 60K 150K San Diego Irvine LA 4500 5000 3500 A A A .6 .7 1 MARS-QL select * from user_car_catalog where model ~= Honda Accord, year >= 1994, price <= 4K, location ~= LA
    60. 60. Query Refinement (QR) Refined Results select * from user_car_catalog where model ~= Honda Accord, year >= 1994, price <= 4K, location ~= LA, mileage~=60K Refined Query 1 M 3975 LA 90K 94 Honda Accord –      .8 A 3500 Malibu 100K 94 Toyota Camry Honda Accord Honda Accord Honda Accord 94 94 95 70K 60K 150K San Diego Irvine LA 4500 5000 3500 A A A .6 .7 1 Results Mileage also important .6 A 3500 Malibu 100K 94 Toyota Camry .6 A 4500 San Diego 70K 94 Honda Accord .8 A 5000 Irvine 60K 94 Honda Accord .9 M 3975 LA 90K 94 Honda Accord Honda Accord 93 80K San Diego 4500 A .5
    61. 61. Why are SR and QR important? <ul><li>Most queries are similarity searches </li></ul><ul><ul><li>Specially in exploratory data analysis tasks (e.g., catalog search) </li></ul></ul><ul><ul><li>Users have only a partial idea of their information need </li></ul></ul><ul><li>Existing Search technologies (text retrieval, SQL) do not provide appropriate support for SR and (almost) no support for QR. </li></ul><ul><ul><li>Users must artificially convert similarity queries to keyword-searches or exact-match queries </li></ul></ul><ul><ul><li>Good mappings difficult or not feasible </li></ul></ul><ul><ul><ul><li>Lack of good knowledge of the underlying data or its structure </li></ul></ul></ul><ul><ul><ul><li>Exact-match may be meaningless for certain data types (e.g., images, text, multimedia ) </li></ul></ul></ul>
    62. 62. Similarity Access and Interactive Mining Architecture Search Client Query Session Manager Query Log Manager/Miner Similarity Query Processor Feedback-based Refinement Method ORDBMS Similarity Operators Types Feedback Table Database Query Log Answer Table Scores Table Refinement Manager History-based Refinement Method Query/Feedback Ranked Results Initial Query Feedback Ranked Results Query Results Schemes Ranking Rules Legend: --- logging __ Process
    63. 63. MARS Challenges... <ul><li>Learning queries from </li></ul><ul><ul><li>user interactions </li></ul></ul><ul><ul><li>user profiles </li></ul></ul><ul><ul><li>past history of other users </li></ul></ul><ul><li>Efficient implementation of </li></ul><ul><ul><li>similarity queries </li></ul></ul><ul><ul><li>refined queries </li></ul></ul><ul><li>Role of similarity queries in </li></ul><ul><ul><li>OLAP </li></ul></ul><ul><ul><li>interactive data mining </li></ul></ul>
    64. 64. Query-Session Manager -parse the query - check query validity -generate schema for support tables - maintain sessions registry Similarity Query Processor -executes query on ORDBMS - ranks results (e.g. can exclude already see tuples, etc) - logs query(query or Top-k) Refinement Manager - maintains a registry of query refinement policies (content/collaborative) - generates the scores table - identifies and invokes intra-predicate refiners. Query Log Manager/Miner - maintains query log . Initial-Final pair . Top-K results . Complete trajectory - Query-query similarity (can have multiple policies) - Query clustering
    65. 65. Text Search Technologies (Altavista, Verity, Vality, Infoseek) Strengths support ranked retrieval can handle missing data, synonyms, data entry errors Approach convert enterprise structured data into a searchable text index . Limitations cannot capture semantics of relationships in data cannot capture semantics of non-text fields (e.g., multimedia) limited support for refinement or preferences in current systems cannot express similarity queries over structured or semi-structured data (e.g., price, location) Movies Actors Directors Al Pacino acted in a movie directed by Francis Ford Coppola Honda accord near LA approx. $4000
    66. 66. SQL-based Search Technologies Oracle, Informix, DB2, Mercado Approach translate similarity query into exact SQL query . Strengths support structured as well as semi-structured data support for arbitrary data types Scalable attribute-based lookup Limitations translation is difficult or not possible difficult to guess right ranges causes near misses not feasible for non-numeric fields cannot rank answers based on relevance does not account for user preference or query refinement select * from user_car_catalog where model = Honda Accord and 1993  year  1995 and dist(90210)  50 and price < 5000 1994 Honda accord near LA approx. $4000