Autonomics and Data Management

491 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
491
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Autonomics and Data Management

  1. 1. Autonomics and Data Management Norman Paton University of Manchester
  2. 2. Hypothesis <ul><ul><li>If database management systems are to be effective in an increasing range of challenging environments, such as grids, then automation will have follow them into these new settings. </li></ul></ul>
  3. 3. Outline <ul><li>Existing examples of automation. </li></ul><ul><li>Limitations in current practice. </li></ul><ul><li>Opportunities presented by ubiquitous automation. </li></ul>
  4. 4. Outline <ul><li>Existing examples of automation: </li></ul><ul><ul><li>Database administration. </li></ul></ul><ul><ul><li>Query processing. </li></ul></ul><ul><ul><li>Data integration. </li></ul></ul><ul><li>Limitations in current practice. </li></ul><ul><li>Opportunities presented by ubiquitous automation. </li></ul>
  5. 5. Example: Database Administration <ul><li>Database administration involves setting values for a lot of controls: </li></ul><ul><ul><li>Where to put indexes. </li></ul></ul><ul><ul><li>What views to materialise. </li></ul></ul><ul><ul><li>How to allocate memory. </li></ul></ul><ul><ul><li>Maximum number of concurrent transactions. </li></ul></ul><ul><ul><li>Which disks to place data on. </li></ul></ul><ul><ul><li>Which statistics to maintain. </li></ul></ul><ul><ul><li>How often to refresh statistics. </li></ul></ul><ul><ul><li>Which transaction isolation level to use. </li></ul></ul><ul><li>Autonomic database administration may set any of these automatically. </li></ul>
  6. 6. Multiprogramming Level <ul><li>The multiprogramming level (MPL) indicates the maximum number of concurrent transactions that may be run. </li></ul><ul><li>Problem: excessive lock conflicts may lead to thrashing, either through deadlocks or significant amounts of blocking. </li></ul><ul><li>Setting the MPL level: </li></ul><ul><ul><li>If too high, then risk of thrashing. </li></ul></ul><ul><ul><li>If too low, then too many jobs waiting in queue. </li></ul></ul><ul><li>The risk of thrashing at a given MPL depends on the update intensity of the transactions. </li></ul><ul><li>G. Weikum, A. Mönkeberg, C. Hasse, P. Zabback: Self-tuning Database Technology and Information Services: from Wishful Thinking to Viable Engineering. VLDB 2002: 20-3. </li></ul>
  7. 7. Automating the Setting of MPL – 1 <ul><li>Observation: </li></ul><ul><ul><li>Want to set the MPL as high as possible, but not too high! </li></ul></ul><ul><ul><li>Identify a property that indicates that there is a high risk of conflicts. </li></ul></ul><ul><ul><li>Conflict ratio: </li></ul></ul><ul><ul><ul><li>(# locks held by all transactions / # locks held by non-blocked transactions) </li></ul></ul></ul><ul><ul><ul><li>Experimental and analytical studies indicated that a level of 1.3 or more means there is a high risk of thrashing. </li></ul></ul></ul>
  8. 8. Automating the Setting of MPL – 2 <ul><li>Monitoring: </li></ul><ul><ul><li>Number of active transactions. </li></ul></ul><ul><ul><li>Number of blocked transactions. </li></ul></ul><ul><li>Assessment: </li></ul><ul><ul><li>Conflict ratio exceeds 1.3. </li></ul></ul><ul><li>Response: </li></ul><ul><ul><li>Transaction admission policy: </li></ul></ul><ul><ul><ul><li>Block admission of new transactions from queue. </li></ul></ul></ul><ul><ul><li>Transaction cancellation policy: </li></ul></ul><ul><ul><ul><li>Cancel one or more blocking transactions. </li></ul></ul></ul>
  9. 9. Example: Query Evaluation <ul><li>Query optimization involves making lots of decisions: </li></ul><ul><ul><li>Which operators to use. </li></ul></ul><ul><ul><li>What order to evaluate the operators in. </li></ul></ul><ul><ul><li>What parallelism level to use. </li></ul></ul><ul><ul><li>How to allocate work to parallel nodes. </li></ul></ul><ul><li>Adaptive query processing may revise any of the decisions made by a query optimizer during query evaluation. </li></ul>
  10. 10. Adaptation for Load Balancing <ul><li>In partitioned parallelism, a task is divided into subtasks that are run in parallel on different nodes. </li></ul><ul><li>For a join, A ⋈ B is represented as the union of the results of plan fragments F i = A i ⋈ B i , for i = 1 .. P , where P is the level of parallelism. </li></ul><ul><li>The time taken to evaluate the join is max(evaluation_time(F i )), for i = 1.. P . </li></ul><ul><li>As a result, any delay in completing a fragment F i delays the completion of the operator, so it is crucial to match fragment size to node capabilities. </li></ul><ul><li>Many join algorithms have state; as such changing the size of a fragment allocated to a machine involves replicating or relocating operator state. </li></ul>
  11. 11. Load Balancing: Flux <ul><li>When load imbalance is detected: </li></ul><ul><ul><li>Halt query execution. </li></ul></ul><ul><ul><li>Compute new distribution policy (dp). </li></ul></ul><ul><ul><li>Update hash tables by transferring data between nodes. </li></ul></ul><ul><ul><li>Update dp in parent exchange nodes. </li></ul></ul><ul><ul><li>Resume query execution. </li></ul></ul><ul><li>M. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin, Flux: An Adaptive Partitioning Operator for Continuous Query Systems. 25-36, ICDE 2003. </li></ul>Scan(A) Join(A 1 ,B 1 ) Join(A 2 ,B 2 ) Hash table A 1 dp Hash table A 2
  12. 12. Load Balance: Dynamic Hashing <ul><li>Hash table build time: </li></ul><ul><ul><li>Partition every hash table bucket over three randomly chosen nodes. </li></ul></ul><ul><ul><li>Store every tuple in the 2 most lightly loaded of the 3 nodes. </li></ul></ul><ul><li>Hash table probe time: </li></ul><ul><ul><li>Probe the 2 hash tables on the 2 most lightly loaded nodes storing the bucket, the primary being that with the lightest load, and the secondary that with the second lightest load. </li></ul></ul><ul><ul><li>Match tuples on the primary node unless a matching tuple is available only on the secondary node. </li></ul></ul><ul><li>N.W. Paton, V. Raman, G. Swart, I. Narang, Autonomic Query Parallelization using Non-dedicated Computers: An Evaluation of Adaptivity Options, Proc. ICAC , 221-230, 2006 </li></ul>Scan(A) Join(A 1 ,B 1 ) Hash table A 1 Join(A 3 ,B 3 ) Hash table A 3 Join(A 2 ,B 2 ) Hash table A 2 Scan(B)
  13. 13. Load Balance: Redundant Work <ul><li>Adapt retrospectively – when a plan fragment A i ⋈ B i is late completing, start evaluating a redundant version of the fragment. </li></ul><ul><li>Typically, for parallelism level P , assuming a perfect hash function and no skew, each node will join | A |/ P to | B |/ P tuples. </li></ul><ul><li>However, this leads to tight dependencies between successive joins. </li></ul>Scan(A) Scan(B) Join(A 1 ,B 1 ) Join(A 2 ,B 2 ) Join(C 1 , Join(A,B) 1 ) Join(C 2 , Join(A,B) 2 )
  14. 14. Load Balance: Redundant Work <ul><li>An alternative distribution strategy allocates vertical slices of the plan to a single node. </li></ul><ul><li>For this to work, the workload allocated to each node must be larger than that allocated in the case of exchange. </li></ul><ul><li>For example, if there are 2 nodes, then for C ⋈( A ⋈ B ), one option is to join | A | to | B |/2 tuples on each node , with the result joined with all of C. </li></ul><ul><li>Whenever a fragment is late completing, a redundant version is started. </li></ul>V. Raman, W. Han, Inderpal Narang: Parallel Querying with Non-Dedicated Computers. 61-72, VLDB 2005 Scan(A) Scan(B) Join(A,B 1 ) Join(A,B 2 ) Join(C, Join(A,B 1 )) Join(C, Join(A,B 2 ))
  15. 15. Example: Data Integration <ul><li>Data integration involves assembling information about the relationships between sources: </li></ul><ul><ul><li>What sources there are. </li></ul></ul><ul><ul><li>The services provided by the source. </li></ul></ul><ul><ul><li>The concepts represented in each source. </li></ul></ul><ul><ul><li>How the data represented. </li></ul></ul><ul><ul><li>What relationships there are between extents. </li></ul></ul><ul><ul><li>What mappings exist between source data types. </li></ul></ul><ul><li>Autonomic data integration involves inferring some of the above data. </li></ul>
  16. 16. Inferring Web Service Annotations <ul><li>Web service annotations are useful for: </li></ul><ul><ul><li>discovering services. </li></ul></ul><ul><ul><li>composing workflows. </li></ul></ul><ul><ul><li>characterising and identifying mismatches. </li></ul></ul><ul><li>However , service annotation is expensive: </li></ul><ul><ul><li>knowledge of the ontology used for annotation. </li></ul></ul><ul><ul><li>knowledge of the web services to be annotated. </li></ul></ul><ul><li>(Semi)automatic annotation can be carried out using: </li></ul><ul><ul><li>schema matching and text classification techniques. </li></ul></ul><ul><ul><li>workflow specifications. </li></ul></ul><ul><ul><li>K. Belhajjame, S.M. Embury, N.W. Paton, N.W., R. Stevens and C.A. Goble, Automatic Annotation of Web Services Based on Workflow Definitions, P roc. 5th Intl. Semantic Web Conference , Springer, 116-129, 2006. </li></ul></ul>
  17. 17. Inferring Web Service Annotations <ul><li>Use workflows to infer information about the semantics of linked parameters: </li></ul>
  18. 18. Summary on Examples of Automation <ul><li>Data management and integration are complex, with many possibilities to benefit from automation. </li></ul><ul><li>Automation has been applied in many different settings, with many worthwhile results. </li></ul><ul><li>The diversity in approaches to and technologies associated with automation is great. </li></ul>
  19. 19. Outline <ul><li>Existing examples of automation. </li></ul><ul><li>Limitations in current practice. </li></ul><ul><li>Opportunities presented by ubiquitous automation. </li></ul>
  20. 20. Outline <ul><li>Existing examples of automation. </li></ul><ul><li>Limitations in current practice: </li></ul><ul><ul><li>Predictability. </li></ul></ul><ul><ul><li>Methodology. </li></ul></ul><ul><ul><li>Composability. </li></ul></ul><ul><ul><li>Semantics. </li></ul></ul><ul><li>Opportunities presented by ubiquitous automation. </li></ul>
  21. 21. Limitations: Predictability <ul><li>Adaptive systems change system behaviour in response to runtime feedback. Risks include: </li></ul><ul><ul><li>Reacting too quickly in response to temporary effects. </li></ul></ul><ul><ul><li>Reacting too slowly to be effective. </li></ul></ul><ul><ul><li>Reacting in a way that makes things worse. </li></ul></ul><ul><li>It can be difficult for developers of adaptive systems to predict how effective their proposals might be. </li></ul><ul><li>It sometimes takes several attempts to refine an adaptive strategy. </li></ul>
  22. 22. Adaptive Load Balancing: Comparison <ul><li>Several existing strategies were compared, across a range of environmental conditions. </li></ul><ul><li>Conditions could be identified in which all of the proposals were worse than not adapting. </li></ul><ul><li>Published evaluations of the existing proposals gave no indication of problematic cases. </li></ul><ul><li>Several of the developers did not know under which circumstances their approaches performed poorly. </li></ul><ul><li>N.W. Paton, V. Raman, G. Swart, I. Narang, Autonomic Query Parallelization using Non-dedicated Computers: An Evaluation of Adaptivity Options, Proc. ICAC , 221-230, 2006. </li></ul>
  23. 23. Adaptive Load Balancing: Experiment <ul><li>Query: </li></ul><ul><ul><li>P ⋈ PS ( P has 200,000 tuples, PS has 800,00 0 tuples). </li></ul></ul><ul><ul><li>Simulation of parallel run on three nodes. </li></ul></ul><ul><li>Types of imbalance: </li></ul><ul><ul><li>Constant : A consistent external load exists on one of the nodes throughout the experiment. The level of the external load represents the number of external tasks that are seeking to make full-time use of the machine. </li></ul></ul><ul><ul><li>Periodic : The load on one of the machines comes and goes during the experiment. The duration of the load indicates for how long each load spike lasts; and the repeat duration represents the gap between load spikes. </li></ul></ul>
  24. 24. Results: Constant Imbalance
  25. 25. Periodic Imbalance (1s)
  26. 26. Designing Adaptive Strategies <ul><li>Overheads : pessimistic strategies carry out additional work on the assumption that things will go wrong (e.g. replicating data). </li></ul><ul><li>Adaptation costs: optimistic strategies evaluate queries as normal, but may pay a high price to carry out specific adaptations when required. </li></ul>Overheads Adaptation Cost Adapt-5 Adapt-4 Adapt-2 Adapt-3 Adapt-1
  27. 27. Limitations: Methodology <ul><li>Adaptive data management proposals are generally described as specific algorithms or techniques: </li></ul><ul><ul><li>It is often not clear what methodology has been followed in their development. </li></ul></ul><ul><ul><li>It is not necessarily clear if there are well established techniques that could have been used to direct their design. </li></ul></ul><ul><li>Approaches that have been applied in the design of adaptive systems include: </li></ul><ul><ul><li>Systematic functional decomposition. </li></ul></ul><ul><ul><li>Control theory. </li></ul></ul>
  28. 28. Autonomic Computing Architecture <ul><li>Autonomic systems typically involve a control loop, with monitoring information driving planning and decision making. </li></ul><ul><li>IBM’s Autonomic Computing Toolkit provides components that implement a functional decomposition known as MAPE (Monitor, Analyze, Plan and Execute). </li></ul><ul><li>The toolkit provides implementations for several of the components (in particular Monitor and Analyze ). </li></ul>J.O. Kephart, D.M. Chess, The Vision of Autonomic Computing, IEEE Computer, 36(1), 41-50, 2003.
  29. 29. Data Management and MAPE <ul><li>Sensors : what monitoring information should a database platform expose to enable effective decision making? </li></ul><ul><li>Effectors : what hooks should a database platform expose to enable effective runtime modification? </li></ul><ul><li>It is not straightforward: </li></ul><ul><ul><li>to retrofit sensing and effecting functionality. </li></ul></ul><ul><ul><li>to predict what may be required. </li></ul></ul><ul><li>Monitor , Analyze , Plan and Execute components may also be able to be implemented in different ways. </li></ul><ul><li>Generic monitoring components have been proposed for tracking query progress and for adaptation: </li></ul><ul><ul><li>A. Gounaris, N.Paton, A. Fernandes, R. Sakellariou, Self-Monitoring Query Execution for Adaptive Query Processing, Data and Knowledge Eng. , 51(3), 325-348, 2004. </li></ul></ul><ul><ul><li>L. Luo, J. Naughton, C. Ellmann, M. Watzke, Towards a progress indicator for database queries, SIGMOD, 791-802, 2004. </li></ul></ul>
  30. 30. Monitoring Query Progress <ul><li>Progress monitoring predicts properties of an operator incrementally from monitored data. </li></ul><ul><li>Raw monitoring data may count the number of tuples returned by an operator, the average tuple size, etc. </li></ul><ul><li>From such information, operator selectivity, result size and runtime can be estimated. </li></ul><ul><li>Unnest : </li></ul><ul><ul><li> = (n out / n in ) </li></ul></ul><ul><ul><li>cardinality = cardinality operand *  </li></ul></ul><ul><ul><li>size = cardinality operand *  * avg(size result_tuple ) </li></ul></ul><ul><ul><li>time = cardinality operand *  * tuple_build_cost </li></ul></ul>
  31. 31. Building Adaptive Databases <ul><li>Most adaptive database extensions involve hard coding changes to the existing code base. </li></ul><ul><ul><li>Complex core infrastructure subject to intrusive changes. </li></ul></ul><ul><ul><li>Steep learning curve for developers of adaptive extensions. </li></ul></ul><ul><ul><li>Incremental changes result in reduced reuse. </li></ul></ul><ul><li>With respect to MAPE: </li></ul><ul><ul><li>Growing experience with generic monitoring. </li></ul></ul><ul><ul><li>Considerable diversity in Analyze , Plan and Execute . </li></ul></ul><ul><ul><li>Control theory provides some insights into decision making. </li></ul></ul>
  32. 32. Control Theory <ul><li>Provides a systematic framework for computing a change to an input given a measured output . </li></ul><ul><li>Designs seek to exhibit SASO properties: </li></ul><ul><ul><li>S table: bounded input gives bounded output. </li></ul></ul><ul><ul><li>A ccurate: measured output converges on desired value. </li></ul></ul><ul><ul><li>S hort Settling: converges to stable value quickly. </li></ul></ul><ul><ul><li>No O vershoot: achieves objectives in a steady manner. </li></ul></ul><ul><li>Either find a control engineer, learn the book, or apply a well established model. </li></ul><ul><ul><li>J.L. Hellerstein, Y. Diao, S. Parakh, D.M. Tilbury, Feedback Control of Computing Systems, Wiley, 2004. </li></ul></ul>
  33. 33. Control Theory: PID Controllers Source: http://en.wikipedia.org/wiki/PID_control
  34. 34. PID Controllers Example <ul><li>Task: evaluating queries from a queue over a server. </li></ul><ul><li>Objective: keep all query evaluation in memory to avoid use of multi-pass algorithms. </li></ul><ul><li>Goal for controller: keep the amount of free memory at 512Mb in order to ensure condition met. </li></ul><ul><li>Control parameter: multiprogramming level. </li></ul>
  35. 35. Proportional Controller <ul><li>Terminology: </li></ul><ul><ul><li>m : output signal. </li></ul></ul><ul><ul><li>K p : proportional gain. </li></ul></ul><ul><ul><li>e : error. </li></ul></ul><ul><li>Definition: m = K p e . </li></ul><ul><li>Query processing example: </li></ul><ul><ul><li>m : multiprogramming level. </li></ul></ul><ul><ul><li>e : (amount of free memory – 512Mb). </li></ul></ul><ul><ul><li>K p : 1/( job size in Mb): assumed 0.01, as 100 Mb jobs . </li></ul></ul>
  36. 36. Proportional Controller: Example 10.24 1024 5.12 512 2.56 256 0 0 -2.56 -256 -5.12 -512 -10.24 -1024 m: Multiprogramming Level Change e: Error
  37. 37. Integrative and Derivative Controllers <ul><li>Integrative Controller: </li></ul><ul><ul><li>Controller output depends on level and duration of error. </li></ul></ul><ul><ul><li>K i : proportional gain. </li></ul></ul><ul><ul><li>T i : integral time. </li></ul></ul><ul><ul><li>Definition: </li></ul></ul><ul><li>Differential Controller: </li></ul><ul><ul><li>Controller output depends on rate of reduction in error. </li></ul></ul><ul><ul><li>K d : differential gain. </li></ul></ul><ul><ul><li>T d : derivative time. </li></ul></ul><ul><ul><li>Definition: </li></ul></ul>. K i . K d
  38. 38. Control Theory for Data Management <ul><li>There are currently rather few examples of control theory being used in data management. Recent example in grid query processing: </li></ul><ul><ul><li>Anastasios Gounaris, Christos Yfoulis, Rizos Sakellariou and Marios Dikaiakos, Self-optimizing Block Transfer in Web Service Grids, WIDM, 2007. </li></ul></ul><ul><li>Modelling the relationship between measured values and controlled inputs can be challenging. </li></ul><ul><li>Many adaptive data management techniques change more than an input parameter. For example: </li></ul><ul><ul><li>A query may be reoptimized by an adaptive query processor. </li></ul></ul>
  39. 39. Limitations: Composability <ul><li>Many proposals for autonomic data management focus on specific adaptations: </li></ul><ul><ul><li>Selecting views for materialization. </li></ul></ul><ul><ul><li>Selecting data for replication. </li></ul></ul><ul><ul><li>Selecting fields for indexing. </li></ul></ul><ul><ul><li>Allocation of memory to functions. </li></ul></ul><ul><li>… however, such decisions are often inter-related, and modelling the inter-relationships between such strategies is challenging. </li></ul>
  40. 40. Query Processing Inter-Dependency <ul><li>Load imbalance results from inappropriate allocation of work to resources in partitioned parallelism. </li></ul><ul><li>Bottlenecks result from inappropriate allocation of work to resources in pipelined parallelism. </li></ul><ul><li>There is no benefit from resolving load imbalance if the bottleneck is elsewhere in the plan. </li></ul><ul><li>Resolving load imbalance may change the location of the bottleneck. </li></ul>join join join join A B C coordinator Change Allocation join Remove Bottleneck
  41. 41. Limitations: Semantics <ul><li>Property guarantees: </li></ul><ul><ul><li>Autonomic systems change behaviour mid-task. </li></ul></ul><ul><ul><li>Non-trivial adaptations may leave uncertainty as to whether an adaptation is meaning-preserving. </li></ul></ul><ul><ul><li>Few adaptations have had their meaning-preserving properties proved: </li></ul></ul><ul><ul><ul><li>K. Eurviriyanukul, A. Fernandes, N. Paton, A Foundation for the Replacement of Pipelined Physical Join Operators in Adaptive Query Processing, EDBT Workshops, 589-600, 2006. </li></ul></ul></ul>
  42. 42. Limitations: Semantics <ul><li>Performance guarantees: </li></ul><ul><ul><li>Autonomic behaviour may take certain risks with performance. </li></ul></ul><ul><ul><li>Some proposals may redo work, leading to the need for thresholds to remove the risk of continuous reoptimization: </li></ul></ul><ul><ul><ul><li>V. Markl, V. Raman, D. Simmen, G. Lohman, H. Pirahesh: Robust Query Processing through Progressive Optimization. SIGMOD Conference 2004: 659-67. </li></ul></ul></ul><ul><ul><li>Some algorithms provide bounded worst case performance: </li></ul></ul><ul><ul><ul><li>Daniel M. Yellin: Competitive algorithms for the dynamic selection of component implementations. IBM Systems Journal 42(1): 85-97 (2003). </li></ul></ul></ul>
  43. 43. Summary on Limitations of Automation <ul><li>Automation is currently partial in scope and often ad hoc in development. </li></ul><ul><li>Automation is a second class citizen in data management; there is interest in the benefits it can bring but not so much in automation per se . </li></ul><ul><li>As a result, automation in data management can be seen as immature, with considerable scope for improving the predictability, composability and clarity of proposals through enhanced methodologies. </li></ul>
  44. 44. Outline <ul><li>Existing examples of automation. </li></ul><ul><li>Limitations in current practice. </li></ul><ul><li>Opportunities presented by ubiquitous automation. </li></ul>
  45. 45. Outline <ul><li>Existing examples of automation. </li></ul><ul><li>Limitations in current practice. </li></ul><ul><li>Opportunities presented by ubiquitous automation: </li></ul><ul><ul><li>Increasing manageability of database technologies. </li></ul></ul><ul><ul><li>Extending the reach of database technologies. </li></ul></ul>
  46. 46. Increasing Manageability - 1 <ul><li>Database products: </li></ul><ul><ul><li>Commercial database systems are typically associated with high total cost of ownership , resulting in significant measure from high administrative costs. </li></ul></ul><ul><ul><li>Vendors are seeking to improve competitiveness by automating or supporting management of their intrinsically complex products. </li></ul></ul><ul><li>Data management components: </li></ul><ul><ul><li>It has been suggested that current database products are too complex, and that more data should be managed by lighter weight components. </li></ul></ul><ul><ul><li>As of yet, there is little evidence that light-weight data management components are being designed with automation in mind, but this is perhaps a practical proposition. </li></ul></ul>
  47. 47. Increasing Manageability - 2 <ul><li>There are increasing needs to manage personal data, and data management within workgroups or laboratories is often hindered by the complexity of current data management platforms. </li></ul><ul><li>Personal and workgroup data management often has evolving requirements, but rarely needs the full range of capabilities of current database products. </li></ul><ul><li>Proposals in this space: </li></ul><ul><ul><li>Data services: I. Subasu, P. Ziegler, K. Dittrich: Towards Service-Based Database Management Systems. BTW Workshops 2007: 296-30. </li></ul></ul><ul><ul><li>Data components: S. Chaudhuri, G. Weikum: Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System. VLDB 2000: 1-1. </li></ul></ul>
  48. 48. Increasing Reach - 1 <ul><li>Most automation in data management has sought to ask the question: </li></ul><ul><ul><li>Which current requirements can be met better by increasing the ranges of tasks that are carried out automatically? </li></ul></ul><ul><li>An alternative view gives rise to a different question: </li></ul><ul><ul><li>If we assume that there is to be no manual administration, what sorts of data management system can be developed? </li></ul></ul>
  49. 49. Increasing Reach - 2 <ul><li>The vision of dataspaces is to support database style access over diverse sources with minimal manual integration. </li></ul><ul><ul><li>A. Halevy, M. Franklin, D. Maier: Principles of dataspace systems. PODS 2006: 1-9. </li></ul></ul><ul><li>Preliminary proposals match schemas automatically but partially, thus giving approximate answers that can be ranked. </li></ul><ul><ul><li>J-P. Dittrich, M. Salles: iDM: A Unified and Versatile Data Model for Personal Dataspace Management. VLDB 2006: 367-378. </li></ul></ul><ul><ul><li>S. Abiteboul, N. Polyzotis: The Data Ring: Community Content Sharing. CIDR 2007: 154-16. </li></ul></ul><ul><li>The challenge is to enable querying over structured data in a personal file store, within an organisation or at internet scale, with no manual integration. </li></ul>
  50. 50. Conclusions <ul><li>Automation is already in lots of places: </li></ul><ul><ul><li>Database administration. </li></ul></ul><ul><ul><li>Query evaluation. </li></ul></ul><ul><ul><li>Data integration. </li></ul></ul><ul><li>Automation in data management is not mature: </li></ul><ul><ul><li>Predictability. </li></ul></ul><ul><ul><li>Methodology. </li></ul></ul><ul><ul><li>Composability. </li></ul></ul><ul><ul><li>Semantics. </li></ul></ul><ul><li>If automation becomes a more central focus: </li></ul><ul><ul><li>Understanding of automation per se should improve. </li></ul></ul><ul><ul><li>The nature of data management systems will change. </li></ul></ul>

×