1                                                                                                  Introduction        ON-...
Data Warehouse Architectures                                                                                          2.2 ...
2.4                                                                                                         Discussion2.4 ...
Achieving Performance in Distributed OLAP Systems                                                              3.2damental...
4                                                                 The Distributed OLAP system CubeStarfew hours. Therefore...
The Distributed OLAP system CubeStar                                                                             4.3     T...
4.5                                                                                      Storage ManagementFor the generat...
Summary and Future Work                                                                                          5      We...
Upcoming SlideShare
Loading in...5
×

1 ieee98

489

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
489
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "1 ieee98"

  1. 1. 1 Introduction ON-LINE ANALYTICAL PROCESSING IN DISTRIBUTED DATA WAREHOUSES Jens Albrecht, Wolfgang Lehner University of Erlangen-Nuremberg, Germany {jalbrecht, lehner}@informatik.uni-erlangen.de Abstract Therefore, many companies today decide to start with smaller, flexible data marts dedicated to specific business The concepts of ‘Data Warehousing’ and ‘On-line areas. In order to get the possibilities for cross-functionalAnalytical Processing’ have seen a growing interest in the analysis, there are two possibilities. The first one is to cre-research and commercial product community. Today, the ate again a centralized data warehouse for only cross-func-trend moves away from complex centralized data ware- tional summary data. The other one is to integrate the datahouses to distributed data marts integrated in a common marts into a common conceptual schema and therefore cre-conceptual schema. However, as the first part of this paper ate a distributed data warehouse.demonstrates, there are many problems and little solutions We will concentrate on the second alternative whichfor large distributed decision support systems in worldwide allows more flexible querying. To the user a distributedoperating corporations. After showing the benefits and data warehouse should behave exactly like a centralizedproblems of the distributed approach, this paper outlines data warehouse. For transparent and efficient OLAP on apossibilities for achieving performance in distributed on- distributed data warehouse several problems need to beline analytical processing. Finally, the architectural frame- solved. The intention of this article is not to present finalwork of the prototypical distributed OLAP system solutions but to show the potential of the distributedCUBESTAR is outlined. approach and the challenges for researchers and software vendors as well.1 Introduction In the following section we will characterize different Today’s global economy has placed a premium on data warehouse architectures and their inherent drawbacks.information, because in a dynamic market environment After showing the benefits of a distributed approach to datawith many competitors it is crucial for an enterprise to have warehousing we show the problems resulting for distrib-on-line information about its general business figures as uted OLAP systems. Performance as the main problem iswell as detailed information on specific topics to be able to the issue of section 3. Finally, in section 4 we give an over-make the right decisions at the right time. That’s why today view over the prototypical distributed OLAP systemalmost all big companies are trying to build data ware- CUBESTAR.houses. In contrast to former management information sys-tems based mostly on operational data, data warehouses 2 Data Warehouse Architecturescontain integrated and non-volatile data and provide there-fore a consistent basis for organizational decision making. According to [Inmo92], a data warehouse is a “sub-In addition to classical reporting, the main application of ject-oriented, integrated, time-varying, non-volatile collec-data warehouses is On-line Analytical Processing (OLAP, tion of corporate data”. Since that definition is very[CoCS93]), i.e. the interactive exploration of the data. The generic, we will now give an overview of three types ofdata warehouse promise is getting accurate business infor- data warehouse architectures, partly based on [WeCa96].mation fast. But the ultimate goal of data warehousing and OLAP 2.1 Enterprise Data Warehousesgoes even further, the vision is to “put a crystal ball on What most people think of when they use the termevery desktop”. Today that ideal is far from being reality. data warehouse is the enterprise or corporate data ware-Because first, to build a single centralized data warehouse house, where all business data is stored in a single databaseserving many different user groups takes a long time. Setup with a single corporate data model (figure 1). Enterpriseand maintenance are very expensive. All this contributes to data warehouses are created to provide knowledge workersa relatively inflexible architecture. Second, local access with consistent and integrated data from all major businessbehavior is considered in the data warehouse design neither areas. They offer the possibility to correlate informationon the conceptual nor on the internal but only on the exter- across the enterprise.nal layer.
  2. 2. Data Warehouse Architectures 2.2 But, because of the huge sub-ject area (the corporation) and thevast amount of different opera-tional sources that need to be inte-grated, enterprise data warehouses OLAP Distributed OLAP Middlewareare very difficult and time-con- Serversuming to implement. Too many Globaldifferent data sources and too Schemamany different user requirements Centralized Asian DW Amer. Service Salesare likely reasons for the failure of Sales All Logistics Servicethe whole data warehouse project. Europe America Asia Since the data warehousingsetup strategy from the analysts Fig. 1: Enterprise Fig. 3: Distributed Enterprise Datapoint of view is, as Inmon puts it, Data Warehouse Warehouse“give me what I want and I will tellyou what I really want” [Inmo92], an incremental approach The benefits of this divide-and-conquer approach are clear.has proven to be the right choice. In the distributed enterprise data warehouse the individual data marts can be created and managed flexibly.2.2 Data Marts The flexibility for the individual data marts is traded for additional complexity in the management of the global The term data mart stands for schema and query processing across several data marts.the idea of a highly-focused ver- Table 2.1 compares some characteristics to the centralizedsion of a data warehouse (figure data warehouse.2). There is a need to distinguishdata marts which are local OLAP OLAP Characteristics Centralized Distributed DWextracts from an enterprise data Server Server DW Local Area Wide Areawarehouse and data marts whichare serving as the data warehouse Conceptual Flexibility low low moderatefor a small business group with Data Distribution none moderate high Saleswell defined needs. In the latter Service Logistics Query Processing easy moderate complexcase, which is the definition we Communication Costs low moderate highwill use in this paper, the limita- Asia Security Requirements. none low lowtion in size and scope of the data Local Maintenance hard easy easywarehouse, e.g. only sales or mar- Fig. 2: Data Marts Global Maintenance n/a easy moderateketing, dramatically reduces setupcosts. Data marts can be deployed in weeks or months. That Tab. 2.1 Characteristics of Centralized andis why most of the current data warehousing projects Distributed Data Warehousesinclude data marts.2.3 Distributed Enterprise Data Warehouses There are basically two scenarios for a distributed enterprise data warehouse. The local area scenario (e.g. the Although the data mart idea is very attractive and data marts of only the Asian division) characterizes whatpotentially offers the opportunity to build a corporate wide first comes into mind. The extension of this approach to adata warehouse from the bottom up, the benefits of data wide area scenario (the global setting depicted in figure 3)marts can easily be outweighed if there is no corporate- is possible, but the high degree of data distribution resultswide data warehouse strategy. If data mart design is not in much more complexity for query processing since com-done carefully, independent islands of information are cre- munication costs become a major factor. The more globalated and cross-functional analysis, as in the enterprise data the distributed data warehouse is, the more important iswarehouse, is not possible. increasing local access by replication. On the contrary, the more local the scenario is, the more likely techniques for Therefore, the pragmatic approach to combine the vir- load balancing in a shared-nothing server cluster can betues of the former is to integrate the departmental data applied.marts into a common schema. The result is a distributedenterprise data warehouse (figure 3), also called federateddata warehouse [WeCa96] or enterprise data mart [Info97].
  3. 3. 2.4 Discussion2.4 Discussion Site A Site C The current trend in the creation of data warehousesfollows the bottom up approach where many data marts arecreated with a later integration into a distributed enterprisedata warehouse in mind. The Ovum report [WeCa96] pre- Site Bdicts that centralized data warehouses will become increas-ingly unpopular due to their inherent drawbacks and thus Query Resultbe replaced by a set of interoperating data marts. The commercial product community seems to confirm Complete data cubethe prediction. A first step towards easy integration was Selected canditate aggregatesrecently done by the Informatica Corp. with the release ofits tools for enterprise data mart support. These tools are Other possible candidatesproposed to easily allow the creation and maintenance of Fig. 4: Patchworking algorithmdistributed data marts and a seamless integration into acommon schema. Moreover, query tools can directly every minute. If user behavior is considered at all, only a fixaccess the metadata by providing a metadata exchange number of queries is taken as the basis for the aggregationAPI. OLAP tool vendors like Microstrategy Inc. already algorithms ([GHRU97], [Gupt97], [BaPT97]). On theannounced to support this feature. For querying, “it means commercial product side there are OLAP tools like Micros-that all data marts are working off the same sets of data def- trategy’s DSS Tools or Informix’ Metacube which are ableinitions and business rules allowing users to query aggre- to give the DBA hints for the creation of aggregates basedgate data that may be distributed across multiple disparate on monitored user behavior. However, if the DBA decidesdata marts” [Ecke97]. Companies like Cognos [Cogn97] to create aggregates, these are static and a change in theand Hewlett Packard [HP97] also offer already rudimental user behavior is not reflected.support for distributed data warehouses. Today, aggregates are more or less seen as another But to our knowledge, no query tool today is able to multidimensional access method, similar to an index, thattransparently and performantly generate and execute que- must be managed by the DBA. But only a self-maintaining,ries in distributed setting. An intelligent middleware layer dynamic aggregate management working like a (distrib-as the basis for traffic routing, aggregate caching and load uted) system buffer or cache for aggregates offers thebalancing is needed. This is the problem we will focus on needed flexibility and the potential for high query responsein the remaining sections. times and low human maintenance efforts. The selection and the usage of aggregates should be completely transpar-3 Achieving Performance in Distributed ent to the user. The DBA should only be able to setup OLAP Systems parameters for the creation of but not the content of aggre- gates. In order to deserve the attribute on-line, queriesshould never run much longer than a minute. In this section Partial Aggregates. The algorithms for the selection ofwe will outline some specific methods and necessary pre- aggregates explicitly supporting the multidimensionalrequisites to achieve performance in large distributed model compute preaggregates that are based on the fullOLAP environments. Note, that some of these also apply in scope of the data cube. The query result in figure 4, if mate-the non-distributed case but become essential in the distrib- rialized, would be such a complete aggregate. Relationallyuted case. spoken, only different attribute combinations in the group- by clause are considered. The where-clause in the materi-3.1 The Selection and Use of Aggregates alized view definition is always empty. The benefit of this approach is that query processing is relatively simple. Researches and vendors agree that the main approach Therefore, this is the only kind of aggregates commercialto speed up OLAP queries is the use of aggregates or mate- products support today.rialized views [LeRT96]. In the last two years many articleson the selection of views to materialize in a data warehouse However, materialized views created in that mannerwere published. A good overview over techniques on the may contain much data that is never needed and thereforeselection of aggregates can be found in [ThSe97]. storage space is wasted. Even more important, in a distrib- uted setting with inherent data fragmentation the handlingDynamic Management of Aggregates. None of the recent of partial aggregates (the candidates in figure 4) becomesarticles considers the dynamics of an on-line environment a necessary prerequisite. A data mart server has only partswith dozens of completely different queries to execute in of the whole data cube stored locally. In fact, that is the fun-
  4. 4. Achieving Performance in Distributed OLAP Systems 3.2damental idea of having data marts in a distributed enter- The fragmentation of the raw data is already fix due toprise data warehouse. Thus, at least at the lower aggrega- the definition of the data mart. Usually, there is not much oftion levels it does not even make sense to create aggregates a choice here. The problems as well as the performancewith full, enterprise-wide scope. potential of distributed data warehouses arise from the The big drawback is clearly that query optimization in mentioned techniques of dynamic aggregate managementthe presence of partial aggregates becomes more complex. and dynamic data allocation. These techniques were notA patchworking algorithm finding the least expensive previously applicable because redundancy in transaction-aggregate partitions must be invoked (figure 4). oriented systems always requires expensive synchroniza- tion mechanisms.Data Allocation and Load Balancing. Directly related to The query optimizer of a distributed OLAP systemthe question of selecting aggregates in a distributed envi- first needs to be aware of the aggregations which would beronment is the allocation problem. The system should not helpful in order to speed up query processing. If this infor-only manage the dynamic creation of partial aggregates, mation is known it must invoke a patchworking algorithmbut also decide where to place them on a tactical basis. finding the least expensive aggregate partitions (figure 4).Moreover, the traditional method to increase local access The choice how to assemble the query result may vary fromperformance in distributed environments by the use of rep- query to query because the underlying redundant fragmentslicas can and should also be applied in distributed OLAP. may be dropped for the selection of new ones by the Hence, redundancy in the distributed data warehouse resource management and the system and network loadcan be distinguished into vertical redundancy involving the may change.creation of summary data with a coarser granularity than An important point is that the determination of candi-the detailed raw data, and horizontal redundancy being dates as well as the patchworking itself, which includes theintroduced by the replication of aggregates (figure 5). comparison of selection predicates, is only feasible in the context of a multidimensional data model. In contrast to the aggr op. simple relational data model, selections (slices) can be replicate defined by classification nodes (see section 4.2). Predicates Summary Data Summary Data defined in that manner contain much more semantics as a aggr op. simple relational selection predicate. If one would allow arbitrary predicates for the definition of fragments, patch- Raw Data working would not be possible. Vertical Horizontal Redundancy Redundancy 3.2 Quality of Service Specification Fig. 5: Horizontal and Vertical Redundancy Another point greatly distinguishes the distributed OLAP scenario from traditional distributed databases. In In order to allow the system to adapt to changing user the analytical context general figures are of interest and notrequirements, there should be no fixed locations for any individual entities. It is often the case that approximatekind of redundant data fragments. Aggregate data should numbers are sufficient if they can be delivered faster. Thus,be close to sites where they are needed in order to minimize it is feasible to trade correctness for performance. Somethe communication costs. In order to increase parallel factors characterizing correctness are accuracy, actualityquery execution, data fragments should be distributed on and consistency.several data mart servers, especially if communication costis low. Redundant fragments must dynamically be created Accuracy. One approach to relaxing the accuracy of theand dropped according to the changing user behavior. reports is offering summary data that is either aggregated at Thus, a flexible adapting migration and replication a higher level than requested or offering only parts of thepolicy is essential for a data based load balancing at the requested information. For example, if a query requestingdata mart servers. Involved in the process of choosing and sales figures for each city in Germany would take hours, anallocating aggregates are not only access characteristics but answer containing sales figures for each county or evenalso communication cost, maintenance cost for redundant only for South Germany might be enough for the moment.data and additional storage cost. An algorithm dealing with these issues is given in [Dyre96].Differences to traditional distributed DBMS. Althoughthe fragmentation and allocation problematic is well Actuality. Absolute actuality in a worldwide distributedknown from traditional distributed database systems, there enterprise data warehouse is an expensive requirement. Ifare several specifics for distributed data warehouses. each data mart is updated once a day for example, it means for the distributed data warehouse that it is updated every
  5. 5. 4 The Distributed OLAP system CubeStarfew hours. Therefore, the maintenance cost for summary Clients Resultdata can increases dramatically. But getting the most actualnumbers may not be necessary for the analyst. Hence, acertain degree of staleness can be acceptable allowing Queryredundant data to be updated or dropped gradually Middleware Parser Syntax Check Query Restructurer[Lenz96]. Generator Semantic Check BrokerConsistency. The consistency of the reports generated in a Partitionsession should be another adjustable factor. In general, it is Delivery Distributed Global Partition Directory Server Classification Selector Cachedesirable that a sequence of drill-down and roll-up opera- Informationtions relies on the same base figures. However, this require-ment might not be crucial, if the analyst just wants to get an Data Mart Server Bidder B Boverview over some figures. Hence, this condition as wellmight be relaxed according to the analysts needs. Note, that RM AE RM AE Aggregation Resourcegenerating consistent reports can be very expensive, Engine Managerbecause all reports in a session must have the same actual-ity. AE Distributed AE Server Federation Especially in a large distributed environment wherecommunication costs can not be neglected, these factors Control Flow Data Floware critical for reasonable overall performance. A littleinaccurate, a little stale or a little inconsistent data may be Fig. 6: Architecture of CUBESTARsufficient for the analyst, if he just gets the figures in a fewseconds rather than minutes. The third layer (server layer) consists of a federation of data mart servers. Each data mart server controls the4 The Distributed OLAP system CUBESTAR locally available set of data partitions and the parallel query execution at its machine. This section focuses on the architecture and the prin-ciples for query processing and resource management of 4.2 The Data Modelthe prototypical distributed OLAP system CUBESTAR. Inorder to deal with the complexity of the issues we borrowed CUBESTAR consequently applies the multidimen-many ideas from the Mariposa system ([SDK+94]) and sional data model at all levels of query processing andapplied them to our special application domain. aggregate management. Only issues of physical data access are managed by the underlying relational database engines.4.1 The Architecture Since the main operation in OLAP is the application of aggregation operations, classification hierarchies can be The general architecture of CUBESTAR consists of a used to define groups for the application of aggregationthree tier architecture as shown in figure 6. At the client operations (figure 7). For example, the typical OLAP-layer are the end user tools with reporting facilities. Que- query of figure 7 specified in CQL asks for the sales figuresries are issued in the Cube-Query-Language (CQL, at the granularity of product families subsumed by video[BaLe97], figure 7), specifically supporting the multidi- equipment (= camcorders and recorders).mensional data model. The heart of the architecture is the middleware layer Multidimensional Objects. In analogy to the ‘cubette’-hiding the details of data distribution. Its main task is to approach of [Dyre96], a query may be seen as a multidi-translate a user query and generate an optimized distributed mensional subcube, called Multidimensional Object (MO),query execution plan. For distributed query processing and basically described by three components [Lehn98]:aggregate management we apply a microeconomic para- • A specification of the original cell and the applieddigm. A site tries to maximize revenues earned for taking aggregation operator (SUM(Sales))part in query processing. In order to compete, the site hasto maintain a set of redundant data according to the current • A multidimensional scope specification consisting ofmarket needs. The middleware layer also implements a glo- a tuple of classification nodes (<P.Group = ‘Video’,bal classification information service providing a globally G.Country = ‘Germany’, T.Year = ‘1997’>)consistent view of the multidimensional data model. • A data granularity specification consisting of a tuple of categories (<P.Family, G.Top, T.Top>)
  6. 6. The Distributed OLAP system CubeStar 4.3 TOP ALL for the aggregation efforts. The value of a query result is computed by a weighted combination of the following fac- Consumer tors: Area Electronics • Actual size of the raw data MOs required for the com- Group Audio Video putation (number of tuples). Taking the size of the raw data is reasonable, since Family Camcorder Recorder requests for aggregates of huge fact tables should be article# TR-75 TS-78 A200 V-201 ClassicI more expensive than those of smaller ones. Moreover, this decision favors the storage of higher aggregates, since they need little space and yield high profits. SELECT SUM(SALES) FROM Product P, Geography G, Time T • Actual size of the query result (number of tuples). WHERE P.Group = ‘Video’, This term simply values the additional overhead for G.Country = ‘Germany’, storing large aggregates. T.Year = ‘1997’ • Aggregation time. UPTO P.Family This is the time spent from issuing the query to deliver- ing the result, i.e. the local query processing time. Fig. 7: A Sample Classification Hierarchy and Higher aggregation time results in a lower price, thus a Sample Query in CQL favoring fast data mart servers. • Transfer time. Within the distributed multidimensional OLAP envi- The computation may require the shipping of interme-ronment, multidimensional objects reflect the basic units diate MOs from one data mart server to another.for distributed storage management and query processing.Therefore, different types of MOs with regard to their gen- 4.4 Query Processingeration in the distributed environment are considered. The raw or base data partitions at each data mart are A query in CUBESTAR is basically a multidimensionalcalled Base Multidimensional Objects (BMO). Thus, object specified by a CQL query like the one in figure 7.BMOs represent data of the finest granularity. BMOs are Affiliated with each query is a user defined quality of ser-not allowed to migrate but are assigned a unique home site, vice specification consisting of a limitation to the querywhere they are uploaded by the data mart loading tools. processing time and an indicator for the requested accuracyConsequently, they are per definition up-to-date. of the data. Thus, the user has the choice to trade speed for actuality. From the BMOs any number of MOs can be derived.In a relational sense these derived MOs are materialized The query is issued to a broker which tries to perform theviews. However, because of the definition of MOs, the query in the best way possible on behalf of the user. Sincequestions whether one MO is contained in, intersected by all users are treated equally and prices are regulated, aor can be derived from another MO is easily decidable. query is not assigned a budget. The brokers objective isAccording to the requirements of the system MOs are simply to get the best service available for the usersallowed to migrate or be replicated freely. request, i.e. to minimize query execution time under the quality of service constraints. After query processing, the broker pays the price according to the billing scheme to the4.3 Microeconomics participating data mart servers. Note, that in order to sup-An elegant solution to cope with the complexity of port user priorities it would make sense to limit the usersdynamic aggregate management and load balancing is to budget resulting in a second objective, price, to query opti-put all issues related to shared resources into a microecono- mization.mic framework ([SDK+94]). In an economy where every Query processing basically proceeds in the three stepssite tries to selfishly maximize its utility there is no need for query preparation, query optimization, and query execu-a central coordinator; the decision process is inherently tion.decentralized. Moreover, the competition for orders and In the query preparation phase, the query string is handedprofits allows the system to dynamically adapt to market over to the parser, where it is checked for syntactical anddemands, i.e. query behavior and system load as well. semantic correctness. The result of this step is a tree-likeIn contrast to Mariposa where demand and supply regulate representation of the query with the queried MO at the top,the prices, for the sake of simplicity a uniform billing intermediate results at the inner nodes, and raw data at thescheme is applied to each executed query having fixed rates leaves.
  7. 7. 4.5 Storage ManagementFor the generation of a distributed query execution plan the 4.5 Storage Managementcurrent distribution of the aggregates must be taken intoaccount. Note, that due to the dynamic aggregation man- So far, the existence of many redundant multidimensionalagement the existence and location of aggregates and rep- objects was assumed. This subsection focuses on when,licates changes frequently. Therefore, the broker may issue how, and where these MOs are generated. In order to deala broadcast request to the data servers in order to obtain the with the complexity of the issues of section 3.1, this prob-necessary information. This kind of meta data is cached in lem is handled by the application of microeconomics.the partition directory cache in order to avoid too many The resource manager at a data server aims towards maxi-broadcasts. Using this method, it is possible that invalid mal usage of its local resources in terms of revenues.meta data are used. In this case, there are two possibilities, Therefore, a site always tries to expand its market share byeither force a data mart server to the recomputation of cleverly creating and dropping aggregates and replicas. Indropped aggregates or submit a new broadcast. order to estimate current demands the resource managerAt the data mart server side, the bidder component receiv- maintains a query history list containing those requests theing the request checks for those locally available MOs site was rejected to answer, either because it did not havewhich can contribute to the query. Each bid from a data any suitable MOs or other bidders were preferred. The listmart server includes the following data entries include for each query the actually selected MOs with their locations and revenues. Based on this informa- • the identifier of the data mart server, tion the resource manager can decide to obtain new MOs in • a set of tuples { (Mi, Ti) } each of which contains the order to underbid its competitors in the future. If the new description of an offered MO in the repository and the MO is to be acquired from a remote location, the resource time the site expects to aggregate that MO to the gran- manager acts as a client itself, and therefore also has to pay ularity of the query M, for the MO from its own account. In this case, it is doing an investment. • an indicator for the available processing power the site is willing to reserve for the query depending on the sys- The other way to use the storage space is to offer it for tem load, query processing. If MOs from different locations must be coalesced, one site will receive the other parts. Note, that it • the free space the site is willing to reserve for interme- is very lucrative to receive other pieces of MOs fitting to diate results possibly from other data mart servers. locally stored ones, because next time a larger MO can beThe query optimizer’s task is now to select from the set of offered.multidimensional objects offered by the bidding sites In fact, using the intermediate results of the query process-those, which require the least aggregation and transporta- ing is the most convenient way to get new MOs also in thetion efforts using a patchworking algorithm. To estimate local case. Each of these MOs is checked by the resourcecommunication costs, average network throughputs manager for usefulness. Basically, storing MOs that werebetween the sites are used. already needed is working like a cache, with a microecono-The result of this step is a distributed query execution plan, mic heuristics for replacement.which is basically a tree with physically existing MOs atthe leaves, the target MO at the top, and intermediate 5 Summary and Future Workresults at the inner nodes. Attached to each node is an oper-ation, an estimated processing time for the operation, and In the first part of this paper we made clear the draw-the location for the execution of the operation denoting the backs of the centralized data warehouse and the non-inte-bidders that won the competition. grated data mart approach. In order to circumvent theseIn the query execution step, the bidders are informed drawbacks and combine the benefits, integrated distributedwhether their offer was accepted or not. The notification enterprise data warehouses were presented as a possibleincludes the complete query execution plan allowing the solution. The on-line exploration of distributed data ware-sites to understand the decision of the broker and draw con- houses or distributed OLAP is related to many complexclusions for their future resource management. issues like distributed query optimization, meta data man- agement and security policy handling, each of whichIf all MOs necessary for processing have arrived at the site, demanding innovative solutions. We characterized somethe local query execution is initiated by an aggregation possibilities for achieving query performance by dynamicengine. If subqueries can locally be executed in parallel, and partial aggregation, data based load balancing and thethe aggregation engine may spawn new aggregation specification of the quality of service. Finally, we discussedengines running in parallel threads. After the aggregation the architectural framework of the prototypical distributedengine completed its task, in the last step the resulting MO OLAP system CUBESTAR.is sent to the target location.
  8. 8. Summary and Future Work 5 We consider sophisticated aggregation management a HaRU96 Harinarayan, V.; Rajaraman, A.; Ullman, J.D.:major research issue. This covers algorithms for using dis- Implementing Data Cubes Efficiently, in: 25thtributed partial aggregations and deciding which aggregate International Conference on Management of Data,combinations yield the highest benefit and are the best to be (SIGMOD96, Montreal, Quebec, Canada, June 4-6, 1996), pp. 205-216materialized. A reasonable solution offering the neededflexibility is the application of a microeconomic framework HP97 N.N.: HP Intelligent Warehouse, White Paper,as in the Mariposa system ([SDK+94]). We admit, that this Hewlett Packard, http://www.hp.com, 1997last statement needs to be proven by the implementation. Inmo92 Inmon, W.H.: Building the Data Warehouse, John Wiley, 1992 Info97 N.N.: Enterprise-Scalable Data Marts: A NewReferences Strategy for Building and Deploying Fast, ScalableBaLe97 Bauer, A.; Lehner, W.: The Cube-Query-Language Data Warehousing Systems, White Paper, for Multidimensional Statistical and Scientific Informatica, http://www.informatica.com, 1997 Database Systems, in: 5th International Conference Lehn98 Lehner, W.: Modeling Large Scale OLAP on Database Systems For Advanced Applications Scenarios, in: 6th International Conference on (DASFAA’97, Melbourne, Australia, April 1-4, Extending Database Technology, (EDBT’98, 1997), pp. 263-272 Valencia, Spain, March 23-27, 1998)BaPT97 Baralis, E.; Paraboschi, S.; Teniente, E.: Lenz96 Lenz, R.: Adaptive Distributed Data Management Materialized View Selection in a Multidimensional with Weak Consistent Replicated Data. In: Database, in: 23rd International Conference on Very Proceedings of the 11th annual Symposium on Large Data Bases (VLDB’97, Athens, Greece, Applied Computing (SAC’96), Philadelphia, 1996 1997) LeRT96 Lehner, W.; Ruf, T.; Teschke, M.: Improving QueryCoCS93 Codd, E.F.; Codd, S.B.; Salley, C.T.: Providing Response Time in Scientific Databases Using Data OLAP (On-line Analytical Processing) to User Aggregation, in: 7th International Conference and Analysts: An IT Mandate, White Paper, Arbor Workshop on Database and Expert Systems Software Corporation, 1993 Applications (DEXA’96, Zurich, Switzerland,Cogn97 Cognos Corporation: Distributed OLAP, White Sept. 9-10, 1996) Paper, http://www.cognos.com, 1997 SDK+94 Stonebraker, M.; Devine, R.; Kornacker, M.; Litwin,Cube97 The CUBESTAR Project: http:// W.; Pfeffer, A.; Sah, A.; Staelin, C.: An Economic www6.informatik.uni-erlangen.de/research/ Paradigm for Query Processing and Data Migration cubestar.html in Mariposa, in: Proceedings of 3rd International Conference on Parallel and Distributed InformationDyre96 Dyreson, C.: Information Retrieval from an Systems (PDIS’94, Austin, TX., Sept. 28-30), 1994, Incomplete Data Cube, in: 22th International pp. 58-67 Conference on Very Large Data Bases, (VLDB’96, Mumbai, India, 1996) ThSe97 Theodoratos, D.; Sellis, T.: Data Warehouse Configuration, in: 23th International Conference onEcke97 Eckerson, W.: Building and Managing Data Marts, Very Large Data Bases, (VLDB’97, Athens, Patricia Seybold Group Report for Informatica, Greece, 1997) http://www.informatica.com, 1997 WeCa96 Wells, D.; Carnelley, P.: Ovum evaluates: The DataGHRU97 Gupta, H.; Harinarayan, V.; Rajaraman, A.; Ullman, Warehouse, Ovum Ltd., London, 1996 J.D.: Index Selection for OLAP, in: 13th International Conference on Data Engineering, (ICDE’97, Birmingham, UK, April 7-11, 1997)Gupt97 Gupta, H.: Selection of Views to Materialize in a Data Warehouse, in: 6th International Conference on Database Theory (ICDT‘97, Delphi, Greece, Jan 8-10, 1997), pp. 98-112

×