Published on

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Analysis of existing databases at the logical level: the DBA companion pro ject 1 2 1 1 Fabien De Marchi , Stéphane Lopes , Jean-Marc Petit , Farouk Toumani 1 Laboratoire LIMOS, CNRS UMR 6158 Université Blaise Pascal - Clermont-Ferrand II 24, avenue des Landais, 63 177 Aubière cedex, France 2 Laboratoire PRISM, CNRS FRE 2510 45, avenue des Etats-Unis, 78035 Versailles Cedex, France January 15, 2003 Abstract task and since a large number of companies can- not justify a full-time dba presence, simplifying ad- Whereas physical database tuning has received a lot ministration of rdbms is becoming a new challenge of attention over the last decade, logical database for the database community: The idea is to have tuning seems to be under-studied. We have devel- databases adjusting themselves to the characteristics oped a project called DBA Companion devoted to the of their applications [2]. For example, in the con- understanding of logical database constraints from text of the AutoAdmin project [22], physical database which logical database tuning can be achieved. tuning is investigated to improve performances of the In this setting, two main data mining issues need database, e.g. index denitions or automatic statistic to be addressed: the rst one is the design of e- gathering from SQL workloads. cient algorithms for functional dependencies and in- In the same spirit, existing logical database con- clusion dependencies inference and the second one straints should be fully understood. For exam- is about the interestingness of the discovered knowl- ple, providing the dba with functional dependencies edge. In this paper, we point out some relationships (FDs) and inclusion dependencies (INDs) holding in between database analysis and data mining. In this her database is particularly critical not only for im- setting, we sketch the underlying themes of our ap- proving data consistency but also for ensuring appli- proach. Some database applications that could bene- cation performances. t from our project are also described, including log- ical database tuning. Understanding existing databases is a necessary step when the database has to evolve for instance to better match user's requirements or when hard- 1 Introduction and motivations ware/software evolution has to be performed. Many database applications, such as database reverse engi- Today's Relational DataBase Management Systems neering, data interoperability or semantic query opti- rdbms) ( require DataBase Administrators ( dba) to mization to mention a few, assume the availability of tune more and more parameters for an optimal use data semantics, mainly conveyed by FDs and INDs of their databases. Due to the diculty of such a over existing databases. However, there is no guar-
  2. 2. antee at all such kind of knowledge is a priori known. Table 1: A running example We have developed a project called DBA Compan- ion devoted to the understanding of logical database pid name price constraints from which logical database tuning can 1 Chai $ 18.00 be achieved [16, 15, 4]. A prototype has been devel- Product 2 Tofu $ 18.00 oped [14]: its objective is to be able to connect any 3 Paté $ 12.30 database (independently of the underlying DBMS) in 4 Paté $ 50.00 order to give some insights to DBA/analyst such as: Order oid date shipVia pid qty • the FDs and INDs satised in her database, 10509 11-10-1998 Federal Shipping 1 2 • small examples of her database, thanks to Infor- 10509 11-10-1998 Federal Shipping 2 1 10510 1-5-2000 United Package 3 1 mative Armstrong Databases. The same bene- 10511 12-12-1998 Speedy Express 2 2 ts when the design by example were introduced [23, 17] are also expected in this slightly dier- ent context (database maintenance vs database design). In [15], a general framework has been proposed based upon the concept of agree set [1, 18, 7]. Given Paper organization Section 2 introduces the a relation, an agree set is an attribute set having the main underlying problems to achieve logical database same values in at least a couple of tuples in this re- tuning such as functional and inclusion dependency lation. Informally, such a set conveys information discovery. Some applications which could benet about FDs not satised by a couple of tuples. More from these dependencies are sketched in Section 3. precicely, it is a closed set with respect to FD clo- We conclude in Section 4. sure. From agree sets, maximal sets (also called meet- irreducible sets) can be easily derived and then, many 2 Data mining issues problems related to FD inference can be achieved without accessing the database anymore. Examples of related problems are key discovery, approximate Since data semantics is mainly conveyed by FDs and FD inference, normal form tests and Armstrong re- INDs in relational databases, it gives rise to two data lation generation from FDs that hold in the relation mining problems. In this setting, two main issues being analyzed [9]. need to be addressed: the former is about the design The most important point of this framework is that of ecient algorithms for FDs and INDs inference and the latter is about the interestingness of the dis- a relation is queried only once, i.e. when agree sets are computed. covered knowledge. A data mining approach and a database approach Example 1 Let us consider a database describing or- using SQL have been proposed to compute agree sets ders of products of a company. A running example from a relation [15]. Both approaches performed is given in table 1 and will be reused throughout the equally well but the latter is fully integrated in the paper. DBMS, which is by far more realistic in our context (no pre-treatment is necessary). 2.1 FD inference Example 2 A cover of FDs satised in Product The problem of discovering FDs is stated as follows: and Order { Product:{pid} → {name,price}, are: given a relation r, nd a small cover of the FDs Product:{name,price} → {pid}, Order: {oid} → that hold in r. Discovering functional dependencies {date,shipVia}, Order: {date} → {oid}, Order: satised by a database has been addressed by various {shipVia} → {oid} , Order: {pid,qty} → {oid} , approaches, among which we quote [18, 11, 15, 20]. Order: {oid,pid} → {qty} , Order: {oid,qty} → {pid} }
  3. 3. 2.2 IND inference from join statements of SQL workloads representa- tive of database activity. Such SQL workloads are The problem of inclusion dependency inference in easily gathered in recent rdbms. relational databases can be formulated as follows: given a database d, nd a small cover of INDs that Interestingness of INDs We can dene the no- hold in d. As far as we know, the discovery of tion of interestingness for INDs as being those INDs inclusion dependencies has raised little interest in which concern duplicated attributes over relation Database and KDD research communities [12, 16]). schemas. In [16], we exploit the logical navigation In our project, a levelwise algorithm called Mind as a guess to automatically nd out only interesting has been devised to discover a cover of INDs satis- INDs, being understood that some of them can be ed by a database [4]. It extends the well known missed. Thus, it can be seen as a pre-treatment of AprioriGen function for association rules discovery a data mining algorithm to decrease the number of to generate new candidate INDs of size i+1 from candidate attributes. satised INDs of size i. It is worth noting that in [4] a new proposition for unary IND inference has Example 4 Continuing our example, it seems quiet been made from which IND discovery becomes more reasonable to assume that in a representative work- realistic in practical cases. load of the database activity, a join between Order[pid] Product[pid] does exist. There- Example 3 INDs satised in the database given in Ta- fore, from example 3 only one IND is considered as in- ble 1 are : { Order[pid] ⊆ Product[pid], Order[qty] Order[pid] ⊆ Product[pid]. It is worth teresting: ⊆ Product[pid], Order[qty] ⊆ Order[pid] } Order[qty] ⊆ noting that other INDs, such as e.g. Product[pid], do not reect integrity constraints, rather accidental dependencies. 2.3 Interestingness of the discovered knowledge Interestingness of FDs Basically, the idea which Once FDs and INDs satised by a given database has been developed so far for INDs can be reused for are discovered, they have not the same weight, i.e. FDs: the idea is to discard FDs whose at least one only a subset of them are worth to take into account attribute of the left-hand side is not a duplicated at- when for instance a logical database tuning process tribute, i.e. does not belong to the logical navigation has to be carried out. Eliciting such subsets can be of the database . 1 guided by DBA' expertise or SQL workloads. We propose two approaches to deal with this im- Example 5 In addition to the previous example, portant task in a KDD process: the former is based assume Order.oid is also a duplicated attribute (with another schema not listed here). Applying on the intuition that attributes implied in so called the previous heuristic, it remains only three de- logical navigation, convey more information than pendencies from the example 2: { Product:{pid} other attributes and the latter is based on the capa- → {name,price}, Order: {oid} → {date,shipVia}, bility of Armstrong relations to represent in a con- Order: {oid,pid} → {qty} } densed form all the discovered knowledge through Nevertheless, we cannot argue that the logical nav- small examples. igation always give enough information to select rel- evant FDs: this is all the more true in case of 'full' 2.3.1 Heuristics based on logical navigation denormalization. Indeed, no join could be possible anymore, which could be the case for Order.oid in The logical navigation can be dened as a set of our example. (duplicated) attributes on which join statements are To cope with such an intrinsic limitation, we have performed in application programs. The basic claim dened a variant of Armstrong databases, so called is that duplicated attributes vehicle more semantics than other ones. These attributes can be identied 1 In [21], a similar idea was proposed in an informal way.
  4. 4. Informative Armstrong Databases, as an eective al- 3.2 Semantic query optimization ternative representation of the extracted knowledge. Semantic query optimization takes advantage of se- mantic information stored in databases to improve 2.3.2 Armstrong relations/databases the eciency of query evaluation [10, 3]. As an example, consider a semantic query opti- Armstrong relations, introduced in [8], are closely re- mization technique called query folding. It consists lated to FDs since they exactly satisfy a set of FDs. in computing query answers using a given set of re- Given a relation as input, Armstrong relations can sources (materialized views, cached results of previ- be computed with respect to the FDs satised in that ous queries or queries answerable by other queries). relation. In [5], we have introduced the notion of In [10], the author proposes to use INDs to nd fold- informative Armstrong relation which is both 1) an ing of queries that would otherwise be overlooked. Armstrong relation with respect to the FDs satised in the input relation and 2) a subset of that relation. 3.3 Logical database tuning Experiments show its interest to sample existing re- lations (up to 1000 times smaller). Moreover their We expect the greatest impact of our work for this ap- tuples come from the real world and thus convey plication. We sketch the usefulness of the discovered more semantics than other existing proposals. dependencies to assist a dba for tuning an existing database at the logical level [14]. Inclusion dependencies can also be safely inte- The key tasks of such an activity should be: grated into this framework: they lead to the de- nition of informative Armstrong databases [6]. • logical constraints denition : for instance, de- nition of keys, foreign keys and triggers, • database restructuring : 3 Applications for instance, normalizing a relation schema with the help of small sam- ples (informative Armstrong relations) of the re- We present below three examples of database appli- lation. cations which can benet from a clear understanding of data semantics in existing databases. Logical DB tuning from FDs A key is a special case of a FD. Keys turn out to be one of the most important integrity constraint in practice. Therefore, 3.1 Data integration a DBA has the opportunity to enforce keys over a Many organizations are using databases implemented relation schema of an existing database. over a decade ago and are actually faced with di- Example 6 For instance, attributes pid in Product is culties to modernize or to replace these databases to a key and should be enforced (if not already done). match the evolution of both the technology and the requirements. Logical DB tuning from INDs A foreign key In such applications, it is essential to recover data is a special case of IND, since it represents the left- semantics in order to develop new databases or to hand side of an IND whose right-hand side is a key. change underlying data models. As an example, Within our framework, foreign keys can be derived consider the problem of data/schema integration in from both INDs and keys of the relation schema of which the Clio system is developed [19]. It is de- the right hand side. voted to the transformation and the integration of As for keys, a DBA has the opportunity to enforce heterogeneous data (relational data and XML Data). foreign keys over an existing database. As far as relational databases are concerned, authors mention the necessity of the discovery of keys and Example 7 For instance, attribute pid in Order is a referential constraints in existing databases. foreign key and should be enforced.
  5. 5. Table 2: A restructurated database 4 Conclusion Product Most of the time, applications of data mining con- pid name price cern decision-making for humans. In this paper, we 1 Chai $ 18.00 show that systems could also take advantage of data 2 Tofu $ 18.00 mining. More precisely, we have shown that logical 3 Paté $ 12.30 database tuning, e.g keys and foreign keys enforce- 4 Paté $ 50.00 ment, is a new application of data mining. Indeed, it OrderDetail can be thought as a decision that an expert (here the pid oid qty database system) have to make to improve her busi- 1 10509 2 ness (here data coherence enforcement) from knowl- 2 10509 1 edge hidden in her data. 2 10511 1 Since data semantics of existing databases is 3 10510 2 mainly conveyed by FDs and INDs, two data min- Order ing algorithms need to be dealt with: FD inference oid date shipVia and IND inference. We have contributed to each of 10509 11-10-1998 Federal Shipping these inference problems and we have presented in 10510 1-5-2000 United Package 10511 12-12-1998 Speedy Express this paper how much this knowledge can be relevant and useful for database applications. References Database restructuring From interesting FDs and INDs, classical algorithms (e.g. synthesis algo- [1] C. Beeri, M. Dowd, R. Fagin, and R. Statman. rithm) could be applied to normalize existing relation On the structure of Armstrong relations for func- schemas (see [13] for non interaction conditions be- tional dependencies. JACM, 31(1):3046, 1984. tween FDs and INDs). [2] P.A. Bernstein and al. TheAsilomar report When a restructuration has to be performed on an on database research. ACM Sigmod Record, operational database, one need to migrate the data 27(4):7480, 1998. which is an easy task once the target dened. [3] Q. Cheng, J. Gryz, F. Koo, T.Y.C. Leung, L. Liu, X. Qian, and B. Schiefer. Implementa- Example 8 F = { Product:{pid} → {name, From tion of two semantic query optimization tech- price}, Order: {oid} → {date, shipVia}, niques in DB2 universal database. In Proc. Order: {oid,pid} → {qty} } and I = of the 25th VLDB, pages 687698, Edinburgh, {Order[pid] ⊆ Product[pid] }, we can nor- Scotland, UK, 1999. Morgan Kaufmann. malize relation schemas and then migrate the data. [4] F. De Marchi, S. Lopes, and J.-M. Petit. E- We would get the database given in Table 2 (we omit cient algorithms for mining inclusion dependen- the constraints). cies. In Proc. of the 7th EDBT, volume 2287 of To take into account application programs access- LNCS, pages 464476, Prague, Czech Republic, ing the database schema through existing relations 2002. Springer. and views, there is no easy solution. One approach is to dene a set of views over the new database schema [5] F. De Marchi, S. Lopes, and J-M. Petit. Sam- to simulate the old database schema. Nevertheless, ples for understanding data-semantics in rela- updating relational views is a dicult problem which tions. In International Symposium on Method- is weakly supported by major rdbms. ologies for Intelligent Systems (ISMIS'02), vol-
  6. 6. ume 2366 of LNAI, pages 565573, Lyon, France, [16] S. Lopes, J-M. Petit, and F. Toumani. Discover- 2002. Springer-Verlag. ing interesting inclusion dependencies: Applica- tion to logical database tuning. IS, 17(1):119, [6] F. De Marchi and J-M. Petit. Construction de 2002. petites bases de données d'Armstrong informa- tives.Revue d'intelligence articielle RSTI-RIA [17] H. Mannila and K-J. Räihä. Design by example: (from EGC'03), 17:3142, Jan. 2003. An application of Armstrong relations. JCSS, 33(2):126141, 1986. [7] J. Demetrovics and V.D. Thi. Some remarks on generating Armstrong and inferring func- [18] H. Mannila and K-J. Räihä. Algorithms for in- tional dependencies relation. Acta Cybernetica, ferring functional dependencies from relations. 12(2):167180, 1995. DKE, 12(1):8399, 1994. [8] Ronald Fagin. Armstrong databases. Technical [19] R.J. Miller, M.A. Hernández, L.M. Haas, L-L Report 5, IBM Research Laboratory, 1982. Yan, C.T.H. Ho, R. Fagin, and Lucian Popa. TheClio project: Managing heterogeneity. [9] G. Gottlob and L. Libkin. Investigations on ACM Sigmod Record, 30(1):7883, Mar. 2001. Armstrong relations, dependency inference, and excluded functional dependencies. Acta Cyber- [20] N. Novelli and R. Cicchetti. Functional and netica, 9(4):385402, 1990. embedded dependency inference: a data mining point of view. IS, 26(7):477506, 2001. [10] J. Gryz. Query Folding with Inclusion Depen- dencies. In Proc. of the 14th IEEE ICDE, pages [21] J-M. Petit, F. Toumani, J-F. Boulicaut, and 126133, Orlando, Florida, Feb. 1998. IEEE CS. J. Kouloumdjian. Towards the reverse engineer- ing of denormalized relational databases. In [11] Y. Huhtala, J. Kärkkäinen, P. Porkka, and Proc. of the 12th IEEE ICDE, New Orleans, H. Toivonen. Tane: An ecient algorithm for Louisiana, pages 218227, 1996. discovering functional and approximate depen- dencies. The Computer Journal, 42(3):100111, [22] AutoAdmin project. Microsoft research, 1999. http://www.research.microsoft.com/dmx/au- toadmin. [12] M. Kantola, H. Mannila, K-J. Räihä, and H. Si- [23] A. M. Silva and M. A. Melkano. A method for irtola. Discovering functional and inclusion de- pendencies in relational databases. Int. Journal helping discover the dependencies of a relation. of Intelligent Systems, 7:591607, 1992. In Hervé Gallaire, Jean-Marie Nicolas, and Jack Minker, editors, Advances in Data Base Theory, [13] M. Levene and G. Loizou. Guaranteeing no in- pages 115133, Toulouse, France, 1979. teraction between functional dependencies and tree-like inclusion dependencies. TCS, 254(1- 2):683690, 2001. [14] S. Lopes, F. De Marchi, and J-M. Petit. DBA companion : un outil pour l'analyse de bases de données (demo session). In BDA'2002 (French database conference), pages 523528, 2002. [15] S. Lopes, J-M. Petit, and L. Lakhal. Func- tional and approximate dependencies mining: Databases and FCA point of view. Special is- sue of JETAI, 14(2/3):93114, 2002.