Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dimensions

277 views

Published on

Data Warehousing is the main Business
Intelligence instrument for the analysis of large amounts of
data. It permits the extraction of relevant information for
decision making processes inside organizations. Given the
great diffusion of Data Warehouses, there is an increasing
need to integrate information coming from independent
Data Warehouses or from independently developed data
marts in the same Data Warehouse. In this paper, we
provide a method for the semi-automatic discovery of
common topological properties of dimensions that can be
used to automatically map elements of different dimensions
in heterogeneous Data Warehouses. The method uses
techniques from the Data Integration research area and
combines topological properties of dimensions in a
multidimensional model.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
277
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dimensions

  1. 1. ACEEE Int. J. on Information Technology, Vol. 01, No. 03, Dec 2011Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dimensions Sonia Bergamaschi1, Marius Octavian Olaru1, Serena Sorrentino1 and Maurizio Vincini1 1 Università degli Studi di Modena e Reggio Emilia, Modena, Italy Email: {sonia.bergamaschi, mariusoctavian.olaru, serena.sorrentino, maurizio.vincini}@unimore.itAbstract Data Warehousing is the main Business multidimensional. Until now, few approaches have beenIntelligence instrument for the analysis of large amounts of proposed for the formalization and the solution of thisdata. It permits the extraction of relevant information for problem (see Related Work section) but none of them hasdecision making processes inside organizations. Given the been widely adopted.great diffusion of Data Warehouses, there is an increasing In this paper, we propose the use of topologicalneed to integrate information coming from independentData Warehouses or from independently developed data properties of Data Warehouses in order to semi-marts in the same Data Warehouse. In this paper, we automatically discover mappings among dimensions. Weprovide a method for the semi-automatic discovery of advocate the use of these topological properties alongsidecommon topological properties of dimensions that can be semantic techniques, typical to data integration, to allowused to automatically map elements of different dimensions the automation of the Data Warehouse integrationin heterogeneous Data Warehouses. The method uses process. We provide a full methodology, based primarilytechniques from the Data Integration research area and on graph theory and the class affinity concept (i.e., acombines topological properties of dimensions in a mathematical measure of similarity among two classes) tomultidimensional model. automatically generate mappings between dimensions.Index Terms Data Warehouse, P2P OLAP, dimension This paper is organized as follows: Section 2 presentsintegration an overview on related work, Section 3 provides a full description of the method that we propose, meanwhile I. INTRODUCTION Section 4 draws the conclusions of our preliminary research. In the past two decades, Data Warehousing became theindustrial standard for analyzing large amounts of II. RELATED WORKoperational data, enabling companies to use informationpreviously hidden to take strategic decisions. Nowadays, In this section, we present some approaches to theseveral tools and methodologies have been developed for formalization and the solution of the Data Warehousedesigning a Data Warehouse at conceptual, logical and integration problem.physical level [10, 12]. In recent years, though, The work described in [12] proposes a simplecompanies have been seeking new and innovative ways definition of conformed dimensions as either identical orof using Data Warehouse information. For example, a strict mathematical subsets of the most granular andnew trend in todays Data Warehousing is the detailed dimensions. Conformed dimensions should havecombination of information residing in different and consistent dimension keys, consistent column names,heterogeneous Data Warehouses. This scenario is consistent attribute definitions and consistent attributebecoming more and more frequent as the dynamiceconomical context sees many company acquisitions or strong similarity relation between dimensions that is veryfusions. This means that at the end of the difficult to achieve with completely independent Dataacquisition/federation process, the two independent Data Warehouses. For data marts of the same Data Warehouse,Warehouses of the two companies have to be integrated the authors provide a methodology for the design and thein order to allow the extraction of unified information. maintenance of dimensions, the so-called DataThis can be a time and effort consuming process that Warehouse Bus Architecture, which is an incrementalincreases the risk of errors if manually executed. An methodology for building the enterprise warehouse. Byautomated or semi-automated process can increase the defining a standard bus interface for the Data Warehouseefficiency of such process. This has been demonstrated in environment, separate data marts can be implemented bythe data integration area where designers make use of different groups at different times. The separate datasemi-automated tools (like [2, 4]) as support for the marts can be plugged together and usefully coexist if theymapping discovery process between two independent and adhere to the standard.heterogeneous data sources. In [8] and [16] there is a first attempt to formalize the The Data Warehouse integration process consists in dimension matching problem. First of all, the authorscombining information coming from different Data provide the Dimension Algebra (DA), which can be usedWarehouses. The problem is different from traditional for the manipulation of existing dimensions. The DAdata integration as the information to integrate is contains three basic operations that can be executed on existing dimensions: selection, projection and© 2011 ACEEEDOI: 01.IJIT.01.03.15 38
  2. 2. ACEEE Int. J. on Information Technology, Vol. 01, No. 03, Dec 2011aggregation. The authors then provide a formal definition described in [6] and [9]) concept is used to find similarof matching among dimensions and three properties that a elements (facts, dimensions, aggregation levels andmatching can have: coherence, soundness and dimensional attributes) and similarity functions forconsistency. A mapping between dimensions that is multidimensional structures are proposed based on thatcoherent, sound and consistent is said to be a perfect concept. In our method, we also make use of semanticmatching. Of course, such mappings are almost relations, but our approach is different: the authors in [1]impossible to find in real-life cases, but the properties can rely on the semantic relations to discover the mappingsstill be used to define Dimension Compatibility. among elements, whereas we use them only as aAccording to [16], two dimensions are compatible if there validation technique.are two DA expressions and over two dimensions and and a matching between and III. MAPPING DISCOVERYthat is a perfect matching. The definition is purely As said earlier, this paper proposes a technique thattheoretical and the paper does not provide a method for enables the semi-automated discovery of mappingsfinding or computing the DA expressions and the between dimensions in heterogeneous Data Warehouses.matching between the dimensions. We try to fill this According to [15], this is an instance level techniquegap, by providing a method for the automatic discovery because it considers instance specific information, mainlyof mappings between two given Data Warehouse cardinality and cardinality ratio between various levels ofdimensions. aggregation inside the same dimension. The main idea In [11] the authors formalize the OLAP query behind the method is that dimensions in a Datareformulation problem in Peer-to-Peer Data Warehouse usually maintain some topological properties.Warehousing. A Peer-to-Peer Data Warehouse is a In this paper, we use cardinality based properties.network (formally called Business Intelligence Network, To explain the idea, let us consider the sampleor BIN) in which the local peer has the possibility of dimensions in two different Data Warehouse presented inexecuting queries over the local Data Warehouse and to Fig. 1 and Fig. 2 (for simplicity, we used schemaforward them to the network. The queries are then examples presented in [10]). As we can clearly see, byrewritten against the remote compatible multidimensional analyzing the names of the attributes, the two consideredmodels using a set of mapping predicates between the dimensions are time dimensions. Let us now supposelocal attributes and remote attributes. The mapping that the dimension in the first example (we will call it S1predicates are introduced to express relations between from now on) comprises every date from January 1 st 2007attributes. They are used to indicate whether two concepts to December 31st 2009 (three complete years) and thein two different peers share the same semantic meaning, second (S2) all the weeks between January 1st 2010 andweather one is a roll-up or a drill-down of the other or December 31st 2010 (one complete year). If a toolwhether the terms are related. At the moment, the analyzed the intersection between the two dimensions,mappings have to be manually defined, which is a time the intersection would be null, as the two timeand resource consuming effort, depending mostly on the dimensions cover different time periods. However, thesize of the network and the complexity of the local Data two dimensions share some topological properties. Let usWarehouses. As there is no central representation of a consider the cardinality of the finest level of dimensionData Warehouse model, mappings have to be created S1. As the date dimension covers a period of three years,between every two nodes that want to exchange in the Data Warehouse we will have differentinformation. As a consequence, among a set of n nodes dates. This information is not important, rather, we maythat want to exchange information with each other, be interested in the fact that to each member of the closest mapping sets have to be manually generated in aggregation level (month) corresponds an average of 30order to allow nodes to have the same view of the different elements in the date level. Moving at the nextavailable information. The exponential increase of the aggregation level, a quarter is composed of threemapping sets means that for large numbers of peers, the different months, a season is also composed of threemanual approach is almost impossible as the benefits of different months (season and quarter are different asthe BIN could not justify the high cost of the initial among them there is a many-to-many relation), and a yeardevelopment phase. Our method can be used to overcomethis problem as it is able to semi-automatically generatethe mapping sets. We will show how it is possible toobtain a complete set of coherent mappings, using onlyschema knowledge and minor information about theinstances. At the end of the process, the designer only hasto validate the proposed mappings. In [1] there is an attempt to automate the mappingprocess between multidimensional structures. The authorspresent a method, based on early works in dataintegration [5, 13] that allow designers to automaticallydiscover schema matchings between two heterogeneousData Warehouses. The class similarity (or affinity, as Figure 1. SALE ( )© 2011 ACEEEDOI: 01.IJIT.01.03.15 39
  3. 3. ACEEE Int. J. on Information Technology, Vol. 01, No. 03, Dec 2011 1. First, the dimensions will be considered as directed labeled graphs and will be assigned a connectivity matrix. 2. Using the connectivity matrices, a common subgraph of the dimensions will be computed. 3. Using the common subgraph, pairs of equivalent elements will be identified. 4. Using a set of rules and the pairs of equivalent nodes identified in Step 3, a complete set of mappings will be generated among elements of two different schemas. In the end, these Figure 2. INVENTORY ( ) mappings will be pruned using a semantic approach. A. Mapping predicatesis composed of four different quarters. These topologicalproperties are not verified only between immediate This paragraph contains a brief resume of the mappingaggregation levels. For example, in the same way, we predicates introduced in [11].may discover that on average, a distinct element of the We decided to use this particular formalism for twolevel year is an aggregation of twelve different elements main reasons: first of all, the predicates are sufficient forof the month level. The dimension S2, although contains expressing the relations among attributes that we need indifferent data, presents the same topological properties our method; secondly, we believe that the conceptswhich are a 3:1 ratio between the cardinality of the levels introduced in this paper are better suitable inmonth and season and a 12:1 correspondence between environments were a large number of mapping sets aremonth and year. The dimension in Fig. 3 can thus be seen needed (like in BINs). The use of a formalism alreadyas a common sub-dimension of the two initial introduced in such environment could facilitate the workdimensions. This sub-dimension can be used to semi- of developers.automatically map levels of S1 to levels of S2 and vice- Definition 1. An md-schema is a tripleversa. where This type of properties works well not only on time is a finite set of attributes,dimensions, but also on dimensions describing a concept each defined on a categorical domainof the real world with fixed topology. For example, twocompanies operating in the same country are likely to is a finite set of hierarchies,have a geographical dimension with the same topology. each characterized by (1) a subset The two dimensions of the independent Data Warehouses of attributes (such thatwill contain attributes with the same relations among for define a partition ofthem: quarters organized in cities, grouped into regions, A) and (2) a roll-up tree-structure partial orderand so on. Inside a single company, if two different ofgroups develop two independent data marts, they are a finite set of measures ,likely to use common dimensions describing a structure each defined on a numerical domainthat reflects the organization of the company: sales and aggregable through onedistribution, supply chain, commercial division, and so distributive operatoron. These dimensions must have the same structurethroughout the Data Warehouse as both of them need to Thus, hierarchies can be seen as sets of attributes withconform to the actual organization of the company. a partial order relation that imposes a structure on the set. The remainder of this section contains a detailed We will use the partial order relation in the method fordescription of the method we propose. It can be mapping discovery.summarized in four steps: Given two md-schemas, a mapping between two attributes sets can be specified using five mapping predicates, namely: same, equi-level, roll-up, drill-down and related. In particular, the mapping predicates are defined as follows: same predicate: used to indicate that two measures in two md-schemas have the same semantic value; equi-level predicate: used to state that two attributes in two different md-schemas have the Figure 3. A simple common sub-dimension same granularity and meaning;© 2011 ACEEEDOI: 01.IJIT.01.03.15 40
  4. 4. ACEEE Int. J. on Information Technology, Vol. 01, No. 03, Dec 2011 roll-up predicate: used to indicate that an attribute (or set of attributes) of one md-schema aggregates an attribute (or set of attributes) of the second md- schema; drill-down predicate: used to indicate that an attribute (or set of attributes) of one md-schema disaggregates an attribute (or set of attributes) of the second md-schema; equi-level predicate: indicates that between two attributes there is a many-to-many relation; Figure 4. First dimension The mapping predicates concern both measures and dimensions. In this paper we are focusing on mappings between dimensional attributes, thus we only use the element of the matrix is greater than 0 if , which last four predicates. means that there is an edge from the level with sequence number and the level with sequence number . ForB. Cardinality matrix example, the matrix of the graph represented in Fig. 4 The key idea of our approach is to see dimensions as will contain an element , as there is a directeddirected labeled graphs, where the label between two edge from node 4 (month) to node 6 (quarter). Theadjacent nodes is the cardinality ratio between the two connectivity matrix is extended with elementsaggregation levels. Starting from these graphs, we can .find a common subgraph where the cardinality ratio holds The connectivity matrix is shown in Fig. 5(a). Thebetween every two adjacent nodes in every of the two matrix can be further extended in order to include theinitial graphs. cardinality ratio between every two levels and for A dimension can thus be seen as a labeled directed which or . This is not possible for every levelsgraph , where of the dimension. For example, there is no cardinality is the set of vertices of the graph, corresponding ratio between nodes season and year as neither of them is to the attributes of the model. In particular, is an aggregation of the other (among the two attributes the support of a hierarchy such that there is a many-to-many relation). We want to expand the matrix as in finding a common subgraph there may be the case where a local attribute (corresponding to an . This means that the aggregation level of the dimension) has no initial graph only contains edges between an correspondence in a remote attribute and thus, no attribute and immediate superior aggregation mapping can be generated among the two attributes. attributes. is a labeling function, with the cardinality ratio between the two levels. For example, if and are physically represented by two relations and , then Two nodes and are connected by a directed labelededge if between levels and (formally between theattributes that compose the two levels) there is a many-to-one relation, which means that is an aggregation of .The label of the edge is the cardinality ratio between level and level . Let us now consider the sample hierarchy in (a) Connectivity matrixFig. 4. We can associate a connectivity matrix thatdescribes the graph. An element (with ) of thematrix is greater than 0 if and only if is an immediateupper aggregation level of and the value of is thecardinality ratio between the two levels. In computing thecardinality matrix, we have assigned a sequence numberto every node maintaining the following rule: if there isthe possibility to aggregate from level to level , thenthe number associated to level is lower than the number (b) Final connectivity matrixassociated to level . In our case, we chose , , , , Figure 5. Connectivity matrices , , . Every line in thematrix corresponds to a node in the graph identified by itssequence number. For example, the fourth line representsthe outgoing edges from the fourth node (month). An© 2011 ACEEEDOI: 01.IJIT.01.03.15 41
  5. 5. ACEEE Int. J. on Information Technology, Vol. 01, No. 03, Dec 2011However, another attribute at a higher aggregation levelmay have a correspondence in another md-schema, so weneed to compute the cardinality ratio between all othernodes, except the node that we may want not to consider.For example, in finding a common subgraph between thedimension represented in Fig. 1 and Fig. 2, we had tosimply not consider the attribute quarter in , as it hasno semantic equivalence in , but we steel needed toconsider the cardinality ratio between the level month and Figure 6. Second sample dimensionall other superior aggregation levels. Algorithm 1 extends matrix in order to include thecardinality ratio between every two nodes that satisfy the partial order relation. The main idea of the algorithm isthat, if there exists a cardinality ratio of between levels and and a cardinality ratio of between levels and , then there exists a cardinality ratio of betweenlevels and . Using the connectivity matrix , we canaggregate between levels and if there exists (a) Connectivity matrix , such that . Furthermore, thevalue gives us the cardinality ratio between levelsand . The algorithm is incremental, as it is continuouslybuilding possible paths of increasing size. The finalconnectivity matrix , obtained after applying thealgorithm, is shown in Fig. 5(b). describes an induced (b) Final connectivity matrixgraph obtained from the initial directed labeled graph byadding edges for every nodes and among which Figure 7. Connectivity matrixthere is the possibility to aggregate (maintaining theaggregation direction). Formally, we have added all connectivity matrices and (shown in Fig. 5(b) anddirected edges if nodes 7(b)). We will use the two final connectivity matrices, and and and , to discover a common subgraph of the two 1 . The label of the new added edge will be : complete ordered and labeled graphs. The resulting . subgraph will be decomposed to a minimum graph and then used for the mapping of dimension levels. In order toC. Mapping discovery method find a common subgraph, we will not use the initial In this section, we describe the method for the graphs, rather the complete induced oriented labeledmatching of aggregation levels between two graphs.heterogeneous dimensions. Let us consider the sample Algorithm 2 is a simple algorithm for the common sub-hierarchies in Fig. 4 and Fig. 6: the connectivity matrix of matrix computation. It identifies one of the possiblethe first hierarchy is meanwhile the connectivity matrix maximum common subgraphs of the two completeof the second hierarchy ( ) is shown in Fig. 7(a). We graphs2. In finding a common subgraph, we used anapplied Algorithm 1 to the matrices and obtained the final approximate matching technique. This is best-suited for real-life cases where part of the data may be missing. If, for example, the dimensions that we presented did not contain complete years, then an exact algorithm would Algorithm 1 cardinality ratio computation surely fail as the cardinality ratio would be slightly lower repeat or higher than the exact cardinality ratio. Algorithm 2 for to do for to do considers, for every corresponding elements of the for to do matrices, a deviation of the elements from their mean if then arithmetic value. The deviation is modulated through the end if value , which the designer manually selects. A low value end for of produces a more precise subgraph while a high value end for can discover a common subgraph of higher rank. For end for example, for a subgraph is common if every until matrix has not changed element differs no more than from the mean value of every two corresponding elements.1 2 For simplicity, we used dimensions that have neither cycles nor For the sake of simplicity, we did not consider the case where moremultiple paths. than one maximum common subgraph exists.© 2011 ACEEEDOI: 01.IJIT.01.03.15 42
  6. 6. ACEEE Int. J. on Information Technology, Vol. 01, No. 03, Dec 2011 Algorithm 2: cardinality ratio computation C= {empty matrix} for every square sub-matrix of the first matrix do for every square sub-matrix of the second matrix do if for every then if then C=new matrix of rank for every do end for end if end if end for end for return C We obtain a matrix (Fig. 8) that describes a maximum first row of the resulting matrix is obtained from the forthcommon subgraph of the two initial graphs. The matrix is row of matrix , we can state that node corresponds toobtained from the first matrix by eliminating the first, the node represented in matrix by the forth row, whichsecond, third and sixth node (i.e., first, second, third and is node month. Similarly, we can say that node alsosixth row and column of the matrix) or from the second corresponds to the second node of the second graph,matrix by eliminating the first node (i.e., first row and month. In the same way, it is possible to discover thecolumn). As the matrix is a square matrix, the correspondences between the nodes of the initial graphsgraph that it describes has 3 nodes. Let us call those and the nodes of the common subgraph. We discovernodes , , and and assign them the first, second and mappings by exploiting the following rules:third row in this specific order. If the graph containedmultiple paths then the next step would be the 1. Rule 1: if two distinct nodes, and , ofcancellation of redundant edges. In order to achieve this, two different dimensions have the samewe have to cancel every edge if there exists a path correspondent node of the common subgraph,between the two given nodes composed of two or more then add the rule:edges. The resulting common sub-graph is depicted inFig. 9. The next step is the actual mappings generation. 2. Rule 2: if and are nodes of one graph,We propose an equi-level mapping between nodes of the and there is a path from to , and there istwo initial dimensions that correspond to the same nodein the common subgraph. For the other nodes we can a node in the other graph and a mappingpropose roll-up, drill-down or related mappings. As the rule , then add the rules: 3. Rule 3: if and are nodes of one graph, and there is a path from to , and there is a node in the other graph and a Figure 8. Resulting matrix mapping rule , then add the rules: 4. Rule 4: if there are two nodes and in one graph such that there is a path from to , and two nodes and in the other graph and a path from to , and there is a mapping rule , then add the rules: Figure 9. Resulting subgraph© 2011 ACEEEDOI: 01.IJIT.01.03.15 43
  7. 7. ACEEE Int. J. on Information Technology, Vol. 01, No. 03, Dec 2011 5. Rule 5: for every nodes and of the two graphs for which there has not been found any mapping rule, add the rule: Fig. 10 contains a graphical representation of themapping rules 2, 3 and 4. Using these simple rules, we obtain the completemapping list (divided by the rule used to discover themappings)3:Rule 1:   (a) Rule2Rule 2:     Rule 3:      (b) Rule 3  Rule 4: (0.7)Rule 5:          (1)D. Semantic mapping validation To validate our mapping rules, we decided to weightand prune them using a semantic approach based onLexical Annotation. Lexical Annotation is the process of (c) Rule 4explicit assignment of one or more meanings to a termw.r.t. a thesaurus. To perform lexical annotation, we Figure 10. Graphical representation of Rules 2,3 and 4exploited the CWSD (Combined Word SenseDisambiguation) algorithm implemented in the MOMISData Integration System [7,3] which associates to each Warehouse by navigating the wide semantic network ofelement label (i.e., the name of the element) one or more WordNet. In particular, the WordNet network includes4:meanings w.r.t. the WordNet lexical thesaurus [14]. Synonym relations: defined between two labelsStarting from lexical annotations, we can discover annotated with the same WordNet meaning (synsetsemantic relations among elements of the different Data3 4 For simplicity, for Rules 2, 3 and 4 we just added the first mapping, as WordNet includes other semantic and lexical relations such asthe second is the opposite of the first. antonym, cause etc. which are not relevant for our approach© 2011 ACEEEDOI: 01.IJIT.01.03.15 44
  8. 8. ACEEE Int. J. on Information Technology, Vol. 01, No. 03, Dec 2011 in the WordNet terminology, e.g., client is a synonym of customer) TABLE I - COEFFICIENT ASSIGNMENT Hypernym relations: defined between two labels where the meaning of the first is more general than equi-level roll-up drill-down related the meaning of the second (e.g. time period is a hypernym of year). The opposite of hypernym is same/synset 1 0.7 0.7 0.7 the hyponym relation; hypernyms 0.9 0.7 1 0.8 Meronym relations: defined between two labels hyponims 0.9 1 0.7 0.8 where the meaning of the first is part of/member coordinated 0.7 0.7 0.7 1 of the meaning of the second (e.g., month is a terms meronym of year). The opposite of meronym is holonym 0.7 1 0.3 0.8 the holonym relation; meronym 0.7 0.3 1 0.8We added the coordinate terms relation that can bedirectly derived by WordNet: two terms are coordinatedif they are connected by a hyponym or meronym relation a roll-up/drill down mapping is semanticallyto the same WordNet synset. Thus, for each identified similar to a holonym/meronym relation (coefficientmappings, we first annotated each label by using CWSD equal to 1)and then, we discovered the shortest path of semantic a related mapping is semantically similar to arelations connecting the two elements into the WordNet coordinate terms relation (coefficient equal to 1)network. Our goal was to validate the mappings bycomputing for each of them a weight on the basis of the For the other mapping/relations combinations weidentified WordNet paths. We computed the weight by associated coefficients (lower than 1) on the basis of theirassigning to every edge (i.e., WordNet relation) of the relevance. For example, to the combination drill-path a coefficient using the assignment rules in Table 1. down/holonym, we associated a low coefficient (0.3) asThe final weight is given by the product of the single they semantically represent opposite concepts. For everycoefficients (thus, long paths will have lower weights mapping in the set the corresponding coefficient has beenthan short or direct paths). These coefficients were computed. Starting from these computed coefficients, wedefined by considering the type of WordNet relation and can prune the discovered mappings.the type of mapping to be validated: Let us consider the example shown in Fig. 1 & 2: we an equi-level mapping is semantically similar to a needed to validate the two mappings synonym relation (coefficient equal to 1) and . CWSD annotates Figure 11. Semantic validation of the mappings© 2011 ACEEEDOI: 01.IJIT.01.03.15 45
  9. 9. ACEEE Int. J. on Information Technology, Vol. 01, No. 03, Dec 2011month REFERENCES week [1] M. Banek, B. Vrdoljak, A. M. Tjoa, and Z. Skocir. holiday Integration of Heterogeneous Data Warehouse Schemas. IJDWM, 4(4):1 21, 2008.From the WordNet semantic network, we discovered a [2] D. Beneventano, S. Bergamaschi, G. Gelati, F. Guerra, anddirect path between month and week where the meanings M. Vincini, Miks: An Agent Framework Supportingare connected by a holonym relation. Thus, by following Information Access and Integration . In M. Klusch, S.the coefficients in Table 1, we associated to a weight Bergamaschi, P. Edwards, and P. Petta, editors, AgentLink,equal to 1. On the contrary, between the terms month and volume 2586 of Lecture Notes in Computer Science, pagesholiday, we discovered a path of length 4 composed by 22 49. Springer, 2003. [3] S. Bergamaschi, P. Bourquet, D. Giacomuzzi, F. Guerra, L.hyponym and hypernym relations (see Fig. 11). Thus, we Po and M. Viassociated to a weight equal to 0.41. Finally, we . Invalidated the mappings by applying a threshold on their International Journal on Semantic Web and Informationsemantic weights: by selecting a threshold of 0.5, we Systems (IJSWIS) 3(3), p.p.57-80, 2007maintained the mapping , while we discarded the [4] S. Bergamaschi, S. Castano, S. D. C. di Vimercati, S.mapping . Montanari, and M. Vincini, A Semantic Approach to Information Integration: The MOMIS Project . In Sesto Convegno della Associazione Italiana per IV. CONCLUSIONS & FUTURE WORK Artificiale (AI*IA98), Padova, Italy, September 1998. In this paper, we argued that topological properties of [5] S. Bergamaschi, S. Castano, and M. Vincini, Semanticdimensions in a Data Warehouse can be used to find Integration of Semistructured and Structured Datasemantic mappings between two or more different Data Sources . SIGMOD Record, 28(1):54 59, 1999. [6] S. Bergamaschi, S. Castano, M. Vincini, and D.Warehouses. We showed how these properties can be Beneventano, Retrieving and Integrating data fromused with semantic techniques to efficiently generate a Multiple Sources: the MOMIS Approach . Data Knowl.mapping set between elements of the dimensions of two Eng., 36(3):215 249, 2001.independent Data Warehouses. However, some [7] S. Bergamaschi, L. Po, and S. Sorrentino, Automaticdrawbacks exist. First of all, our method depends on the Annotation in Data Integration Systems . In OTMinstance of the Data Warehouse. If too little, or partial, Workshops (1), pages 27 28, 2007.information is present in the Data Warehouse then the [8] L. Cabibbo and R. Torlone, On the Integration ofcardinality ratio among levels could vary rendering the Autonomous Data Marts . In SSDBM, pages 223 .IEEEmapping generation step inefficient. A second problem is Computer Society, 2004. [9] S. Castano, V. D. Antonellis, and S. D. C. di Vimercati,that the mapping predicates have no exact equivalent to Global Viewing of Heterogeneous Data Sources . IEEEthe WordNet semantic relations, so it is impossible to Trans. Knowl. Data Eng., 13(2):277 297, 2001.assign an exact weight coefficient to a specific type of [10] M. Golfarelli, D. Maio, and S. Rizzi, The Dimensionalmapping (for example we associated 0.3 to the relation Fact Model: A Conceptual Model for Data Warehouses .drill-down/holonym in order to penalize types of relations Int. J. Cooperative Inf. Syst., 7(2-3):215 247, 1998.that are semantically opposite). These weights depend on [11] M. Golfarelli, F. Mandreoli, W. Penzo, S. Rizzi, and E.the context of the Data Warehouse. A fine tuning of these Turricchia. Towards OLAP query reformulation in Peer-coefficients could increase the precision of the method, to-Peer Data Warehousing . In I.-Y. Song and C. Ordonez,but in any case, a human validation is required editors, DOLAP, pages 37 44. ACM, 2010. [12] R. Kimball and M. Ross, The Data Warehouse Toolkit:nevertheless. This is also an issue in data integration The Complete Guide to Dimensional Modeling . Johnwhere developers/analysts rely on semi-automatic tools to Wiley & Sons, Inc., New York, NY, USA, 2nd edition,discover semantic correspondences, but unfortunately the 2002.process cannot be entirely automatic. As any approach [13] J. Madhavan, P. A. Bernstein, and E. Rahm, Genericproposed so far in Data Warehouse integration has flaws, Schema Matching With Cupid . In P. M. G. Apers, VLDB,we believe that a combination of approaches (like the pages 49 58. Morgan Kaufmann, 2001.topological/semantic approach proposed in this paper) [14] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K.could improve the accuracy of the mapping discovery Miller, WordNet: An On-Line Lexical Database .process. In the future we plan to investigate further the International Journal of Lexicography, 3:235 244, 1990. [15] E. Rahm and P. A. Bernstein A Survey of Approaches torelation between the mapping predicates and semantic Automatic Schema Matching . VLDB J., 10(4):334 350,relations between terms and to study how these terms 2001affect the efficiency of our method. [16] R. Torlone Two Approaches to The Integration of Heterogeneous Data Warehouses . Distributed and Parallel Databases, 23(1):69 97, 2008.© 2011 ACEEEDOI: 01.IJIT.01.03.15 46

×