Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Metadata Version Management for DW 2.0 Environment Ding Pan Metadata Version Management for DW 2.0 Environment Ding Pan Center for Business Intelligence Research, Management School, Jinan University, Guangzhou 510632, China tpanding@jnu.edu.cn doi: 10.4156/jcit.vol5.issue3.8 Abstract As a new paradigm for data warehousing demanded by today’s decision support community, DW2.0 recognized the life cycle of data with it, that make metadata evolution mechanism became one ofthe important research issues. The requirements of multi-version management for four data sectors inDW 2.0 environment are described. Then a novel metadata versioning meta-model is proposed, that iscapable of storing and managing schemas versions, comparing and interpreting the results of versionsqueries, and tracing the version evolution. In implementation, the schema evolution with version isdiscussed in the abstract by model management operators; a verification engine to resolve theevolution inconsistencies is represented. The prototype has verified its feasibility and validity. Keywords: Metadata versioning, model management, DW 2.0, structural integrity1. Introduction Metadata is our knowledge of data that we have interpreted as information in a particular decision-making situation, used in a personal or organizational context. Metadata embraces the fundamentalinformation that is used to understand business processes, provide information access, obtain dataunderstanding, outline the rules applying to actions and data, and ultimately, make business decisions.As a readily accessible resource, the Web is a huge data warehouse that contains volatile informationthat is gathered and extracted into something valuable for use in the organization situation. Themetadata management has become an essential part of the successful implementation of datawarehousing, Web warehousing and Web mining [1]. A data warehouse (DW) is a construct that supplies integrated, granular, and historical data to thecorporation. The data warehousing has been around for about two decades now. In fact there have beenmany forces that have shaped the evolution of information architecture to DW 2.0 environment, whichwas defined as the next-generation data warehouse architecture by Inmon [2]. The architectural features of DW 2.0 represent an advance in technology and architecture beyondfirst-generation data warehouses. One of the most important features is that the life cycle of data withinthe data warehouse is recognized. Hence, the DW 2.0 data warehouse includes four life cycle “sectors”of data: interactive sector, integrated sector, near line sector and archival sector. Meanwhile, the DW2.0 approach also recognizes that metadata management is a major and critically important part of thedata warehouse infrastructure. Because metadata provides a context for interpreting the meaning ofinformation, the metadata management must involve the structure of the data over a broad spectrum oftime. As a one of the cornerstones of the architectural, it integrates local metadata from four sectors inenterprise perspective, to offer a support for the evolution of their structures and data, i.e. metadataversion management. The support for managing the evolution of data and schemas turned up as an important feature indatabase community many years ago. The evolution problem was intensively investigated in variousapproaches and prototypes. Salzberg et al. presented the research and technological work concentrateon data versioning [3]. On the contrary to the approaches managing evolution in mediated andfederated database systems, in data warehouse systems a DW stores not only elementary data but alsodata aggregated at many levels. In fact, there have been three solutions respectively focused on: (1)schema and data evolution [4], (2) temporal extensions [5], and (3) versioning extensions [6]. Metadata management has been an important component in information process environment formany years. Moreover, metadata is a fundamental element in heterogeneous system [7]. Wrembel et al. 54
  2. 2. Journal of Convergence Information Technology Volume 5, Number 3, May 2010discussed detecting changes in external data sources and metadata management in a multi-version datawarehouse [8]. Hauch et al. describe how MetaMatrix captures and manages the metadata through theuse of the OMGs MOF architecture and multiple domain-specific modeling languages, and how thissemantic and syntactic metadata is then used for accessing and integrating data [9]. The need formetadata describing the life cycle of a DW system planning, design, development, deployment, andusage as well as the need for data exchange between different heterogeneous tools and systems resultedin two industrial metadata standards, i.e. the Open Information Model (OIM) and the CommonWarehouse Metadata (CWM) [10]. But the standards do not include either models for propagation andmanagement of schema evolution, or methods for detection of evolution consistency in a DW system.Consequently, they do not provide support for temporal or multi-version queries. Commercial metadatamanagement systems existing on the market, such as ASG Rochade Repository, IBM Metadata Server,Informatica SuperGlue, DAG MetaCenter, CA AllFusion Repository, Adaptive Metadata Manager andUnicorn System, do not offer advanced mechanisms for managing DW evolution or versioningcapabilities. To our knowledge, little work has been done on versioning management for DW system,especially DW 2.0 environment. Bernstein and Melnik applied model management method to metadata management, which offered ahigher level programming interface than current techniques, and showed how to apply the operators tothe classical metadata management problems [11]. In this paper we attempt to explore a metadata version management mechanism to make trace theevolution of process flows and schemes, with respect to the life cycle of data. In implementation, wediscuss metadata evolution with version using the model management operators, and a verificationengine to resolve the evolution inconsistencies that might have been introduced during the metadataintegration and maintenance process. The rest of this paper is organized as follows: section 2 overviews the basic concepts of the multi-version problems, describes requirements of versioning management for DW 2.0; section 3 presents ameta-model; section 4 discusses the implementation of metadata evolution and consistency detection;finally, section 5 concludes the paper.2. The Problem and Requirement There are various components of a DW. We may distinguish the following the main components[2]: (1) Data Acquisition, i.e. extract/transform/load (ETL), is the set of processes that capture, integrate,transform, cleanse, reengineer and load source data into the data warehouse and operational data store. (2) The Data Warehouse is the central point of data integration and is the source of data for the datamarts, delivering a common view of enterprise data. (3) The Data Mart is customized and/or summarized data derived from the data warehouse andtailored to support the specific analytical requirements of a business unit or function. (4) Meta Data Management is the process for managing information needed to promote datalegibility, use and administration. In recognition of the life cycle of data within the data warehouse, the DW 2.0 data warehouseincludes four major data life-cycle sectors, as shown in Figure 1. The first sector is Interactive Sector,the place where online update of data occurs and where the data enter from ETL processes. As datasettles, it is integrated and then is passed into the Integrated Sector. The falling off of the probability ofdata access usually comes with age. From the Integrated Sector the data can then move on to one oftwo sectors, one is the Near Line Sector. The Near Line Sector can be considered as an extension of theIntegrated Sector. The Near Line Sector is used when there is an extraordinarily large amount of dataand the probability of access of the data differs significantly. Then data comes from either the NearLine Sector or the Integrated Sector can enter the Archival Sector. The Archival Sector holds data thathas been collected electronically and may see some usage in the future [2]. At a certain time, the data multi-version is formed, e.g. there are two dataset at integrated sector,and the D2.1 represents a dataset whose schema is different from the D2.2. Corresponding to thedataset, a multi-version metadata environment is generated. 55
  3. 3. Metadata Version Management for DW 2.0 Environment Ding Pan Hence, the metadata management of DW 2.0 must have an advanced versioning mechanism for thesequence of schema and process versions. For the data warehouse based on a relational database, aschema version may involve mainly a table structure and constraints, whereas for the data mart basedon a multidimensional data model, a schema version needs include the fact table, level table, dimensionand hierarchy, within a given time period. Interactive sector Integrated sector Near line sector Archival sector Data D1 D2.1 D2.2 D3.1 D3.2 D3.3 D4.1 D4.2 D4.3 Metadata M1 M2.1 M2.2 M3.1 M3.2 M3.3 M4.1 M4.2 M4.3 Figure 1. The metadata in DW 2.0 environment A basic requirement of the schema version is shown in Figure 2. A schema version describes thestructure of a DW within a given time period. A data cube is represented by a fact version. Versions ofthe same data cube belong to a multi-version fact, denoted by MV_Fact. In a given hierarchy versionHVi, level versions {LVi1, LVi2, ..., LVim} form a hierarchy version that belongs to a multi-versionhierarchy, denoted by MV_Hierarchy. Each level can have many versions that belong to a multi-version level, denoted by MV_Level. Each dimension can have many versions that belong to a multi-version dimension, denoted by MV_Dimension. A schema version has many fact versions. For a given fact version FVi, the function FV→{HV}assigns FVi to the set of hierarchy versions {HVi1, ..., HVik}. For a given hierarchy version HVi, thefunction HV→{LV} assigns HVi to the set of level versions {LVi1, ..., LVik}. A hierarchy versionbelongs to a dimension version. For a given hierarchy version HVi, the function HV→DV assigns thehierarchy version to an indicated dimension version. A given dimension version may include severalhierarchy versions, every hierarchy can consist of various level versions. MV_Fact FV→ FV AV→ {FV} CV→ {AV} FV1 FV2 MV_Attribute MV_Constraint … FVn AV→ {AV} CV→ {CV} FL→ {HV} AV1 CV1 AV2 CV2 MV_ Dimension MV_Hierarchy … AVn … CVn DV→ DV HV→ HV AV→ {LV} DV1 HV1 DV2 HV2 MV_Level … DVn … HVn LV→ {LV} HV→ DV HV→ {LV} LV1 LV2 … LVn Figure 2. Basic requirement of the schema version3. The meta-model for DW 2.0 56
  4. 4. Journal of Convergence Information Technology Volume 5, Number 3, May 2010 An enterprise metadata repository must have some underlying model of the structure of theinformation that is stored in it, namely meta-model. We provide a core meta-level ER model thatdescribes the meta-object of multidimensional modeling, used to represent information about storingand operating the sequence of metadata versions. In the meta-model, a metadata schema versionconsists of some facts and dimensions, the dimension involves some levels and hierarchies dependupon analytical context. Figure 3 shows main notions of the meta-model for versioning management. To simplicity, ourmodel is described according to relational database schema. Essentially, the model consists of 8 basicelements: Versions, Facts, Dimensions, Levels, Hierarchies, Attributes, Constraints and Transactions,and some Associations, as can be seen in below. Figure 3. The meta-model for versioning The Versions entity stores data about all existing versions, i.e., a unique version identifier, name,begin and end validity times, status (whether a version is committed or under development), parent-child (derivation) dependencies between versions. Metadata about Fact versions are stored in the Fact_Vers entity. It contains a unique Fact identifier,name, the identifier of a multi-version Fact a given Fact belongs to, Fact implementation name,Version identifier a given Fact belongs to, the identifier of a transaction that created a given Fact. Metadata about Dimension versions are stored in Dim_Vers. It contains a Dimension versionidentifier, name, the identifier of a multi-version Dimension a given Dimension belongs to, Versionidentifier a given Dimension belongs to, the identifier of a transaction that created a given Dimension. Metadata describing Hierarchy versions and their associations to Dimension versions are stored inHier_Vers. Versions of hierarchies are composed of level versions, whose descriptions are stored inLev_Vers. This entity stores a level identifier, name, the identifier of a multi-version level a given levelbelongs to, level implementation name, Version identifier a given level belongs to, the identifier of atransaction that created a given level. Versions of level hierarchies are composed of level versions.These associations are stored in Hier_Eles. Fact versions are associated with Dimension versions via hierarchy versions. The associations arestored in FT_HV. Every instance in this entity contains the identifier of a Fact version, and theidentifier of the version of a hierarchy, the identifier of a Version this association is valid in, and theidentifier of a transaction that created this association. Every Fact version and Level version includes the set of its attributes, which are stored in theAttributes entity. A single attribute can be shared by multiple Fact or Level versions. 57
  5. 5. Metadata Version Management for DW 2.0 Environment Ding Pan Constraints that may be defined for the attributes of Fact and Level versions are stored in theInt_Cons and Att_Cons entities. Int_Cons stores the name of a constraint, its type and definition.Attributes of fact as well as level entity that have some constraints defined, are registered in Att_Cons.4. Implementation4.1. Scheme evolution Model management is a new approach to metadata management that offers a higher levelprogramming interface. The model is a formal description of the computer artifacts, such as web sitelayouts, interface definitions, DB schemas, XML schemas, and control flow diagrams. Modelmanagement treats these model and mapping between models as bulk objects and offers such operatorsas Match, Merge, and Diff. We discuss the scheme evolution with version problem in model abstractions level, not to involvephysical storage way, using model operators. In actual implementation, the model operators transferinto function, according to model management method [12]. For DW 2.0 environment, the life cycle of data forms a data evolution chain, correspondingly, andform a series of metadata integration and evolution chain. The schema evolution is usually performedafter the schema integration. The schema evolution problem arises when a change to a database schemabreaks views that are defined on it. Formally, for schema evolution with version, we are given a baseschema S0, S1, a set of view schemas V1 over S0, S1, and two mapping map0, map1 that mapsrespectively objects of S0, S1 to objects of V1, as shown in Figure 4. Then, given a new version S2 ofS1, the problem is to define a new version V2 of V1 that is consistent with S0, S2 and two mappingmap0’ from S0, map2 from S2 to V2. We can solve this problem using model management operators as follows (Figure 5): V2 V4 map2 map6 map0’ map18 V1 V1 V2 V1 V3 map10 V5 map9 map16 map0 map1 map0 map1 map0’ map2 map0 map11 map4 map5 map1 map7 S0 S1 S2 S0 S1 S2 S0 S1 S2 S3 map3 map8 Figure 4. Schema evolution problem Figure 5. Result of schema evolution (1) map3=Match(S1, S2) // map3 identifies what is unchanged in S2 relative to S1 (2) map4=map1 ▪ map3 // right composition, map4 is a part of map1 unaffected from S1 to S2 (3) <V3, map5>=DeepCopy(V1, map4) // copy V3 of V1 along with a copy map5 of map4 (4) <V4, map6>=Diff(V3, map5) // identify orphans (5) For m in Enumerate (map6) do Delete (domain(e)) from V3 // enumerates orphans and deletes (6) <S3, map7>=Diff(S2, map3) // S3 consists of new objects of S2 (7) map8=Match(S2,S3) // map8 identifies objects (8) map9=map5 ▪ map8 // right composition (9) <V5, map10,map11>=Merge(V3, S3, map9) // merge V3 and S3 according to map9 (10) map12=map10 ▪ map5 (11) map13=map11 ▪ map8 (12) map15=Match(map12, map13) // map15 identifies objects of map12,map13 (13) map16=Merge(map12, map13, map15) // merge map12 and map13 according to map15 (14) map17=Match(S0, V5) //create map17 between S0& V5 (15) <V2, map0’, map18>=Merge(S0, V5, map17) 58
  6. 6. Journal of Convergence Information Technology Volume 5, Number 3, May 2010 (16) map2=map18 ▪ map16 Here, the steps (1)-(5) form evolution of view schemas V1 over S0, S1 (not including expandedobjects of S2), the steps (6)-(7) form the expanded objects of S2, the steps (8)-(9) merge the results ofbefore, the steps (10)-(13) merge the temporary maps, the steps (14)-(16) generate V2 and twomapping map0’ and map2, and finish the evolution of S0, S1.4.2 Prototype implementation Our prototype system is based on the J2EE (Java 2 Enterprise Edition) architecture, a standardscheme for implementing and deploying enterprise applications. In order to experiment and evaluatethe feasibility of the design, we constructed main components in a prototype. The prototype wasperformed on a 2 CPU Xeon Quad Core Server with 8GB main memory, Windows 2003 Server,Oracle 10g, Tomcat 5.0 and Java 2 SDK 1.5. We chose to concentrate on several issues of our design,namely meta-model management, versioning management, and consistency detection. The metadata repository stores the layered, multi-version meta-object information. The prototypeuses an object-oriented knowledge representation model based on OMG MOF architecture [10]. TheMOF standard defines an abstract meta-meta model and mappings to a generic OO language. TheMOF meta-model consist of four layers, from M0 to M3. The relationship between objects on twoadjacent layers is the type-instance relationship, i.e. definitions on a layer are instances of definitionson the next higher layer. Structural integrity ensures the conformance of objects on one layer to typedefinitions on the adjacent higher meta-level. Structural integrity results from the strict enforcement ofthe type-instance relationship across meta-layers. Without structural integrity, repository applicationsmight create or modify objects on Mn-1 inconsistent with respect to their meta-classes on Mn. The prototype provides a verification engine to resolve any structural inconsistencies that mighthave been introduced during the metadata integration and maintenance process. Maintainingconsistency is not only important during the integration process where a particular meta-object might“use” other meta-object, it is even more important during the deployment phase where the meta-objectshave to be valid and consistent so that they can be used by applications without any errors. The verification engine ensures that the object structure of M1 (metadata) is accordant with the typedefinition of M2 (meta-model). According to the meta-model in Figure 3, the structural integrityalgorithm with a version identifier as the input parameter is shown as below. It needs an effort ofO(nmijk), where for the version, n, m, i, j and k denoted respectively the total number of fact,hierarchy, level, attribute and constraint.CheckStructuralIntegrity(ver, namespace)If not ValidMetaObject(ver) then Output (“invalid:” ver); Return End IfIf not ValidMetaObjectTransaction(ver) then Output (“invalid transaction:” tr) End IfIf not ValidMetaObjectDimension(ver) then Output (“invalid dimension:” dim) End IfFor all fact in refFactInVersion(ver) do If not ValidMetaObject(fact) then Output (“invalid fact:” fact) End If For all lev in refLev (fact) do If not ValidMetaObject(lev) then Output (“invalid ref_level:” lev) End If End For For all hier in refHier (fact) do If not ValidMetaObject(hier) then Output (“invalid hierarchy:” hier) End If If not ValidMetaObject(dim) then Output (“invalid dimension:” dim) End If For all hier_eles in refHierLev (hier) do If not ValidMetaObject(hier_ele) then Output (“invalid Hier_lev:” lev) End If For all attr in refAttr (lev) do If not ValidMetaObject(attr) then Output (“invalid Lev_Attr:” attr) End If For all con in refCon (attr) do If not ValidMetaObject(con) then Output (“invalid Cons:” con) End If End For End For End For 59
  7. 7. Metadata Version Management for DW 2.0 Environment Ding Pan End For For all attr in refAttr (fact) do If not ValidMetaObject(attr) then Output (“invalid fact_Attr:” attr) End If For all con in refCon (attr) do If not ValidMetaObject(con) then Output (“invalid Cons:” con) End If End For End ForEnd For5. Conclusion The schemas evolution problem had been intensively investigated in various approaches for manyyears. When DW 2.0 recognized the life cycle of data within the data warehouse, tracing and handlingchanges in metadata became a new challenge. Structural changes applied inappropriately to a schemaversion may result in wrong analytical results. Currently, most of commercially available metadatamanagement systems do not offer mechanisms for managing multi-version states. In this paper, wehave discussed a metadata versioning mechanism and its implementation. The approach is based on amulti-version metadata repository, where a schema version represents the structure of a DW within acertain time period. Our meta-model is capable of managing versions of schemas in the DW 2.0environment, supporting to present, compare, and interpret the results of queries that address severalversions. In implementation, we discuss the scheme evolution with version problem based on modelmanagement, and structural integrity detection. We have already started to experiment such systems,the first results are encouraging.6. Acknowledgment This work is supported by the National Natural Science Foundation of China under grant No.70771044 and 70872020.7. References[1] A. Scime, Web Mining: Applications and Techniques, Idea Group, Hershey, 2004.[2] W.H. Inmon, D. Strauss, G. Neushloss. DW 2.0: The Architecture for the Next Generation of Data Warehousing, Morgan Kaufmann, San Francisco, 2008.[3] B.Salzberg, L. Jiang, D.Lomet, et al., “A framework for access methods for versioned data”, In Proc. of Int. Conference on Extending Database Technology (EDBT), pp. 730-747, 2004.[4] E. Rahm, P.A. Bernstein, “An on-line bibliography on schema evolution”, SIGMOD Record, vol. 35, no.4, pp.30-31, 2006.[5] C.Letz, E.T.Henn, G.Vossen, “Consistency in data warehouse dimensions”, In Proc. of Int’l Database Engineering and Applications Symposium (IDEAS), pp. 224-232, 2002.[6] M. Golfarelli, J.Lechtenborger, S. Rizzi, G.Vossen, “Schema versioning in data warehouses”, In Proc. of ER Workshops, LNCS 3289, pp. 415-428, 2004.[7] A.Y. Halevy, N. Ashish, D. Bitton, et al., “Enterprise information integration: successes, challenges and controversies”, In Proc of the SIGMOD 2005, pp.778-787, 2005.[8] R. Wrembel, B. Bebel, “Metadata management in a multiversion data warehouse”, In Proc. of CoopIS/DOA/ ODBASE 2005, Springer, pp. 1347-1364, 2005.[9] R. Hauch, A. Miller, R. Cardwell, “Information intelligence: metadata for information discovery, access, and integration”, In Proc of The 2005 ACM SIGMOD Int’l Conf on Management of Data, Baltimore, pp.793-798, 2005.[10] Object Management Group, Common Warehouse Metamodel Specification, v1.1. Retrieved Aug. 10, 2005 from http://www.omg.org/cgi-bin/doc?formal/03-03-02[11] P.A. Bernstein, S. Melnik, “Model management 2.0: manipulating richer mappings”, In Proc. of the 2007 ACM SIGMOD Int’l Conf on Management of Data, ACM Press, pp.1-12, 2007. 60
  8. 8. Journal of Convergence Information Technology Volume 5, Number 3, May 2010[12] D. Pan, J. Shen, “Research on software architecture for realtime metadata management”, J. of Xi’an Jiaotong University, vol.39, no. 6, pp. 566-569, 2005. 61