Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. BOB THE WAREHOUSE BUILDER MEETS DON THE DATA MINER: FROM CREATIVE BOOKKEEPING TO FORTUNE TELLING Lucas Jellema, AMIS Services BV Introduction In this paper I will try to report on my recent entry into the world of Data Warehouses and Business Intelligence, the introduction to the Oracle 9i Technology Stack in this area and the concrete application of the tools and functionality offered. I illustrate the questions and challenges using the ODTUG organization as an example; however, I have assumed many things that I do not know to be true or even that I know to be not true. I hope you will forgive me these slight liberties with the truth, in light of a useful and tangible business case. However, any resemblance of real people, companies or events is purely coincidental. By the way: this paper can not even scratch the surface when it comes to discussing all ins and outs of Data Warehouse and BI concepts and technology. I hope it can serve as an introduction that at least puts some things in perspective – something that would have helped me a lot one year ago. Business Intelligence requirements Design of a solution for a BI challenge can be driven from two ends: • what is the data we have and what useful things can we do with it • what are the business questions that we need answered in order to make better decisions, use our resources in a more effective way, increase our customer base, etc. The second angle is more likely to provide true business value, as it is the business that drives the requirements rather than the mere availability of a lot of data. However, the first situation is not unlikely to occur, if only because “the business” does not know what can be done with today’s technology (and yesterday’s data). In both cases, at some point we have to decide: 1. what are the questions we will (initially) attempt to get answered 2. what is the information needed for answering the questions 3. where can we get hold of the information we need 4. how should the answer be presented (channel and format, frequency, audience) Business Intelligence can be said to exist at various levels. Each with a different objective, scope, audience, technical approach. OPERATIONAL REPORTING Straight reporting of facts and events, supporting the day-to-day processes. This type of Business Intelligence is typically derived directly from the operational systems, does not involve historical data and does little in terms of cross-system collection of data, aggregation, enrichment or transformation. The frequency of this level of BI is high – daily or even more often, application of the results is rapid. This paper does not go into this type of Business Intelligence, no matter its importance. TACTICAL/STRATEGIC – ON LINE ANALYTICAL PROCESSING (OLAP) This type of Business Intelligence supports mid to longer term management processes. Operational data is typically aggregated, grouped by specific dimensions. The data is often collected, possibly cleansed and enriched, and merged together from several sources into a new data store that specifically supports the BI gathering; this data store is called the Data Warehouse. Data is typically filtered as the scope of OLAP does not involve all aspects of the operational processes. The Data Warehouse in general contains a several year history of data, probably unlike the operational systems. The BI sought after with OLAP includes financial reporting, monitoring of performance indicators (KPIs, Boardroom Dashboard, Balanced Scorecard), analysis of under- and over performers (employees, countries/branches/departments/divisions, customers, products), trend analysis, forecasting – trying to translate knowledge of the past into predictions for the future, planning & budgeting – allocation of resources, what-if scenarios to analyze www.odtug.com ODTUG 2003
  2. 2. Bob the Warehouse Builder meets Jellema consequences of possible management choices. OLAP usually offers drill-down and roll-up capability that allows to zoom in into a fairly detailed level when desired and then zoom out to a more aggregate level to get an overall picture. DATA MINING Data Mining is a new ball-game altogether. Data Mining is a technique & technology to sift through large amounts of data, target events (sale, hire, response, fraud, sickness) and characteristics related with those events, trying to find patterns that might exist. If those patterns exist, it is the assumption that events can be predicted to a certain extent once a number of characteristics is known. Data Mining ideally will tell us: that a pattern exist, which characteristics are of relevance in making a prediction – i.e. influence the probability of the event – and what the reliability of the predictions is, at least when applied to historical data. In comparison with OLAP, where you ask the question yourself and typically know what you are looking for and where you can find it and which is linear and has a very straightforward logic that consists largely of adding up, dividing and counting, you can consider Data Mining somewhat non-intuitive. Because of the large volumes of data involved and the fairly complex statistical computations, the results of Data Mining are often not expected or easily explained. With Data Mining, you may not know what you are looking for – there could be a hidden pattern and if so, that could help us. Data Mining is primarily used to understand and predict, on a case level. It derives predictions for individual cases from patterns among large numbers of cases, although no real aggregation is involved. Data Mining also requires data of non-events: to understand why an event took place you also need to know about the conditions under which the event did not occur. To more efficiently target potential customers, you need to know what people (more specifically: which set of characteristics) became customer in the past but also need to know who did not come your way. Since the likelihood of events can depend on many things, general and specific, known to you or not, successful data mining will typically require a substantial number of details about your customers, many of which you may not know. External sources can be used to supply many of such details: demographics, (social) geographical data, weather information, data from the stock exchange, market researchers, the UN, etc. Furthermore, information about the conditions when the event took place may not be enough: incidents and changes in conditions in the period prior to the event (or even expectations before the event) may just as well have had an influence. For example the increase in salary as compared to last year, the number of children recently born in the customer’s family or the number of sunny days in the month prior to the car sale may have an influence on the chance that a convertible is acquired. Oracle Technology – portfolio and strategy Oracle has been very active over the last two years or so in the area of Data Warehousing and Business Intelligence. The portfolio has been revamped and many new initiatives and tools have been launched. Keywords in all of this seem to be: Integration, Web and Java-enable, align with or even initiate standards (for example OMG’s CWM, JSR-73 and PMML for Data Mining, JOLAP – JCP Java API for OLAP). Note that all OLAP functionality as well as the results of OLAP operations can be accessed from PL/SQL and SQL statement. Several tools have been discontinued and folded or transformed into other products– most prominently Oracle Darwin for Data Mining, Oracle Express for OLAP and Oracle Sales Analyzer & Oracle Financials Analyzer for BI on Oracle Applications. The key tools that Oracle offers and that you should be aware of are: Oracle 9i Warehouse Builder Oracle 9i Warehouse Builder 9.0.4 (OWB) – part of Oracle 9iDS. OWB has a large Design Time component, a Client/Server tool with a Repository for Meta- or Design-Data. OWB is used for design and generation of a Data Warehouse and the ETL processes for populating the Data Warehouse from the indicated data sources, typically through a number of transformations. OWB supports the latest Oracle 9i functionality – for example External Tables for data load from files, SQL Merge and Upsert for loading data in non-empty tables, OLAP Catalog for logical Data Warehouse design and CWM2 to create Analytical Workspaces. OWB has a process flow editor that integrates with Oracle Workflow. It can be used to schedule the workflow for (periodic) data loading into the Data Warehouse. OWB Design Time offers a number of Web based reports that publish the contents of the OWB Design Repository, much like the Repository Object Browser does for the Oracle Designer Repository. OWB offers a Java API through which the DWH and ETL Design can be manipulated in a programmatic fashion. www.odtug.com ODTUG 2003
  3. 3. Bob the Warehouse Builder meets Jellema OWB also has a run-time component: a library of transformation functions and a framework for monitoring and auditing the ETL processes. OWB generates DDL and ETL scripts from the DWH and ETL design in its repository. This way of working, combined with the Graphical Design Editors, support Rapid Development and Prototyping. It also greatly facilitates maintenance; for example when a new Measure is added to a Fact (or Cube) and therefore an additional value is populated during ETL, you simple change the Design through the GUI and re-generate. When a new release of the Oracle RDBMS is available, OWB will generate new code based on the same Design data, now utilizing the latest functions and options in the RDBMS. Oracle 9i Database The Oracle 9i Database, especially with Release 2, provides a single integrated server for all Data Warehouse and OLAP operations, many of which previously had to be performed in separate environments such as the Oracle Express multi- dimensional database. The main Oracle 9i R2 components and features for DWH and OLAP – sometimes introduced in previous releases – are listed below. Note that the OLAP Engine and Data Mining are part of the Oracle 9i Enterprise Edition. BI Beans Discoverer Reports Oracle 9i Release 2 OLAP API SPLExecutor Oracle Enterprise SQL Engine Manager OLAP DML CubeViewer DBMS_AW Query SummaryAdvisor OLAP Catalog rewrite OLAP_TABLE OLAP Worksheet OLAP Engine AW Manager Materialized Analytic (multi- Wizards for Cubes Views & Dimensions dimensional) Workspaces Relational Data Mining Tables DM4J Figure 1 Oracle 9i stack for Data Warehouse and Business Intelligence • SQL Analytical Functions – advanced OLAP-style queries can be performed in normal SQL statements against normal relational data sources; examples are: ranking, moving window aggregates, first/last aggregates, multiple aggregation levels with Cube, Rollup and Grouping Sets, histograms. • External Tables – for reading data from external files as if they were tables, allowing transformations without the need for staging tables • Change Data Capture (CDC) and Oracle Streams - for advanced, cross database, near real-time Data Warehouse updates • Table Functions – for pipelined, parallel execution of PL/SQL statements, producing multi-record output from a PL/ SQL function that can for example be used to select from, e.g. select * from TABLE(function(parameters)) www.odtug.com ODTUG 2003
  4. 4. Bob the Warehouse Builder meets Jellema • Materialized Views and Query Rewrite – for storing pre-calculated aggregation results, typically along the dimensions of the cube. The Query Rewrite functionality transparently uses the MV data when the underlying table is queried for a pre-calculated value; that means that applications can benefit from Materialized Views without explicitly accessing or even knowing about them • OLAP Catalog and OLAP (Java) API – The OLAP Catalog is the collection of OLAP Meta Data that define Dimensions and Hierarchies/Levels, Cubes and Measures. Note that the actual data referred to in the OLAP objects is stored in relational structures: Dimensions and Cubes map to Tables, Measures to Columns. The OLAP Catalog is used by the OLAP API and BI Beans to learn about the logical OLAP structures in your database. It is not related in any way to the Analytical Workspaces and the OLAP DML, except for the CWM2_OLAP_AW_CREATE package that creates Analytical Workspace objects based on definitions in the (relational) OLAP Catalog (the Analytical Workspace Manager provides a GUI to this same functionality). The OLAP API is a Java API for developing OLAP applications. The OLAP API uses the OLAP Catalog to know about OLAP objects. The OLAP API does not use the multi-dimensional Analytical Workspaces or OLAP DML, it only uses SQL. However, it does provide a way to execute OLAP DML against an Analytical Workspace. BI Beans are built on top of the OLAP API. • Analytical Workspace, OLAP DML & OLAP Engine – An analytical workspace (AW) is a structure in a database schema that holds data in a multi dimensional format. In such an AW, advanced OLAP calculations and operations can be performed, by issuing OLAP DML commands that are executed by the OLAP engine. Some important OLAP operations are only possible in an AW – not through regular SQL statements ; for example forecasting, regression analysis, time-series manipulation, allocation and what-if scenarios. OLAP DML commands can be issued to the OLAP engine in three ways: in the OLAP Worksheet tool, through the PL/SQL package DBMS_AW and via the OLAP API. The results of OLAP operations can be retrieved through the same channels. You can expose complex OLAP operations in a relational manner by using the OLAP_TABLE table function. This is an operator that you pass a series of OLAP DML statements to and that returns a number of records. You could create a relational View that returns the results of some OLAP operation: create view Complex_Forecast as select … from TABLE(OLAP_TABLE( ‘aw_name duration session’, ‘result TYPE’, ‘OLAP DML comands’)) where …. The view is used like any other view: select * from Complex_Forecast. Note that the OLAP Engine and Analytical Workspaces integrated in Oracle 9i replace the Oracle Express tools. • Data Mining –Data Mining capability (build and apply model) available within the database and therefore very easily integrated with applications and applied to operational data. Data Mining analyzes historical data of events trying to discover patterns that would help us understand and even predict events in the future. Data Mining can produce Association Rules that describe the probability of several events occurring together, such as two products being in same Shopping Basket. Web sites like Amazon can make recommendations to users based on these Association Rules: if you are interested in this item, we have found that it is likely you are also interested in this other item. Data Mining can also build Classification Models. Such a model can predict the probability of events for agents such as customers or voters, for example estimate the probability of a customer being a Very Likely Buyer or a Non- Buyer. The model tries to find out whether for the historical data of large numbers of customers there is a pattern through which such predictions can be made – provided the significant attributes of the agent are known. Clustering is used to group data into clusters of related items, the idea being that items within a cluster are pretty much alike and items in different clusters are quite distinct. It can be used for market segmentation, and it can also be helpful to help focus the discovering of Association Rules or Classification Models on selected, coherent sets of items. Oracle Data Mining also provides a technique called Attribute Importance. This technique reports on the importance or relevance of attributes in determining the probability of an event or in general the classification of an item. Only the most relevant attributes need to be kept in the Data Mine. This helps to speed up the actual Data Mining, eases the process of populating and managing the data mine and may even reduce costs we otherwise would make for acquiring specific attributes from external parties. Data Mining in Oracle 9i is accessible through a Java API; it will act on relational data (in a Table or a View) and write its findings to relational tables as well. Note that rapid development of Data Mining application components www.odtug.com ODTUG 2003
  5. 5. Bob the Warehouse Builder meets Jellema can be done with the DM4J component (Data Mining for Java), that can be downloaded from OTN and be installed as add-in for Oracle 9i JDeveloper. Oracle 9i Data Mining replaces the stand alone tool Oracle Darwin. • Oracle Enterprise Manager – OEM provides extensive support for Data Warehousing and OLAP. The OLAP Catalog objects (Dimension, Cube) are listed in the Object Tree; wizards are available for creation and maintenance. OEM will generate the code to deploy OLAP objects to an Analytical (multi-dimensional) Workspace – though it currently uses the CWM Version 1 model (CWMLite) instead of the current, more advanced CWM 2 model. The Summary Advisor (or alternatively the DBMS_OLAP package) give suggestions for the definitions of Materialized Views; it recommends pre-calculated aggregates based on the OLAP Meta Data that specify measures, dimensions and levels. The CubeViewer shows the Cube and allows the user to drill down and roll up along all dimensions of the Cube, showing the actual data from the Cube in Cross-Tab format. The Analytic Workspace Manager is a new tool, to be downloaded from OTN, that provides a GUI for managing an Analytic Workspace (creating and editing dimensions and variables, possibly based on the relational OLAP Catalog. The OLAP Worksheet is a graphical tool in which OLAP DML commands can be executed and results can be viewed. Tools for Building applications for OLAP and Business Intelligence Reporting • Oracle Reports – The SQL-based reporting tool in Oracle 9iDS for fixed and potentially parameterized queries in advanced layouts, typically run in batch-mode, producing output formats such as ASCII, HTML, PDF, CSS, XML . Oracle Reports does not have intrinsic knowledge of Data Warehouse and OLAP structures, such as Cubes, Dimensions and Hierarchies. • Oracle Discoverer – Oracle’s long running stand-alone, interactive, End User tool for OLAP and BI Reporting, part of Oracle 9iDS and Oracle 9iAS. Available on the Web (Plus and Viewer) and as Client/Server tool (Desktop), integrated with Portal and Oracle Reports. An End-User Layer (EUL) is configured through the Discoverer Administrator (or imported from Oracle Warehouse Builder). The EUL contains definitions of Business Areas, Folders (equivalent to Tables/Views), Items (equivalent to Columns), Hierarchies, etc. Users can perform random queries within the Business Areas made accessible to them. Results can be published in several layouts (graph, table, crosstab) and many formats (Excel, HTML, comma separated values file, XML for Oracle Reports). Note that Oracle Discover does not use the OLAP Catalog, nor does it access the OLAP API, OLAP DML or the Analytical Workareas; Discoverer at the present is a purely relational, SQL-based tool. Of course, through Table Functions and the OLAP_TABLE function it is very well possible to expose Analytical Workspaces in a relational manner, allowing Discoverer access. • BI Beans – The BI Beans are a set of reusable Java components, built on top of the OLAP API, that allow rapid development of BI Applications. The BI Beans are integrated with Oracle 9i JDeveloper and offer WYSIWYG wizards that allow creation of sophisticated Reports (Graphs, Cross Tabular or Table Format) using complex selection rules, calculations and layout definitions showing live data at design time. The wizard generates a Java Class that in turn invokes the BI Beans and OLAP API. This Java Class can be customized and incorporated in regular Java applications an be deployed to the Web (as JSP) or as Java Client application. A good illustration of the capabilities of BI Beans is seen in the Car Sales Demo on OTN. BI Beans allow OLAP functionality to be customized in both appearance and behavior, and to be built into your own custom applications. With BI Beans you typically guide the user in his BI queries to a greater extent than with Oracle Discoverer. Discoverer requires more knowledgeable users than typical BI Beans based applications. BI Beans seem to be key to Oracle’s BI Reporting technology: Discoverer is partly built using BI Beans as is the new Enterprise Planning & Budgeting product. Oracle 9i Forms allow BI Beans to be included in a form through the Pluggable Java Component architecture, see example on OTN. • DataMining Beans (DM4J) – see Oracle 9i Data Mining • Oracle 9i Personalization – Based on the Data Mining engine – in particular the Association Rules - this tool for example helps you build personal recommendation functionality into your web-site. • Oracle ClickStream – A tool for analyzing traffic on your web-site, to help you improve the web-site as well as learn about your customers. Which pages are often visited, which are never seen, which pages are visited by the same visitors? Apparently, Oracle is not propelling this tool in a very strong way; it seems to be moved out of Oracle 9iAS and into Oracle Warehouse Builder. Its exact future is somewhat hazy, but it is not entirely looking good. www.odtug.com ODTUG 2003
  6. 6. Bob the Warehouse Builder meets Jellema • Oracle Enterprise Planning & Budgeting – A new tool on the horizon for BI with Oracle Applications (a Hosted Beta release in May 2003, production release scheduled for end of 2003). This tool will replace Oracle Sales Analyzer and Oracle Financials Analyzer (both built on the Oracle Express tools). It will be packaged with Oracle Applications and can also use OLAP definitions for non Oracle Apps systems. It is positioned for advanced analysis and planning for key processes supported by the Oracle E-Business Suite. You may wonder whether there is no role at all for Oracle Forms and Oracle Designer? Well, there can be. Oracle Designer can be used to design the database for the data warehouse. That does not include the OLAP meta data – that is designed using Warehouse Builder – but it could include all Table Definitions. Oracle Warehouse Builder can import Table Definitions from Oracle Designer. Once again, we face the multi-meta-repository situation here. In truth, it does not seem too useful to use Oracle Designer in parallel with Oracle Warehouse Builder. Designer is the far better Design tool – at the present – but OWB is where the investments go. Furthermore, the actual table design is but a little part of the overall process of getting your OLAP application and Data Warehouse up and running. Definitions for Sources of data to be loaded into the Data Warehouse can be read from the Data Dictionary or from the Designer Repository. Oracle Forms 9i is happily oblivious to all the OLAP fuzz. However, through the Pluggable Java Component (PJC) Architecture, you may incorporate BI functionality, for example developed using the BI Beans, into your Forms applications. Your Forms can interact with these BI PJCs See OTN for a demonstration of this concept/ ODTUG Business Questions The ODTUG organization has several Critical Success Factors or Business Objectives. One is the total number of members and their happiness with value for money. With regard to its conferences it will try to grow the number of attendees and improve the overall satisfaction of the attendees – assuming this will also have a beneficial effect on next year’s visitor numbers as well as the number of non-member attendees who sign up to become member too. One way of increasing the number of conference visitors and their satisfaction is by offering the right program. The program mix is made up of presentation topics, presenters, complexity levels, allocation of time slots and rooms. Of course budget is a limiting factor as is a total of four days for the duration of the conference. How should we plan for next year’s conference: what is the right venue in terms of traveling distance, what mix of topics, levels and speakers should we have, which speakers are the favorites – and could be used to keep a good crowd until the last day of the conference, how should we allocate the slots to the technology tracks, etc. Which technologies are on a roll and which are past their prime? Are there trends in interest for certain topics? Can we do a forecast for the interest in year(s) to come? What geographical regions should be our prime targets for marketing – given the potential for expansion? Which companies should we target and which Industrys are best left alone? Which technologies should be considered to be added to our current range of topics given their potential among ODTUG members and (potential) conference attendees? How can we cut our marketing costs by focusing within the group of current members, past members, conference attendees and other individuals that we have information about, on those persons that are most likely to positively respond to our marketing message? Would it be possible to offer personalized agenda suggestions to conference attendees – based on their past evaluations, current self professed interest and the results of a test panel who scrutinized this year’s agenda and drew up their personalized agenda’s. Some more direct questions are also asked: what were the top ten best rated presentations, what were the bottom ten presentations, what were the top ten best attended presenters, what were the average and the total audience and the ‘market share’ for each of the technology tracks. Business Question details We need to investigate whether we currently have (access to) the data that is required to analyze the questions asked. If we do not, either the question should be rephrased or dropped or we have to acquire the data in some way or another. For each question that will be supported, several attributes must be set to specify: • how the analysis will take place & how the results will be presented (graph/table, web/paper/client_server/file, predefined or through flexible user interaction) • who will do the BI analysis www.odtug.com ODTUG 2003
  7. 7. Bob the Warehouse Builder meets Jellema • to what extent the BI application should be integrated into new or existing applications • how often the BI analysis will be performed (how often will the question be asked and how often should new source data be retrieved) and what the performance requirements are for answering the question/doing the analysis • what the workflow is that the BI Application will be part of. Technical Roadmap for Setting up Business Intelligence process In short the following steps describe the technical approach towards building a BI application. The assumption is the Oracle 9i technology stack for Data Warehousing and Business Intelligence will be used. Note that this technical roadmap should be initiated from and flanked by a functional, process oriented roadmap. Also note that this is an iterative process: it is likely for the scope to broaden (more types of events become of interest) or deepen (more details about the selected events are required) as business questions are (re)(de)fined. IDENTIFY AND REGISTER DATA SOURCES Find out where which relevant data is available and register these data sources in a Source Module in Oracle Warehouse Builder. Find out how to get hold of the data – how much data is it, what format is it in, what do all fields and columns mean in Data Warehouse terms, how often does it change, what are the time-windows for unloading, etc. DESIGN AND GENERATE DATA WAREHOUSE (LOGICAL AND PHYSICAL) Knowing the business questions or at the least the business areas that will be analyzed and the availability or data sources, it should be possible to design the logical structure of the Data Warehouse in OWB. This is done using OLAP terms such as Dimension, Hierarchy and Level, Cube, Fact and Measure. In short, Facts are the events or findings that form the basis of business questions. Examples of Facts are: Sales, Memberships, Test Results, Votes. For each Fact, one or more Measures are recorded. Measures are numerical values that help specify and especially quantify the Fact. Measures will often be aggregated. Examples of Measures are Price, Time, Cost, Discount, Level of Satisfaction. Measures can be the result of a calculation or aggregation themselves, for example when the Data Warehouse will not provide the same level of detail or granularity as the Operational Data Store. An example of such Measures would be an aggregation of Cost, Price and the number of items sold for a specific Product during one business day at a Store – instead of recording each individual sale of the product. Dimensions provide context for Facts, a bit like a lookup table. Dimensions also define the axes along which drilling down and rolling up takes place. Dimensions are found in the GROUP BY expression of SQL statements. Typical examples of Dimensions are Time, Geographic Location, Product, Customer, etc. Dimensions are built using one or more Hierarchies, for example Time built from a Hierarchy involving Day-Month-Quarter-Year. Each Level in a Dimension Hierarchy represents an aggregation level or drill-down/roll-up station. Cubes are structures that relate Facts to Dimensions. Note that each Fact may be defined through one or more Dimensions, each of which is mandatory. Dimensions can provide context for many different Facts. OWB will automatically create the design for the relational tables that will physically implement the OLAP objects. Subsequently, you should configure these tables, specifying physical properties such as Storage Details, Partitioning and Indexes. Once the Design is done for at least a coherent subsection of the Data Warehouse, OWB can validate the design and generate and deploy the DDL scripts to create the objects involved. This includes entries in the OLAP Catalog, Relational Objects and – if so desired - Analytical Workspaces. DESIGN AND GENERATE ETL (MAPPING) AND WORKFLOW With both Data Sources identified and registered in the OWB Source Module and the Data Warehouse Design recorded as Dimensions and Facts in an OWB Target Module, it is time to design the ETL process. This is done by specifying Mappings in the graphical Mapping Editor between Source Tables, File or Queues and Target Dimensions or Facts, Tables, Queues or Files. As part of the Mapping, operators for joining, filtering, enriching or transformations of the data can be specified using built-in or user defined transformation functions; the cleansed source data will be molded into the target Data Warehouse. When the Design is done, it can be validated, generated and deployed. Generation typically yields PL/SQL packages that use SQL set operations as well as PL/SQL Bulk operations for fast processing of the data. The code is optimized for either the Oracle 8i or 9i Database. In addition, generation can also result in SQL*Loader scripts for use with file based sources that are not exposed as External Table. When the ETL process is deployed, a PL/SQL package is created that can be invoked to perform an Extract, Transform and Load process as specified in the Design. www.odtug.com ODTUG 2003
  8. 8. Bob the Warehouse Builder meets Jellema Populating and refreshing the Data Warehouse constitutes a job that consists of several Mappings, at least one per Dimension and one per Fact. These Mappings should be executed in the right order – for example Dimensions before Facts that refer to the Dimensions because of the underlying Foreign Keys. The job is scheduled on a periodic basis and will typically be performed in batch. Using the Process Flow editor in OWB, you define the process flow for the ETL process: which Mappings are executed and in what order, which conditions apply – e.g. do not start Mapping 4 before Mappings 2 and 3 are successfully completed – and how are the dependencies on other process flows. The process flow thus created can be deployed to Oracle 9i Workflow. SCHEDULE, EXECUTE AND MONITOR ETL JOBS Extract, Transform and Load (ETL) jobs consist of PL/SQL procedures and optionally SQL*Loader scripts. Scheduling and Executing ETL jobs is most easily be done through a Process Flow defined in OWB and deployed to the Oracle Workflow engine. Within the Workflow Manager, the process can be scheduled and its execution monitored. Alternatively, you can create and schedule jobs in Oracle Enterprise Manager; however, OEM allows for less dependency logic between process steps and across jobs. The Oracle Warehouse Builder Runtime Audit Viewer can be used to inspect the logging produced by the ETL jobs. This logging will show exactly how many records have been processed by each Mapping– how many inserted, updated and deleted in the Data Warehouse – and which errors have occurred. Errors in the ETL process can be tracked down to individual records and Mappings. The OWB Runtime Viewer also gives some information about performance. ADMINISTRATE THE DATA WAREHOUSE Obviously, the Data Warehouse is a database application with quite high administration demands. For quick, scalable and reliable OLAP and reporting, it is important that actual table and index statistics are available, efficient indexes are in place, Materialized Views are refreshed when needed, capacity is planned for, expired data is removed, security is enforced and performance is tuned. Oracle Enterprise Manager is the main tool for these DBA activities. DEVELOP BI REPORTING APPLICATIONS Once the Data Warehouse is in place, populated and all, it is time to start OLAP and Reporting. As discussed before, we have a choice in tools: Discoverer, Reports, BI Beans and non-Oracle offerings such as Business Objects for those who do not like Discoverer. Oracle Warehouse Builder can provide an import file for Oracle Discoverer that largely specifies the composition of the End User Layer for the Data Warehouse. OWB has also populated the OLAP Catalog, which fuels BI Beans. Data Mining Data Mining is a parallel but largely separate route. With Business Questions you decide to try to answer using Data Mining techniques, you do not know beforehand which data may prove relevant. You typically do not apply the same levels of aggregation you may use for assembling the Data Warehouse supporting OLAP applications. The Data Mine is a special variant of the Data Warehouse with highly denormalized tables with large numbers of columns (several 100s is not uncommon in the initial stages of discovering patterns) and binned values. Binned values mean that every numerical value is translated to a bucket or category; the Data Mining algorithms have to way of knowing that when it comes to percentages 1.0 and 1.5 are very close and 1.0 and 99.0 are very far apart – binning might result in 1.0 and 1.5 being in the same bin (“very low”) and 99.0 being in a different bin (“very high”). Now the model may find that the target event is probable for very low- bin cases. However to develop the Data Mine, you roughly go through the same steps as for the Data Warehouse for OLAP: identify data sources, design and implement Data Mine, design and implement ETL processes and develop the Data Mining application. The first three steps can very well be made using Oracle Warehouse Builder in much same way as discussed before. Identify and Register ODTUG Data Sources The ODTUG organization has several sources of operational data at its disposal, both internally and externally. First of all there is the Conference Database that holds the data for this year’s conference (see the ERD). It records all attendees, presentations, evaluations etc. Second is the ODTUG Members Administration. It contains details about all current and past members. The link between these two systems is the Member Id. During every conference, there is a conference evaluation in which the attendees are asked questions such as: what is the likelihood of you returning for next year’s conference, how did you learn about ODTUG and this year’s conference, would you recommend visiting the conference to your colleagues, what is you overall rating of the conference. The evaluation results are registered in a separate system that links through an Attendee Identifier to the Conference Database. www.odtug.com ODTUG 2003
  9. 9. Bob the Warehouse Builder meets Jellema ORACLE PRODUCT * NAME* NAME SUITE o part of OBSOLETION DATEo OBSOLETION DATE * NAME* NAME o RELEASE DATEo RELEASE DATE o INTRODUCTION DATEo INTRODUCTION DATE o VERSIONo VERSION contains o OBSOLETION DATEo OBSOLETION DATE DAY contains o DAY DATEo DAY DATE primarily discussed in o SEQo SEQ on relates most to PRESENTATION TECHNOLOGY TRACK * COMPLEXITY LEVEL* COMPLEXITY LEVEL fits in * NAME* NAME * PAPER SUBMITTED YN* PAPER SUBMITTED YN o COLORo COLOR TIME SLOT contains * VENDOR PRESENTATION YN* VENDOR PRESENTATION YN contains # SEQ# SEQ o DEMO IN PRESENTATION YNo DEMO IN PRESENTATION YN o END TIMEo END TIME in o START TIMEo START TIME presented by CONFERENCE presents ROOM ATTENDEE * NAME* NAME visited by * FIRST NAME* FIRST NAME scheduled in o CAPACITYo CAPACITY * LAST NAME* LAST NAME offers stage for o CITYo CITY o COUNTRYo COUNTRY o FUNCTIONo FUNCTION at o ODTUG MEMBER IDo ODTUG MEMBER ID of o STATEo STATE ATTENDANCE o has MOTIVATION FOR COMINGo MOTIVATION FOR COMING filled in by o MOTIVATION FOR LEAVING EARLYo MOTIVATION FOR LEAVING EARLY EVALUATION has handed in o PERCENTAGE ATTENDEDo PERCENTAGE ATTENDED works for o SIGNED_YNo SIGNED_YN o TIME SLEPTo TIME SLEPT contains employs part of COMPANY * NAME* NAME EVALUATION ANSWER o CITYo CITY o REPLYo REPLY EVALUATION QUESTION o COUNTRYo COUNTRY o SCOREo SCORE to * QUESTION* QUESTION o SECTORo SECTOR answered in * QUESTION TYPE* QUESTION TYPE o SIZEo SIZE o STATEo STATE Figure 2. Data model for the (fictitious) ODTUG Conference Database application From external sources there are data available – some for free some at a price - such as: Oracle annual product sales figures and shifts compared to last year, regional Oracle User Group membership numbers, Oracle product announcements, releases & obsoletions, Geographical data (size of cities, states and countries, distance to conference venue from each city), Dun & Bradstreet Company data (size, Industry, number of locations, number of Oracle licenses), Stock Exchange and Statistics Agencies (economic indicators per state/country/continent), the weather, news events per region such as political assassinations, wars, general strikes, monetary crises, elections etc. Oracle Warehouse Builder To start the process using OWB, create a new project. Create a Oracle Database Source Module to contain the OCD Source Objects. Choose to import the Table Definitions from the Data Dictionary or from a Designer Repository. Select and Import the tables that hold the operational data of the ODTUG Conference Database application. Create a Flat File Module for File Based Data Sources, for example the ODTUG Member Administration export data with records for each ODTUG member, past and present and the Geographical information linking Cities and zip-codes to States, Countries and Regions. Use the Import Meta Data Wizard to define Source Files, by sampling an example of the source file. For source files you need to indicate their record-structure and layout (single record-type vs. multi-record-type, delimited or fixed length) and an identification of each field. After you have created Source File definitions, you can create External Tables based on these file definitions; External Tables allow you to access the data in the file through normal SQL Statements, instead of having to use SQL*Loader and staging tables to bring data into the database. Design and Generate Data Warehouse (logical and physical) Given the Business Questions asked and the data sources that are available, we can design the Data Warehouse. The Cubes of interest center round Conference Attendance and Presentation Attendance. For Conference Attendance, the following Dimensions are of interest – for aggregation, roll-up and drill-down levels: www.odtug.com ODTUG 2003
  10. 10. Bob the Warehouse Builder meets Jellema • Geographic Location: City > State > Country > Region • Time: Year > Era (Client/Server, Three Tier, Internet Computing,…) Figure 3. OWB Dimension Editor - three hierarchies within the Company Dimension • Company: Company > Industry, Company > Type of Oracle relationship (customer, partner, investor, reseller,..), Company > Size Category, Company > City > State > Country > Region • Person: Person > Role within Company, Person > Primary Product Interest > Product Suite, Person > Previous Attendance Category The Measures connected with the Conference Attendance Cube include: duration of attendance (# days), overall conference evaluation, travel time to conference venue, birth date of attendee, duration of ODTUG membership (0 or more), #years Oracle experience. Figure 4. Cube Editor linking Cube Conference_Attendance to Dimensions www.odtug.com ODTUG 2003
  11. 11. Bob the Warehouse Builder meets Jellema For the Presentation Attendance Cube, the following Dimensions are proposed: • Speaker: Speaker (Person) > Company, Speaker (Person) > State > Country > Region • Time: Slot > Day > Year Single level Dimensions: Technology Track, Level of Complexity. Measures include: number of attendees, average overall evaluation for presentation, average evaluation for presentation skills, rank within slot with regard to size of audience (best visited presentation … worst visited presentation), number of previous presentations by presenter on ODTUG conferences. Oracle Warehouse Builder Create a Target Module of type Database Module in OWB. Use the Dimension and Cube wizards to design the Warehouse objects. Note that OWB will create associated Table Definitions with Primary, Unique and Foreign Keys and Indexes to provide the relational implementation structure for the Warehouse objects. Use the Table Details and Configure option to specify the physical characteristics of the Tables, including Partition details and Indexes – the Configure tool has a Generate option to generate bitmap indexes for Foreign Keys, a common approach in data warehouses. Now you can validate and generate the design. Generating after successful validation will produce DDL scripts to create the designed objects for real in the database. OWB also has a tool called Deployment Manager. With this tool, objects can be deployed and re-deployed (updated) to the database. Note that deployment of OLAP Objects can include creation of Analytical Workspaces using the CWM Version 2 packages (apparently, this functionality is only supported with the Oracle 9i patch. Design and Generate ETL (Mapping) and Workflow Now that we have specified the target Warehouse as well as the source objects, it is time to design the ETL process: how is source data mapped to the target objects. A mapping typically contains operators to join, split, filter, transform and otherwise manipulate the data that gets loaded into the Warehouse. Note that OWB will generate code that supports loading into (inserting) and updating a Warehouse. The ETL code will produce logging, details, warnings and errors, that are reported in the OWB Runtime Viewer. In OWB, create a Mapping object, using the Graphical Mapping Editor. In this editor, you can select all the data sources that the mapping retrieves data from as well as all the data targets into which the data is uploaded. Figure 5. Screenshot of the OWB Mapping Editor; there the ETL-process from one or more source objects to one or more target objects is illustrated. Note: this mapping process is not complete or is it for real. Mapping Designs can be validated, generated and deployed. Generation typically renders PL/SQL packages that make use of the OWB Runtime Framework with built –in transformation operator library. Alternatively, they can also generate SQL*Loader scripts. www.odtug.com ODTUG 2003
  12. 12. Bob the Warehouse Builder meets Jellema A typical ETL job consists of multiple mappings, for example one for each Dimension and one for each Cube. It is important that the entire job is executed, it’s components in the correct order. With the OWB Process Flow editor (new in OWB 9.0.4), you can design the workflow for the ETL processes. The workflow is built up from mappings, decision points, alerts, email- sending, report running, etc. You specify in the Process Flow the sequence of the activities and to some extent the conditions for starting an activity. Figure 6. Screenshot from OWB Process Flow ed itor. Organize individual Mapping activities into their logical order and deploy process flow to the Oracle 9i Workflow Engine. Schedule, Execute and Monitor ETL Jobs The Process Flow can be registered with the Oracle Workflow Engine, included with the Oracle 9i Database. Subsequently, the regular workflow application can be used to schedule, execute and monitor ETL-jobs. Mapping processes write Audit data to the OWB Runtime Framework. The OWB Runtime Audit Browser – a web-based tool reading the Runtime framework – can be used to report on the Audit History of Workflow instances or individual Mappings. The Audit trail Figure 7. Example of report in the Runtime Audit Brower includes the number of records processed, insert, updated, deleted, discarded and merged. It also reports on errors and warning during the ETL process. www.odtug.com ODTUG 2003
  13. 13. Bob the Warehouse Builder meets Jellema Note: since ETL processes are implemented using PL/SQL Packages, it is quite a simple feat to invoke the Mapping yourself; you do not necessarily rely on the Process Flow Editor and Workflow Engine to get any ETL done. Develop BI Reporting applications Having the Data Warehouse designed, in particular the OLAP Catalog, we are in good shape to develop BI Applications with BI Beans. Furthermore, Oracle Discoverer can be configured (EUL set up) with an export from Oracle Warehouse Builder. This allows for rapid deployment against Data Warehouses. Your applications may make use of OLAP DML (or the more indirect means of accessing these technologies. Note that BI applications do not link directly into Warehouse Builder. Data Mining The first question we would like to subject to Data Mining: can we somehow predict the probability of one of the individuals that we have information about is interested in attending our conference, thereby focusing our marketing efforts on the most receptive audience. We do not know which of the attributes we have on each potential attendee helps determine their decision to attend or not. Using historic data for previous conferences, describing people who decided to attend or not attend, we will Figure 8. Screenshot from the DM4J Wizard in Oracle 9i JDeveloper : select which attributes from the mining table to include for building the model. have the Classification Algorithm try to build a model that will help us predict the chance of someone attending next year’s conference once we have some data on him. We can easily use the Data Mining for Java component, integrated with JDeveloper 9.0.3. We provide data in denormalized format to the Data Mining engine. This data is loaded from various Data Sources using ETL processes, similar to the ones used for the Data Warehouse. These processes can be designed and generated using Oracle Warehouse Builder. Once the model is built, it can be tested to find out how good it is – by applying it to historical data where it is known who did or did not attend. When the model proves itself to be reliable, we can apply it to the current records of the potential target customers. This should help us select the top 30% contacts that – according to the model – should bring in 90% (good lift!) of the conference attendees. To be able to make personal agenda recommendations, we have asked the members of a testing panel to study the conference agenda in great detail and write up a personal agenda – “which sessions will you visit?” we asked them. Furthermore, many personal details about the panel-members are recorded. Now we try to Cluster the panel-members and Build Association www.odtug.com ODTUG 2003
  14. 14. Bob the Warehouse Builder meets Jellema Rules for each Cluster. Once the model is created, we are ready to offer agenda recommendations to conference attendees who trust us with some personal information. Resources Some very useful resources when you venturing into BI territory: Oracle by Example (for 9i Release 2, Chapters 8 and 10) – excellent on-line tutorial and demonstration, many concrete examples and step by step explanations and instructions. BI Beans Tutorial – on line step by step tutorial for developing a BI application (Java Client or Web/JSP) with BI Beans and JDeveloper 9.0.3. Also BI Beans samples. Product documentation: Oracle Warehouse Builder 9.0.4 User Reference, Data Warehousing Guide, OLAP API, OLAP User Guide, OLAP DML, Data Mining, Discoverer, Personalization and ClickStream (9iAS Release 2). An interesting book, released in February 2003, with an interesting overview of Oracle 9i tools and technology for Data Warehouses and BI: Oracle 9iR2 Data Warehousing – Lilian Hobbs, Susan Hillson, Shilpa Lawande (Digital Press, 2003, isbn 1-55558-287-7). AUTHOR’S BACKGROUND Lucas Jellema works for AMIS Services BV, an independent software provider, based in The Netherlands, specializing in Oracle technology with a full service portfolio: technical and business consultancy, project management and application development, database and application hosting, administration and maintenance. AMIS was founded in 1991 and is an Oracle partner. Lucas’ role at AMIS is technical consultant and expertise manager in the area of application development; he supervises four knowledge centers: Server Development & Programming Languages, Web & Java/J2EE, Oracle Designer, SCM & Forms and Data Warehouse, Business Intelligence & Portal. Lucas attends ODTUG conferences since 1997 and is a regular as author and presenter. He used to be an expert on Oracle Designer, Oracle SCM and API programming – and claims he still is in a way. His contributions to the Repository Object Browser or Oracle Designer Web Assistant and the Oracle JHeadstart initiative were his steppingstones into the domain of Web and Java/J2EE. He spent the larger part of the past year working on Oracle’s new Data Warehouse and Business Intelligence technology. Lucas has worked for Oracle Corporation as a member of the global iDevelopment Center of Excellence until 2002. He can be contacted at jellema@amis.nl. www.odtug.com ODTUG 2003