Data warehousing


Published on

notes on data warehousing

Published in: Business, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data warehousing

  1. 1. INSTITUTE OF MANAGEMENT STUDIES DAVV,INDORE E-BUSINESS FUNDAMENTALS Assignment on: “Data warehouse and data mining”Submitted to: Prof. Kritika JainSubmitted by: Shoma Roy MBA (F.T) II Semester SEC-BDate of Submission: 11/4/11
  2. 2. DATA WAREHOUSINGA data warehouse (DW) is a database used for reporting. The data is offloaded from the operationalsystems for reporting. The data may pass through an operational data store for additional operationsbefore it is used in the DW for reporting.A data warehouse maintains its functions in three layers: staging, integration, and access. Staging isused to store raw data for use by developers (analysis and support). The integration layer is used tointegrate data and to have a level of abstraction from users. The access layer is for getting data out forusers.This definition of the data warehouse focuses on data storage. The main source of the data is cleaned,transformed, catalogued and made available for use by managers and other business professionals fordata mining, online analytical processing, market research and decision support (Marakas & OBrien2009). However, the means to retrieve and analyze data, to extract, transform and load data, and tomanage the data dictionary are also considered essential components of a data warehousing system.Many references to data warehousing use this broader context. Thus, an expanded definition for datawarehousing includes business intelligence tools, tools to extract, transform and load data into therepository, and tools to manage and retrieve metadata.ArchitectureOperational database layer The source data for the data warehouse — An organizations Enterprise Resource Planning systems fall into this layer.Data access layer The interface between the operational and informational access layer — Tools to extract, transform, load data into the warehouse fall into this layer.Metadata layer The data dictionary — This is usually more detailed than an operational system data dictionary. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool.Informational access layer The data accessed for reporting and analyzing and the tools for reporting and analyzing data — This is also called the data mart. Business intelligence tools fall into this layer. The Inmon-Kimball differences about design methodology, discussed later in this article, have to do with this layer.
  3. 3. Conforming informationAnother important fact in designing a data warehouse is which data to conform and how to conformthe data. For example, one operational system feeding data into the data warehouse may use "M" and"F" to denote sex of an employee while another operational system may use "Male" and "Female".Though this is a simple example, much of the work in implementing a data warehouse is devoted tomaking similar meaning data consistent when they are stored in the data warehouse. Typically,extract, transform, load tools are used in this work.Master data management has the aim of conforming data that could be considered "dimensions.Normalized versus dimensional approach for storage of dataThere are two leading approaches to storing data in a data warehouse — the dimensional approachand the normalized approach. The dimensional approach, whose supporters are referred to as―Kimballites‖, believe in Ralph Kimball’s approach in which it is stated that the data warehouseshould be modeled using a Dimensional Model/star schema. The normalized approach, also called the3NF model, whose supporters are referred to as ―Inmonites‖, believe in Bill Inmons approach inwhich it is stated that the data warehouse should be modeled using an E-R model/normalized model.In a dimensional approach, transaction data are partitioned into either "facts", which are generallynumeric transaction data, or "dimensions", which are the reference information that gives context tothe facts. For example, a sales transaction can be broken up into facts such as the number of productsordered and the price paid for the products, and into dimensions such as order date, customer name,product number, order ship-to and bill-to locations, and salesperson responsible for receiving theorder.A key advantage of a dimensional approach is that the data warehouse is easier for the user tounderstand and to use. Also, the retrieval of data from the data warehouse tends to operate veryquickly. Dimensional structures are easy to understand for business users, because the structure isdivided into measurements/facts and context/dimensions. Facts are related to the organization’sbusiness processes and operational system whereas the dimensions surrounding them contain contextabout the measurement.The main disadvantages of the dimensional approach are 1. In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated, and 2. It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business.In the normalized approach, the data in the data warehouse are stored following, to a degree, databasenormalization rules. Tables are grouped together by subject areas that reflect general data categories(e.g., data on customers, products, finance, etc.). The normalized structure divides data into entities,which creates several tables in a relational database. When applied in large enterprises the result isdozens of tables that are linked together by a web of joints. Furthermore, each of the created entities isconverted into separate physical tables when the database is implemented. The main advantage of thisapproach is that it is straightforward to add information into the database.
  4. 4. A disadvantage of this approachBecause of the number of tables involved, it can be difficult for users both to: 1. join data from different sources into meaningful information and then 2. access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.It should be noted that both normalized – and dimensional models can be represented in entity-relationship diagrams as both contain jointed relational tables. The difference between the two modelsis the degree of normalization.These approaches are not mutually exclusive, and there are other approaches. Dimensionalapproaches can involve normalizing data to a degree.Top-down versus bottom-up design methodologiesBottom-up designRalph Kimball, a well-known author on data warehousing, is a proponent of an approach to datawarehouse design which he describes as bottom-up.In the bottom-up approach data marts are first created to provide reporting and analytical capabilitiesfor specific business processes. Though it is important to note that in Kimball methodology, thebottom-up process is the result of an initial business oriented Top-down analysis of the relevantbusiness processes to be modelled.Data marts contain, primarily, dimensions and facts. Facts can contain either atomic data and, ifnecessary, summarized data. The single data mart often models a specific business area such as"Sales" or "Production." These data marts can eventually be integrated to create a comprehensive datawarehouse. The integration of data marts is managed through the implementation of what Kimballcalls "a data warehouse bus architecture". The data warehouse bus architecture is primarily animplementation of "the bus" a collection of conformed dimensions, which are dimensions that areshared (in a specific way) between facts in two or more data marts.The integration of the data marts in the data warehouse is centered on the conformed dimensions(residing in "the bus") that define the possible integration "points" between data marts. The actualintegration of two or more data marts is then done by a process known as "Drill across". A drill-acrossworks by grouping (summarizing) the data along the keys of the (shared) conformed dimensions ofeach fact participating in the "drill across" followed by a join on the keys of these grouped(summarized) facts.Maintaining tight management over the data warehouse bus architecture is fundamental tomaintaining the integrity of the data warehouse. The most important management task is making suredimensions among data marts are consistent. In Kimballs words, this means that the dimensions"conform".
  5. 5. Top-down designBill Inmon, one of the first authors on the subject of data warehousing, has defined a data warehouseas a centralized repository for the entire enterprise. Inmon is one of the leading proponents of the top-down approach to data warehouse design, in which the data warehouse is designed using a normalizedenterprise data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the datawarehouse. Dimensional data marts containing data needed for specific business processes or specificdepartments are created from the data warehouse. In the Inmon vision the data warehouse is at thecenter of the "Corporate Information Factory" (CIF), which provides a logical framework fordelivering business intelligence (BI) and business management capabilities.Inmon states that the data warehouse is:Subject-orientedThe data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together.Non-volatileData in the data warehouse are never over-written or deleted — once committed, the data are static,read-only, and retained for future reporting.IntegratedThe data warehouse contains data from most or all of an organizations operational systems and thesedata are made consistent.Time-variantThe top-down design methodology generates highly consistent dimensional views of data across datamarts since all data marts are loaded from the centralized repository. Top-down design has alsoproven to be robust against business changes. Generating new dimensional data marts against the datastored in the data warehouse is a relatively simple task. The main disadvantage to the top-downmethodology is that it represents a very large project with a very broad scope. The up-front cost forimplementing a data warehouse using the top-down methodology is significant, and the duration oftime from the start of project to the point that end users experience initial benefits can be substantial.In addition, the top-down methodology can be inflexible and unresponsive to changing departmentalneeds during the implementation phases.Hybrid designData warehouse (DW) solutions often resemble hub and spoke architecture. Legacy systems feedingthe DW/BI solution often include customer relationship management (CRM) and enterprise resourceplanning solutions (ERP), generating large amounts of data. To consolidate these various data models,and facilitate the extract transform load (ETL) process, DW solutions often make use of anoperational data store (ODS). The information from the ODS is then parsed into the actual DW. Toreduce data redundancy, larger systems will often store the data in a normalized way. Data marts forspecific reports can then be built on top of the DW solution.
  6. 6. It is important to note that the DW database in a hybrid solution is kept on third normal form toeliminate data redundancy. A normal relational database however, is not efficient for businessintelligence reports where dimensional modelling is prevalent. Small data marts can shop for datafrom the consolidated warehouse and use the filtered, specific data for the fact tables and dimensionsrequired. The DW effectively provides a single source of information from which the data marts canread from, creating a highly flexible solution from a BI point of view. The hybrid architecture allowsa DW to be replaced with a master data management solution where operational, not staticinformation could reside.Evolution in organization useThese terms refer to the level of sophistication of a data warehouse:Offline operational data warehouse Data warehouses in this initial stage are developed by simply copying the data off of an operational system to another server where the processing load of reporting against the copied data does not impact the operational systems performance.Offline data warehouse Data warehouses at this stage are updated from data in the operational systems on a regular basis and the data warehouse data are stored in a data structure designed to facilitate reporting.Real-time data warehouse Data warehouses at this stage are updated every time an operational system performs a transaction (e.g. an order or a delivery or a booking).Integrated data warehouse These data warehouses assemble data from different areas of business, so users can look up the information they need across other systems.BenefitsSome of the benefits that a data warehouse provides are as follows: A data warehouse provides a common data model for all data of interest regardless of the datassource. This makes it easier to report and analyze information than it would be if multiple data modelswere used to retrieve information such as sales invoices, order receipts, general ledger charges, etc. Prior to loading data into the data warehouse, inconsistencies are identified and resolved. This greatly simplifies reporting and analysis. Information in the data warehouse is under the control of data warehouse users so that, even if the source system data are purged over time, the information in the warehouse can be stored safely for extended periods of time. Because they are separate from operational systems, data warehouses provide retrieval of data without slowing down operational systems.
  7. 7. Data warehouses can work in conjunction with and, hence, enhance the value of operational business applications, notably customer relationship management (CRM) systems. Data warehouses facilitate decision support system applications such as trend reports (e.g., the items with the most sales in a particular area within the last two years), exception reports, and reports that show actual performance versus goals. Data warehouses can record historical information for data source tables that are not set up to save an update history. Sample applicationsSome of the applications data warehousing can be used for are: Decision support Trend analysis Financial forecasting Churn Prediction for Telecom subscribers, Credit Card users etc. Insurance fraud analysis Call record analysis Logistics and Inventory management Agriculture DATA MININGData mining (also known as Knowledge Discovery in Data, or KDD), a relatively young andinterdisciplinary of computer science, is the process of extracting patterns from large data sets bycombining methods from statistics and artificial intelligence with database management.With recent tremendous technical advances in processing power, storage capacity, and inter-connectivity of computer technology, data mining is seen as an increasingly important tool by modernbusiness to transform unprecedented quantities of digital data into business intelligence giving aninformational advantage. It is currently used in a wide range of profiling practices, such as marketing,surveillance, fraud detection, and scientific discovery. The growing consensus that data mining canbring real value has led to an explosion in demand for novel data mining technologies.[4]The related terms data dredging, data fishing and data snooping refer to the use of data miningtechniques to sample portions of the larger population data set that are (or may be) too small forreliable statistical inferences to be made about the validity of any patterns discovered. Thesetechniques can, however, be used in the creation of new hypotheses to test against the larger datapopulations.
  8. 8. ProcessPre-processingBefore data mining algorithms can be used, a target data set must be assembled. As data mining canonly uncover patterns already present in the data, the target dataset must be large enough to containthese patterns while remaining concise enough to be mined in an acceptable timeframe. A commonsource for data is a datamart or data warehouse. Pre-process is essential to analyse the multivariatedatasets before data mining.The target set is then cleaned. Data cleaning removes the observations with noise and missing data.Data miningData mining commonly involves four classes of tasks Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Classification – is the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam. Common algorithms include decision tree learning, nearest neighbor, naive Bayesian classification, neural networks and support vector machines. Regression – Attempts to find a function which models the data with the least error. Association rule learning – Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.Results validationThe final step of knowledge discovery from data is to verify the patterns produced by the data miningalgorithms occur in the wider data set. Not all patterns found by the data mining algorithms arenecessarily valid. It is common for the data mining algorithms to find patterns in the training setwhich are not present in the general data set, this is called overfitting. To overcome this, theevaluation uses a test set of data which the data mining algorithm was not trained on. The learntpatterns are applied to this test set and the resulting output is compared to the desired output. Forexample, a data mining algorithm trying to distinguish spam from legitimate emails would be trainedon a training set of sample emails. Once trained, the learnt patterns would be applied to the test set ofemails which it had not been trained on, the accuracy of these patterns can then be measured fromhow many emails they correctly classify. A number of statistical methods may be used to evaluate thealgorithm such as ROC curves.If the learnt patterns do not meet the desired standards, then it is necessary to reevaluate and changethe preprocessing and data mining. If the learnt patterns do meet the desired standards then the finalstep is to interpret the learnt patterns and turn them into knowledge.
  9. 9. Notable usesGamesSince the early 1960s, with the availability of oracles for certain combinatorial games, also calledtablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining hasbeen opened up. This is the extraction of human-usable strategies from these oracles. Current patternrecognition approaches do not seem to fully have the required high level of abstraction in order to beapplied successfully. Instead, extensive experimentation with the tablebases, combined with anintensive study of tablebase-answers to well designed problems and with knowledge of prior art, i.e.pre-tablebase knowledge, is used to yield insightful patterns. Berlekamp in dots-and-boxes etc. andJohn Nunn in chess endgames are notable examples of researchers doing this work, though they werenot and are not involved in tablebase generation.BusinessData mining in customer relationship management applications can contribute significantly to thebottom line. Rather than randomly contacting a prospect or customer through a call center or sendingmail, a company can concentrate its efforts on prospects that are predicted to have a high likelihood ofresponding to an offer. More sophisticated methods may be used to optimise resources acrosscampaigns so that one may predict which channel and which offer an individual is most likely torespond to—across all potential offers. Additionally, sophisticated applications could be used toautomate the mailing. Once the results from data mining (potential prospect/customer andchannel/offer) are determined, this "sophisticated application" can either automatically send an e-mailor regular mail. Finally, in cases where many people will take an action without an offer, upliftmodeling can be used to determine which people will have the greatest increase in responding if givenan offer. Data clustering can also be used to automatically discover the segments or groups within acustomer data set.Businesses employing data mining may see a return on investment, but also they recognise that thenumber of predictive models can quickly become very large. Rather than one model to predict howmany customers will churn, a business could build a separate model for each region and customertype. Then instead of sending an offer to all people that are likely to churn, it may only want to sendoffers to customers. And finally, it may also want to determine which customers are going to beprofitable over a window of time and only send the offers to those that are likely to be profitable. Inorder to maintain this quantity of models, they need to manage model versions and move toautomated data mining.Data mining can also be helpful to human-resources departments in identifying the characteristics oftheir most successful employees. Information obtained, such as universities attended by highlysuccessful employees, can help HR focus recruiting efforts accordingly. Additionally, StrategicEnterprise Management applications help a company translate corporate-level goals, such as profitand margin share targets, into operational decisions, such as production plans and workforce levelsAnother example of data mining, often called the market basket analysis, relates to its use in retailsales. If a clothing store records the purchases of customers, a data-mining system could identify thosecustomers who favour silk shirts over cotton ones. Although some explanations of relationships maybe difficult, taking advantage of it is easier. The example deals with association rules withintransaction-based data. Not all data are transaction based and logical or inexact rules may also bepresent within a database. In a manufacturing application, an inexact rule may state that 73% of
  10. 10. products which have a specific defect or problem will develop a secondary problem within the nextsix months.Market basket analysis has also been used to identify the purchase patterns of the Alpha consumer.Alpha Consumers are people that play a key role in connecting with the concept behind a product,then adopting that product, and finally validating it for the rest of society. Analyzing the datacollected on this type of users has allowed companies to predict future buying trends and forecastsupply demandData Mining is a highly effective tool in the catalog marketing industry Catalogers have a rich historyof customer transactions on millions of customers dating back several years. Data mining tools canidentify patterns among customers and help identify the most likely customers to respond toupcoming mailing campaigns.Related to an integrated-circuit production line, an example of data mining is described in the paper"Mining IC Test Data to Optimize VLSI Testing. In this paper the application of data mining anddecision analysis to the problem of die-level functional test is described. Experiments mentioned inthis paper demonstrate the ability of applying a system of mining historical die-test data to create aprobabilistic model of patterns of die failure which are then utilised to decide in real time which die totest next and when to stop testing. This system has been shown, based on experiments with historicaltest data, to have the potential to improve profits on mature IC products.Science and engineeringIn recent years, data mining has been widely used in area of science and engineering, such asbioinformatics, genetics, medicine, education and electrical power engineering.In the area of study on human genetics, an important goal is to understand the mapping relationshipbetween the inter-individual variation in human DNA sequences and variability in diseasesusceptibility. In lay terms, it is to find out how the changes in an individuals DNA sequence affectthe risk of developing common diseases such as cancer. This is very important to help improve thediagnosis, prevention and treatment of the diseases. The data mining technique that is used to performthis task is known as multifactor dimensionality reduction.In the area of electrical power engineering, data mining techniques have been widely used forcondition monitoring of high voltage electrical equipment. The purpose of condition monitoring is toobtain valuable information on the insulations health status of the equipment. Data clustering such asself-organizing map (SOM) has been applied on the vibration monitoring and analysis of transformeron-load tap-changers(OLTCS). Using vibration monitoring, it can be observed that each tap changeoperation generates a signal that contains information about the condition of the tap changer contactsand the drive mechanisms. Obviously, different tap positions will generate different signals. However,there was considerable variability amongst normal condition signals for exactly the same tap position.SOM has been applied to detect abnormal conditions and to estimate the nature of the abnormalities.Data mining techniques have also been applied for dissolved gas analysis (DGA) on powertransformers. DGA, as a diagnostics for power transformer, has been available for many years. Datamining techniques such as SOM has been applied to analyse data and to determine trends which arenot obvious to the standard DGA ratio techniques such as Duval Triangle.A fourth area of application for data mining in science/engineering is within educational research,where data mining has been used to study the factors leading students to choose to engage inbehaviors which reduce their learning and to understand the factors influencing university studentretention. A similar example of the social application of data mining is its use in expertise finding
  11. 11. systems, whereby descriptors of human expertise are extracted, normalised and classified so as tofacilitate the finding of experts, particularly in scientific and technical fields. In this way, data miningcan facilitate Institutional memory.Other examples of applying data mining technique applications are biomedical data facilitated bydomain ontologies, mining clinical trial data, traffic analysis using SOM,et cetera.In adverse drug reaction surveillance,the Uppsala Monitoring Centre has, since 1998, used datamining methods to routinely screen for reporting patterns indicative of emerging drug safety issues inthe WHO global database of 4.6 million suspected adverse drug reaction incidents. Recently, similarmethodology has been developed to mine large collections of electronic health records for temporalpatterns associating drug prescriptions to medical diagnoses.Spatial data miningSpatial data mining is the application of data mining techniques to spatial data. Spatial data miningfollows along the same functions in data mining, with the end objective to find patterns in geography.So far, data mining and Geographic Information Systems (GIS) have existed as two separatetechnologies, each with its own methods, traditions and approaches to visualization and data analysis.Particularly, most contemporary GIS have only very basic spatial analysis functionality. The immenseexplosion in geographically referenced data occasioned by developments in IT, digital mapping,remote sensing, and the global diffusion of GIS emphasises the importance of developing data driveninductive approaches to geographical analysis and modeling.Data mining, which is the partially automated search for hidden patterns in large databases, offersgreat potential benefits for applied GIS-based decision-making. Recently, the task of integrating thesetwo technologies has become critical, especially as various public and private sector organisationspossessing huge databases with thematic and geographically referenced data begin to realise the hugepotential of the information hidden there. Among those organisations are: offices requiring analysis or dissemination of geo-referenced statistical data public health services searching for explanations of disease clusters environmental agencies assessing the impact of changing land-use patterns on climate change geo-marketing companies doing customer segmentation based on spatial location.Pattern mining"Pattern mining" is a data mining technique that involves finding existing patterns in data. In thiscontext patterns often means association rules. The original motivation for searching association rulescame from the desire to analyze supermarket transaction data, that is, to examine customer behaviourin terms of the purchased products. For example, an association rule "beer ⇒ potato chips (80%)"states that four out of five customers that bought beer also bought potato chips.In the context of pattern mining as a tool to identify terrorist activity, the National Research Councilprovides the following definition: Pattern-based data mining looks for patterns (including anomalousdata patterns) that might be associated with terrorist activity — these patterns might be regarded assmall signals in a large ocean of noise Pattern Mining includes new areas such a Music InformationRetrieval (MIR) where patterns seen both in the temporal and non temporal domains are imported toclassical knowledge discovery search techniques.
  12. 12. Subject-based data mining"Subject-based data mining" is a data mining technique involving the search for associations betweenindividuals in data. In the context of combatting terrorism, the National Research Council providesthe following definition: "Subject-based data mining uses an initiating individual or other datum thatis considered, based on other information, to be of high interest, and the goal is to determine whatother persons or financial transactions or movements, etc., are related to that initiating datum." [35] Privacy concerns and ethicsSome people believe that data mining itself is ethically neutral It is important to note that the term datamining has no ethical implications. The term is often associated with the mining of information inrelation to peoples behavior. However, data mining is a statistical technique that is applied to a set ofinformation, or a data set. Associating these data sets with people is an extreme narrowing of the typesof data that are available in todays technological society. Examples could range from a set of crashtest data for passenger vehicles, to the performance of a group of stocks. These types of data setsmake up a great proportion of the information available to be acted on by data mining techniques, andrarely have ethical concerns associated with them. However, the ways in which data mining can beused can raise questions regarding privacy, legality, and ethics. In particular, data mining governmentor commercial data sets for national security or law enforcement purposes, such as in the TotalInformation Awareness Program or in ADVISE, has raised privacy concerns.Data mining requires data preparation which can uncover information or patterns which maycompromise confidentiality and privacy obligations. A common way for this to occur is through dataaggregation. Data aggregation is when the data are accrued, possibly from various sources, and puttogether so that they can be analyzed. This is not data mining per se, but a result of the preparation ofdata before and for the purposes of the analysis. The threat to an individuals privacy comes into playwhen the data, once compiled, cause the data miner, or anyone who has access to the newly compileddata set, to be able to identify specific individuals, especially when originally the data wereanonymous.It is recommended that an individual is made aware of the following before data are collected: the purpose of the data collection and any data mining projects, how the data will be used, who will be able to mine the data and use them, the security surrounding access to the data, and in addition, how collected data can be updated.In the United States, privacy concerns have been somewhat addressed by their congress via thepassage of regulatory controls such as the Health Insurance Portability and Accountability Act(HIPAA). The HIPAA requires individuals to be given "informed consent" regarding any informationthat they provide and its intended future uses by the facility receiving that information. According toan article in Biotech Business Week, "In practice, HIPAA may not offer any greater protection thanthe longstanding regulations in the research arena, says the AAHC. More importantly, the rules goalof protection through informed consent is undermined by the complexity of consent forms that arerequired of patients and participants, which approach a level of incomprehensibility to averageindividuals." This underscores the necessity for data anonymity in data aggregation practices.One may additionally modify the data so that they are anonymous, so that individuals may not bereadily identified. However, even de-identified data sets can contain enough information to identifyindividuals, as occurred when journalists were able to find several individuals based on a set of searchhistories that were inadvertently released by AOL