Understanding Data Mining


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Understanding Data Mining

  1. 1. Understanding Data Mining White paper The effectiveness of our buying decisions is governed by how well we understand our problems, our assets and our objectives. Data mining, or “DM,” can help us make sense of our available information. Data mining refers to a variety of applications, tools and processes that almost always involve knowledge creation and/or extraction. But today’s technology landscape is immense and overcrowded with a mix of products, vendors and challenges. To truly understand your data and data mining itself, the best place to start is with business intelligence. Business intelligence (BI) As shown in Figure 1, Business Intelligence is an umbrella term that encapsulates many existing disciplines, including: • Data Warehousing (DW) • Query and Reporting (Q&R) • On-Line Analytics Processing (OLAP/MOLAP/ROLAP) • Statistical Analysis • Data Mining (DM) Figure 1. Business Intelligence is the end result of the data mining processes Data is a critical asset of any organization, and BI is the process of drawing the full value from this critical asset. The kinds of value that can be drawn from data include information and reporting, explanation and insight, and forecasting and prediction. Business Intelligence provides organizations with a complete temporal spectrum, from historical to future perspective, as illustrated in Figure 2. Each of the disci- plines plays a key role in providing these perspectives. Figure 2. Various types of value available from business intelligence over time
  2. 2. 2 Understanding Data Mining Historical and current analytics Query and reporting include both static and ad hoc options that provide details on historical and current values and quantities. These reports do not, however, provide projections or predictive capabilities. OLAP is more advanced than Q&R because it provides for rapid drill down and analysis of dimensional data, as well as trending and forecasting. One drawback to OLAP is that, typically, analysis is done on aggregated sets of data in the form of a ”cube” (which is a summarized view of a dataset) and not the data itself. Predictive analytics Predictive analytics fall into two main categories: statistical and data mining. Statistical approaches emerged from statistical theory, whereas data mining emerged from computational theory. Statistical analytics and data mining are both rooted in mathematics and share many of the same techniques. However, statistical approaches work to uncover the relationships and correlations among the variables in the predictive models, while data mining, also known as “machine learning,” focuses on the predictive capability of the model without regard to explanation. Data mining techniques do not require the user to know the relationships or patterns in the data, nor do they necessarily lead to simple explanations of the relationships. Today’s data mining tools and applications incorporate many of the statistical algorithms traditionally associated with a statistical approach. The tools leave the appropriate decision to the analyst. Data Mining Process The data mining process involves more than just the execution of a data mining algorithm to produce a predic- tive model. In fact, most of the time and effort in data mining is typically spent defining exactly what needs to be predicted, collecting the data, preparing the data, interpreting the results and deploying the model into an end user environment. Because of the investment and skill sets necessary to produce working and trustworthy models, a detailed examination of business needs is required. The high level process The high level process consists of the following six steps: 1. Define what is to be predicted 2. Decide on the appropriate model 3. Prepare data sources 4. Build the model 5. Interpret 6. Deploy Each of these steps is described in more detail below. Step 1. Define what is to be predicted With this first step, you create a clear definition of the prediction and associated requirements, the data that is needed, the data that is available, the reason(s) the prediction is needed, and the way the prediction will be used.
  3. 3. 3 Understanding Data Mining Step 2. Decide on the appropriate modeling type For this step, you choose a modeling type, which can be one of four: Classification, Clustering/Segmentation, Regression or Forecasting/Trending. Step 3. Prepare data sources Perhaps the most time-consuming of all the steps is the data preparation step, also known as extract, transform and load (ETL). At various stages in this process, it’s advisable to archive data to ensure attribution later. Throughout the data mining exercise, changes are made to the data to keep the whole process moving. Later, when analyzing output/results, it will be necessary to know exactly what data went into the DM process. Preparing the data sources also involves pulling data from resident systems (it may be necessary to integrate data housed in multiple systems), transforming the data to a format appropriate for the data mining platform, and identifying data variables necessary for the data mining effort. Preparing the data also involves data cleanup, in order to avoid “garbage in, garbage out.” Poor data quality typically ends in invalid predictive results without the end user being aware. Data quality is a requirement, not a nice-to-have. This process typically involves standardization and/or normal- ization of the data. Data hygiene routines are often provided with the data mining application. Step 4. Build the model Two of the industry-accepted data mining processes are SEMMA (the SAS Institute approach) and CRISP-DM (the Cross Industry Standard Process for Data Mining). In either case, there is a methodical approach to ensure robust model development and deployment. Most analytics techniques are built on the sampling premise. Representative and statistically valid sampling is required to produce valid results. Some improvements in DM applications and platforms have permitted the complete manipulation of entire datasets with little to no performance impacts, making this step obsolete. Other modeling techniques include Explore (perform an exploratory data analysis), Central Tendencies, Population Characteristics, Dispersion and Distribution, Frequencies, Outliers and Anomalies, and Modify. Modify requires more ETL. Now that outliers, data gaps, skewed variables and data reduction opportunities have been identified, they need to be modified and/or recoded in preparation for the DM platform. There are a number of modeling techniques, some of the more popular being neural networks, decision trees, linear and logistic regression, discriminate, rule based, and assess, which is the process of validating the accuracy and recall of the model. Typically, this is performed against a test dataset that was not included in the model creation. Step 5. Interpret This step involves a subject matter expert (SME) to interpret the prediction of the model as well as to translate the results to a form appropriate for deployment to an end user. Often this becomes an iterative process, with the SME providing valuable insight at this stage that can be folded back into the data. Step 5 is often revisited multiple times, but the outcome is typically a superior and trusted prediction. Step 6. Deploy This is the process of making the model available to the end user. Without the deployment of a model, the results are only available to the model developer. The model, in combination with a scoring engine, produces the prediction for a given dataset. This process depends on the chosen DW, application infrastructure and DM platform. Most advanced DM platforms can produce their models in the form of C++, JAVA or PMML for deployment into
  4. 4. 4 Understanding Data Mining a production DW, middle-tier infrastructure, or a combination of these. (PMML stands for Predictive Markup Model Language. It is an XML based specification that contains Data Dictionary, Mining Schema, Transformation Dictionary and Model Information. The biggest advantage of PMML deployment is that the data does not need to be removed from the DW to score.) Successful Data Mining Businesses and organizations must to come to grips with certain required and undeniable needs for success in data mining. Possessing sophisticated data mining tools and elaborate data warehousing systems does not alone ensure successful data mining. There must be a focus on other key factors including data itself, data experts and skillful analysts. Data quality As referenced numerous times already, data is at the heart of data mining. Businesses and organizations that attempt to apply DM techniques to data without first understanding their data run the risk of being seriously misguided. When analytics illustrate trends and correlations or provide predictions based on models, businesses and organizations are better equipped to take action. At some point, leaders have to trust the direction and/or advice provided by data modeling. Invariably someone asks about the data used to arrive at the decision, and at this point, it is too late to discover that there are issues with the data (for example: it wasn’t current, a quarter’s worth of stats was left out, outliers weren’t eliminated, the database query used to pull the data didn’t join all the tables as expected, and so on). Data quality and understanding have to remain a major focus during data mining. Subject matter expert There are silos and pockets of staff who know critical details about their data, recall changes in systems and understand how data has evolved in their business or organization. And none of this is written down. Typically, there is enough existing data for several subject matter experts to possess a vast amount of knowledge and insight about the nature of data, frequency of update or change, gaping holes and more. The SMEs need to be involved early in data mining discussions to ensure assumptions do not derail efforts down the road. Subject matter experts also play a critical role interpreting model results. The SME provides the context and business perspective that is required to ensure user acceptance and a desirable application of the modeled results. Skilled analysts Data mining exercises are highly dependent upon the skill and experience of the analyst. Encapsulation of data mining complexities has proliferated the deployment of predictive analytics to those who may not be experts. Experts have sufficient knowledge and experience to know when to apply the ladder of powers, avoid the pitfalls of collinearity and overtraining issues. A good data mining platform and tool alone are not enough to ensure a valid predictive model is constructed.
  5. 5. 5 Understanding Data Mining Data Mining Vendors and Tools There are two main categories of data mining platforms: stand-alone and integrated, or in-database, for example, data warehouse vendors offering DM platforms. Stand-alone vendors Stand-alone data mining platforms are those that are not bundled with a data warehouse, in other words, data warehouse agnostic. Model creation utilizes either extracted data or data that is queried at scoring time from the DW. Recent technical advances have the data warehouse and stand-alone vendors working together to provide in-database scoring solutions, utilizing models developed from stand-alone applications. The leading stand-alone data mining platform vendors are SAS Institute and SPSS. All other vendors are quite small in comparison. Data warehouse vendors Vendor 2005 Market Share (%) Integrated DM platforms are bundled with a data ware- SAS Institute 28.3 house. The data is mined in place in the data warehouse, utilizing the data warehouse hardware for the data mining SPSS 12.6 computations. All leading data warehouse providers offer a Visual Numerics 3.2 data mining solution: Oracle, IBM, Microsoft and Teradata. Teradata 1.7 Market research firm IDC defines the market for data mining and statistical tools as Advanced Analytics Software. Oracle 1.7 Table 1, on the right, illustrates just how much stand-alone IBM 1.4 advanced analytics software vendors dominate the market. Insightful 1.3 Relative to DM, a recent trend of the large DW vendors Microsoft 1.1 has been to acquire and integrate existing stand-alone DM platform vendors into their product offerings. In some Table 1. DM solution vendors and their market share cases, the DM platform, or some form of it, is then presented as part of the base DW offering, such as Microsoft SQL Server 2005 Analysis Services (SSAS). The advantages of a DM platform from a DM provider include: • a single integrated solution with the DW • data can be mined in place • the DW vendor often offers a complete suite of BI solutions (Microsoft is number three behind Business Objects and Cognos)1 The disadvantages of an integrated solution include: • single vendor reliance • DW hardware resources typically utilized for computational, expensive mining activities • current product offerings are not as extensive as market leading stand-alone vendors 1 IDC, Worldwide Business Intelligence Tools 2005 Vendor Shares, Doc #202603, July 2006 (used with permission)
  6. 6. 6 Understanding Data Mining Qbase background Qbase thrives on creating the most inventive approach to any database or analytics challenge. Using the most advanced technology, Qbase delivers data management and analytics solutions to help make better decisions faster. The company has a suite of tools and processes that provide standardization, extraction, cleansing, import, export, enhancement, integration, segmentation and a host of analytical models that transform your data into usable information. Our innovative technology works smarter and faster to drive better results at lower cost. Clients include healthcare providers, hospital associations, colleges and universities, nonprofit organizations, the US Air Force and large defense contractors. For more information, please visit www.qbase.us Copyright Qbase © 2010 3 printed on recycled paper 01/10 2619 Commons Boulevard, Dayton, Ohio 45431 TEL: 888 458 0345 www.qbase.us