Inteligencia de Negocios en SQL Server 2008 y Minería de Datos.
Ing. Eduardo Castro Martinez, PhD
Microsoft SQL Server MVP
http://ecastrom.blogspot.com
http://comunidadwindows.org
1. Data Mining in SQL Server 2008 Ing Eduardo Castro GrupoAsesor en Informática ecastro@grupoasesor.net
2. Eduardo Castro ecastro@grupoasesor.net MCITP Server Administrator MCTS Windows Server 2008 ActiveDirectory MCTS Windows Server 2008 Network Infrastructure MCTS Windows Server 2008 Applications Infrastructure MCITP Enterprise Support MCSTS Windows Vista MCITP Database Developer MCITP Database Administrator MCTS SQL Server MCITP Exchange Server 2007 MCTS Office PerformancePoint Server MCTS Team Foundation Server MCPD Enterprise Application Developer MCTS .Net Framework 2.0: Distributed Applications MCT 2008 International Association of Software Architects Chapter Leader IEEE Communications Society Board of Directors European Datawarehouse Research
23. Understanding Data Mining Structure Improvements Data Partitioning for Training and Testing Mining Model Column Aliases Data Mining Filters Drill Through to Mining Structure Data Cross-Validation of a Mining Model
24. Data Partitioning for Training and Testing Specify as percentage or maximum number of cases Smaller value is used if both parameters specified Data is divided randomly between training and testing HoldoutSeed property enables consistent partitions across structures
25. Data Partitioning with DMX Create a structure with partitioning with the HOLDOUT keyword Query the structure to review partitions
26. Mining Model Column Aliases Assign a column alias to reuse a column in a structure Column content can be clarified Column can be more easily referenced in DMX Continuous and discretized versions of the same column can be used in separate models in the same structure
27. Data Mining Filters Specify a condition to apply to mining structure columns Filter creates subsets of training and testing data for a model Multiple conditions can be linked with AND/OR operators Conditions for continuous value use > , >=, <, <= operators Conditions for discrete values use =, !=, or is null operators Conditions on nested tables can use EXISTS keyword and subquery
29. Drill Through to Mining Structure Data Add columns to the mining structure, but not to models Eliminates unnecessary data from model and improves processing time Supports drill through from mining model viewer or DMX for visibility into results
30. Cross-Validation of a Mining Model Purpose Validate the accuracy of a single model Compare models within the same mining structure Process Split mining structure into partitions of equal size Iteratively build models on all partitions excluding one partition such that all partitions are excluded once Measure accuracy of each model using the excluded partition Analyze results
31. Cross-Validation Parameters Fold Count Number of partitions to use Minimum 2, Maximum 256 Maximum 10 for session mining structure Max Cases Total number of cases to include in cross-validation Cases divided across folds Value of 0 specifies all cases Target Attribute Predictable column Target State Target value for target attribute Value of null specifies all states are to be tested Target Threshold Value between 0 and 1 for prediction probability above which a predicted state is considered correct Value of null specifies most probable prediction is considered correct
35. Using the New Time Series Algorithm Better Time Series Support Time Series Algorithm Parameters
36. Better Time Series Support ARTxp algorithm Still included in Microsoft Time Series algorithm Best for prediction of next likely value in a series ARIMA algorithm Added to Microsoft Time Series algorithm Best for long-term predictions The new Microsoft Time Series algorithm Trains one model using ARTxp and second model using ARIMA Blends the results to return best prediction
38. Resources Model Filter Syntax and Examples, technet.microsoft.com/en-us/library/bb895186(SQL.100).aspx Cross-Validation, msdn2.microsoft.com/en-us/library/bb895174(SQL.100).aspx SQL Server Data Mining, www.sqlserverdatamining.com Jamie MacLennan’s blog, blogs.msdn.com/jamiemac/default.aspx
Data Mining Office Add-ins were introduced with SQL Server 2005, and a new version is available for SQL Server 2008 to take advantage of the improvements made to Analysis Services data mining. In this module, we’ll review how to use the Data Mining Add-ins, and then examine the changes made to mining structures as well as the new Time Series alogrithm.
Data Mining Add-ins for Office allow you to perform a variety of data mining tasks. You can prepare data by applying data cleansing, and you can partition the data into training and test sets. Some of the add-in tools are focused on exploring your data, while other tools are built specifically for prediction purposes. The add-ins also includes functionality for testing and validating each model.Point out that the add-ins are also useful as a client viewer for data mining models developed on the server.
This slide shows the data preparation tasks : Explore Data (to find anomalies), clean data (to handle outliers or erroenous data, and partition data to separate it into training and test data.In the background is a view used to consolidate information from several tables. Transformations have been applied to enforce business rules. This logical table is then used as the source for data mining activities –whether using the add-ins or using BI Development Studio.
This slide identifies the table analysis tools that are exploration-based data mining tools and identifies the data mining algorithm associated with the tool.
This slide identifies the data modeling tools that are exploration-based data mining tools and identifies the data mining algorithm associated with the tool.
Model viewers are available not only for mining data models created by using the add-in, but also for mining models created on the server.
This slide shows the predictive tools and shows the related algorithm.
Prediction tools are also available in the Data Modeling ribbon of Excel. Here you see the algorithm associated with these predictive tools.
The Data Mining add-in also includes model testing and validation tools, such as an Accuracy cart, a classification matrix, and a profit chart. Cross Validation is also new to Analysis Services data mining and will be discussed in more detail later in this module.
In this section, we’ll review the improvements for mining structures in SSAS 2008. Specifically, we’ll look at setting up data partitions for training and testing dta, how to us aliases with mining model columns, how to apply filers to data associated with a mining model, how to drillthrough to details when studying data mining results, and how to use the cross-validation report to assess the accuracy of a model or to compare multiple models to find the best model.
To create training and testing sets using random data for SSAS 2005, best practice was to use the Random Sample transformation in SSIS 2005. However, the package design was particularly cumbersome for structures with nested tables. In SSAS 2008, the process to generate random data sets for training and testing is built in.You can specify parameters for partitioning data into training and testing sets: In the Data Mining Wizard In the Properties pane of the mining structureAnalysis services uses a random sampling algorithm to assign data to either the training or the testing data set.If you provide both a percentage and maximum number of rows, the smaller number prevails. For example, you can specify a percentage of 30% of the entire data set which is not to exceed 1,000 rows if the data source continues to grow. When using the same data source view for multiple mining structures, you might want to keep the same partitioning strategy for each mining structure. Set the HoldoutSeed property to the same value in each structure to yield comparable results in the training and testing data sets.You can also define partitioning using DMX, AMO, or XML DDL.Point out that partitioning is not available for a model using the Time Series algorithm.
For those who prefer to use DMX to create mining structures instead of the user interface, DMX now supports partitioning when the mining structure is created. Point out that HOLDOUT cannot be used with ALTER MINING STRUCTURE.The process to train the model – using INSERT INTO MINING STRUCTURE – is unchanged. The query executes and data is random sampled. A holdout store is created for each partition of the mining structure. In SSAS 2008, you can now query the structure to view the contents of the training and testing data sets.
In SSAS 2005, you could change the name of a mining model column in Business Intelligence Development Studio, but not in DMX. One reason you might want to use alias a column is when you want to use the same column with different algorithms, but one algorithm supports continuous columns and the other does not. You can add a column to the mining structure more than once and set the Content property to a different value for each version of the column. Ignore the column in the model where the content type is unsupported, and include it as an input column in models supporting that content type. By enabling the use of an alias, you can use the same NATURAL PREDICTION JOIN for the models in the same mining structure because input columns are bound by name to the model column.
Instead of creating separate data source views for your mining structure, you can create separate filtered models. Each model contains the same training and testing data which allows you to compare model results. Why create filtered models?Achieve better overall accuracy by eliminating strong patterns of one attribute value (e.g. North America versus Pacific).Compare patterns in isolated subsets of data.You can create filers: In the Model Filter dialog box In the Properties pane of the mining modelIn the case of discretized values, the bucket containing the specified value is selected. Example: Age = 23 returns bucket containing 20-25 ages.An example of a filter expression for a case table and a nested table:Gender = ‘M’ and EXISTS(select * from Products where Model = ‘Water Bottle’)Point out that NOT EXISTS is also valid.Mention the URL on the Resources slide for more information about filter syntax.You must process the mining structure to see the filter applied to the model.
Mention that using drillthrough in a filtered model returns all cases matching the filter, whether used for training or testing.
As in SSAS 2005, the following algorithms do not support drill through: NaïveBayes Neural Network Logistic RegressionThe Time Series algorithm supports drill through in a DMX query only; drill through is not supported in Business Intelligence Development Studio.
Using parameters you specify, cross-validation automatically creates partitions of the data set of approximately equal size. For each partition, a mining model is created for the entire data set with one of the partitions removed, and then tested for accuracy using the partition that was excluded. If the variations are subtle, then the model generalizes well. If there is too much variation, then the model is not useful.Point out that cross-validation cannot be used with models built using the Time Series or Sequence Clustering algorithms.You can use the Cross Validation Report in the Mining Accuracy Chart of Business Intelligence Development Studio, or use Analysis Services stored procedures to create an ad hoc cross-validation SQL Server Management Studio.
More folds results in longer processing time.
This slide and the next outlines the types of tests and their respective measures that are found on the cross-validation report. Different models will use different test types for this report. Point out the report can be generated in Business Intelligence Development Studio, which will be shown in the demonstration, or by calling an Analysis Services stored procedure.
Data mining in SSAS 2008 was also improved by modifying the Time Series algorithm. In this section, we’ll review why the mining structure is improved and we’ll review the algorithm parameters for the Time Series algorithm.
In SSAS 2005, the ARTxp Time Series prediction algorithm (autoregressive tree model for multiple prior unknown states), built by Microsoft Research, was introduced. The purpose of this algorithm was to tackle a difficult business problem – how to accuractly predict the next step in a series. It was less reliable for predicting 10 steps or further out.ARIMA (autoregressive integrated moving average) is a very common time series algorithm that is well understood by seasoned data miners. It provides good predictions when projecting beyond the next 10 steps. In SSAS 2008, the Microsoft Time Series algorithm blends results of the two algorithms to leverage short and long term capabilities.In Standard Edition, you can configure your model to use one or the other algorithm, or both (which is the default). In Enterprise Edition, you can do custom weighting to get best prediction over a variable time span.
The FORECAST_METHOD default value is MIXED. You can change this to use ARIMA or ARTXP to use a single algorithm exclusively.The PREDICTION_SMOOTHING parameter affects the weighting of the ARTxpand ARIMAalgorithms when MIXED mode is used. A value closer to 0 weights in favor of ARTxp while a value closer to 1 weights in favor of ARIMA. For example, a value of 0.8 is weighted towards ARIMA and the value of 0.2 is used for ARTxp.