Data Mining


Published on

DATA MINING WITH SQL Server Analysis services and Neural Network

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining

  1. 1. Name: KRIENGSAK CHANINCHOMPOONUT Date: December 10th , 2010 1
  2. 2. As a result of the increased use of various technologies in virtually all areas of data mining research, obviously the good decision making is as important as the key of successfully for the organization strategic. Data mining gives you access to the information that you need to make intelligent decisions about difficult business problems which somehow be able to identify rules and patterns in data, so that you can determine why things happen and predict what will happen in the future. The Top-Bottom technique can be use when data form as functions which can be calculate by equation. However in the real world scenario, dealing with the complex data which is not always given the accurate outcome because many cases can not be solved with mathematical equation formula which attempt to map the unknown factors into the algorithms. Therefore, another solution come up with Bottom-Top technique that tend to cross validate with the solutions from both ways which are Top-Bottom and Bottom-Top 2
  3. 3. Top-Down technique Bottom-Up technique As a result, the next number of this dataset are likely to be 0, 4, 7 and so on as we are able to map the known factors into equation. Unlikely the dataset at the bottom as it need to be learn the unknown factors from the bottom to top. Because it could not be found in any linear proportion data that can be solve with equation. Instead, it rather spread out over the graph with unknown direction. If we still using the equation to solve this dataset, we hardly or never detect any pattern or relationship at all. So that’s why the bottom-up is become in efficiency way, by try to learn a data and recognize them once the similar pattern appear again in the dataset. 3
  4. 4. To answer the various types of businesses questions, data mining will help you finding patterns and relations in data that is not apparent with human eyes by analysis those dataset using mathematical algorithms such as decision trees, segmentation, clustering, association and time series etc. through Microsoft SQL Server technologies and confirm those found discovery pattern for doing predictions base on the patterns in historic . Such that the valuable information found can be used for the various application such as financial applications, marketing & sale forecast, CRM, ERP etc. The most topic as discuss in this project will be using the database as the foundation to provide the appropriate model , algorithms base on pattern recognition or detection that found in the historical data. 4
  5. 5. To achieve the project, the following tools below are developing tools with including within this project Application Microsoft SQL Database Server (MSSQL) Microsoft SQL Server Analysis Services (SSAS) Microsoft SQL Integration Services Connections (SSIS) Microsoft Visual Studio C# Microsoft Decision Tree Algorithm Microsoft Naïve-Bayes Algorithm Neural Network Algorithm Hardware Server running the SQL Database engine and Analysis Services PC for daily gathering data source and supply to MSSQL Server running the SSIS for daily updating the SSAS server PC for C# coding, database, SSAS and data mining design 5
  6. 6. There are 5 phases to implement for this project Phase I : Identify the business problems Phase II : Data source collection Phase III : Database transformation Phase IV : Data mining model building Phase V : Model Assessment 6
  7. 7. Data source Data miningSSAS Database Server MSSQL Database Server Neural Network • Data Converting • SSIS Convert and Supplying data to MSSQL Produce data mining Query data from database NNproducedatamining 7
  8. 8. To identify the business need, the experiment to demonstrate for this project involve to the financial application which inquire the questions as following To help the financial department mange a currency swap. What are/is the most factors effected to the US Dollar and Thai Baht currency exchange rate? And what is the next day currency exchange rate likely to be? Let determine the definition of each inquired to identify for the whole this presentation as following Fundamental : As is for the financial department inquiring. 8
  9. 9. To get the answering regarding to the first phase questions, the appropriate data need to be collected on this process which might get the ideas from the persons whom have the particularly those experiences background which help to narrow down the huge data raw into the meaning full data instead gathering all those meaningless data. However, the data mining techniques tend to require more historical data than the standard models and in the case of neural networks, can be difficult to interpret. 9
  10. 10. Contents Data Source Economic statistical indicators • Bank of Thailand Daily Thai stock index • The Stock Exchange of Thailand Daily Thai bank interest rate • Bank of Thailand Daily exchanges rates • Bank of Thailand Daily gold trading price • Bloomberg • Thai Gold Trader Daily crude oil prices • Bloomberg Daily world stock index • Bloomberg 10
  11. 11. Database Tables Once we got all expected data source, the data transformation is begin. I wrote the scripts using C# grabbing all those data from the raw source and then feeding into the MSSQL database server which will be auto daily updating. 32 Tables The only selected appropriated tables will be include in this project. Create views table as usdVSVariables responding to selected appropriated Fundamental Database 11
  12. 12. 12 SELECT DISTINCT TOP (100) PERCENT dbo.ExchangeRates.DateKey, dbo.GoldMarket.DollarPerOunce, dbo.Energy.Value AS CrudeOil, dbo.ExchangeRates.BuyingSightBill, StockValue.SETValue, StockValue.DJValue, InterestMRR.MRR, DepositRate.OneYearMax FROM dbo.ExchangeRates INNER JOIN dbo.Energy ON dbo.ExchangeRates.DateKey = dbo.Energy.DateKey INNER JOIN dbo.GoldMarket ON dbo.Energy.DateKey = dbo.GoldMarket.DateKey INNER JOIN (SELECT T.DateKey, T.Value AS SETValue, D.Value AS DJValue FROM dbo.StockMarket AS T INNER JOIN dbo.StockMarket AS D ON T.DateKey = D.DateKey WHERE (T.Symbol = 'SET') AND (D.Symbol = 'DowJones')) AS StockValue ON dbo.GoldMarket.DateKey = StockValue.DateKey INNER JOIN (SELECT DateKey, BankName, MRR FROM dbo.LoanInterestRate WHERE (BankName = 'Bangkok Bank')) AS InterestMRR ON StockValue.DateKey = InterestMRR.DateKey INNER JOIN (SELECT DateKey, BankName, OneYearMax FROM dbo.DepositInterestRate WHERE (BankName = 'Bangkok Bank')) AS DepositRate ON InterestMRR.DateKey = DepositRate.DateKey WHERE (dbo.ExchangeRates.DateKey > 19991231) AND (dbo.ExchangeRates.Currency = 'USD') AND (dbo.GoldMarket.DollarPerOunce > 0) SQL Code
  13. 13. 13
  14. 14. 14
  15. 15. 15 SSAS Sample (Internet connection required) Or follow this link
  16. 16. At this point, I will divide two demonstrations into two different sections which are Fundamental : Predict USD-Thai currency rate exchanges Customers : Identifying perspective customers who are a potential Let get start the Fundamental data mining implementation first. The standard approach to modeling the fundamental factors returns the currency exchange rates is to model the whole attributes associated as the input variables to predict Thai Baht per dollar as the result by analyzing the most influent effective factors. Mining Structure Data source from SSAS server Data for training and testing is 70:30 Data type as discretized Key : DateKey 16
  17. 17. In order to illustrate what are/is the most important variables for the prediction of Thai Baht per dollar, I aim using hybrid algorithms approach to utilize each advantages with including a Decision tree, Naïve Bayes to classify which variables to use for input in the Neural Network algorithm. The decision tree is capable of detecting rules like “if A then B” However, dealing with continuous values is not work quite well like “if A then 2.5” but tries to split the node as “if A is > 20 then B” So, that’s why the Neural Network would take over the outcomes given as the numeric data to compare its results against the Decision Tree. Such that, my approach to forecast Thai Baht per dollar will be more accurate base on the associated variables which can be more efficiency predicted the approximately the next day as the result. Decision Tree Neural Network Input Variable 2 Variable 3 Variable 4 Variable 5 Variable 6 Variable 1 ? ? ? ? Classify Variable 2 Variable 6 17 Naïve Bayes Input ? ? ?
  18. 18. All associated variables can be retrieved by survey, by using external data research, or by discuss to persons who have those experience background. The advantage of using several factors to perform the forecasting instead depend on only one factor is they can cross validate the result which provide more quality and precisely of data interpreted outcome. Variable Description Usage SETValue Thai stock index (SET) Input DJValue Dow Jones index Input CrudeOil Crude Oil dollar per barrel Input DollarPerOunce Gold price dollar per ounce Input BuyingSightBill Thai Baht per USD currency rate Output – Predicted DateKey Date dimension Key column 18
  19. 19. In order to get the whole picture of how each attribute related to predicted value, typically we need to retrieve entirely those attributes historically in database which will be given an idea of main pattern occurred in the big cycle for determining a ceiling and floor of data range. Then later on we can spot or narrow down in data range for seeking a pattern in a small cycle base on a big cycle. 10 Years Data range 1 Year Data range 19
  20. 20. 10 Years Gold Price Dollar Per Ounce and Baht Per USD Currency Relationship Graph From Jan-01-2000 To Dec-31-2010 DollarPerOunce 20
  21. 21. CrudeOil 10 Years Crude Oil USD/Barrel and Baht Per USD Currency Relationship Graph From Jan-01-2000 To Dec-31-2010 21
  22. 22. 10 Years Thai SET Index and Baht Per USD Currency Relationship Graph From Jan-01-2000 To Dec-31-2010 SETValue 22
  23. 23. 10 Years Dow Jones Index and Baht Per USD Currency Relationship Graph From Jan-01-2000 To Dec-31-2010 DJValue 23
  24. 24. Decision tree can help identify which factors to be considered and how each factor has historically been associated with different outcomes of decision. Concept : Decision Tree is a classification makes predictions base on the relationships between input columns in a dataset by creating a series of splits or nodes in the trees. The algorithm adds a node to the model every time an input column is found to be significantly correlated with the predictable columns. To get the big cycle of data range, in this scenario the algorithms build 2 discretized containing in buckets as following After process decision tree now it help to determining which variable most effected to value under 38.32 and above 38.32 Attribute Baht per USD Bucket 1 < 38.32 Bucket 2 >= 38.32 24
  25. 25. Dependency Network Displays the relationships between the attributes that contribute the least and most important factors to the predictive attribute. The center node of the chart represents the predictable attribute and all nodes around represent the input factors attribute. The number 1 is the most important factor while 4 is the least. As the diagram, the SET Value is the least factor influential. Therefore, it is first disappeared by adjusting then Crude Oil, DJ Value and Dollar Per Ounce in order. As the result, decision tree will automatically create tree node in order by most important to least. 1 2 4 3 25
  26. 26. Trees Nodes Typically, the decision trees is the classification model that contains all cases at the root node then split itself into the most several influential cases or we call children nodes which is Value – vEnergy and then each children node split themselves into the second important factor then split it again until there is no more cases can be split which is least important or we call leaf nodes as a diagram below. According to this, the pink histogram represent value < 38.32 in the opposite green represent value >= 38.32 which each node split it own into 3 DollarPerOunce node along with data range and color to indicate the meaning categories. 26
  27. 27. Histogram Each node might contain only pure single factor or a multi factors in a same node which contribute statistics ,cases supported and probability as representing by histogram. These histogram indicate percentage of node that effect to cases for example if we start travel from root node through node DollarPerOunce < 543.445 with high percentage histogram represent by green stripe along with 906 cases, probability 92.65% which imply these node determine value of Baht/USD greater than 38.32 Even through DJValue were split into greater than 10532 and less than 10532 but both nodes are support Baht/USD > 38.32 as well. Apparently the only different is they were grouped by two categories that either possibly can be fall into those node. If we consider on DJValue and Baht/USD relationship chart, that would help you understand more clearly. 27
  28. 28. 38.32 10532 DJValue >= 10532 Zone DJValue < 10532 Zone 10 Years Dow Jones Index and Baht Per USD Currency Relationship Graph From Jan-01-2000 To Dec-31-2010 Dow Jones 28
  29. 29. After processing decision tree, nodes contain low histogram is not influent to predicted value instead only the most pure color would be include for interpreting. As a result gold price is the most influent for determining Baht/USD direction. If gold price is going up, seem likely impact to Baht/USD going down in the opposite direction. In contrast if gold price is going down then Baht/USD is going up in conversely way. The dependency network will help to confirm Gold price is most important in tree algorithm which can be prove by looking at the next level of node gold price 543.44-862.84. It split into 3 nodes of Thai SET index. Although they are all most high histogram but they are seem likely meaningless. Because the process 29 38.32 DollarPerOunce Gold>543.44
  30. 30. 30 repeats recursively for each child that given the whole range of SET value which can be any zone of SET range. However under Baht/USD 38.32 with Gold price 543.44 – 862.84, there are 3 SET nodes supporting this scenario possibly occurred. Apparently, the same observation is applied for node under gold price below 543.44 which can be explained on figures page 27- 28. For instance, If gold price drop below 543.44 with any range of Dow Jones are likely to impact Baht/USD is going up. 38.32 SETValue Zone 1 Zone 3 Zone 1 Zone 2 Zone 3
  31. 31. 31 Even Decision tree can classify dataset into each segmentation and can point out what is the most important variable impact to predicted value. However the disadvantage of tree is built with univariate at root and splits at each node, as each split is made the data is split base on recursive from root node to leaf node where is usually very little data left to make a decision. For instance, recall from previous figure under gold price 543.44 – 862.84 node there are 3 nodes splitting which are SET value but those nodes can not specify exactly what data range of SET are, instead they are given all zone possibly. Because those 3 nodes are made decision base on their parent node recursively. Unlikely a Naïve-Bayes, each attribute made decision independence with their own base on predicted value directly and not recursive from any others nodes. An classifier is made at leaf nodes. For instance Are small companies with annual profits of more than $500K a bad credit risk? Are large companies with annual profits in the negative still a good credit risk? Naïve-Bayes does not consider combinations of attributes like decision tree. So, if decision tree segments the data that is consider an essential part of big picture then each segment of data represented by a leaf is described through a Naïve-Bayes. Absolutely it depend on what is/are business problem defined, if we only looking for the big picture of data then decision tree would be provide enough information. But if we need to focus on ,or likely to explore the others attributes those are not depend on big picture then we need a Naïve-Bayes for this task. In this case, node Gold price is a big picture as when travel through entirely tree to leaf node include each path from root. Unfortunately, at the leaf node contain little data which might be important as well if we process with a Naïve-Bayes at the leaf.
  32. 32. 32 1 4 3 2 Dependency Network After executed a Naïve-Bayes, Dependency Network is given a result of order important attribute differ from Decision Tree. Crude Oil is a second most important attribute instead Dow Jones. That because Crude Oil is classified independency directly into Baht/USD as same as to others attribute as well. However a gold price still be the first important one. Considering an attribute profiles as each attributes states by data range that that represent by color on the next page. Baht/USD is split into two cases which are >= 38.32 and < 38.32 and it seem a case >= 38.32 is more reliable than case < 38.32 because there are less segmentation than < 38.32. Therefore those input attributes has a meaningful of relationship to Baht/USD.
  33. 33. 33 Attribute Profiles Figure on the left shows each attributes corresponding to Baht- USD. A pure color indicate the highest probability occurred. Such that gold price is very confidence for determining with blue contains value below 543.44 is 96% probability support Baht/USD >= 38.32. In contrast with the same attribute and data range fall in a case < 38.32 only 0.83% probability but 50:41 port potion with value greater than 543.44 instead. Analyzing the result Significantly, gold price and crude oil are likely conversely to Baht/USD in the opposite direction. Since gold price, crude oil price are drop then make Baht/USD going up. Unlikely Dow Jones and SET are quite not in linear data relationship (Figure page 28 and 30) so they can be either under and above 38.32 zone. For instance Dow Jones with below and above 1053.85 is 68:32 probability fall in value >= 38.32 and can be < 38.32 as well with probability 34:66. Therefore, Dow Jones and SET value are not quite well confidence determining Baht/USD direction in Naïve Bayes algorithm that is why they are low important impacted in dependency network.
  34. 34. 34 CrudeOil 38.32 In this phase, I use tools to determine the accuracy of the models that were created, and examine the models to determine the meaning of discovered patterns and how to apply to business. For example, a model may determine that Baht/USD is dropped if gold price or crude oil is going up. Obviously, a dataset in linear relationship is more meaningfulness than data in random. Although 10 years gold price and crude oil historical dataset can be the most appropriate input attributes to process data mining. Occasionally, the same attribute might doesn’t contain any useful patterns with a different data ranges. For examples 1 year of crude oil historical dataset might contain
  35. 35. 35 non linear dataset. But, SET might contains a well useful patterns instead. So it depends on business needs what try to approach. If only focus on a main scope, then algorithms One year Baht/USD - Crude Oil Historical with discretized content under a large historical dataset would be the best fit for this application. In the other hand, a small of historical dataset with numeric content might be a best solution for application that focus on a real linear number calculation such as daily stock forecasting. Because in a large dataset will take a lot of time consuming to produce the result. Even with a high performance computer especially to produce Neural Network result which might take a whole month to learn and searching just a small pattern under a multi attribute input. Therefore a good approach for a generic result is likely to build a several model using different algorithms and then compare the accuracy of these models. One year Baht/USD - SET Historical
  36. 36. 36 The accuracy of an algorithm depends on the nature of the data, data range and an appropriate algorithm. You may need to repeat Classification Matrix the data cleaning and transformation in order to derive more meaningful variables. Then determine the big picture of dataset with created algorithms. However if the relationships among attributes are complicated, a neural network may perform better. Essentially it is very important to work with business analysis who have the proper domain knowledge to validate to discoveries as a bottom line before deploying those patterns discovered by data mining to a production used. Similar to this experiment, a big picture pattern is found by a Decision Tree and Naïve- Bayes algorithms with a couple input attribute as gold price and crude oil need to be validated before we move to another step. However, to accomplish this project I will assume those attributes are the most important to determine Baht/USD direction as a big picture. For the next step, a Neural Network is a next algorithm be used for learning and searching a dataset that derived from a previous algorithms output by attempting form those found pattern in a linear relationship.
  37. 37. 37 Recall from the beginning of this presentation, the unknown dataset pattern can be solve by bottom-up technique. A Neural Network is a good approach for solving a complicated data as long as the input attributes are the right one. CONCEPT Basically, a neural network (NN) is an algorithm based on the operation of biological, in other words, is an emulation of human brain. It designed to think like a human brain by learning problems and later solve the others with similar problems. In the human brain action potentials are the electric signals that neurons use to convey information to the brain and travel through the net using what is called the synapse. As this signals are identical, the brain determines what type of information is being received based on the path that signal took. The brain analyzes the patterns of signals being sent from that information it can interpret the type of information being received. To emulate that behavior, the artificial neural network has several components: the node plays the role of the neuron, the weights are the links between the different nodes, so it is what the synapse is in the biological net. The input signal is modified by the weights and summarized to obtain the total input value for a specific node (diagram next page). There are three layers in a NN: the input layer which holds one node for each input variable; the bias layer, where there could be several internal layers; and the output layer that holds the result set. An activation function is used to amplify the results of that input and obtain the value of particular node.
  38. 38. 38 Neuron scheme Node scheme A diagram illustrates a neuron scheme, received the information from others neuron as the input via a synapse while the connections between neuron and others forming like a branch or a network. Once the input is large than determined threshold then neurons will be fired according to that corresponding received information. Similarly to a node scheme does, the perceptron is In In In Perceptron taking a weighted sum of inputs and sending the output to others node member, if the sum is greater than some adjustable threshold value. The inputs x1, x2, x3..xm and connection weights w1,w2,w3,wm are typically real values. If the feature of some xi tends to cause the perceptron to fire, the weight wi will be positive but if the feature xi inhibits the perceptron, the weight wi will be negative The perceptron consists of weights, the summation processor and adjustable threshold processor or bias input. A bias input might get more weight than others regular input then it comes
  39. 39. 39 affecting firing the activate function. There are several algorithms used in neural networks. The backpropagation is the one of most popular which is used in this project. Typically, what the backpropagation algorithm does is to propagate backwards the error obtained in the output layer while comparing the calculated value in the nodes to the real or desired value. This propagation is made by distributing the error and modifying the weights or links between the previous and present nodes. Going backwards, the values of the nodes in the bias input can be modified and so can be the weights between the input and bias input, but not the values of the nodes in the regular input as they are the values of the variables we are using. Once the algorithm got to the input layer it goes again forward with the new modified weights and calculates the results in the output layer again. This process is repeated until a minimum error is reached. GOLD SET w1 w2 BipolarSigmoid Function f Output One node scheme Perceptron As explained on the right, there are two input attributes, one bias in the first layer pass forward its weights to perceptron then sum the inputs and sending to the output layer. The output layer is fired through the activation function. This entire process run 20 nodes as the first layer to produce one output layer And the following steps are carried out how it’s work. BIAS w3
  40. 40. 40 Learning Process •Split data into 2 set, 85 % training set and 15% for validating. •Randomly 20 values of each gold price and SET weights from training set. •Generate the weights for the between the nodes. •Compare how accuracy the outputs to the actual data (validating set). •Calculate the learning errors. •Adjustable the output errors for getting improvement on the results. •Contribute a new lot of the training set and repeat the process again until a minimum learning error outputs is reached. Implementation •Gold price data range : 1062 – 1413 •SET data range : 684 – 1047 •1 year data range Jan-01-2010 to Dec-31-2010 •24 Hours total learning process time. •Query statement from SQL Server Here is how the learning process work as it keep try to recognize the pattern against the actual value and solving the problem with equation. (Internet connection required) Or just follow this link
  41. 41. 41 Performance Due to the learning process quite take so long so it came up with 24 hours for this experiment which was given total error was 33.43 and 0.14 average error. Absolutely, it will take only a few minutes to generate the result if data range is in a month or 10 days but the performance is going down as a result. One year Baht/USD – Gold Price Result One year Baht/USD – SET Result This validation given Baht/USD predicted as 33.01 which is 0.16 error when compare to actual gold price as 33.17
  42. 42. 42 Even in 2009, gold price 1091.50 and 681.91 SET were not include in data range for learning but NN still recognize the similarly pattern occurred in 2010 and try to generated the similarly output. The occurred pattern is not only rely only on gold price but SET will help NN to classify this pattern as well for instance in 2009 and 2010 were given the same gold price as 1091.40 but different SET value as 686.41 and 784.38. So Baht/USD result will be vary depend on SET input too. VS Predicted ResultActual This learning error historical demonstrate as much as it getting closer to zero, as much as NN given an accuracy result. As the NN algorithm goes back and forth to get the correct weights that will allow it to predict the output variable, so the weights vary in value from the initial randomly generated until the final ones that comply with the error 33.43 total, each pair of predicted and actual value 0.14 average error different, 0.0002 min and 0.58 max have been found in the learning historical.
  43. 43. 43 Implement Neural Network learning video (Internet connection Required) Or follow this link
  44. 44. 44 Summary To answering as financial department inquiring for predicting Thai Baht against USD currency exchange rate, A Neural Network is a bottom line of this experiment that derived the classified input attribute from Decision Trees and Naïve Bayes through the process to analyze using SQL Database and SSAS to reach the goal of Baht/USD prediction movement in a numeric data, also covering data pattern recognition with a several algorithms i.e.. classification, segmentation, approximation, and back propagation approached. References 3.Neural Network on C# By Andrew Krillov 4.Delivering Business Intelligence By Brian Larson 5.Neural Network, from Wikipedia 6.Back Propagation, from Wikipedia 7.Decision Tree, from Wikipedia 8.Naïve Bayes, from Wikipedia