Data mining  2012 generalwithmethods
Upcoming SlideShare
Loading in...5

Data mining 2012 generalwithmethods



Data Mining Overview

Data Mining Overview



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Data mining  2012 generalwithmethods Data mining 2012 generalwithmethods Presentation Transcript

    • Michael Gilman, Ph.D.Copyright Data Mining Technologies Inc. 2012 631 –692-4400 ext. 100 1
    • What is Data Mining? An information extraction activity which has as its goal the discovery of hidden facts contained in databases. It finds patterns and subtle relationships in data, inferring rules and generalizations that allow the prediction of future results. To be a true knowledge discovery method, a data mining tool should unearth information automatically.pres071911a 2
    • Overview The purpose of this presentation is to introduce a new and powerful methodology and associated software that overcomes many of the limitations of the other data mining methods in use today.pres071911a 3
    • Background The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data included statistical methods such as Bayes theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology has increased data collection, storage and manipulations. As data sets have grown in size and complexity, direct hands-on data analysis has increasingly been augmented with indirect, automatic data processing. This has been aided by other discoveries in computer science, such as neural networks, clustering, genetic algorithms (1950s), decision trees (1960s) and support vector machines (1990s). Data mining is the process of applying these methods to data with the intention of uncovering hidden patterns. (Wikipedia)pres071911a 4
    • How Does Data Mining Work? Data Mining Involves Building Predictive Models that enable better understanding of how to proceed in some enterprise in a better way. In order to build a predictive model, several steps are necessary. Before we outline these steps, here is a real world problem.pres071911a 5
    • Question:How can we keep healthcare quality high and keep costs down ?
    • Input Data:File containing clinical data and costs pres071911a 7
    • Steps in Data Mining Define the problem goals Identify data sources pres071911a 8
    • Then build the model Mine Data ModelData If-Then Rules pres071911a 9
    • Results Model containing rules showing what is best of breed treatment for each case and whyIf diagnosis = Congestive HF and Age =60- 70 and previous. bypass = yes and . . . Then BOB Treatment = aortic stent pres071911a 10
    • Steps in a Data Mining Project 1. Define the business or scientific problem Example: Which of my current customers are likely to become inactive in the next 6 months. 2. Gather historical data file Prepare file of customers (present and past) which include predictive descriptors such as start date, date of first sale, date of last sales , sales by month, how acquired, etc. Include current status (active or inactive) for each customerpres071911a 11
    • Steps in a Data Mining Project(continued) 3. Cleanse the Data Data cleansing reduces noisy and missing data and removes erroneous data. 4. Add Derived Attributes Create additional variables from the original data if necessary (example: compute customer account duration from start date and current date) 5. Create Test and Holdout Files Randomly separate the original file into two parts called the test and holdout files Build predictive model with modeling softwarepres071911a 12
    • Steps in a Data Mining Project(continued) 6. Validate the Data Validation uses a test set of data which was not used when building the model. This is the holdout set defined previously. The learned patterns are applied to this test set and the resulting output is evaluated for accuracy. For example, a data mining algorithm trying to distinguish spam from legitimate emails would be trained on a training set of sample emails. Once trained, the learned patterns would be applied to the test set of emails on which it had not been trained. The accuracy of these patterns can then be measured from how many emails they correctly classify.pres071911a 13
    • Steps in a Data Mining Project(continued)At this point a model has been created and it cannow be used.7. Use the Model to Make Predictions The final step of knowledge discovery from data is to use the model produced by the data mining algorithms. As new data come in, the model is then applied to this data to make predictions. pres071911a 14
    • Comparison of Methods Nuggets offers benefits that the other methods don’t offer Here are a few: Handles missing data Handles very large amounts of predictor attributes Fast Model Development Able to model small data patterns missed by other methods Handles wide variety of data types Doesn’t require highly trained specialistspres071911a 15
    • Principal Data Mining Techniques Industry Standard Methods  Statistics  Neural Nets  Decision Trees Following is a comparison of Nuggets with these principal competitorspres071911a 16
    • NuggetsNuggets is a proprietary technology that uses proprietarysearch algorithms to intelligently prospect data for validhypotheses.In the act of searching, the algorithms “learn” about thetraining data as they proceed.The result is a very fast and efficient discovery strategythat does not preclude any potential rule or generalizationfrom being found. This document outlines its advantagesover its competitors in providing useful and profitableinformation from the vast store of data that are beingaccumulated at an ever increasing rate. pres071911a 17
    • Statistics Methods Pros/Cons Statistics Pros Statistical analysis is sometimes a good ‘first step’ in understanding data. These methods deal well with numerical data where important mathematical facts such as the underlying probability distributions of the data are known. However, in today’s world these mathematical facts are rarely known. These methods are not as good with nominal data values such as “good”, “better”, “best” in the case of a preference attribute or “Europe”, “North America”, “Asia” or “South America” in the case of a location attribute.pres071911a 18
    • Method Pros/Cons Statistics (continued) Some of the statistical methods commonly used are regression analysis, correlation, Chaid analysis, hypothesis testing, and discriminant analysis. Statistical analysis is sometimes a good “first step” in understanding data. These methods deal well with numerical data where the underlying probability distributions of the data are known. This is not often the case in real world problems.pres071911a 19
    • Statistics Methods Pros/Cons (cont.) Nuggets Advantages Over Statistics Statistical methods require statistical expertise, or a project person well versed in statistics who is heavily involved. Such methods require difficult to verify statistical assumptions. They suffer from the “black box aversion syndrome”. This means that that non-technical decision makers, those who will either accept or reject the results of the study, are often unwilling to make important decisions based on a technology that gives them answers but does not explain how it got the answers.pres071911a 20
    • Statistics Method Pros/Cons Nuggets Advantages Over Statistics To tell a non-statistician CEO that she or he must make a crucial business decision because of a favorable R statistic or some other arcane statistical reason is not usually well received. With Nuggets® you can be told exactly how the conclusion was arrived at. Another problem is that statistical methods are valid only if certain assumptions about the data are met. Some of these assumptions are: linear relationships between pairs of variables, non-multicollinearity, normal probability distributions and independence of samples. If you do not validate these assumptions because of time limitations or are not familiar with them, your analysis may be faulty and therefore your results may not be valid. Even if you know about them you may not have the time or information to verify the assumptions.pres071911a 21
    • Method Pros/Cons - Neural Nets Neural Networks This is a popular technology, particularly in the financial community. This method was originally developed in the 1940’s to model biological nervous systems in an attempt to mimic human thought processes.pres071911a 22
    • Method Pros/Cons - Neural Nets Pros The end result of a Neural Net project is a mathematical model of the process. It deals primarily with numerical attributes such as age, income, height, etc., but not as well with nominal data such as state, brand preference, vehicle make, etc.pres071911a 23
    • Method Pros/Cons - Neural Nets Nuggets Advantages There is still much controversy regarding the efficacy of Neural Nets. One major objection to the method is that the development of a Neural Net model is partly an art and partly a science in that the results often depend on the individual who built the model. That is, the model form (called the network topology) and hence the results, may differ from one researcher to another for the same data.pres071911a 24
    • Method Pros/Cons - Neural Nets There is also the problem with Neural Nets of “overfitting” that results in good prediction of the data used to build the model but bad results with new data. Neural Nets often use a sigmoid function in its computations. This is a mathematical function resembling the shape of the letter “S”. Questions exist whether there is any theoretical justification for this somewhat arbitrary choice and makes this approach somewhat ad hoc. Another issue is that the modeling results produced by a Neural Net method are not intuitive. The method is called a “black box” to indicate the lack of intuitive understanding of its results. Neural Nets are still in use but becoming less popular due to these issues.pres071911a 25
    • Method Pros/Cons Decision Trees Decision Trees (Cart, Chaid, etc.) Decision tree methods are techniques for partitioning a training file into a tree representation. The starting node is called the root node. Depending upon the results of a test this node is then partitioned into two or more sub-sets. Each node is then further partitioned until a tree is built. This tree can be mapped into a set of rules. These rules in the form of a data tree are used to generate forecasts.pres071911a 26
    • Method Pros/Cons Decision Trees Nuggets Advantages By far the most important negative for decision trees is that they are forced to make decisions along the way based on limited information that implicitly leaves out of consideration the vast majority of potential patterns in the training file. This approach may leave valuable patterns undiscovered since decisions made early in the process will preclude some good rules from being discovered later. This is called “greedy optimization” and lessens the accuracy of the resulting model. Furthermore large numbers of predictor attributes as exist in most of today’s data sets are not handled with decision trees. pres071911a 27
    • Method Pros/Cons Decision Trees Nuggets Advantages Nuggets does not make these greedy decisions. Instead it “implicitly” searches all possible patterns and thus is able to find patterns that are useful but that wouldn’t be found with decision trees.pres071911a 28
    • Summary of Comparison With Other Methods Nuggets Advantages Nuggets offers many advantages over other methods in common use. A few were presented here. Nuggets advantages vary from method to method and most are due to the limiting assumptions required by these older methods which limit their effectiveness. Nuggets is designed to circumvent these disadvantages and offer a superior methodology that can work with the challenges of the large number of complex data bases that exist in today’s world.pres071911a 29