Your SlideShare is downloading. ×
0
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
source
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

source

208

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
208
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Alternative methods of accessing digital information: Data mining I. Introduction • What is it? II. How does it work? • The virtuous circle of data mining • Techniques of data mining III. Data mining applications • What is it good for? • DM and CRM
  • 2. I. Introduction • What is it? Data mining is a process of knowledge discovery in databases It involves the extraction of interesting information, patterns, or rules from data in large databases These data are non-trivial, implicit, previously unknown and potentially useful It is a search for valuable information in large volumes of data It uses statistical techniques to explore and analyze large quantities of data in order to discover meaningful patterns and rules
  • 3. Why has data mining become so popular? Large amounts of data are being produced as more functions become automated Many algorithms require large data sets for training and learning Data are being warehoused They are being extracted from various systems (accounting, billing, ordering etc) and stored in a central location They are stores in a common format Consistent definitions for fields and keys Computer power is increasing and costs are decreasing
  • 4. And Strong competitive pressures Information-intensive activities (business, science) are competing for market share, funding etc. Realization of the increasing value of information (especially as a source of revenue) There is value in what can be discovered in data For business, there is value in customization Commercial data mining software is now available There are off the self products
  • 5. Data mining can be directed The goal is to use the available data to build a model that describes a variable of interest in relation to the data set Given what we know about people in Bloomington, which types of people are likely to subscribe to DSL? Data mining can also be undirected There is no variable of interest The goal is to search through the available data to look for patterns and relationships What can we learn about students at IU who default on their student loans?
  • 6. Data mining provides an organization with “memory” and “intelligence” Noticing Uses on-line transaction processing systems (OLTP) Remembering Capturing as much of the transaction process as possible Phone records, communications, CRM exchanges Learning The records must be organized into “data warehouses” Data mining is used to analyze these data Intelligence involves patterns, rules, and predictions
  • 7. Data mining typically involves six activities 1. Classification Examining the features of a data instance and assigning it to a predefined class Records in a database are updated by filling in a field with a “class code” The process uses a “training set” to sort unclassified data into discrete classes Assigning keywords to articles as they arrive Sorting credit card applicants according to risk levels Assigning people to demographic categories
  • 8. 2. Estimation This process sorts continuously valued outcomes Using new data to predict whether a given data instance is above or below a threshold This requires a model to determine the threshold level It can be used to make predictions Use customer data to determine churn rates Estimate how long a person is likely to remain a customer Assess the probability that people will respond to an offer of a home equity loan The model runs between 0-1 with a .83 threshold
  • 9. 3. Prediction Similar to estimation but with the expectation that there will be some check in the future Uses a training set with historical data and a known predicted variable Predicting the size of a balance that is likely to be transferred when a person accepts a credit card offer Determining which customers will leave in a given time period Predicting which customers will add a new service such as caller ID in a given area
  • 10. 4. Affinity grouping or association rules The goal is to explore an available data set to determine which data instances should be grouped together This involves discovering relationships among data Which items should be placed near each other in a supermarket? Which products can be grouped for cross-selling? 5. Clustering The task is to sort undifferentiated data into like groups This process does not begin with predefined classes What do the book and music purchases tell us about our customers?
  • 11. 6. Description and visualization Developing a preliminary understanding of the data This is a first step in developing an explanation What can we tell about the people who shop in a food coop? Visualization is the graphic representation of the data Directed data mining Classification, estimation, prediction Undirected data mining Affinity grouping, clustering, description
  • 12. Classes of data mining activity Information Discovery, Inc. (2001). A Characterization of Data Mining Technologies and Processes. http://www.datamining.com/dm-tech.htm
  • 13. Types of data mining Hypothesis testing Top down approach designed to test careful guesses Process Hypotheses are formulated to be falsified This is done in scientific and business applications Specific kinds of data are proposed to test the hypothesis A data requirements document is created The data are gathered and prepared Profile the data, especially if it is derived from heterogeneous sources
  • 14. Process continued: Data preparation (cont) The transformation is very important and will vary with the type of software being used Computer model is built based on the data The model is evaluated to reject or fail to reject the hypothesis This is done by applying it to the data set The end result is an analysis which statistically tests the hypothesis The results are stated with the appropriate margin of error
  • 15. Issues in data preparation Summarization: developing the appropriate level of detail The original data should not be summarized at all The fine grained data may be irrelevant to the question There may be too few examples at the finest level of detail Incompatible computer architectures Data transport software can translate among different languages and formats COBOL, C, C++, ASCII encoding, single and double precision floating point integers…)
  • 16. Inconsistent data encoding Different sources represent the same data in different ways If not caught, these can introduce error into later ` analysis Textual data Mostly not useful What is useful should be encoded in another form This is best done by hand (to key UK, Wales, Scotland to country code “44”) Missing values Most software is not good at handling these
  • 17. Knowledge discovery: takes two major forms Directed Goal is to explain the value of some field (income, genetic information) or a specific relationship Analysis seeks to estimate, classify, and predict the target field This is an explanatory function Finding patterns in data to explain the past and predict the future What type of person is likely to default on a loan? Is these genetic markers are found, what future predispositions are indicated?
  • 18. Process Identify available data sources It’s best to have preclassified data Preparing the data for analysis Similar issues are involved Also involves adding fields to the data to clarify what we take for granted but that software cannot Based on our experience with the data Improves the chances of finding patterns Training set: build the initial model Test set: adjust to improve generality Evaluation set: test the model
  • 19. Process (cont) Building and training the model Toss in as many variables as seem relevant and let the algorithm sort then out Goal is to develop an explanation of independent (target) variables based on dependent (input) variables The test set is used to minimize the problem of overfitting Evaluating the model Error rate of the evaluation set is a good indicator of the error rate with new data
  • 20. Undirected knowledge discovery There is no target field to serve as the focus of analysis The goal is to search for meaningful patterns Question might be: what goes together? Process Similar to directed knowledge discovery Identify potential targets for directed knowledge discovery analysis At the end of the process, one result is often new variables Generate new hypotheses to test
  • 21. Alternative methods of accessing digital information: Data mining I. Introduction • What is it? II. How does it work? • The virtuous circle of data mining • Techniques of data mining III. Data mining applications • What is it good for? • DM and CRM
  • 22. II. How does it work? The virtuous cycle of data mining Transform data into useful information with DM Act on the information Measure the results to reuse the data Identify problems where DM can provide value
  • 23. In business applications, data mining does not seek to replicate previous efforts The goal is to discover new markets, not saturate old ones In science, replication of results is more important Data mining is a creative activity Many patterns will be found, but the art is in focusing on the meaningful ones Data mining results can change over time Models can become less useful over time as data changes and markets change
  • 24. Characteristics of DM systems The focus is on the analysis of current and historical data to predict future action The analytic work depends on the flow of data (which is not regular) Typically the emphasis is on working with large data sets The purpose is to support decision making in business and hypothesis testing in science Response time are slower due to the computing cycles involved in analysis
  • 25. Another way to think about it Aggregate the data Prepare it in a common format Find patterns in the data There are a range of techniques that can be used Respond to the patterns (what do they mean?) Data information Act on the patterns Information action Action generates value
  • 26. Identify data requirements Obtain data Validate, explore, clean data Transpose data Add derived variables Create model set Choose modeling technique Train model Check model performance Choose best model If improvements, obtain more data If values don’t look correct If data are not available If values don’t look correct If new derived variable improves performance If new segmentation improves performance If a new technique improves performance Building a DM model
  • 27. Data mining depends on three main elements DM techniques These are algorithmic approaches to problem solving that are statistically based Data DM data should be clean, simple, and in a table with well-defined columns Data modeling This is a process of developing predictive models for directed DM The method for building these models is based on principles of experimental design
  • 28. Techniques of DM Automatic cluster detection Used for undirected DM to find groupings in data Algorithms sort the data set top down (divisive) or bottom-up (agglomerative) k-means cluster detection is a common method Start with an arbitrary number of “seeds” (the initial clusters) Assign the records to the closest seed (“centroid”) Then recalculate the k-means and move the centroid Continue until each is in the center of a cluster of records and there are clear boundaries
  • 29. Seed 1 Seed 2 Seed 3 New centroid 2 New centroid 1 New centroid 3 Results of a cluster analysis
  • 30. This is a discovery technique but is difficult to interpret It shows that some records are closer together given the arbitrary starting points The number of initial seeds is important The number should minimize the distance between members of a cluster and maximize the distance between clusters The results should be combined with other techniques to see if they have any meaning Use it when you think there are patterns that you can’t see Or when there are too many patterns and you want to reduce complexity
  • 31. Decision trees A classification tree labels records and assigns them to classes The data is split iteratively until the groupings become useful The point of the initial split is critical Each branch cuts the space into two or more pieces and is a test on a record Each record is tested at each node of the tree until it reaches the “leaf” or terminal node (categorical - yes/no) Is record X greater than Y? - if yes, keep moving
  • 32. Respond 8% Rent Respond 5% Own Respond 15% Low Respond 9% Medium Respond 13% Yes Respond 16% No Respond 45% High Respond 24% Yes Respond 4% No Respond 18% Family income? Mortgage? Savings account? Own home? A sample decision tree http://www.spss.com/datamine/trees.htm
  • 33. Decision trees are good when the goal is to develop a set of rules to organize data for predictive purposes This works best when the tree has a manageable number of branches The rule is formulated by tracing the branches back to the root They are not good for discovering relationships among variables Each split in the tree is a test of a single variable They also may produce errors when the training set is too small
  • 34. Neural networks They use data models that simulate the structure of the brain to generalize and learn from data They learn from a set of inputs and adjust parameters of the model using to new knowledge to find patterns in data They fit a model to a set of historical data to classify or make predictions They can find interaction effects among variables You do not need to have any specific model in mind when running the analysis They require extensive preparation of data They also require a lengthy training period
  • 35. Neural network model Output Inputs Appraised value Floor space Size of garage Age of house Acreage Other factors Example of a neural network model Berry, M.J. and Linoff, G. (1997). Data Mining Techniques. Wiley. p290
  • 36. Neural networks are useful in predicting a target variable when the data is highly non-linear with interactions A disadvantage is that it is difficult interpret the resultant model with its layers of weights and transformations The result is a set of weights distributed throughout the network The weights provide no insight into why the solution is valid This makes the use of neural nets a “black box” process They are not very useful when these relationships in the data need to be explained
  • 37. Alternative methods of accessing digital information: Data mining I. Introduction • What is it? II. How does it work? • The virtuous circle of data mining • Techniques of data mining III. Data mining applications • What is it good for? • DM and CRM
  • 38. III. What is it good for? Data mining is used for Research Pharmaceutical companied use DM prediction to predict which chemicals are likely to produce powerful drugs Process improvement Using DM to determine thresholds for manufacturing (to separate good from bad product) Marketing Learning about customers to refine and target marketing campaigns and save money
  • 39. Fed government uses DM to search for criminals and terrorists Analyse FBI field agent reports Look for patterns in international funds transfer Customer relationship management Developing sophisticated customer profiles shared across the business Learning from customer behavior

×