DMDW Lesson 05 + 06 + 07 - Data Mining Applied

STUDIEREN UND DURCHSTARTEN. Author I: Dip.-Inf. (FH) Johannes Hoppe Author II: M.Sc. Johannes Hofmeister Author III: Prof. Dr. Dieter Homeister Date: 01.04.2011 08.04.2011 15.04.2011

Data Mining Applied Author I: Dip.-Inf. (FH) Johannes Hoppe Author II: M.Sc. Johannes Hofmeister Author III: Prof. Dr. Dieter Homeister Date: 01.04.2011 08.04.2011 15.04.2011

01 Applications of Data Mining 3

Applicationsof Data Mining Applications of Data Mining Database Marketing Time-series prediction, detecting "trends" Detection (of whatever is detectable) Probability Estimation Information compression Sensitivity Analysis 5

Applicationsof Data Mining Database Marketing(1/2) Response modeling Model for the response of specific customers. Systematic selection of (old and potential) customers. Advertisements and promotion based on these results. ( CRM) Visualization: "Lift chart" shows how successful the selection should be. (later topic: DM validation) 6

Lift Chart Example “For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders.” 7

Applicationsof Data Mining Database Marketing(2/2) Cross selling: Selling additional products to existing customers Question: Which customer might buy which other product? Uses historical purchase data Uses credit card information, lifestyle data, demographic data, etc. Other possible information: Did the customer query special information? How customer heard of the company? 8

Applicationsof Data Mining Database Marketing(2/2) Cross selling: Selling additional products to existing customers Results for direct marketing, mailing lists, direct advertising (Amazon) Amazon: "Customers who bought this item also bought" and "personalized recommendations" 9

Applicationsof Data Mining Time-series prediction Time series: Stock prices, market shares, … Extrapolation of future values Detection of newly arising trends like customer movements to other products Own experience: German print magazines 10

Applicationsof Data Mining Detection Identification of existence or occurrence of a condition Fraud detection: Identifying patterns/criteria to detect credit card fraud Estimating creditworthiness ( German Schufa) Prediction of mail orders that will not be paid 11

Applicationsof Data Mining Detection Identification of existence or occurrence of a condition Intrusion detection (in computer networks) Find patterns that indicate when an attackis made on an network e.g. clustering: small clusters are of high interest,they point to unusual cases. Definition of Classes may be useful:e.g. harmless, possible harmful,harmful, immediately close LAN 12

Applicationsof Data Mining Detection Identification of existence or occurrence of a condition Typical difficulties Needs knowledge DM costs Cost of missing a fraud Cost of false positives(e.g. falsely accusing someone of fraud, company image problems) 13

Applicationsof Data Mining Probability Estimation Approximate the likelihood of an event given an observation e.g. for classify a potential customer into an A,B,C range before any business 14

Applicationsof Data Mining Information Compression Can be viewed as a special type of estimation problem. For a given set of data, estimate the key components that be can be used to construct the data. 15

Applicationsof Data Mining Sensitivity Analysis Understand how changes in one variable affect others. Identify sensitivity of one variable on another(find out if dependencies exist). 16

Data Mining Algorithms Data Mining Algorithms Different algorithms, different uses Combined The algorithm depends on what you want to do Not every algorithm is suited for what you want to do 18

Data Mining Algorithms Algorithms in SSAS: Groups Classification algorithms Regression algorithms Association algorithms Segmentation algorithms Sequence analysis algorithms Plug-In algorithms 19

Data Mining Algorithms Classification algorithms Predict discrete attributes Based on experience values Algorithms in SSAS: Naive Bayes Decision Trees Neural Networks 20

Data Mining Algorithms Regression algorithms Predict continuous attributes The same as classification algorithms Algorithms in SSAS Linear Regression (Line) Logistic Regression (Curve) MS Time Series 21

Data Mining Algorithms Association algorithms Predict likely combinations Find elements that occur in combination Algorithms in SSAS: MS Associtation Algorithm (Apriori) 22

Data Mining Algorithms Segmentation algorithms Also called „Clustering algorithms“ Groups data with similar properties Algorithms in SSAS: MS Clustering Algorithms (e.g. K-Means) 23

Data Mining Algorithms Sequence analysis algorithms …are clustering algorithms Consider the sorting; the sequence of values while clustering Does not group by similar properties Groups by similar sequences Algorithms in SSAS: MS Sequence Clustering 24

Data Mining Algorithms Plug-In algorithms .NET Wrapper for COM objects Use ANY algorithm Provided as an assembly (possible workshop to create one) 25

03 Repetition - Datatypes, Contentypes 26

Repetition - Datatypes, Contentypes Applying anAlgorithm Datatypes Contenttypes 27

Repetition - Datatypes, Contentypes Datatypes Definethestructure of thevalues Availabledatatypes: Text Long Boolean Double Date 28

Repetition - Datatypes, Contentypes Contenttypes Definethebehaviour of values Discrete Continuous Discretized Key Key Sequence Key Time Ordered Cyclical 29

Repetition - Datatypes, Contentypes Contenttype: Discrete Fixed set of values Example: Commute Distance: 1-2, 2-5, 5-10 Region: Pacific, Northern America, Europe Name: … … … Boolean values are always discrete Text is most likely discrete 30

Repetition - Datatypes, Contentypes Contenttype: Continuous Unlimited set of values Infinite items possible Example Income Age Difference between Continuous and Discrete is the most important one 31

Repetition - Datatypes, Contentypes Contenttype: Discretized Continuousvaluesconvertedintodiscretevalues Examples: Income to Categories:A, B, C, … Age to groups:0-20,21-30, 31-40, … 32

Repetition - Datatypes, Contentypes Contenttype: Key Key Uniquely identifies a row Key Sequence (sequence clustering models) Series of events Sorted Key Time (time series models) Identify values on a time scale 33

Repetition - Datatypes, Contentypes Contenttype: Ordered Discretevaluesthathave a sorting order Nodistancesvisible Norelationsvisible „One Star“ to „Five Stars“ 34

Repetition - Datatypes, Contentypes Contenttype: Cyclical Discretevaluesthathave a cyclicalsorting order Example: Weekdays: Monday, Tuesday, … Sunday, Monday, … 1,2,3, …,7, 1, … Months Jan, Feb, Mar, … , Dec, Jan, … 1, 2, 3, …, 12, 1, … 35

04 Data Mining Algorithms - Decision Trees 37

Applied Data Mining - Decision Trees 38

Applied Data Mining - Decision Trees In General Also known as: Classification Trees Goal: Sequentially partition Data Can detect non-linear relationships Machine Learning Technique Separate into Training and Testing set Training set is created to create model based on certain criteria Test set is used to verify the model 39

Applied Data Mining - Decision Trees Tree for response of a mailing action Income > $30 000: 3,6 % Male 3,2% (Total: 4.677) Income < $30 000: 2,3 % 2,6 % respose rate (Total: 10.000 persons) Age > 40: 3,8% Female 2,1% (Total: 5,323) Age < 40: 3,2 % 40

Applied Data Mining - Decision Trees UsingtheTrainedTree Example: the management decides to mail only to groups with response rate >3.5%. TrainedTree Males: $30 000 Response Rate: > 3,5 % Female: 40+ 41

Applied Data Mining - Decision Trees Pros Very flexible, white box Model Kiss – Keep it simple, stupid! Little preparation and resources needed Cons Can be tuned until death Long time to build Requires wisely selected training data! False training yields false results Big tree might require disk swapping(Computation might be difficult if it does not fit into main memory.) 42

Project: “DMDW Mining Test” 43

Project: “DMDW Mining Test”(explanation of one note) 44

Project: “DMDW Mining Test”(shows connections, more useful if there are more predictable values)

Project: “DMDW Mining Test”(Generic Content Tree Viewer  DMX (Data Mining Extensions))

References References for Decisions Trees Olivia Parr Rud et. al, Data Mining Cookbook - Modeling Data for Marketing, Risk, and Customer Relationship Management, Wiley, 2001 David A. Grossman, Ophir Frieder: Introductionto Data Mining, Illinois Institute of Technology 2005 Andrew W. Moore: DecisionTrees, Carnegie Mellon University, http://www.autonlab.org/tutorials/dtree16.pdf NongYe (ed.): The Handbook of Data Mining, Lawrence Erlbaum Associates, 2003 Sushimita Mitra, TinkuAcharya, Data Mining - Multimedia, Soft Computing andBioinformatics, Wiley, 2003 http://en.wikipedia.org/wiki/Classification_tree 47

05 Data Mining Algorithms - Clustering 48

Data Mining Algorithms - Clustering X 1 2 49

Data Mining Algorithms - Clustering Clustering Segmentation Algorithm Find homogenous groups within set Find similar variables for different cases Identify new relationships that were unclear before(heuristics) e.g. „Person who rides a bike to work doesn‘t live far from his workplace“ (this is not obvious) 50

51 Homogeneous Subsets Independent Variables Description of class classify identify X 1 2

52 Homogeneous Subsets Independent Variables Description of class 1. Clustering 2. Classification classify identify X 1 2

Clustering 1. Clustering Reducesdatatoclasses of equaltypes Becomefriedswiththedata Iterative Algorithm Clustering Validate Classify Apply http://msdn.microsoft.com/en-us/library/ms174879.aspx 53

Data Mining Algorithms - Clustering 2. Classification Create a Description of a group Give it a „name“ Also: Characterization 54

Process Start with random values Reuse will create different sets and different groups Different clustering technique / algorithm will create different group Reuse on same dataset, reseed Expert evaluate found classes and plausibility Good classes used for predictions Good? 1. Clustering Evaluate, Check 2. Classify Apply (Predict) 55

Clustering MS Clustering Algorithm Combination of two algorithms K-Means – Hard! Datapoint can be in only one cluster Expectation Maximization – Soft Datapoint has different combinations Datapoint belongs to different clusters Probability is calculated 56 Source: http://msdn.microsoft.com/en-us/library/cc280445.aspx

Clustering 57 Pros No predictable variable to choose Trains itself without much effort Easy to configure „Cons“ Interpretation is everything Good eye needed Expert has to check for plausibility

Project: “DMDW Mining Test”(strongest relations only, amount of matching cases for Region Europe)

Project: “DMDW Mining Test”(good to know: continuous attributes are shown by there arithmetic average)

Project: “DMDW Mining Test”(comparing two clusters)

THANK YOU FOR YOUR ATTENTION 61

DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Similar to DMDW Lesson 05 + 06 + 07 - Data Mining Applied (20)

More from Johannes Hoppe

More from Johannes Hoppe (20)

Recently uploaded

Recently uploaded (20)

DMDW Lesson 05 + 06 + 07 - Data Mining Applied