Agenda• What Data Mining IS and IS NOT• Steps in the Data Mining Process – CRISP-DM – Explanation of Models – Examples of Data Mining Applications• Questions
The Evolution of Data AnalysisEvolutionary Step Business Question Enabling Product Providers Characteristics TechnologiesData Collection "What was my total Computers, tapes, IBM, CDC Retrospective,(1960s) revenue in the last disks static data delivery five years?"Data Access "What were unit Relational Oracle, Sybase, Retrospective,(1980s) sales in New databases Informix, IBM, dynamic data England last (RDBMS), Microsoft delivery at record March?" Structured Query level Language (SQL), ODBCData Warehousing "What were unit On-line analytic SPSS, Comshare, Retrospective,& Decision sales in New processing Arbor, Cognos, dynamic dataSupport England last (OLAP), Microstrategy,NCR delivery at multiple(1990s) March? Drill down multidimensional levels to Boston." databases, data warehousesData Mining "What’s likely to Advanced SPSS/Clementine, Prospective,(Emerging Today) happen to Boston algorithms, Lockheed, IBM, proactive unit sales next multiprocessor SGI, SAS, NCR, information month? Why?" computers, massive Oracle, numerous delivery databases startups
Results of Data Mining Include: • Forecasting what may happen in the future • Classifying people or things into groups by recognizing patterns • Clustering people or things into groups based on their attributes • Associating what events are likely to occur together • Sequencing what events are likely to lead to later events
Data mining is not•Brute-force crunching of bulkdata•“Blind” application of algorithms•Going to find relationshipswhere none exist•Presenting data in differentways•A database intensive task•A difficult to understandtechnology requiring anadvanced degree in computerscience
Data Mining Is •A hot buzzword for a class of techniques that find patterns in data •A user-centric, interactive process which leverages analysis technologies and computing power •A group of techniques that find relationships that have not previously been discovered •Not reliant on an existing database •A relatively easy task that requires knowledge of the business problem/ subject matter expertise
Data Mining versus OLAP•OLAP - On-lineAnalyticalProcessing – Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening
Data Mining Versus Statistical Analysis•Data Mining •Data Analysis – Originally developed to act – Tests for statistical as expert systems to solve correctness of models problems • Are statistical – Less interested in the assumptions of models mechanics of the correct? technique – Eg Is the R-Square – If it makes sense then let’s good? use it – Hypothesis testing – Does not require • Is the relationship assumptions to be made significant? about data – Use a t-test to validate – Can find patterns in very significance large amounts of data – Tends to rely on sampling – Requires understanding – Techniques are not of data and business optimised for large amounts problem of data – Requires strong statistical skills
Examples of What People are Doing with Data Mining:•Fraud/Non-Compliance •Recruiting/AttractingAnomaly detection customers – Isolate the factors that •Maximizing lead to fraud, waste and profitability (cross selling, identifying abuse profitable customers) – Target auditing and •Service Delivery and investigative efforts more Customer Retention effectively – Build profiles of•Credit/Risk Scoring customers likely to use which•Intrusion detection services•Parts failure prediction •Web Mining
How Can We Do Data Mining?By Utilizing the CRISP- DM Methodology – a standard process – existing data – software technologies – situational expertise
Why Should There be aStandard Process? •Framework for recording experience – Allows projects to beThe data mining process must replicatedbe reliable and repeatable by •Aid to project planning andpeople with little data mining management •“Comfort factor” for newbackground. adopters – Demonstrates maturity of Data Mining – Reduces dependency on “stars”
Process StandardizationCRISP-DM:• CRoss Industry Standard Process for Data Mining• Initiative launched Sept.1996• SPSS/ISL, NCR, Daimler-Benz, OHRA• Funding from European commission• Over 200 members of the CRISP-DM SIG worldwide – DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, Magnify, .. – System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte & Touche, … – End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ...
CRISP-DM•Non-proprietary•Application/Industryneutral•Tool neutral•Focus on business issues – As well as technical analysis•Framework for guidance•Experience base – Templates for Analysis
Why CRISP-DM?•The data mining process must be reliable and repeatable bypeople with little data mining skills•CRISP-DM provides a uniform framework for –guidelines –experience documentation•CRISP-DM is flexible to account for differences –Different business/agency problems –Different data
Phases and Tasks B u s in e s s D a ta D a ta M o d e lin g E v a lu a t io n D e p lo y m e n t U n d e r s t a n d in g U n d e r s t a n d in g P r e p a r a t io nD e t e r m in e C o lle c t In it ia l D a t a D ata Set S e le c t M o d e lin g E v a lu a t e R e s u lt s P la n D e p lo y m e n t B u s i n e s s O b j e c t Ii v e s D ata C ollection nitial D ata Set D escription T e c h n iq u e A ssessment of D ata D eployment P lanB ackground R eport M odeling T echnique M ining R esults w.r.t.B usiness Objectives S e le c t D a t a M odeling A ssumptions B usiness Success P la n M o n it o r in g a n dB usiness Success D e s c r ib e D a t a R ationale for I nclusion / C riteria M a in t e n a n c e C riteria D ata D escription R eport E xclusion G e n e r a t e T e s t D A pproved M odels e s ig n M onitoring and T est D esign M aintenance P lanS i t u a t i o n A s s e s s mEex p l o r e D a t a nt C le a n D a t a R e v ie w P r o c e s sI nventory of R esources D ata E xploration R eport D ata C leaning R eport B u i l d M o d e l R eview of P rocess P r o d u c e F in a l R e p oR equirements, P arameter Settings F inal R eport A ssumptions, and V e r i f y D a t a Q u a l i t y C o n s t r u c t D a tM odels a D e t e r m in e N e x t S F e p s resentation t inal P C onstraints D ata Q uality R eport D erived A ttributes M odel D escription List of P ossible A ctionsR isks and C ontingencies Generated R ecords D ecision R e v ie w P r o je c tT erminology As s es s Model E xperienceC osts and B enefits I n t e g r a t e D a t a odel A ssessment M D ocumentation M erged D ata R evised P arameterD e t e r m in e Settings D a t a M in in g G o a l F o rma t D a taD ata M ining Goals R eformatted D ataD ata M ining Success C riteriaP r o d u c e P r o je c t P la nP roj P lan ectI nitial A sessment of T ools and T echniques
Phases in the DM Process (1 & 2)•Business Understanding: – Statement of Business Objective – Statement of Data •Data Understanding Mining objective – Explore the data and – Statement of Success verify the quality Criteria – Find outliers
Phases in the DM Process (3)• Data preparation: – Takes usually over 90% of our time • Collection • Assessment • Consolidation and Cleaning – table links, aggregation level, missing values, etc • Data selection – active role in ignoring non- contributory data? – outliers? – Use of samples – visualization tools • Transformations - create new variables
Phases in the DM Process (4) • Model building – Selection of the modeling techniques is based upon the data mining objective – Modeling is an iterative process - different for supervised and unsupervised learning • May model for either description or prediction
Neural Networks• Description – Difficult interpretation – Tends to ‘overfit’ the data – Extensive amount of training time – A lot of data preparation – Works with all data types
Rule Induction•Description – Produces decision trees: • income < $40K – job > 5 yrs then good risk – job < 5 yrs then bad Credit ranking (1=default) risk Cat. % Bad 52.01 168 n Good 47.99 155 • income > $40K Total (100.00) 323 Paid Weekly/Monthly P-value=0.0000, Chi-square=179.6665, df=1 – high debt then bad risk Weekly pay Monthly salary – low debt then good risk Cat. % Bad 86.67 143 Good 13.33 22 n Cat. % Bad 15.82 25 Good 84.18 133 n Total (51.08) 165 Total (48.92) 158 – Or Rule Sets: Age Categorical P-value=0.0000, Chi-square=30.1113, df=1 Age Categorical P-value=0.0000, Chi-square=58.7255, df=1 • Rule #1 for good risk: Young (< 25);Middle (25-35) Cat. % n Old ( > 35) Cat. % n Cat. % Young (< 25) n Middle (25-35);Old ( > 35) Cat. % n – if income > $40K Bad 90.51 143 Good 9.49 15 Total (48.92) 158 Bad 0.00 Good 100.00 Total (2.17) 0 7 7 Bad 48.98 24 Good 51.02 25 Total (15.17) 49 Bad 0.92 1 Good 99.08 108 Total (33.75) 109 – if low debt Social Class P-value=0.0016, Chi-square=12.0388, df=1 • Rule #2 for good risk: Management;Clerical Cat. % n Professional Cat. % n – if income < $40K Bad 0.00 0 Bad 58.54 24 Good 100.00 8 Good 41.46 17 Total (2.48) 8 Total (12.69) 41 – if job > 5 years
Rule InductionDescription• Intuitive output• Handles all forms of numeric data, as well as non-numeric (symbolic) dataC5 Algorithm a special case of rule induction• Target variable must be symbolic
AprioriDescription• Seeks association rules in dataset• ‘Market basket’ analysis• Sequence discovery
Kohonen NetworkDescription• unsupervised• seeks to describe dataset in terms of natural clusters of cases
Phases in the DM Process (5)• Model Evaluation – Evaluation of model: how well it performed on test data – Methods and criteria depend on model type: • e.g., coincidence matrix with classification models, mean error rate with regression models – Interpretation of model: important or not, easy or hard depends on algorithm
Phases in the DM Process (6)•Deployment – Determine how the results need to be utilized – Who needs to use them? – How often do they need to be used•Deploy Data Mining results by: – Scoring a database – Utilizing results as business rules – interactive scoring on-line
What data mining hasdone for... The US Internal Revenue Service needed to improve customer service and... Scheduled its workforceto provide faster, more accurate answers to questions.
What data mining has donefor... The US Drug Enforcement Agency needed to be more effective in their drug “busts” and analyzed suspects’ cell phone usage to focus investigations.
What data mining has donefor... HSBC need to cross-sell more effectively by identifying profiles that would be interested in higher yielding investments and... Reduced direct mail costs by 30% while garnering 95% of the campaign’s revenue.
Final Comments • Data Mining can be utilized in any organization that needs to find patterns or relationships in their data. • By using the CRISP-DM methodology, analysts can have a reasonable level of assurance that their Data Mining efforts will render useful, repeatable, and valid results.