Data Mining: The Next Revolution
     in Institutional Research



                 C. R. Thulasi Kumar
    Office of Info...
The Evolution of Data Analysis
             Evolutionary         Business          Enabling              Product
         ...
What is Data Mining?

The exploration and analysis of large quantities of data in order
to discover meaningful patterns an...
Differences between Statistics and
                Data Mining
     STATISTICS                 DATA MINING
Confirmative   ...
Why Data Mining?

           Too much data
                 Too many records
                 Too many variables


       ...
Data Mining is not…

OLAP
Data Warehousing
Data Visualization
SQL
Ad Hoc Queries
Reporting
Data Mining Algorithms

Statistics
   Distributions, mathematics, etc.


Machine Learning
   Computer science, heuristics ...
Data Mining is…

Predictive Modeling
  Liner/Logistic Regression
  Neural Networks
  Decision Trees


Clustering
  Kohonen...
Data Mining is…(cont’d)

Segmentation                                                                           Credit ran...
Kohonen Network
            Seeks to describe dataset in terms of natural clusters
            of cases




Source: SPSS BI
Apriori
             Seeks association rules in dataset
             “Market Basket” analysis
             Sequence discov...
Areas of Current Application

Credit Card/Insurance Fraud Detection
Credit/Risk Scoring
Direct Mail Marketing
Parts Failur...
Higher Education Applications

Student academic success/Retention and graduation
Identify high risk students
Predict cours...
Software Vendors

Clementine (SPSS)
Intelligent Miner (IBM)
Insightful Miner (Insightful)
Enterpriser Miner (SAS)
Affinium...
Clementine (SPSS)
Insightful Miner (Insightful)
CART (Salford Systems)
How much does it cost?
Clementine (SPSS)
   Price varies
Insightful Miner (Insightful)
   Small/fraction of other mining t...
Resources

Web Sites
  http://www.kdnuggets.com/
  http://www.uni.edu/instrsch/dm/index.html


Training
  http://www.the-m...
What is Data Mining?
•   The process of discovering meaningful new correlations,
    patterns, and trends by sifting throu...
Training
                             (The Modeling Agency)

DATA MINING: LEVEL I
A Strategic Overview of Methods, Resourc...
Selected Data Mining Books
What percentage (%) of time in your data mining project (s) is spent
            on data cleaning and preparation? (187 vo...
Thank You
Upcoming SlideShare
Loading in...5
×

Data Mining: The Next Revolution Data Mining: The Next ...

901

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
901
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
72
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Data Mining: The Next Revolution Data Mining: The Next ...

  1. 1. Data Mining: The Next Revolution in Institutional Research C. R. Thulasi Kumar Office of Information Management & Analysis University of Northern Iowa May 31, 2004
  2. 2. The Evolution of Data Analysis Evolutionary Business Enabling Product Step Question Technologies Providers Characteristics Data Collection "What was my Computers, tapes, IBM, CDC Retrospective, (1960s) total revenue in disks static data delivery the last five years?" Data Access "What were unit Relational Oracle, Sybase, Retrospective, (1980s) sales in New databases Informix, IBM, dynamic data England last (RDBMS), Microsoft delivery at record March?" Structured Query level Language (SQL), ODBC Data "What were unit On-line analytic SPSS, Comshare, Retrospective, Warehousing & sales in New processing Arbor, Cognos, dynamic data Decision Support England last (OLAP), Microstrategy, delivery at (1990s) March? Drill down multidimensional NCR multiple levels to Boston." databases, data warehouses Data Mining "What’s likely to Advanced SPSS/Clementine, Prospective, (Emerging happen to Boston algorithms, Lockheed, IBM, proactive Today) unit sales next multiprocessor SGI, SAS, NCR, information month? Why?" computers, Oracle, numerous delivery massive databases startups Source: SPSS BI
  3. 3. What is Data Mining? The exploration and analysis of large quantities of data in order to discover meaningful patterns and rules (Berry and Linoff). The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data stored in repositories and by using pattern recognition technologies as well as statistical and mathematical techniques (The Gartner Group). The nontrivial extraction of implicit, previously unknown, and potentially useful information from data (Frawley, Paitestsky- Shapiro and Mathews).
  4. 4. Differences between Statistics and Data Mining STATISTICS DATA MINING Confirmative Explorative Small data sets/File-based Large data sets/Databases Small number of variables Large number of variables Deductive Inductive Numeric data Numeric and non-numeric Clean data Data cleaning
  5. 5. Why Data Mining? Too much data Too many records Too many variables Interesting patterns difficult to find with traditional statistics, due to Complex non linear relationships Multi-variable combination Source: Abbot, Data Mining: Level II
  6. 6. Data Mining is not… OLAP Data Warehousing Data Visualization SQL Ad Hoc Queries Reporting
  7. 7. Data Mining Algorithms Statistics Distributions, mathematics, etc. Machine Learning Computer science, heuristics and induction algorithms Artificial Intelligence Emulating human intelligence Neural Networks Biological models, psychology and engineering
  8. 8. Data Mining is… Predictive Modeling Liner/Logistic Regression Neural Networks Decision Trees Clustering Kohonen Neural Networks Clustering K-Means Clustering Nearest Neighbor Clustering
  9. 9. Data Mining is…(cont’d) Segmentation Credit ranking (1=default) Cat. % Bad 52.01 168 n Good 47.99 155 Decision Trees Total (100.00) 323 Paid Weekly/M onthly P-value=0.0000, Chi-square=179.6665, df=1 Weekly pay Monthly salary Cat. % n Cat. % n Bad 86.67 143 Bad 15.82 25 Good 13.33 22 Good 84.18 133 Total (51.08) 165 Total (48.92) 158 Age Categorical Age Categorical P-value=0.0000, Chi-square=30.1113, df=1 P-value=0.0000, Chi-square=58.7255, df=1 Young (< 25);Middle (25-35) Old ( > 35) Young (< 25) Middle (25-35);Old ( > 35) Neural Networks Cat. % n Cat. % n Cat. % n Cat. % n Bad 90.51 143 Bad 0.00 0 Bad 48.98 24 Bad 0.92 1 Good 9.49 15 Good 100.00 7 Good 51.02 25 Good 99.08 108 Total (48.92) 158 Total (2.17) 7 Total (15.17) 49 Total (33.75) 109 Social Class P-value=0.0016, Chi-square=12.0388, df=1 Management;Clerical Professional Cat. % n Cat. % n Bad 0.00 0 Bad 58.54 24 Good 100.00 8 Good 41.46 17 Total (2.48) 8 Total (12.69) 41 Predictive Modeling Affinity Analysis Association Rule Sequence Generators
  10. 10. Kohonen Network Seeks to describe dataset in terms of natural clusters of cases Source: SPSS BI
  11. 11. Apriori Seeks association rules in dataset “Market Basket” analysis Sequence discovery Source: SPSS BI
  12. 12. Areas of Current Application Credit Card/Insurance Fraud Detection Credit/Risk Scoring Direct Mail Marketing Parts Failure Prediction Recruiting/Attracting Customers Service Delivery and Customer Retention “Market Basket” Analysis
  13. 13. Higher Education Applications Student academic success/Retention and graduation Identify high risk students Predict course demand Profile good transfer candidates Application success rates Predict potential alumni donations
  14. 14. Software Vendors Clementine (SPSS) Intelligent Miner (IBM) Insightful Miner (Insightful) Enterpriser Miner (SAS) Affinium Model (Unica) CART (Salford Systems) XLMiner GhostMiner SPlus
  15. 15. Clementine (SPSS)
  16. 16. Insightful Miner (Insightful)
  17. 17. CART (Salford Systems)
  18. 18. How much does it cost? Clementine (SPSS) Price varies Insightful Miner (Insightful) Small/fraction of other mining tools Enterpriser Miner (SAS) Academic server license $40K-100K Affinium Model (Unica) Intelligent Miner (IBM) XLMiner Standard academic version $199 for two-years GhostMiner $2.5K-30K + Maintenance fee CART (Salford Systems) Very low for academic license
  19. 19. Resources Web Sites http://www.kdnuggets.com/ http://www.uni.edu/instrsch/dm/index.html Training http://www.the-modeling-agency.com
  20. 20. What is Data Mining? • The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data stored in repositories and by using pattern recognition technologies as well as statistical and mathematical techniques (The Gartner Group). • The Nontrivial extraction of implicit, previously unknown and potentially useful information from data (Frawley, Paitestky- Shapiro and Mathews). Data Mining in Institutional Research • Data analysis for institutional research (IR) has evolved from simple retrospective data delivery in the 1960’s to retrospective dynamic data delivery at multiple levels in the 1990’s. Unlike the past methodologies, data mining is prospective and proactive in data analysis and information delivery. With a blend of tools and techniques from disciplines such as statistics, computer science, mathematics, biology and engineering, data mining provides new opportunities for institutional research professionals to provide decision support data. This site provides a collection of resources from an introductory perspective for institutional research professionals interested in data mining. • As this area is still in its infant stages, real world examples of IR applications are difficult to find, let alone emulate. As more and more examples in IR become available, this site will be updated. Until that time, most of the examples refer to the current data mining applications in the business and industry sectors. • Data mining has been used by universities in a number of areas, including but not limited to enrollment management, retention and graduation analysis, survey data analysis, and donation prediction (alumni contribution). Comments or Suggestions? Email Dr. Kumar, Information Management & Analysis Last Modified: March 25, 2004 Copyright 2004 University of Northern Iowa Office of Information Management & Analysis
  21. 21. Training (The Modeling Agency) DATA MINING: LEVEL I A Strategic Overview of Methods, Resources and Applications for Predictive Analytics by Tony Rathburn; Eric Siegel Registration: $1,295, 2 Days* Washington, DC - June 21 & 22, 2004 San Diego, CA - September 20 & 21, 2004 Las Vegas, NV - November 29 & 30, 2004 *DM Levels I & II Package $1,995 DATA MINING: LEVEL II A Tactical Drill-Down of the Data Mining Process, Tools and Techniques by Dean Abbott Registration: $1,295, 2 Days* Washington, DC - June 23 & 24, 2004 San Diego, CA - September 22 & 23, 2004 Las Vegas, NV - December 1 & 2, 2004 DATA MINING: LEVEL III A Hands-On Application Workshop for Data Mining Practitioners by Dean Abbott Registration: $695, 1 Day* Washington, DC - June 25, 2004 San Diego, CA - September 24, 2004 Las Vegas, NV - December 3, 2004
  22. 22. Selected Data Mining Books
  23. 23. What percentage (%) of time in your data mining project (s) is spent on data cleaning and preparation? (187 votes total) Over 80% (46) 25% 61 to 80% (73) 39% 41 to 60% (46) 25% 21 to 40% (7) 4% 20% or less (15) 8% Source: http://www.kdnuggets.com/
  24. 24. Thank You
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×