A Comparison of Leading Data Mining Tools
Upcoming SlideShare
Loading in...5
×
 

A Comparison of Leading Data Mining Tools

on

  • 1,363 views

 

Statistics

Views

Total Views
1,363
Views on SlideShare
1,361
Embed Views
2

Actions

Likes
0
Downloads
31
Comments
0

1 Embed 2

http://www.slideshare.net 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

A Comparison of Leading Data Mining Tools A Comparison of Leading Data Mining Tools Presentation Transcript

  • KDD-98: A Comparison of Leading Data Mining Tools A Comparison of Leading Data Mining Tools John F. Elder IV & Dean W. Abbott Elder Research Fourth International Conference on Knowledge Discovery & Data Mining Friday, August 28, 1998 New York, New York
  • KDD-98: A Comparison of Leading Data Mining Tools Copyright © 1998, John F. Elder IV and Dean W. Abbott All rights reserved Manufactured in the United States of America © 1998 Elder Research T8-2 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Contacting Elder Research http://www.datamininglab.com Dr. John F. Elder IV Dean W. Abbott 1006 Wildmere Place 3443 Villanova Avenue Charlottesville, VA 22901 San Diego, CA 92122-2310 elder@datamininglab.com dean@datamininglab.com 804-973-7673 619-450-0313 Fax: 804-995-0064 © 1998 Elder Research T8-3 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Tutorial Goals • Compare and Summarize Data Mining Tools which: – Offer multiple modeling and classification algorithms – Support project stages surrounding model construction – Stand alone – Are general-purpose – Cost a lot – We could get our hands on • Include some (focused) Desktop Tools Other Reports: Two Crows, Aberdeen Group, Elder Research (forthcoming ), Data Mining Journal © 1998 Elder Research T8-4 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Topics • Products covered • Review of algorithms • Comparative tables of properties • Screen shots exemplifying qualities • Summary of distinctives © 1998 Elder Research T8-5 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Caveats • We don’t know every tool well (and are sure to have missed some!) – Level of exposure noted for each tool • Our background (biasing our perspective) – Very technical, “early adopters” – Emphasize solving real-world applications – More classification than estimation • Field of tools is quite dynamic – New versions appear regularly © 1998 Elder Research T8-6 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Data Mining Products PRW Model 1 © 1998 Elder Research T8-7 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Tools Evaluated Version Our Tested Product Company URL Experience Integral Solutions, Ltd. Integral Solutions, Ltd. Clementine http://www.isl.co.uk/clem.html 4 Moderate Darwin Thinking Machines, Corp. Thinking Machines, Corp. http://www.think.com/html/products/products.htm 3.0.1 Moderate DataCruncher DataMind DataMind http://www.datamindcorp.com 2.1.1 High Enterprise Miner SAS Institute SAS Institute http://www.sas.com/software/components/miner.html Beta Moderate GainSmarts Urban Science Urban Science http://www.urbanscience.com/main/gainpage.htm 4.0.3 Low Intelligent Miner IBMIBM http://www.software.ibm.com/data/iminer/ 2 Low MineSet Silicon Graphics, Inc. Silicon Graphics, Inc. http://www.sgi.com/Products/software/MineSet/ 2.5 Low Model 1 Group 1 1 Group http://www.unica-usa.com/model1.htm 3.1 Moderate ModelQuest AbTech Corp. AbTech Corp. http://www.abtech.com 1 Moderate PRW Unica Technologies, Inc. Unica http://www.unica-usa.com/prodinfo.htm 2.5 High CART Salford Systems http://www.salford-systems.com 3.5 Moderate NeuroShell Ward Systems Group, Inc. http://www.wardsystems.com/neuroshe.htm 3 Moderate OLPARS PAR Government Systems mailto://olpars@partech.com 8.1 High Scenario Cognos http://www.cognos.com/busintell/products/index.html 2 Moderate See5 RuleQuest Research http://www.rulequest.com/see5-info.html 1.07 Moderate S-Plus MathSoft http://www.mathsoft.com/splus/ 4 High WizWhy WizSoft http://www.wizsoft.com/why.html 1.1 Moderate © 1998 Elder Research T8-8 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Categories for Comparisons • Platforms Supported • Algorithms Included – Decision Trees – Neural Networks – Other • Data Input and Model Output Options • Usability Ratings • Visualization Capabilities • Modeling Automation Methods © 1998 Elder Research T8-9 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Unix Standalone PC Standalone Key Unix Server / Connectivity NT Server / Platforms PC Client PC Client Database (95/NT) blank no capability √– some capability Clementine √ √+ √ √ good capability Darwin √ √ DataCruncher √ √ √ √+ excellent capability Enterprise Miner √ √+ √ √ GainSmarts √ √ √ Intelligent Miner √ √ MineSet √ √ Model 1 √ √ √ ModelQuest √ √ √ PRW √ √ CART √ √+ Scenario √ √ NeuroShell √ OLPARS √ √ See5 √ √+ S-Plus √ √− WizWhy √ © 1998 Elder Research T8-10 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Tool Groupings Desktop High End • PC (standalone) • Multiple Platforms, Client- Server • Flat Files • Flat Files or Direct Database Access • One or Two Algorithms • Multiple Algorithm Types • Data Fits into RAM • Large Databases © 1998 Elder Research T8-11 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools End User Perspectives Business Technical • Intuitive Interface • Algorithm Options – Clear steps in data mining – Knobs to enhance model process performance – Non-technical terminology • Model Automation – Familiar environment – Simplify model design cycle • Descriptive Reporting – Documentation of steps used – Domain terminology in generating models – Graphical representations (repeatability) © 1998 Elder Research T8-12 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Save Native Output Data Input & Automatic Data ODBC Database Summary Source Model Output Header Reports Format Drivers Code Clementine √ √ √ Darwin √ √ √ DataCruncher √ √ √ √ √ Enterprise Miner √− √ √ √− √ GainSmarts √ √ √ √ √ Intelligent Miner √− √ MineSet √ √ Model 1 √ √ √ √ √ ModelQuest √ √ √ √ PRW √ √ √ √ CART √ Scenario √ √ NeuroShell √ OLPARS √ See5 √− S-Plus √ √ √ √ WizWhy √ √ © 1998 Elder Research T8-13 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Decision Trees a>4 n y b>3.5 b>2 n y n y a> 1 2 2 3 n y 1 0 © 1998 Elder Research T8-14 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Polynomial Networks Z17 = 3.1 + 0.4a - .15b2 + 0.9bc - 0.62abc + 0.5c3 Layer 0 Layer 1 (Normalizers) Layer 2 z1 a N1 z16 Layer 3 z9 Double 16 k N9 Unitizers z14 z6 Single 14 f N6 z15 z21 Multi- Triple 21 U2 Y1 z17 Linear 15 z4 d N4 Triple 17 h N8 z8 z19 Double 19 U7 Y2 Double 20 z20 e N5 z5 © 1998 Elder Research T8-15 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools “Consensus” Models Parametrically Summarize Data Points orders, terms Regression Polynomial Networks (e.g. GMDH, ASPN) Decision Trees (e.g., CART, CHAID, C5) ∑ ∑ Logistic or Sigmoidal ∑ ∑ Networks (ANNs) Hinging Hyperplanes, MARS © 1998 Elder Research T8-16 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools “Consensus” Models (continued) orientation, bin width Histogram function Radial Basis Function family, order Wavelets © 1998 Elder Research T8-17 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools “Contributory” Models retain data points; each potentially affects estimate at new point shape, spread Kernels k, distance metric k-Nearest Neighbor Goal, iterations Delaunay Planes Spread, index Projection Pursuit Regression © 1998 Elder Research T8-18 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Properties of Algorithms Algorithm Accurate Scalable Interpret- able Useable Robust Versatile Fast Hot Key Classical (LR, LDA) – C C– C – – C D C good Neural Networks C D D D – D DD C – neutral Visualization C DD C C CC D DDD C– D bad Decision Trees D C C C– C C C– C– Polynomial Networks C – D C– –D – –D – K-Nearest Neighbors D DD C– – –D D C D Kernels C DD D –D D D C D © 1998 Elder Research T8-19 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Generalized Linear Models Multi-layer Perceptrons Radial Basis Functions Polynomial Networks Sequential Discovery Association Rules Nearest Neighbor Linear/Statistical Rule Induction Algorithms Decision Trees Time Series Kohonen K Means Bayes Clementine √ √ √ √ √ √ √ Darwin √ √ √ Datamind √ Enterprise Miner √ √ √ √ √ √ √ √ GainSmarts √ √+ Intelligent Miner √ √− √ √− √ √ √+ √ MineSet √ √ √ √ Model 1 √+ √ √ ModelQuest √ √ √ √ √− PRW √+ √ √ √ √ √ CART √ Cognos √ NeuroShell √+ √ √− OLPARS √ √ √ √ √ √ √ See5 √ √ SPlus √ √+ √ √ √ WizWhy √ © 1998 Elder Research T8-20 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Multiple Activation Functions Automatic Model Selection Advanced Learning Alg. Multiple Stop Criteria Learning Rate Decay Parameter Summary Other Cost functions Multi-Layer Normalize Inputs Cross-Validation Perceptrons Network Visual Learning Rate Momentum Clementine √ √ √ √ Darwin √ √ √ √ √ √ Enterprise Miner √ √ √ √ √ √ √ √ √ √ Intelligent Miner √ √ Model 1 √ √ PRW √ √ √ √ √ √ NeuroShell √ √ √ √ √ OLPARS √ √ √ √ √ √ √ © 1998 Elder Research T8-21 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Classification Costs Pruning Severity Missing Data Visual Trees Decision Trees C5 or C4.5 "CART" CHAID Priors Other Clementine √ √ √ √ √− Darwin √ √ √ √ Enterprise Miner √ √− √ √+ √ √ √ √ GainSmarts √ √ √ √ √ Intelligent Miner √ √ √ MineSet √ √ √ √ √ √ Model 1 √ √ √− ModelQuest √− √ √ CART √+ √ √ √ √ Scenario √ √ S-Plus √ √ √ √ See5 √+ √ √ √ © 1998 Elder Research T8-22 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Complexity Cross- Input Factor Regression / Stats Linear Logistic Penalty Validation Selection Analysis Clementine Y Clementine √ Enterprise Miner √+ √+ √ √ √ √ GainSmarts √+ √+ √ Intelligent Miner √− √ √ MineSet √ Model 1 √ √ √ √+ ModelQuest Enterprise √ √ √ √ √ PRW S-Plus √ √ √ √+ √ √ √ √+ √ √ S-Plus √+ √+ √ √ √ √ Scenario √ © 1998 Elder Research T8-23 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Data Loading and Model Technical Usability Manipulation Model Building Understanding Support Overall Clementine √+ √+ √+ √+ √+ Darwin √ √ √+ √ √ DataCruncher √+ √+ √ √ √ Enterprise Miner √ √ √ √ √ GainSmarts √+ √ √ √ √ Intelligent Miner √ √ √ √ √ MineSet √ √+ √+ √ √+ Model 1 √+ √+ √+ √+ √+ ModelQuest Enterprise √ √+ √+ √+ √+ PRW √+ √+ √+ √+ √+ CART √− √ √ √ √ Scenario √ √+ √+ √ √+ NeuroShell √ √ √ √ √ OLPARS √− √ √ √ √ See5 √ √ √ √ √ S-Plus √ √ √+ √ √ WizWhy √ √ √+ √ √ © 1998 Elder Research T8-24 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Scatter/ Classification Pie Rotating Conditional Correlation Visualization Histograms Charts Line Scatter Plots Decision Plots Plots Regions Clementine √ √ √ √− √ Darwin √− √− √− DataCruncher √ √ √ √ Enterprise Miner √ √ √ √− √ √ GainSmarts √− √− Intelligent Miner √ √ √ √ MineSet √ √ √ √ √ Model 1 √ √ √ √ ModelQuest Enterprise √ √ PRW √ √ √ √ CART Scenario √ NeuroShell √ OLPARS √ √ √ √− √ √ See5 √ S-Plus √ √ √ √ √ WizWhy © 1998 Elder Research T8-25 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Free Text Automation Method of Automation Annotation of Steps Clementine Visual Programming, Programming Language √ Darwin Programming Language √ DataCruncher (Task manager) Enterprise Miner Visual Programming, Programming Language √ GainSmarts Macro Language, Wizards √− Intelligent Miner (Wizards) MineSet Data History, Log Model 1 Model Wizard ModelQuest Batch Agenda PRW Experiement Manager; Macros √ CART Built-in Basic Scripting Scenario NeuroShell OLPARS See5 S-Plus Scripting (S); C/C++ WizWhy © 1998 Elder Research T8-26 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools A Recent Breakthrough: Bundling 1) Construct varied models, and 2) Combine their estimates Generate component models by varying: • Case Weights • Data Values • Guiding Parameters • Variable Subsets Combine estimates using: • Estimator Weights • Voting • Advisor Perceptrons • Partitions of Design Space © 1998 Elder Research T8-27 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Example Bundling Techniques • Bayes: sum estimates of possible models, weighted by priors • GMDH (Ivakhenko 68) -- multiple layers of quadratic polynomials, using two inputs each, fit by LR • Stacking (Wolpert 92) -- train a 2nd-level (LR) model using leave-1-out estimates of 1st-level (neural net) models • Bagging (Breiman 96) (bootstrap aggregating) -- bootstrap data (to build trees mostly); take majority vote or average • Bumping (Tibshirani 97) -- bootstrap, select single best • Boosting (Freund & Shapire 96) -- weight error cases by βτ = (1-e(t))/e(t), iteratively re-model; weight model t by ln(βτ) • Crumpling (Anderson & Elder 98) -- average cross-validations • Born-Again (Breiman 98) -- invent new X data... © 1998 Elder Research T8-28 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Distinctives Strengths Weaknesses Clementine vis ual inte rfa c e ; algorithm bre a d th s c a lability Darwin e ffic ient c lie nt-s e rve r; intuitive inte rfa c e o p tions no uns upe rvis e d ; limite d vis ualization DataCruncher e a s e o f us e s ingle a lgorithm Enterprise Miner d e p th of algorithm s ; visual inte rfa c e harde r to u s e ; ne w product is s ue s GainSmarts data trans fo rmations , built on SAS ; a lgorithm option depth no uns upe rvis e d ; limite d vis ualization Intelligent Miner algorithm bre a d th; graphical tre e /clus te r output fe w a lg o rithm options ; no automation MineSet data visualization fe w a lg o rithms; no model e xport Model 1 e a s e o f us e ; automate d m o d e l dis c o ve ry re a lly a ve rtical to o l ModelQuest bre a d th of alg o rithms s o m e non-intuitive inte rfa c e o p tions PRW e xte n s ive a lgorithms; automate d m o d e l s e le c tion limite d visualization CART d e p th of tre e o p tions difficult file I/O; limite d visualization Scenario e a s e o f us e narrow analysis path NeuroShell multiple neural ne twork archite c ture s unorthodox inte rfa c e ; only neural ne tworks OLPARS multiple s tatis tical algorithms; cla s s -bas e d vis ualization date d inte rfa c e ; difficult file I/O See5 d e p th of tre e o p tions limite d visualization; fe w data options S-Plus d e p th of algorithm s ; visualization; programable /e xte ndable limite d inductive m e tho d s ; s te e p le a rning curve WizWhy e a s e o f us e ; e a s e o f mode l unde rs tanding limite d visualization © 1998 Elder Research T8-29 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Closing Observations • Data Mining Tools Can: – Enhance inference process – Speed up design cycle • Data Mining Tools Can Not: – Substitute for statistical and domain expertise • Users are advised to: – Get training on tools – Be alert for product upgrades © 1998 Elder Research T8-30 updated October 10, 1998
  • KDD-98: A Comparison of Leading Data Mining Tools Forthcoming Report • Report provides detailed comparison of high-end data mining tools, including capabilities, ease of use, and practical tips. • Available for $695 from Elder Research (http://www.datamininglab.com), Q4 1998. • Purchasers receive brief free consulting session to explore report findings in more detail, if desired. Note: The analyses and reviews were performed completely independently, and were made possible by the cooperation of the vendors, for which Elder Research is very grateful. The companies, however, provided no financial support, and had no influence on its editorial content. © 1998 Elder Research T8-31 updated October 10, 1998