SlideShare a Scribd company logo
Top 10 Data Mining Mistakes
                       -- and how to avoid them

                    Salford Systems Data Mining Conference
                              New York, New York
                                 March 29, 2005


                                                    Elder Research, Inc.
                                                    635 Berkmar Circle
         John F. Elder IV, Ph.D.               Charlottesville, Virginia 22901
       elder@datamininglab.com                          434-973-7673
                                                 www.datamininglab.com


© 2005 Elder Research, Inc.
                                                                             0
© 2005 Elder Research, Inc.
                                               1
                              © Hildebrandts
You've made a mistake if you…
0. Lack Data

1. Focus on Training              6. Discount Pesky Cases

2. Rely on One Technique          7. Extrapolate

3. Ask the Wrong Question         8. Answer Every Inquiry

4. Listen (only) to the Data      9. Sample Casually

5. Accept Leaks from the Future   10. Believe the Best Model


© 2005 Elder Research, Inc.
                                                        2
0. Lack Data
•     Need labeled cases for best gains (to classify or estimate; clustering is
      much less effective). Interesting known cases may be exceedingly rare.
      Some projects probably should not proceed until enough critical data is
      gathered to make it worthwhile.
•     Ex: Fraud Detection (Government contracting): Millions of transactions,
      a handful of known fraud cases; likely that large proportion of fraud cases
      are, by default, mislabeled clean. Only modest results (initially) after
      strenuous effort.
•     Ex: Fraud Detection (Taxes; collusion): Surprisingly many known cases
      -> stronger, immediate results.
•     Ex: Credit Scoring: Company (randomly) gave credit to thousands of
      applicants who were risky by conventional scoring method, and
      monitored them for two years. Then, estimated risk using what was
      known at start. This large investment in creating relevant data paid off.

    © 2005 Elder Research, Inc.
                                                                                  3
1. Focus on Training
•    Only out-of-sample results matter. (Otherwise, use a lookup table!)
•    Cancer detection Ex: MD Anderson doctors and researchers (1993),
     using neural networks, surprised to find that longer training (week vs.
     day) led to only slightly improved training results, and much worse
     evaluation results.
•    (ML/CS often sought models with exact results on known data -> overfit.)
•    Re-sampling (bootstrap, cross-validation, jackknife, leave-one-out...) is
     an essential tool. (Traditional significance tests are a weak defense when
     structure is part of search, though stronger penalty-based metrics are
     useful.) However, note that resampling no longer tests a single model,
     but a model class, or a modeling process (Whatever is held constant
     throughout the sampling passes.)

© 2005 Elder Research, Inc.
                                                                              4
2. Rely on One Technique
•    "To a little boy with a hammer, all the world's a nail."
     For best work, need a whole toolkit.
•    At very least, compare your method to a conventional one (linear
     regression say, or linear discriminant analysis).
•    Study: In refereed Neural Network journals, over 3 year period, only
     1/6 articles did both ~1 & ~2; that is, test on unseen data and compare
     to a widely-used technique.
•    Not checking other methods leads to blaming the algorithm for the
     results. But, it’s somewhat unusual for the particular modeling
     technique to make a big difference, and when it will is hard to predict.
•    Best: use a handful of good tools. (Each adds only 5-10% effort.)


© 2005 Elder Research, Inc.
                                                                                5
Data Mining Products




                                               Model 1




© 2005 Elder Research, Inc.
                                                         6
Decision Tree                                                Nearest Neighbor




                                 Delaunay Triangles




       Kernel                                         Neural Network (or
   © 2005 Elder Research, Inc.
                                                                            7
                                                      Polynomial Network)
Relative Performance Examples: 5 Algorithms on 6 Datasets
                                                                  (John Elder, Elder Research & Stephen Lee, U. Idaho, 1997)
Error Relative to Peer Techniques (lower is better)




                                                      © 2005 Elder Research, Inc.
                                                                                                                               8
Error Relative to Peer Techniques (lower is better)
                                                          Essentially every Bundling method improves performance




                                                      © 2005 Elder Research, Inc.
                                                                                                                   9
3. Ask the Wrong Question
a) Project Goal: Aim at the right target
•    Fraud Detection (Positive example!) (Shannon Labs work on Int'l
     calls): Didn't attempt to classify fraud/nonfraud for general call, but
     characterized normal behavior for each account, then flagged outliers.
     -> A brilliant success.


b) Model Goal: Get the computer to "feel" like you do.
   [e.g., employee stock grants vs. options]
•    Most researchers are drawn into the realm of squared error by its
     convenience (mathematical beauty). But ask the computer to do what's
     most helpful for the system, not what's easiest for it. [Stock price ex.]




    © 2005 Elder Research, Inc.
                                                                                 10
4. Listen (only) to the Data
4a. Opportunistic data:
• [School funding ex.] Self-selection. Nothing inside the data protects
   analyst from significant, but wrong result.

4b. Designed experiment:
• [Tanks vs. Background with Neural networks]: Great results on out-
   of-sample portion of database. But found to depend on random pixels
                               sunny
   (Tanks photographed on ______ day, Background only on _______).cloudy
• [Tanks & Networks 2]: Tanks and Trucks on rotating platforms, to
   train to discriminate at different angles. Used radar, Fourier
   transforms, principle components, polynomial networks. But, source
   of the key signal = platform corner And, discriminated between the
                       _____________.
                                  bushes
   two classes primarily using _______.


© 2005 Elder Research, Inc.
                                                                          11
5. Accept Leaks from the Future
•    Forecasting ex.: Interest rate at Chicago Bank.
                                output was a candidate input
     N.net 95% accurate, but _________________________.
•                                                               today
     Financial ex. 2: moving average of 3 days, but centered on ______.
•    Many passes may be needed to expel anachronistic leakers - D. Pyle.
•    Look for variables which work (too) well.
     Insurance Ex: code associated with 25% of purchasers
     turned out to describe type of cancellation.
•    Date-stamp records when storing in Data Warehouse, or
     Don't overwrite old value unless archived.
•    Survivor Bias [financial ex.]


© 2005 Elder Research, Inc.
                                                                           12
6. Discount Pesky Cases
• Outliers may be killing results (ex: decimal point error on price),
  or be the whole answer (ex: Ozone hole), so examine carefully.
• The most exciting phrase in research isn't "Aha!", but "That's odd…"
• Internal inconsistencies in the data may be clues to problems with
  the process flow of information within the company; a larger
  business problem may be revealed.
• Direct Mail example: persisting in hunting down oddities found
  errors by Merge/Purge house, and was a major contributor to
  doubling sales per catalog.
• Visualization can cover a multitude of assumptions.
  © 2005 Elder Research, Inc.
                                                                        13
7. Extrapolate
•    Tend to learn too much from first few experiences.
•    Hard to "erase" factoids after an upstream error is discovered.
•    Curse of Dimensionality: low-dimensional intuition is useless in high-d.
•    Philosophical: Evolutionary Paradigm:
     Believe we can start with pond scum (pre-biotic soup of raw materials)
     + zap + time + chance + differential reinforcement -> a critter.
          (e.g., daily stock prices + MARS -> purchase actions,
          or pixel values + neural network -> image classification)
     Better paradigm is selective breeding:
     mutts + time + directed reinforcement -> greyhound
     Higher-order features of data + domain expertise essential


© 2005 Elder Research, Inc.
                                                                         14
“Of course machines can
                              think. After all, humans
                              are just machines made
                              of meat.”
                              - MIT CS professor




                              Human and computer
                              strengths are more
                              complementary than alike.

© 2005 Elder Research, Inc.
                                                  15
8. Answer Every Inquiry
• "Don't Know" is a useful model output state.
• Could estimate the uncertainty for each output (a function
  of the number and spread of samples near X). Few
  algorithms provide a conditional σ with their conditional µ.




Global Rd Optimization when Probes are Expensive (GROPE)

 © 2005 Elder Research, Inc.
                                                             16
9. Sample without Care
•   9a Down-sample Ex: MD Direct Mailing firm had too many non-responders
    (NR) for model (about 99% of >1M cases). So took all responders, and
    every 10th NR to create a more balanced database of 100K cases. Model
    predicted that everyone in Ketchikan, Wrangell, and Ward Cove Alaska
                       Sorted         zip code
    would respond. (______ data, by ________, and 100Kth case drawn before
    bottom of file (999**) reached.)
•   "Shake before baking". Also, add case number (and other random variables)
    to candidate list; use as "canaries in the mine" to signal trouble when chosen.
•   9b Up-sampling Ex: Credit Scoring - paucity of interesting (Default) cases
    led to quintupling them. Cross-validation employed with many techniques
    and modeling cycles. Results tended to improve with the complexity of the
    models. Oddly, this didn't reverse. Noticed that Default (rare) cases were
    better estimated by complex models but others were worse. (Had
     duplicated                         up-sampling
    _________ Defaults in each set by ___________ before splitting.)
    -> Split first.
•   It's hard to beat a stratified sample; that is, proportional sample from each
    group. [even with Data Squashing - W. DeMouchel]
    © 2005 Elder Research, Inc.
                                                                              17
10. Believe the Best Model
•    Interpretability not always necessary.
     Model can be useful without being "correct" or explanatory.
•    Often, particular variables used by "best" model (which barely won
     out over hundreds of others of the millions (to billions) tried, using a
     score function only approximating one's goals, and on finite data)
     have too much attention paid to them. (Un-interpretability could be a
     virtue!).
•    Usually, many very similar variables are available, and the particular
     structure of the best model can vary chaotically. [Polynomial
     Network Ex.] But, structural similarity is different from functional
     similarity. (Competing models often look different, but act the same.)
•    Best estimator is likely to be bundle of models.
     [Direct Marketing trees] [5x6 table] [Credit Scoring ex.]
© 2005 Elder Research, Inc.
                                                                            18
Lift Chart: %purchasers vs. %prospects




•   Ex: Last quintile of customers are 4 times more expensive to obtain than
    first quintile (10% vs. 40% to gain 20%)
•   Decision Tree provides relatively few decision points.
    © 2005 Elder Research, Inc.
                                                                           19
Bundling 5 Trees
           improves accuracy and smoothness




© 2005 Elder Research, Inc.
Credit Scoring Model Performance
#Defaulters Missed (fewer is better)



                                                  Bundled Trees                      SNT
                                                                           NT
                                                Stepwise Regression
                                                                          NS    ST
                                               Polynomial Network         MT
                                                                                PS    SPT
                                                                                      PNT   SMT       SPNT
                                                 Neural Network           PT
                                                                                NP   MPT    SPN       SMPT
                                                          MARS                              MNT
                                                                          MS         SMN
                                                                                            SMP       SMNT
                                                                                MN                            SMPNT
                                                                                                      SMPN
                                                                          MP                          MPNT

                                                                                     MPN




                                                                #Models Combined (averaging output rank)     21
                                       © 2005 Elder Research, Inc.
Median (and Mean) Error Reduced
          with each Stage of Combination
                       75

                M 70
                i
                s
                  65
                s
                e
                d 60

                       55

                              1    2      3     4         5
                              No. Models in combination

© 2005 Elder Research, Inc.
                                                              22
Fancier tools and harder problems –> more ways to mess up.

            How then can we succeed?
     Success <- Learning <- Experience <- Mistakes
     (so go out and make some good ones!)

   PATH to success:
   • Persistence - Attack repeatedly, from different angles.
        Automate essential steps. Externally check work.

   • Attitude - Optimistic, can-do.
   • Teamwork - Business and statistical experts must cooperate.
                  Does everyone want the project to succeed?

   • Humility - Learning from others requires vulnerability.
                  Don’t expect too much of technology.
© 2005 Elder Research, Inc.
                                                                   23
John F. Elder IV
                                        Chief Scientist, ERI
                              Dr. John Elder heads a data mining consulting team with offices in
                              Charlottesville, Virginia and Washington DC, and close affiliates in Boston,
                              New York, San Diego, and San Francisco (www.datamininglab.com).
                              Founded in 1995, Elder Research, Inc. focuses on investment and commercial
                              applications of pattern discovery and optimization, including stock selection,
                              image recognition, process optimization, cross-selling, biometrics, drug
                              efficacy, credit scoring, market timing, and fraud detection.

John obtained a BS and MEE in Electrical Engineering from Rice University, and a PhD in Systems
Engineering from the University of Virginia, where he’s an adjunct professor, teaching Optimization.
Prior to a decade leading ERI, he spent 5 years in aerospace defense consulting, 4 heading research at an
investment management firm, and 2 in Rice's Computational & Applied Mathematics department.

Dr. Elder has authored innovative data mining tools, is active on Statistics, Engineering, and Finance
conferences and boards, is a frequent keynote conference speaker, and was a Program co-chair of the
2004 Knowledge Discovery and Data Mining conference. John’s courses on data analysis techniques --
taught at dozens of universities, companies, and government labs -- are noted for their clarity and
effectiveness. Dr. Elder holds a top secret clearance, and since the Fall of 2001, has been honored to
serve on a panel appointed by Congress to guide technology for the National Security Agency.

John is a follower of Christ and the proud father of 5.


© 2005 Elder Research, Inc.

More Related Content

Viewers also liked

Drivers of Housing Demand: Preparing for the Impending Elder Boom
Drivers of Housing Demand: Preparing for the Impending Elder BoomDrivers of Housing Demand: Preparing for the Impending Elder Boom
Drivers of Housing Demand: Preparing for the Impending Elder BoomGNOCDC
 
Children under the age of three in formal care in Eastern Europe and Central ...
Children under the age of three in formal care in Eastern Europe and Central ...Children under the age of three in formal care in Eastern Europe and Central ...
Children under the age of three in formal care in Eastern Europe and Central ...
UNICEF Europe & Central Asia
 
Many Children Left Behind
Many Children Left BehindMany Children Left Behind
Many Children Left BehindMelissa Fulton
 
Angola child soldiers
Angola child soldiersAngola child soldiers
Angola child soldiers
guestba9d13
 
GAEBrochureFinal2010
GAEBrochureFinal2010GAEBrochureFinal2010
GAEBrochureFinal2010Simon Roberts
 
State of Nigerian children
State of Nigerian childrenState of Nigerian children
State of Nigerian childrenYoung Bill
 
Thia Presentation April 2 2008 B
Thia Presentation April 2 2008 BThia Presentation April 2 2008 B
Thia Presentation April 2 2008 B
Scott Nystrom, Ph.D.
 
Universal health care
Universal health care Universal health care
Universal health care
Dr.Hemant Kumar
 

Viewers also liked (8)

Drivers of Housing Demand: Preparing for the Impending Elder Boom
Drivers of Housing Demand: Preparing for the Impending Elder BoomDrivers of Housing Demand: Preparing for the Impending Elder Boom
Drivers of Housing Demand: Preparing for the Impending Elder Boom
 
Children under the age of three in formal care in Eastern Europe and Central ...
Children under the age of three in formal care in Eastern Europe and Central ...Children under the age of three in formal care in Eastern Europe and Central ...
Children under the age of three in formal care in Eastern Europe and Central ...
 
Many Children Left Behind
Many Children Left BehindMany Children Left Behind
Many Children Left Behind
 
Angola child soldiers
Angola child soldiersAngola child soldiers
Angola child soldiers
 
GAEBrochureFinal2010
GAEBrochureFinal2010GAEBrochureFinal2010
GAEBrochureFinal2010
 
State of Nigerian children
State of Nigerian childrenState of Nigerian children
State of Nigerian children
 
Thia Presentation April 2 2008 B
Thia Presentation April 2 2008 BThia Presentation April 2 2008 B
Thia Presentation April 2 2008 B
 
Universal health care
Universal health care Universal health care
Universal health care
 

Similar to Elder

Debugging machine-learning
Debugging machine-learningDebugging machine-learning
Debugging machine-learning
Michał Łopuszyński
 
Outliers and Inconsistency
Outliers and InconsistencyOutliers and Inconsistency
Outliers and Inconsistency
Neil Rubens
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
Turi, Inc.
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
Vivian S. Zhang
 
920 plenary elder
920 plenary elder920 plenary elder
920 plenary elder
Rising Media, Inc.
 
910 plenary Elder
910 plenary Elder910 plenary Elder
910 plenary Elder
Rising Media, Inc.
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
c.titus.brown
 
L15. Machine Learning - Black Art
L15. Machine Learning - Black ArtL15. Machine Learning - Black Art
L15. Machine Learning - Black Art
Machine Learning Valencia
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
Krishna Sankar
 
840 plenary elder_using his laptop
840 plenary elder_using his laptop840 plenary elder_using his laptop
840 plenary elder_using his laptop
Rising Media, Inc.
 
Artificial Intelligence for Medicine
Artificial Intelligence for MedicineArtificial Intelligence for Medicine
Artificial Intelligence for Medicine
Tassilo Klein
 
Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7CS, NcState
 
Machine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting StartedMachine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting Started
Sqrrl
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
CS, NcState
 
Jsm big-data
Jsm big-dataJsm big-data
Jsm big-data
Sean Taylor
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
Dasha Herrmannova
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
CS, NcState
 
Decision Modelling for n00bs
Decision Modelling for n00bsDecision Modelling for n00bs
Decision Modelling for n00bs
Suranga Nath Kasthurirathne
 
Biological Foundations for Deep Learning: Towards Decision Networks
 Biological Foundations for Deep Learning: Towards Decision Networks Biological Foundations for Deep Learning: Towards Decision Networks
Biological Foundations for Deep Learning: Towards Decision Networks
diannepatricia
 

Similar to Elder (20)

Debugging machine-learning
Debugging machine-learningDebugging machine-learning
Debugging machine-learning
 
Outliers and Inconsistency
Outliers and InconsistencyOutliers and Inconsistency
Outliers and Inconsistency
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
920 plenary elder
920 plenary elder920 plenary elder
920 plenary elder
 
910 plenary Elder
910 plenary Elder910 plenary Elder
910 plenary Elder
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
L15. Machine Learning - Black Art
L15. Machine Learning - Black ArtL15. Machine Learning - Black Art
L15. Machine Learning - Black Art
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
840 plenary elder_using his laptop
840 plenary elder_using his laptop840 plenary elder_using his laptop
840 plenary elder_using his laptop
 
Artificial Intelligence for Medicine
Artificial Intelligence for MedicineArtificial Intelligence for Medicine
Artificial Intelligence for Medicine
 
Input modeling
Input modelingInput modeling
Input modeling
 
Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7
 
Machine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting StartedMachine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting Started
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
 
Jsm big-data
Jsm big-dataJsm big-data
Jsm big-data
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Decision Modelling for n00bs
Decision Modelling for n00bsDecision Modelling for n00bs
Decision Modelling for n00bs
 
Biological Foundations for Deep Learning: Towards Decision Networks
 Biological Foundations for Deep Learning: Towards Decision Networks Biological Foundations for Deep Learning: Towards Decision Networks
Biological Foundations for Deep Learning: Towards Decision Networks
 

More from George Ang

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...George Ang
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarizationGeorge Ang
 
Huffman coding
Huffman codingHuffman coding
Huffman codingGeorge Ang
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textGeorge Ang
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿George Ang
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势George Ang
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程George Ang
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qqGeorge Ang
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道George Ang
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化George Ang
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间George Ang
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨George Ang
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站George Ang
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程George Ang
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagementGeorge Ang
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享George Ang
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享George Ang
 

More from George Ang (20)

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar text
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qq
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享
 

Recently uploaded

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 

Elder

  • 1. Top 10 Data Mining Mistakes -- and how to avoid them Salford Systems Data Mining Conference New York, New York March 29, 2005 Elder Research, Inc. 635 Berkmar Circle John F. Elder IV, Ph.D. Charlottesville, Virginia 22901 elder@datamininglab.com 434-973-7673 www.datamininglab.com © 2005 Elder Research, Inc. 0
  • 2. © 2005 Elder Research, Inc. 1 © Hildebrandts
  • 3. You've made a mistake if you… 0. Lack Data 1. Focus on Training 6. Discount Pesky Cases 2. Rely on One Technique 7. Extrapolate 3. Ask the Wrong Question 8. Answer Every Inquiry 4. Listen (only) to the Data 9. Sample Casually 5. Accept Leaks from the Future 10. Believe the Best Model © 2005 Elder Research, Inc. 2
  • 4. 0. Lack Data • Need labeled cases for best gains (to classify or estimate; clustering is much less effective). Interesting known cases may be exceedingly rare. Some projects probably should not proceed until enough critical data is gathered to make it worthwhile. • Ex: Fraud Detection (Government contracting): Millions of transactions, a handful of known fraud cases; likely that large proportion of fraud cases are, by default, mislabeled clean. Only modest results (initially) after strenuous effort. • Ex: Fraud Detection (Taxes; collusion): Surprisingly many known cases -> stronger, immediate results. • Ex: Credit Scoring: Company (randomly) gave credit to thousands of applicants who were risky by conventional scoring method, and monitored them for two years. Then, estimated risk using what was known at start. This large investment in creating relevant data paid off. © 2005 Elder Research, Inc. 3
  • 5. 1. Focus on Training • Only out-of-sample results matter. (Otherwise, use a lookup table!) • Cancer detection Ex: MD Anderson doctors and researchers (1993), using neural networks, surprised to find that longer training (week vs. day) led to only slightly improved training results, and much worse evaluation results. • (ML/CS often sought models with exact results on known data -> overfit.) • Re-sampling (bootstrap, cross-validation, jackknife, leave-one-out...) is an essential tool. (Traditional significance tests are a weak defense when structure is part of search, though stronger penalty-based metrics are useful.) However, note that resampling no longer tests a single model, but a model class, or a modeling process (Whatever is held constant throughout the sampling passes.) © 2005 Elder Research, Inc. 4
  • 6. 2. Rely on One Technique • "To a little boy with a hammer, all the world's a nail." For best work, need a whole toolkit. • At very least, compare your method to a conventional one (linear regression say, or linear discriminant analysis). • Study: In refereed Neural Network journals, over 3 year period, only 1/6 articles did both ~1 & ~2; that is, test on unseen data and compare to a widely-used technique. • Not checking other methods leads to blaming the algorithm for the results. But, it’s somewhat unusual for the particular modeling technique to make a big difference, and when it will is hard to predict. • Best: use a handful of good tools. (Each adds only 5-10% effort.) © 2005 Elder Research, Inc. 5
  • 7. Data Mining Products Model 1 © 2005 Elder Research, Inc. 6
  • 8. Decision Tree Nearest Neighbor Delaunay Triangles Kernel Neural Network (or © 2005 Elder Research, Inc. 7 Polynomial Network)
  • 9. Relative Performance Examples: 5 Algorithms on 6 Datasets (John Elder, Elder Research & Stephen Lee, U. Idaho, 1997) Error Relative to Peer Techniques (lower is better) © 2005 Elder Research, Inc. 8
  • 10. Error Relative to Peer Techniques (lower is better) Essentially every Bundling method improves performance © 2005 Elder Research, Inc. 9
  • 11. 3. Ask the Wrong Question a) Project Goal: Aim at the right target • Fraud Detection (Positive example!) (Shannon Labs work on Int'l calls): Didn't attempt to classify fraud/nonfraud for general call, but characterized normal behavior for each account, then flagged outliers. -> A brilliant success. b) Model Goal: Get the computer to "feel" like you do. [e.g., employee stock grants vs. options] • Most researchers are drawn into the realm of squared error by its convenience (mathematical beauty). But ask the computer to do what's most helpful for the system, not what's easiest for it. [Stock price ex.] © 2005 Elder Research, Inc. 10
  • 12. 4. Listen (only) to the Data 4a. Opportunistic data: • [School funding ex.] Self-selection. Nothing inside the data protects analyst from significant, but wrong result. 4b. Designed experiment: • [Tanks vs. Background with Neural networks]: Great results on out- of-sample portion of database. But found to depend on random pixels sunny (Tanks photographed on ______ day, Background only on _______).cloudy • [Tanks & Networks 2]: Tanks and Trucks on rotating platforms, to train to discriminate at different angles. Used radar, Fourier transforms, principle components, polynomial networks. But, source of the key signal = platform corner And, discriminated between the _____________. bushes two classes primarily using _______. © 2005 Elder Research, Inc. 11
  • 13. 5. Accept Leaks from the Future • Forecasting ex.: Interest rate at Chicago Bank. output was a candidate input N.net 95% accurate, but _________________________. • today Financial ex. 2: moving average of 3 days, but centered on ______. • Many passes may be needed to expel anachronistic leakers - D. Pyle. • Look for variables which work (too) well. Insurance Ex: code associated with 25% of purchasers turned out to describe type of cancellation. • Date-stamp records when storing in Data Warehouse, or Don't overwrite old value unless archived. • Survivor Bias [financial ex.] © 2005 Elder Research, Inc. 12
  • 14. 6. Discount Pesky Cases • Outliers may be killing results (ex: decimal point error on price), or be the whole answer (ex: Ozone hole), so examine carefully. • The most exciting phrase in research isn't "Aha!", but "That's odd…" • Internal inconsistencies in the data may be clues to problems with the process flow of information within the company; a larger business problem may be revealed. • Direct Mail example: persisting in hunting down oddities found errors by Merge/Purge house, and was a major contributor to doubling sales per catalog. • Visualization can cover a multitude of assumptions. © 2005 Elder Research, Inc. 13
  • 15. 7. Extrapolate • Tend to learn too much from first few experiences. • Hard to "erase" factoids after an upstream error is discovered. • Curse of Dimensionality: low-dimensional intuition is useless in high-d. • Philosophical: Evolutionary Paradigm: Believe we can start with pond scum (pre-biotic soup of raw materials) + zap + time + chance + differential reinforcement -> a critter. (e.g., daily stock prices + MARS -> purchase actions, or pixel values + neural network -> image classification) Better paradigm is selective breeding: mutts + time + directed reinforcement -> greyhound Higher-order features of data + domain expertise essential © 2005 Elder Research, Inc. 14
  • 16. “Of course machines can think. After all, humans are just machines made of meat.” - MIT CS professor Human and computer strengths are more complementary than alike. © 2005 Elder Research, Inc. 15
  • 17. 8. Answer Every Inquiry • "Don't Know" is a useful model output state. • Could estimate the uncertainty for each output (a function of the number and spread of samples near X). Few algorithms provide a conditional σ with their conditional µ. Global Rd Optimization when Probes are Expensive (GROPE) © 2005 Elder Research, Inc. 16
  • 18. 9. Sample without Care • 9a Down-sample Ex: MD Direct Mailing firm had too many non-responders (NR) for model (about 99% of >1M cases). So took all responders, and every 10th NR to create a more balanced database of 100K cases. Model predicted that everyone in Ketchikan, Wrangell, and Ward Cove Alaska Sorted zip code would respond. (______ data, by ________, and 100Kth case drawn before bottom of file (999**) reached.) • "Shake before baking". Also, add case number (and other random variables) to candidate list; use as "canaries in the mine" to signal trouble when chosen. • 9b Up-sampling Ex: Credit Scoring - paucity of interesting (Default) cases led to quintupling them. Cross-validation employed with many techniques and modeling cycles. Results tended to improve with the complexity of the models. Oddly, this didn't reverse. Noticed that Default (rare) cases were better estimated by complex models but others were worse. (Had duplicated up-sampling _________ Defaults in each set by ___________ before splitting.) -> Split first. • It's hard to beat a stratified sample; that is, proportional sample from each group. [even with Data Squashing - W. DeMouchel] © 2005 Elder Research, Inc. 17
  • 19. 10. Believe the Best Model • Interpretability not always necessary. Model can be useful without being "correct" or explanatory. • Often, particular variables used by "best" model (which barely won out over hundreds of others of the millions (to billions) tried, using a score function only approximating one's goals, and on finite data) have too much attention paid to them. (Un-interpretability could be a virtue!). • Usually, many very similar variables are available, and the particular structure of the best model can vary chaotically. [Polynomial Network Ex.] But, structural similarity is different from functional similarity. (Competing models often look different, but act the same.) • Best estimator is likely to be bundle of models. [Direct Marketing trees] [5x6 table] [Credit Scoring ex.] © 2005 Elder Research, Inc. 18
  • 20. Lift Chart: %purchasers vs. %prospects • Ex: Last quintile of customers are 4 times more expensive to obtain than first quintile (10% vs. 40% to gain 20%) • Decision Tree provides relatively few decision points. © 2005 Elder Research, Inc. 19
  • 21. Bundling 5 Trees improves accuracy and smoothness © 2005 Elder Research, Inc.
  • 22. Credit Scoring Model Performance #Defaulters Missed (fewer is better) Bundled Trees SNT NT Stepwise Regression NS ST Polynomial Network MT PS SPT PNT SMT SPNT Neural Network PT NP MPT SPN SMPT MARS MNT MS SMN SMP SMNT MN SMPNT SMPN MP MPNT MPN #Models Combined (averaging output rank) 21 © 2005 Elder Research, Inc.
  • 23. Median (and Mean) Error Reduced with each Stage of Combination 75 M 70 i s 65 s e d 60 55 1 2 3 4 5 No. Models in combination © 2005 Elder Research, Inc. 22
  • 24. Fancier tools and harder problems –> more ways to mess up. How then can we succeed? Success <- Learning <- Experience <- Mistakes (so go out and make some good ones!) PATH to success: • Persistence - Attack repeatedly, from different angles. Automate essential steps. Externally check work. • Attitude - Optimistic, can-do. • Teamwork - Business and statistical experts must cooperate. Does everyone want the project to succeed? • Humility - Learning from others requires vulnerability. Don’t expect too much of technology. © 2005 Elder Research, Inc. 23
  • 25. John F. Elder IV Chief Scientist, ERI Dr. John Elder heads a data mining consulting team with offices in Charlottesville, Virginia and Washington DC, and close affiliates in Boston, New York, San Diego, and San Francisco (www.datamininglab.com). Founded in 1995, Elder Research, Inc. focuses on investment and commercial applications of pattern discovery and optimization, including stock selection, image recognition, process optimization, cross-selling, biometrics, drug efficacy, credit scoring, market timing, and fraud detection. John obtained a BS and MEE in Electrical Engineering from Rice University, and a PhD in Systems Engineering from the University of Virginia, where he’s an adjunct professor, teaching Optimization. Prior to a decade leading ERI, he spent 5 years in aerospace defense consulting, 4 heading research at an investment management firm, and 2 in Rice's Computational & Applied Mathematics department. Dr. Elder has authored innovative data mining tools, is active on Statistics, Engineering, and Finance conferences and boards, is a frequent keynote conference speaker, and was a Program co-chair of the 2004 Knowledge Discovery and Data Mining conference. John’s courses on data analysis techniques -- taught at dozens of universities, companies, and government labs -- are noted for their clarity and effectiveness. Dr. Elder holds a top secret clearance, and since the Fall of 2001, has been honored to serve on a panel appointed by Congress to guide technology for the National Security Agency. John is a follower of Christ and the proud father of 5. © 2005 Elder Research, Inc.