SlideShare a Scribd company logo
1 of 28
Salford Systems
   Dan Steinberg
Mikhail Golovnya
   Data mining is the search for patterns in data
    using modern highly automated, computer
    intensive methods

    ◦ Data mining may be best defined as the use of a specific
      class of tools (data mining methods) in the analysis of
      data

    ◦ The term “search” is key to this definition, as is
      “automated”

   The literature often refers to finding hidden
    information in data
Data Mining           Data Mining              Cont.

• Predictive         • Statistics         • OLAP
  Analytics          • Computer science   • CART
• Machine Learning   • Insurance          • SVM
• Pattern            • Finance            • NN
  Recognition        • Marketing          • CRISP-DM
• Artificial         • Robotics           • CRM
  Intelligence
                     • Biotech            • KDD
• Business
                     • Sports Analytics   • Etc.
  Intelligence
• Data Warehousing
   Data guides the analysis, it is the “Alpha and
    Omega” of everything you do

   Analyst asks the right questions but makes
    no assumptions

   The success of data mining solely depends on
    the quality of available data
    ◦ Famous “Garbage In – Garbage Out” principle
   (Insert visual aid)
   In a nutshell: Use historical data to gain
    insights and/or predictions on the new data
   Any game is the ultimate and unambiguous source of
    the quality data

    ◦ This is very different from the data availability and quality
      in other areas of research

   However, there is no universal agreement on the best
    way of organizing and summarizing the results in a
    numeric form

    ◦ Large number of various game statistics available

    ◦ Common sense and game rules are at the core

    ◦ Heated debates on which stats best describe the potential
      for a future win
   (insert screenshot of Baseball-reference.com)

   Available from many sources, including the
    Internet

   Player level: summarize performance in a season,
    post season, and entire career

   Team level: wins and losses

   Game level: most detailed
   (insert Sean Lahman website screenshot)

   Widely known public database

   Gathers baseball stats all the way back to
    1871

   Will use parts of it to illustrate the potential
    of data mining
   Focusing on the 2010 regular season performance in
    both leagues

   Have access to the player stats for the entire season
    organized in a flat table

   Define a measure of the overall player success simply
    by having the team winning its division

    ◦ Thus 6 out of 30 participating teams in 2010 are declared
      as success

   Question: Which of the player stats are associated
    with the team winning the division?
Core Stats              Derived Stats

•AB-At Bats                 •AVG-Batting Average H/AB
•R-Runs                     •TB-Total Bases
•H-Hits                      B1+2x2B+3x3B+4xHR
•2B-Doubles                 •SLG-Slugging TB/AB
•3B-Triples                 •OBP-On Base Percentage
•HR-Home Runs                (H+BB+HBP)/(AB+BB+SF+HBP)

•RBI-Runs Batted In         •OPS-On Base Plus Slugging OBP+SLG

•SB-Stolen Bases            •…-Many more exist

•CS-Caught Stealing
•BB-Base on Balls
•SO-Strikeouts
•SF-Sacrifice Flies
•HBP-Hit by pitch
   (insert scatter matrix)

   This is how the problem is usually attacked

   Each dot represents a single batter record for the whole
    2010 season

   1245 overall records

   16 core stats

   Winning team batters are marked in red

   No obvious insights!
   Leo Breiman, Jerome Friedman, Richard Olshen and
    Charles Stone

   Starting with CART in 1984, laid the foundation for tree-
    based modeling techniques

   Conduct deep look into all available data

   Point out most relevant variables and features

   Automatically identify optimal transformations

   Capable of extracting complex patterns going way beyond
    the traditional “single performance at a time” approach
   (insert graph)
   6 core batter stats were identified as most predictive

   About 20% of total variation can be directly associated with
    the batter stats

   The single plots show non-linear nature of many of the
    relationships

   Fine plot irregularities should be ignored

   Striking result: HR above 30 is associated with loosing the
    division

   Proceed by digging into pair-wise contribution plots
   (insert images)
   The colored area within each plot shows pairs
    that actually occur in the data

   Areas associated with contribution towards
    team win are marked in red

   Contributions towards team defeat are
    marked in blue
   (insert graphs)
   These two plots further highlight the rather
    unusual HR finding

   It is a well-known fact that batters aiming at
    a home run have higher number of strike-
    outs

   However, in 2010 regular season the HR-
    centered approach lead to a defeat!
   (insert graph)

   This plot represents two performance stats
    plotted against each other taken “as is” from
    the original data table

   Note the difficulty at discerning the identified
    HR X SO pattern visually because of “shadow”
    projections
   (Insert graphs)
   (Insert screen shot of Baseball-reference.com
    and standard pitching chart)

   Similar to batting stats

   Large number of derived stats exists
Core Stats                Derived Stats

•W-Wins                     •ERA-Earned Run Average
•L-Losses                    9xER/InningsPitched
•H-Hits Allowed             •DICE-Defense Independent
•BFP-Batters Faced           Component 3.0+(13HR+3(BB+HBP)-
                             2SO)/IP
•R-Runs Allowed
                            •FIP-Fielding Independent Pitching
•HR-Home Runs Allowed
                             3.1+(13HR+3BB-2SO)/IP
•WP-Wild Pitches
                            •dERA-Defense Independent ERA 10-
•IPOUTS-Outs Pitched         line algorithm
•SHO-Shutouts               •CERA-Component ERA Long
•BB-Base on Balls            convoluted equation
•SO-Strikeouts              •…-Many more exist
•ER-Earned Runs
•HBP-Batters Hit by Pitch
   (Insert charts)

   Started by feeding a complete set of available
    26 pitching stats for 2010 season
    performance

   Using top variable elimination followed by
    bottom variable elimination
    technique, reduced the list to only 8
    important stats
   (inset graphs)
   (insert graphs)

   Keep the strikeouts high and the base on
    balls low to win the division!
   (insert graphs)

   Remember that these are pitchers not batters

   More wild pitches, more home runs
    allowed, more strikeouts=>the division is
    won!
   (insert graph)

   Conventional plot IGNORES other dimensions
    which effectively project on top of each other

   As a result, there is a lot of confusion on the
    plot, making it difficult to see any pattern

   In contrast, TN dependence plot shows the
    given pair contribution AFTER the influence of
    other dimensions has been eliminated
   (insert graphs)

   These plots represent the results of running
    conventional linear regression (LR) on the
    pitching data

   While the anomalous HR-effect is present, the
    model fails at the identifying the fine local
    nature of the phenomenon

   LR does not provide enough “resolution”
   It appears that in the 2010 regular season Home Run
    driven strategy did not work!

   At least, this is what the data tells us, further
    understanding will require experts in the field

   Core stats have good explaining potential once put into
    true multivariate modeling framework

   Conventional statistics approaches do not have enough
    “resolution” to see the real details

   Modern Data Mining helps identifying realized patterns
    and allows quick and efficient check of the usefulness of
    various performance measures available to a manager or
    researcher
   NEVER FALL FOR THESE

   Absolute Powers- data mining will finally find and
    explain everything

   Gold Rush- with the right tool one can rip the stock-
    market or predict World-Series winner to become
    obscenely rich

   Quest for the Holy Grail- search for an algorithm that
    will always produce 100% accurate models

   Magic Wand- getting a complete solution from start
    to finish with a single button push
The End

More Related Content

Viewers also liked

Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 
Erp --functional-modules
Erp --functional-modulesErp --functional-modules
Erp --functional-modulesRavi shankar
 

Viewers also liked (6)

Social Data Mining
Social Data MiningSocial Data Mining
Social Data Mining
 
Data mining
Data miningData mining
Data mining
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Data mining
Data miningData mining
Data mining
 
Erp --functional-modules
Erp --functional-modulesErp --functional-modules
Erp --functional-modules
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 

Similar to Data mining for baseball new ppt

useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talkrtelmore
 
Final presentation
Final presentationFinal presentation
Final presentationlhbrennan
 
NBA Moneyball in Web Application Using R (20160307 MLDM)
NBA Moneyball in Web Application Using R (20160307 MLDM)NBA Moneyball in Web Application Using R (20160307 MLDM)
NBA Moneyball in Web Application Using R (20160307 MLDM)wqchen
 
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensFive Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensCitus Data
 
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...Citus Data
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsAjay Ohri
 
Major League Soccer Player Analysis
Major League Soccer Player AnalysisMajor League Soccer Player Analysis
Major League Soccer Player AnalysisChris Armstrong
 
CLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_ProjectCLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_ProjectDimitry Slavin
 
SLA Nov2009 Public
SLA Nov2009 PublicSLA Nov2009 Public
SLA Nov2009 Publicaspoerri
 
Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...
Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...
Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...Shrikant Mandlik
 
NBA playoff prediction Model.pptx
NBA playoff prediction Model.pptxNBA playoff prediction Model.pptx
NBA playoff prediction Model.pptxrishikeshravi30
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015StampedeCon
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
A Time Series Analysis for Predicting Basketball Statistics
A Time Series Analysis for Predicting Basketball StatisticsA Time Series Analysis for Predicting Basketball Statistics
A Time Series Analysis for Predicting Basketball StatisticsJoseph DeLay
 
grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013 grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013 adrianheilbut
 
Rallying Around Standards
Rallying Around StandardsRallying Around Standards
Rallying Around Standardsahoffer
 

Similar to Data mining for baseball new ppt (20)

useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
Big data and MLB
Big data and MLBBig data and MLB
Big data and MLB
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Sabr
SabrSabr
Sabr
 
NBA Moneyball in Web Application Using R (20160307 MLDM)
NBA Moneyball in Web Application Using R (20160307 MLDM)NBA Moneyball in Web Application Using R (20160307 MLDM)
NBA Moneyball in Web Application Using R (20160307 MLDM)
 
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensFive Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
 
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
 
LAX IMPACT! White Paper
LAX IMPACT! White PaperLAX IMPACT! White Paper
LAX IMPACT! White Paper
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
 
Major League Soccer Player Analysis
Major League Soccer Player AnalysisMajor League Soccer Player Analysis
Major League Soccer Player Analysis
 
CLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_ProjectCLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_Project
 
SLA Nov2009 Public
SLA Nov2009 PublicSLA Nov2009 Public
SLA Nov2009 Public
 
Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...
Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...
Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...
 
NBA playoff prediction Model.pptx
NBA playoff prediction Model.pptxNBA playoff prediction Model.pptx
NBA playoff prediction Model.pptx
 
Ali upload
Ali uploadAli upload
Ali upload
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
A Time Series Analysis for Predicting Basketball Statistics
A Time Series Analysis for Predicting Basketball StatisticsA Time Series Analysis for Predicting Basketball Statistics
A Time Series Analysis for Predicting Basketball Statistics
 
grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013 grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013
 
Rallying Around Standards
Rallying Around StandardsRallying Around Standards
Rallying Around Standards
 

More from Salford Systems

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4Salford Systems
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsSalford Systems
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Salford Systems
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Salford Systems
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningSalford Systems
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerSalford Systems
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like YouSalford Systems
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To RememberSalford Systems
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetSalford Systems
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideSalford Systems
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to marsSalford Systems
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher EducationSalford Systems
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingSalford Systems
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hivSalford Systems
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning CombinationSalford Systems
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSalford Systems
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998Salford Systems
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPMSalford Systems
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7Salford Systems
 

More from Salford Systems (20)

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data Mining
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele Cutler
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To Remember
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example Dataset
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher Education
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modeling
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hiv
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
 
SPM v7.0 Feature Matrix
SPM v7.0 Feature MatrixSPM v7.0 Feature Matrix
SPM v7.0 Feature Matrix
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARS
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPM
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7
 

Recently uploaded

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Data mining for baseball new ppt

  • 1. Salford Systems Dan Steinberg Mikhail Golovnya
  • 2. Data mining is the search for patterns in data using modern highly automated, computer intensive methods ◦ Data mining may be best defined as the use of a specific class of tools (data mining methods) in the analysis of data ◦ The term “search” is key to this definition, as is “automated”  The literature often refers to finding hidden information in data
  • 3. Data Mining Data Mining Cont. • Predictive • Statistics • OLAP Analytics • Computer science • CART • Machine Learning • Insurance • SVM • Pattern • Finance • NN Recognition • Marketing • CRISP-DM • Artificial • Robotics • CRM Intelligence • Biotech • KDD • Business • Sports Analytics • Etc. Intelligence • Data Warehousing
  • 4. Data guides the analysis, it is the “Alpha and Omega” of everything you do  Analyst asks the right questions but makes no assumptions  The success of data mining solely depends on the quality of available data ◦ Famous “Garbage In – Garbage Out” principle
  • 5. (Insert visual aid)  In a nutshell: Use historical data to gain insights and/or predictions on the new data
  • 6. Any game is the ultimate and unambiguous source of the quality data ◦ This is very different from the data availability and quality in other areas of research  However, there is no universal agreement on the best way of organizing and summarizing the results in a numeric form ◦ Large number of various game statistics available ◦ Common sense and game rules are at the core ◦ Heated debates on which stats best describe the potential for a future win
  • 7. (insert screenshot of Baseball-reference.com)  Available from many sources, including the Internet  Player level: summarize performance in a season, post season, and entire career  Team level: wins and losses  Game level: most detailed
  • 8. (insert Sean Lahman website screenshot)  Widely known public database  Gathers baseball stats all the way back to 1871  Will use parts of it to illustrate the potential of data mining
  • 9. Focusing on the 2010 regular season performance in both leagues  Have access to the player stats for the entire season organized in a flat table  Define a measure of the overall player success simply by having the team winning its division ◦ Thus 6 out of 30 participating teams in 2010 are declared as success  Question: Which of the player stats are associated with the team winning the division?
  • 10. Core Stats Derived Stats •AB-At Bats •AVG-Batting Average H/AB •R-Runs •TB-Total Bases •H-Hits B1+2x2B+3x3B+4xHR •2B-Doubles •SLG-Slugging TB/AB •3B-Triples •OBP-On Base Percentage •HR-Home Runs (H+BB+HBP)/(AB+BB+SF+HBP) •RBI-Runs Batted In •OPS-On Base Plus Slugging OBP+SLG •SB-Stolen Bases •…-Many more exist •CS-Caught Stealing •BB-Base on Balls •SO-Strikeouts •SF-Sacrifice Flies •HBP-Hit by pitch
  • 11. (insert scatter matrix)  This is how the problem is usually attacked  Each dot represents a single batter record for the whole 2010 season  1245 overall records  16 core stats  Winning team batters are marked in red  No obvious insights!
  • 12. Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone  Starting with CART in 1984, laid the foundation for tree- based modeling techniques  Conduct deep look into all available data  Point out most relevant variables and features  Automatically identify optimal transformations  Capable of extracting complex patterns going way beyond the traditional “single performance at a time” approach
  • 13. (insert graph)  6 core batter stats were identified as most predictive  About 20% of total variation can be directly associated with the batter stats  The single plots show non-linear nature of many of the relationships  Fine plot irregularities should be ignored  Striking result: HR above 30 is associated with loosing the division  Proceed by digging into pair-wise contribution plots
  • 14. (insert images)  The colored area within each plot shows pairs that actually occur in the data  Areas associated with contribution towards team win are marked in red  Contributions towards team defeat are marked in blue
  • 15. (insert graphs)  These two plots further highlight the rather unusual HR finding  It is a well-known fact that batters aiming at a home run have higher number of strike- outs  However, in 2010 regular season the HR- centered approach lead to a defeat!
  • 16. (insert graph)  This plot represents two performance stats plotted against each other taken “as is” from the original data table  Note the difficulty at discerning the identified HR X SO pattern visually because of “shadow” projections
  • 17. (Insert graphs)
  • 18. (Insert screen shot of Baseball-reference.com and standard pitching chart)  Similar to batting stats  Large number of derived stats exists
  • 19. Core Stats Derived Stats •W-Wins •ERA-Earned Run Average •L-Losses 9xER/InningsPitched •H-Hits Allowed •DICE-Defense Independent •BFP-Batters Faced Component 3.0+(13HR+3(BB+HBP)- 2SO)/IP •R-Runs Allowed •FIP-Fielding Independent Pitching •HR-Home Runs Allowed 3.1+(13HR+3BB-2SO)/IP •WP-Wild Pitches •dERA-Defense Independent ERA 10- •IPOUTS-Outs Pitched line algorithm •SHO-Shutouts •CERA-Component ERA Long •BB-Base on Balls convoluted equation •SO-Strikeouts •…-Many more exist •ER-Earned Runs •HBP-Batters Hit by Pitch
  • 20. (Insert charts)  Started by feeding a complete set of available 26 pitching stats for 2010 season performance  Using top variable elimination followed by bottom variable elimination technique, reduced the list to only 8 important stats
  • 21. (inset graphs)
  • 22. (insert graphs)  Keep the strikeouts high and the base on balls low to win the division!
  • 23. (insert graphs)  Remember that these are pitchers not batters  More wild pitches, more home runs allowed, more strikeouts=>the division is won!
  • 24. (insert graph)  Conventional plot IGNORES other dimensions which effectively project on top of each other  As a result, there is a lot of confusion on the plot, making it difficult to see any pattern  In contrast, TN dependence plot shows the given pair contribution AFTER the influence of other dimensions has been eliminated
  • 25. (insert graphs)  These plots represent the results of running conventional linear regression (LR) on the pitching data  While the anomalous HR-effect is present, the model fails at the identifying the fine local nature of the phenomenon  LR does not provide enough “resolution”
  • 26. It appears that in the 2010 regular season Home Run driven strategy did not work!  At least, this is what the data tells us, further understanding will require experts in the field  Core stats have good explaining potential once put into true multivariate modeling framework  Conventional statistics approaches do not have enough “resolution” to see the real details  Modern Data Mining helps identifying realized patterns and allows quick and efficient check of the usefulness of various performance measures available to a manager or researcher
  • 27. NEVER FALL FOR THESE  Absolute Powers- data mining will finally find and explain everything  Gold Rush- with the right tool one can rip the stock- market or predict World-Series winner to become obscenely rich  Quest for the Holy Grail- search for an algorithm that will always produce 100% accurate models  Magic Wand- getting a complete solution from start to finish with a single button push