1Statistical Engineering and BIG DATABig Data’ - A Challenge for StatisticalLeadershipChicago Chapter ASASAY Award LuncheonRoger W. HoerlUnion CollegeSchenectady, NYWith significant input from Ron Snee
2Statistical Engineering and BIG DATAAbstractThe Wall Street Journal, New York Times and other respected publications have had major featuresrecently on Big Data - the massive data sets which are becoming commonplace, and on the new,"sexy" data mining methods developed to analyze them. These articles, as well as much of theprofessional data mining and Big Data literature, may give casual users the impression that if onehas a powerful enough algorithm and a lot of data, good models and good results are guaranteed atthe push of a button. Obviously, this is not the case. The leadership challenge to the statisticalprofession is to insure that Big Data projects are built upon a sound foundation of good modeling,and not upon the sandy foundation of hype and unstated assumptions. Further, we need toaccomplish this without giving the impression that we are "against" Big Data or newer methods. I feelthat the principles of statistical engineering (see Anderson-Cook and Lu 2012) can provide a path todo just this. Three statistical engineering principles that are often overlooked or underemphasized byBig Data enthusiasts are the importance of data quality - knowing the "pedigree" of the data; theneed to view statistical studies as part of the sequential process of scientific discovery - versus the"one-shot study" so common in textbooks; and the criticality of using subject-matter knowledge whendeveloping models. I will present examples of the severe problems that can arise in Big Data studieswhen these principles are not understood or ignored. In summary, I argue that the development ofBig Data analytics provides significant opportunities to the profession, but at the same time requiresa more proactive role from us, if we are to provide true leadership in the Big Data phenomenon.
3Statistical Engineering and BIG DATAOutlineStatistical Leadership (Advocacy)The ―Big Data‖ PhenomenonWhat Could Possibly Go Wrong?Statistical Engineering, and How It Can HelpLeading the Way – Doing Big Data the Right WaySummary
4Statistical Engineering and BIG DATAStatistical LeadershipLeadership: taking people from one paradigm to another.Enabling people to think statistically, and apply statistical methods, requires leadership.Opinion: too many statisticians are satisfied being experts in the tools themselves,without worrying much about the overall impact our profession is having on society.Can’t see the forest for the trees.As a result, society too often compartmentalizes statisticians as narrow specialists, anddoes not view us as thought leaders; they look elsewhere for leadership.Passive consultants versus proactive leaders.As a case in point, most professionals view the ―Big Data‖ phenomenon as being led bycomputer scientists, engineers, or data scientists (whatever that means), rather than bystatisticians.Ron Snee, Gerry Hahn, and other leaders have been noting for years that statisticiansneed to be more proactive, and guide society as to what needs to be done.We shouldn’t be satisfied being the ―tools guys‖.“Everything Rises and Falls on Leadership.” John Maxwell
5Statistical Engineering and BIG DATAData Mining and Big DataThe technology for acquiring, storing, and processing data have been increasingexponentially (―Big Data‖), providing new opportunities to ―mine‖ the data.According to IBM, there are now 1.6 zetabytes (1021 bytes) of digital data available.To use 1.6 zetabytes of bandwidth, you would need to watch HD TV for 47,000 years.―I keep saying that the sexy job in the next 10 years will be statisticians,‖ Hal Varian,chief economist at Google. ―And I’m not kidding.‖March 2012: The White House announced a national "Big Data Initiative" thatconsisted of six Federal departments and agencies committing more than $200million to Big Data research projects.As noted by Ron Snee, data mining has been around for decades:1950s: Stepwise regression first developed at Esso (now Exxon) by Efroymsonto analyze refinery data1960s: Graphical methods developed by Tukey, Wilk, Gnanadesikan and othersat Bell Labs to gain insight from large data sets1970s: DuPont uses data compression algorithms in process monitoring usingon-line systemsBig Data and Data Mining are Growing Rapidly, but Are Not New.
6Statistical Engineering and BIG DATAWhat’s New?Sheer size of data – often requires compression, parallel processing, and sampling,to store and analyze.Some traditional methods are no longer relevant, e.g., hypothesis testing.Insight from graphical methods must be rethought – difficult to see find outliers inzetabytes of data.The sample sizes coupled with faster computing enables much more complexmodels, relative to data sets of 30.Due to the above, newer techniques have become popular:CART and other tree-based methods; recursive splits on the data.Neural networks; non-linear models involving combinations of variables – very flexible.Methods based on bootstrapping – resampling and combining models; random forests,―bagging‖, etc.Clustering and classification methods designed for massive data sets; K-meansclustering, support vector machines, etc.Good News: We Have More Data and Powerful Analysis Methods.
7Statistical Engineering and BIG DATAWhat Could Possibly Go Wrong?
8Statistical Engineering and BIG DATAWhat Could Possibly Go Wrong?Duke Genomics Center published several groundbreaking articlesconclusively identifying cancer biomarkers in the 2005-2010 timeframe.Unfortunately, clinical trials based on this research did not pan out.Women died unexpectedly.Two statisticians, Keith Baggerly and Kevin Coombes, dug into theresearch.New York Times, July 8, 2011:Dr. Baggerly and Dr. Coombes found errors almost immediately. Some seemedcareless – moving a row or column over by one in a giant spreadsheet – while othersseemed inexplicable. The Duke team shrugged them off as ―clerical errors‖...In the end,four gene signature papers were retracted. Duke shut down three trials using the results.(Lead investigator) Dr. Potti resigned from Duke...His collaborator and mentor, Dr.Nevins, no longer directs one of Duke’s genomics centers. The cancer world is reeling.Large Amounts of Data Plus Sophisticated Algorithms Do Not Guarantee Success.
9Statistical Engineering and BIG DATAWhat Could Possibly Go Wrong?Financial giant Lehman Brothers declared bankruptcy on September 15th,2008.This was the largest bankruptcy filing in US history, with Lehman Brothersholding roughly $600 billion in assets.The Dow Jones Industrial Average dropped over 500 points that day,several other financial institutions followed Lehman Brothers intobankruptcy.....and the rest is history.A few years earlier, I had visited Lehman Brothers headquarters in NY withrepresentatives of GE Capital:Lehman was selling models to predict corporate defaults.Their models were quite sophisticated, and based on large amounts of historicalfinancial data.Virtually all financial institutions impacted by the crisis had models.“Historical Results Do Not Guarantee Future Performance.”
10Statistical Engineering and BIG DATAWhat Could Possibly Go Wrong?On April 18th, 2011 the book ―The Making of a Fly‖ goes on sale onAmazon.com.Amazon’s automated algorithm places a price of $1,730,045 on thebook.Later in the day, the Amazon price goes up to $23,698,656.Plus $3.55 for shipping and handling.No one buys the book that day.Days later, the Amazon price was $106.People started to buy the book.“We are Writing Things That No One Can Read.” Kevin Slavin (2011 TED Conference)
11Statistical Engineering and BIG DATAWhat Could Possibly Go Wrong?Our quandary:All other things being equal, ―Big Data‖ is better than ―littledata‖.The newer data mining tools are powerful and work quitewell in numerous cases.Yet, modeling disasters continue to occur; why?Clearly, we are missing something in the equation.Could It Be That the Fundamentals Are Still Important?
12Statistical Engineering and BIG DATACan Statistical Engineering Principles Help?Some Background, and a Definition
13Statistical Engineering and BIG DATAInteresting Course Taught at HarvardStat 399: Problem Solving in Statistics“…emphasizes deep, broad, and creative statisticalthinking instead of technical problems that correspondto a recognizable textbook chapter.”**Xiao-Li Meng, American Statistician, August 2009Do the Important Problems We Face “Correspond to aRecognizable Textbook Chapter?”
14Statistical Engineering and BIG DATASusan Hockfield – MIT PresidentAround the dawn of the 20th century, physicists discovered thebasic building blocks of the universe; a ―parts list‖, if youwill. Engineers said ―we can build something from this list,‖and produced the electronics revolution, and subsequentlythe computer revolution.More recently, biologists have discovered and mapped thebasic ―parts list‖ of life – the human genome. Engineershave said ―we can build something from this list,‖ and areproducing a revolution in personalized medicine.*Who is Building Something Meaningful From the Statistical Science Parts List of Tools?*Loosely quoted from January, 2010 seminar at GE Global Research
15Statistical Engineering and BIG DATAStatistical Engineering DefinitionStatistical engineering:The study of how to best utilize statistical concepts, methods, and toolsand integrate them with information technology and other relevantsciences to generate improved results (Hoerl and Snee 2010a).In other words, trying to build something meaningful from the statisticalscience tools list.Enables us to attack the large, complex, unstructured problems “that donot correspond to a recognizable textbook chapter.”NotesThis is a different definition than that used by Eisenhart, who we believe wasthe first to use this term in 1950.Good statisticians have always done this, but little practical guidance hasbeen documented in the literature.This Definition is Consistent with Dictionary Definitions of Engineering.
16Statistical Engineering and BIG DATATypical Phases of Statistical Engineering Projects1. Identify problems: find the high-impact issues inhibitingachievement of the organization’s strategic goals.2. Create structure: carefully define the problem, objectives,constraints, metrics for success, and so on.3. Understand the context: identify important stakeholders (e.g.,customers, organizations, individuals, management), research thehistory of the issue, identify unstated complications and culturalissues, locate relevant data sources.4. Develop a strategy: create an overall, high level approach toattacking the problem, based on phases 2 and 3.5. Establish tactics: develop and implement diverse initiatives orprojects that collectively will accomplish the strategy.There Are No “Seven Easy Steps” to Statistical Engineering Projects.
17Statistical Engineering and BIG DATAStatistical Engineering – Critical Considerations for BIG DATAData QualityFree of omissions, errors, missing values, etc.Missing variablesHigh measurement variationBiases – human, equipment,Subject Matter Knowledge – Used in Many different waysVariables selection and appropriate scales (e.g., log, inverse, square. …)Selection of model form; linear, curvilinear, multiplicativeInterpretation of resultsAbility to extrapolate findingsUse of Sequential ApproachesBig problems are not solved with one analysis or even one data setStrategy must move beyond the one shot study mindsetThree Macro Issues That Seem to Be Overlooked in the Big Data literature.
18Statistical Engineering and BIG DATAUnderstanding the “Data Pedigree”Trust but Verify - Data pedigree must be assessed whenanalyzing Big Data. Data quality is an issue with all sources ofdata.Careful thought must be given to the model form needed toanswer the question, and whether the current data is sufficientfor that purpose.Multiple sources of data require careful thought as to datapedigree and how to fit the data bases together to produceuseful results.Different data sources are typically associated with politicalissues, different agendas, different objectives, etc.Good Principle: Data Are Guilty Until Proven Innocent.
19Statistical Engineering and BIG DATAThe Advantages of a Sequential ApproachMuch of our professional literature, and virtually all of our textbooks,assume that statistical problems are, by their nature, ―one shotstudies‖:We are handed a fixed data set, and must develop the ―best‖model to fit the data.Articles are frequently published challenging previously publishedanalyses, and proposing a better model for the same data.This is the clearly the tone of many high-profile data analysiscompetitions, beginning with the Netflix Challenge, and continuingtoday with Kaggle.com.Are Most Statistical Problems One-Shot Studies?
20Statistical Engineering and BIG DATAThe Advantages of a Sequential ApproachIn 30 years working as a statistician in the private sector, I almostalways needed a sequential approach, involving more than onestatistical tool, to solve the important problems I faced.If one is in the midst of an sequential process, he or she approachesdata analysis from a very different viewpoint versus one-shot studies.A key goal in the process is to direct the next round of data gatheringand analysis, as opposed to finding the ―optimal‖ model.Sequential approaches, as proposed by Box, Hunter, and Hunter(2005) also offer the opportunity for using hindsight to our advantage.―The best time to design an experiment is after examining theresults.‖Are Netflix and Kaggle.com Missing Something?
21Statistical Engineering and BIG DATAThe Importance of Subject Matter Knowledge―Data have no meaning in themselves; they are meaningful only in relation to aconceptual model of the phenomenon being studied.‖ Box, Hunter, and Hunter.Implied message of the data mining, machine learning, and Big Data literature; ―Datahave complete meaning in themselves; no theory is required‖.For example, only subject matter theory, NOT statistics, allows us to extrapolatethe results of a study, say a clinical trial, to a broader population.Subject matter theory guides the statistical process, including data collection,analysis, and interpretation.This is a ―scientific method‖ approach to statistics, as opposed to a ―test‖ approach tostatistics.Such an approach allows statistics and statisticians an active role in developing newtheories, as opposed to simply providing yes/no answers to existing theories(proactive leadership vs. passive consulting paradigm).New subject matter insights lead naturally to new questions, and new data,directly linking this principle to the sequential approach principle.Data and Understanding Are Not Synonyms
22Statistical Engineering and BIG DATADataSubject Matter TheoryProcess Knowledge IncreasesBusiness ProcessCustomerDataIntegration of Subject Matter KnowledgeFrom Hoerl & Snee, Statistical Thinking: Improving Business Performance, 2nd Ed., Wiley, 2012
23Statistical Engineering and BIG DATAPutting It All Together- Providing Leadership to Ensure We Do Big Data theRight Way
24Statistical Engineering and BIG DATAStatistical Engineering Approach to Big DataLeadership is needed to avoid the pitfalls of ―Big Data + powerful algorithms = success‖fallacy; if we don’t lead the way, it probably won’t happen.The fundamentals still apply – in fact they are even more critical.The phases of Statistical Engineering provide a framework with which to attack BigData projects more scientifically1. Identify problems: find the high-impact Big Data problems – don’t wait for them tocome to you2. Create structure: carefully define the real (versus stated) problem, objectives,constraints, metrics for success, and so on.3. Understand the context: obtain as much subject-matter knowledge as possible,research the history of the issue, locate relevant data sources, and so on.4. Develop a strategy: create an overall, high level approach to attacking the problem,based on phases 2 and 3; incorporate a sequential approach – applying what welearn in the initial analysis.5. Establish tactics: develop and implement individual steps in the strategy – stayflexible, but start with a defined plan.Big Data Constitutes One of Our Profession’s Best Leadership Opportunities in Our History.
25Statistical Engineering and BIG DATASummaryThe glass is half-full: Big Data and associated tools offer a unique opportunity tosolve important problems that were previously intractable.Fundamentals of good science, analytical modeling and interpretation still apply.Ignoring these fundamentals increases the probability that invalidconclusions are reached and inappropriate actions taken.Statistical Engineering provides a useful approach for using Big Data to solveimportant problems.A five-phase framework is suggested to guide the work associated with BigData problems that are typically large, complex and unstructured.Probability of success is significantly increased when the following aspects ofStatistical Engineering are incorporated in the approach:Understanding of data pedigreeUtilization of sequential approachesIntegration of subject matter knowledgeStatistical Engineering Can Help Big Data Projects Be Successful
26Statistical Engineering and BIG DATAReferencesDavenport, T. H and J. G. Harris (2007) Competing on Analytics, Harvard Business School Press, Boston,MADeVeaux, R. D. and D. J. Hand (2005) ―How to Lie with Bad Data‖, Statistical Science, Vol. 20, No.3, 231-238Hoerl, R. W. and R. D. Snee (2012) Statistical Thinking: Improving Business Performance, 2nd Ed., Wiley,2012Pierrard, J. M. (1974) ―Relating Automotive Emissions and Urban Air Quality‖, DuPont Innovation, Vol. 5.No. 2, pp 6-9.Pierrard, J. M., R. D. Snee and J. Zelson (1973) ―A New Approach to Setting Vehicle Emission Standards‖,Presented at Air Pollution Control Association Annual Meeting, June 24-28, 1973Pierrard, J. M., R. D. Snee and J. Zelson (1974) ―A New Approach to Setting Vehicle Emission Standards‖,Air Pollution Control Association Journal, Vol. 24, No. 9, pp 841-848.Snee, R. D. and R. W. Hoerl (2003) Leading Six Sigma – A Step by Step Guide Based on Experience WithGeneral Electric and Other Six Sigma Companies, FT Prentice Hall, New York, NY.Snee, R. D. and R. W. Hoerl (2012) ―Inquiry on Pedigree – Do You Know the Quality and Origin of YourData?‖ Quality Progress, December 2012, 66-68.Snee, R. D. and J. M. Pierrard (1977) ―The Annual Average: An Alternative to the Second Highest Value asa Measure of Air Quality‖, Air Pollution Control Association Journal, Vol. 27, No. 2, pp 131-133.
27Statistical Engineering and BIG DATAArticles on Statistical Engineering by Hoerl and SneeRoger W. Hoerl and Ronald D. Snee, (2009) ―Post Financial Meltdown: What Do Services Industries NeedFrom Us Now?‖ Applied Stochastic Models in Business and Industry, December 2009, pp. 509-521.Roger W. Hoerl and Ronald D. Snee, (2010) ―Moving the Statistics Profession Forward to the Next Level,‖ TheAmerican Statistician, February 2010, pp. 10-14.Roger W. Hoerl and R. D. Snee, (2010) ―Closing the Gap: Statistical Engineering Can Bridge StatisticalThinking with Methods and Tools,‖ Quality Progress, May 2010, pp. 52-53.Roger W. Hoerl and R. D. Snee, (2010) ―Tried and True—Organizations Put Statistical Engineering to the Testand See Real Results,‖ Quality Progress, June 2010, pp. 58-60.Roger W. Hoerl and Ronald D. Snee, (2010) ―Statistical Thinking and Methods in Quality Improvement: A Lookto the Future,‖ Quality Engineering, 22, 3, pp. 119-139.Roger W. Hoerl and Ronald D. Snee, (2011) ―Statistical Engineering: Is This Just Another Term for AppliedStatistics?‖ Joint Newsletter of the ASA Section on Physical and Engineering Sciences and Quality andProductivity , March 2011, 4-6.Ronald D. Snee and Roger W. Hoerl, (2010) ―Further Explanation; Clarifying Points About StatisticalEngineering,‖ Quality Progress, December 2010, pp. 68-72Ronald D. Snee and Roger W. Hoerl (2011) ―Engineering an Advantage‖, Six Sigma Forum Magazine, GuestEditorial, February 2011, 6-7.Ronald D. Snee and Roger W. Hoerl, (2011) ―Proper Blending: Finding the Right Mix of Statistical Engineeringand Traditional Applied Statistics,‖ Quality Progress, June 2011.