Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Science Isn't a Fad: Let's Keep it That Way


Published on

First presented at the February 2013 Research Triangle Analysts meeting, this presentation discusses the technical side of making data science a field that's here to last. This presentation focuses on the "science" aspect of data science and how it drives value to an organization.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Data Science Isn't a Fad: Let's Keep it That Way

  1. 1. Data Science Isn’t a FadLet’s Keep It That Way Presentation to Research Triangle Analysts February 21, 2013
  2. 2. Data Science: Buyer Beware Forbes article: Data Science: Buyer Beware “This is a management fad.”Me: I’ve been doing this for 16 years. It isn’t a fad. You keep renaming it.Result: Great conversation, and another Forbes article.
  3. 3. Obligatory Definition Wikipedia: Data science is a novel term that is often used interchangeably with competitive intelligence or business analytics, although it is becoming more common. Data science seeks to use all available and relevant data to effectively tell a story that can be easily understood by non-practitioners. Sexiest job of the 21st century. --Thomas H. Davenport and DJ Patil Pseudo science performed by rock-star unicorns. -- The Internet
  4. 4. Data SCIENCEData: emphasizes the transformation of rawinformation into actionable results.Science: emphasizes the commitment to verifiable andrepeatable process.Data Science: The discipline of transforming rawinformation into actionable results in a manner that isverifiable and repeatable.“Information is cheap. Meaning is expensive.” --George Dyson, 2011
  5. 5. Data Science Is.... Google’s Search Engine Fraud FrameworkSpotfire Operations Analytics in Production Analytics
  6. 6. Once upon a time... Information was VERY expensive.
  7. 7. Data Science and Statistics The statistical methods you learn as an undergraduate were optimized to make efficient use of small data samples. Data is a unique resource: The more you have, the more valuable each individual piece becomes. Provided you can extract meaning from the information.
  8. 8. “Big Data” = New ProblemsDynamic environment: relationships change.Constant sampling means you will have false positives.Large numbers of variables and data points means youhave to rely on automated tools.Not all automated tools are created equal.
  9. 9. Cue Shameless Plug.... John Sall Co-Founder & EVP of SAS Institute Director of JMP “From Big Data to Big Statistics” March 21, 6:30pm Louie and Charlies
  10. 10. Raw Information to ActionableResults The results of the analysis must answer the business question(s). The results of the analysis must provide a course of action.
  11. 11. ActionableClick on this link. Check this person’s file.Stop/encourage this Look at this pattern. activity.
  12. 12. Verifiable The assumptions from the underlying methods must be stated and shown to be true. Outlier cases must be documented and handled effectively. Different analysis, error table, excluded point.
  13. 13. Y = 3.0017 + 0.499X Corr = 0.8199Anscombe’s QuartetLinear regression assumes a straight linerelationship and normally distributed errors.
  14. 14. Y = 3.0017 + 0.499X Corr = 0.8199Anscombe’s QuartetThis line has the same statistics as the onebefore. But the relationship is not a straight line.
  15. 15. Y = 3.0017 + 0.499X Corr = 0.8199Anscombe’s QuartetAn outlier is affecting the equation.
  16. 16. Y = 3.0017 + 0.499X Corr = 0.8199Anscombe’s QuartetOne outlier drives the entire relationship.
  17. 17. RepeatableWhen I do this again with data that meets the statedassumptions, I should get the same answers.Small changes in the data should NOT break thealgorithm. Easier said than done.
  18. 18. Making Results RepeatableAutomated verification of assumptions.Good coding practices (no matter the language).Out of sample testing. Do the same analysis with similar data.Failure conditions Document what should happen when bad data goes into the algorithm. Run the algorithm with bad data.
  19. 19. This is the endpoint of the analysis.Companies who hire data scientists use the resultsto make decisions.
  20. 20. Repeatable: Closing theLoop With UsersIt is the data scientist’s responsibility to make sure theresults are used effectively.Involve users at the beginning of the process.Use iterative feedback to make sure results are: Actionable Verifiable Repeatable.
  21. 21. Why Bother? “Beware the Big Errors of Big Data” “Big Data is Falling into the Trough of Disillusionment” “If you asked me to describe the rising philosophy of the day, I would say it’s data-sim...”
  22. 22. Really,Then, Why Bother? “...the Oakland As frontoffice ...fielded a team that could compete successfully against richer competitors in Major League Baseball (MLB).”
  23. 23. Because What We Do Matters “Refugees United...uses mobile and web technologies to help refugees find their missing loved ones.” “Predictive analytics is saving lives and taxpayer dollars in New York City.” --Alex Howard, Michael Flowers interview
  24. 24. That’s Enough From MeWhat do you think about me? THANK YOU!All photos the property of their respective owners.