Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Science An Engineering Implementation Perspective

15 views

Published on

Successful data science project requires multiple functions to come together and work in tandem.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Data Science An Engineering Implementation Perspective

  1. 1. Data Science An Implementation Perspective http://tech.lalitbhatt.net Director Engineering – Entytle Inc (http://www.entytle.com)
  2. 2. The Joke of the town is  Engineers think that equations approximate the real world.  Scientists think that the real world approximates equations.  Mathematicians are unable to make the connection. Science is very comfortable living with uncertainty and not knowing. Engineers hate that. - Ranjeet Tate Does not mean anyone is less or more important than other. Everyone has it’s place in grand scheme. Scientist and Engineer
  3. 3.  Models to represent the abstraction  Prototype Implementation  Solution for production Science looks for truth. What engineers build, they consider it truth. Science and Engineering
  4. 4.  It’s a mind set – Schrödinger Cat / Earthquakes  Deterministic: Single solution for appropriate inputs so that it can be actionable with no interpretation.  Probabilistic: Distribution of possible outcomes and how likely they are going to occur. Still actionable but requires interpretation Deterministic and Probabilistic
  5. 5. Example:  Customer A can bring $10000 revenue  Customer B can bring $1000 revenue Which one to chase if for A the probability is 0.001 and for B probability is 0.5? Deterministic and Probabilistic
  6. 6. Finding what? There are known knowns; there are things we know that we know. There are known unknowns; that is to say, there are things that we now know we don't know. But there are also unknown unknowns -- there are things we do not know we don't know. Donald Rumsfield
  7. 7. A Process view  Business Goals  Data finding and cleaning – De-duplication, Normalization, Nomenclature, Units, Outliers  Exploration – Visualization, Patterns, Causality  Modeling – Build a story  Testing and Validation  Production Implementation  Feedback loop Domain Knowledge
  8. 8.  Classification and regression  Cluster analysis  Pattern mining  Outlier analysis Lists not exhaustive but most important takeaway is Model building. Mathematical Tools
  9. 9.  Don’t try to find every pattern in everything. Mine the pattern and not force it.  If it’s available to buy, consider that.  Never underestimate the variation that one can get in real data  More you code, more bug you put. Approach to implementation
  10. 10.  Descriptive: Last year we had 10 mm of rain per day average.  Predictive: Next year we will have 9 mm of rain per day average.  Prescriptive: Recommendation on what can increase the rain further. Types of Output
  11. 11.  Training vs Testing  A/B Test  Errors Testing Models/Implementations
  12. 12.  Type I and Type II errors  Accuracy = (110+80)/(110+50+40+80)  Recall = 110/(110 +40) How many +ve were right  Precision = 110/(110+50) Predicted yes and came yes  Diagnostics Odd ratio = (TP/FP)/(FN/TN)  … Models accuracy Predicted Y Predicted N Actual Y 110 40 Actual N 50 80
  13. 13. CRISP-DM Image: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
  14. 14. Communication Flow Product Management Data Science Engineering QA Let’s appreciate each one is doing his/her job. Like any relationship, it takes constant work What customers want? What data is telling? What is the model and how to implement it. Preferably a prototype implementation Input and Output Is model working on all type of data and at scale? Also what are edge cases? Review test results on all possible real life cases Input and expected output
  15. 15. Thanks Questions

×