Data Science
An Implementation Perspective
http://tech.lalitbhatt.net
Director Engineering – Entytle Inc
(http://www.entytle.com)
The Joke of the town is
 Engineers think that equations approximate the real
world.
 Scientists think that the real world approximates
equations.
 Mathematicians are unable to make the connection.
Science is very comfortable living with uncertainty and not
knowing. Engineers hate that. - Ranjeet Tate
Does not mean anyone is less or more important than
other. Everyone has it’s place in grand scheme.
Scientist and Engineer
 Models to represent the abstraction
 Prototype Implementation
 Solution for production
Science looks for truth.
What engineers build, they consider it truth.
Science and Engineering
 It’s a mind set – Schrödinger Cat / Earthquakes
 Deterministic: Single solution for appropriate inputs
so that it can be actionable with no interpretation.
 Probabilistic: Distribution of possible outcomes and
how likely they are going to occur. Still actionable
but requires interpretation
Deterministic and Probabilistic
Example:
 Customer A can bring $10000 revenue
 Customer B can bring $1000 revenue
Which one to chase if for A the probability is 0.001
and for B probability is 0.5?
Deterministic and Probabilistic
Finding what?
There are known knowns; there are things we know
that we know. There are known unknowns; that is to
say, there are things that we now know we don't know.
But there are also unknown unknowns -- there are
things we do not know we don't know.
Donald Rumsfield
A Process view
 Business Goals
 Data finding and cleaning – De-duplication,
Normalization, Nomenclature, Units, Outliers
 Exploration – Visualization, Patterns, Causality
 Modeling – Build a story
 Testing and Validation
 Production Implementation
 Feedback loop
Domain
Knowledge
 Classification and regression
 Cluster analysis
 Pattern mining
 Outlier analysis
Lists not exhaustive but most important takeaway is
Model building.
Mathematical Tools
 Don’t try to find every pattern in everything. Mine
the pattern and not force it.
 If it’s available to buy, consider that.
 Never underestimate the variation that one can get
in real data
 More you code, more bug you put.
Approach to implementation
 Descriptive: Last year we had 10 mm of rain per
day average.
 Predictive: Next year we will have 9 mm of rain per
day average.
 Prescriptive: Recommendation on what can
increase the rain further.
Types of Output
 Training vs Testing
 A/B Test
 Errors
Testing Models/Implementations
 Type I and Type II errors
 Accuracy = (110+80)/(110+50+40+80)
 Recall = 110/(110 +40) How many +ve were right
 Precision = 110/(110+50) Predicted yes and came yes
 Diagnostics Odd ratio = (TP/FP)/(FN/TN)
 …
Models accuracy
Predicted Y Predicted N
Actual Y 110 40
Actual N 50 80
CRISP-DM
Image: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
Communication Flow
Product
Management
Data Science
Engineering QA
Let’s appreciate each one is doing his/her job.
Like any relationship, it takes constant work
What customers want?
What data is telling?
What is the model and
how to implement it.
Preferably a prototype
implementation
Input and Output
Is model working
on all type of data
and at scale?
Also what are edge
cases?
Review test results on
all possible real life
cases
Input and
expected
output
Thanks
Questions

Data Science An Engineering Implementation Perspective

  • 1.
    Data Science An ImplementationPerspective http://tech.lalitbhatt.net Director Engineering – Entytle Inc (http://www.entytle.com)
  • 2.
    The Joke ofthe town is  Engineers think that equations approximate the real world.  Scientists think that the real world approximates equations.  Mathematicians are unable to make the connection. Science is very comfortable living with uncertainty and not knowing. Engineers hate that. - Ranjeet Tate Does not mean anyone is less or more important than other. Everyone has it’s place in grand scheme. Scientist and Engineer
  • 3.
     Models torepresent the abstraction  Prototype Implementation  Solution for production Science looks for truth. What engineers build, they consider it truth. Science and Engineering
  • 4.
     It’s amind set – Schrödinger Cat / Earthquakes  Deterministic: Single solution for appropriate inputs so that it can be actionable with no interpretation.  Probabilistic: Distribution of possible outcomes and how likely they are going to occur. Still actionable but requires interpretation Deterministic and Probabilistic
  • 5.
    Example:  Customer Acan bring $10000 revenue  Customer B can bring $1000 revenue Which one to chase if for A the probability is 0.001 and for B probability is 0.5? Deterministic and Probabilistic
  • 6.
    Finding what? There areknown knowns; there are things we know that we know. There are known unknowns; that is to say, there are things that we now know we don't know. But there are also unknown unknowns -- there are things we do not know we don't know. Donald Rumsfield
  • 7.
    A Process view Business Goals  Data finding and cleaning – De-duplication, Normalization, Nomenclature, Units, Outliers  Exploration – Visualization, Patterns, Causality  Modeling – Build a story  Testing and Validation  Production Implementation  Feedback loop Domain Knowledge
  • 8.
     Classification andregression  Cluster analysis  Pattern mining  Outlier analysis Lists not exhaustive but most important takeaway is Model building. Mathematical Tools
  • 9.
     Don’t tryto find every pattern in everything. Mine the pattern and not force it.  If it’s available to buy, consider that.  Never underestimate the variation that one can get in real data  More you code, more bug you put. Approach to implementation
  • 10.
     Descriptive: Lastyear we had 10 mm of rain per day average.  Predictive: Next year we will have 9 mm of rain per day average.  Prescriptive: Recommendation on what can increase the rain further. Types of Output
  • 11.
     Training vsTesting  A/B Test  Errors Testing Models/Implementations
  • 12.
     Type Iand Type II errors  Accuracy = (110+80)/(110+50+40+80)  Recall = 110/(110 +40) How many +ve were right  Precision = 110/(110+50) Predicted yes and came yes  Diagnostics Odd ratio = (TP/FP)/(FN/TN)  … Models accuracy Predicted Y Predicted N Actual Y 110 40 Actual N 50 80
  • 13.
  • 14.
    Communication Flow Product Management Data Science EngineeringQA Let’s appreciate each one is doing his/her job. Like any relationship, it takes constant work What customers want? What data is telling? What is the model and how to implement it. Preferably a prototype implementation Input and Output Is model working on all type of data and at scale? Also what are edge cases? Review test results on all possible real life cases Input and expected output
  • 15.

Editor's Notes

  • #3 - Heisenberg uncertainty principal.
  • #11 Example – conference partiipants