Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Science: The Product Manager's Primer

Ad

Data Science: The Product Manager's Primer
by Andrew Koller and Doron Bergman
/Productschool @ProdSchool /Productmanagemen...

Ad

Who we are
Andrew Koller
koller.andrew.j@gmail.com
Five years experience as an
entrepreneur and product manager
including ...

Ad

Overview
Everything you need to know to understand the world of data science in start ups
➔ Back to the Basics
An overview...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Loading in …3
×

Check these out next

1 of 54 Ad
1 of 54 Ad

More Related Content

More from Product School

Data Science: The Product Manager's Primer

  1. 1. Data Science: The Product Manager's Primer by Andrew Koller and Doron Bergman /Productschool @ProdSchool /ProductmanagementSF
  2. 2. Who we are Andrew Koller koller.andrew.j@gmail.com Five years experience as an entrepreneur and product manager including a background in statistical physics modeling and data science. Doron Bergman PhD in theoretical solid state physics. Three years experience in large tech and startups.
  3. 3. Overview Everything you need to know to understand the world of data science in start ups ➔ Back to the Basics An overview of statistics and the mathematical basis ➔ Data Science and AI How Data Science differs from statistics and makes prediction in the real world ➔ Putting it all together with an example Provide a simple unifying message for what is to come ➔ Data Scientist perspectives How to understand objections your DS’s might have and how to be their hero NOTE: Find this presentation at tiny.cc/ProductSchoolDataScience.
  4. 4. How popular is Data Science really?
  5. 5. Really Popular! (Especially in the Bay Area)
  6. 6. Data Science
  7. 7. Data Science!
  8. 8. Q: What is data science and AI? Where do I start? A: Statistics
  9. 9. Statistical Background A quick review and overview of the statistics you’ll need to communicate with your team ➔ Regression Describing a relationship between two variables. ➔ Confidence Understanding to some measure of how valid and strong the result is.
  10. 10. Q: What is statistics and why are you talking about that? I’m here for data science.
  11. 11. A: Statistics is a way to describe and make generalizations about a population of data. It forms the basis of Data Science.
  12. 12. Regression is the description of a variety of data points with a function. The most basic form is linear regression in the form of y=Mx+b
  13. 13. Money and Wins Money might not be able to buy happiness but it sure seems to buy wins in Major League Baseball Y = M * x + b Wins = 0.153 wins/$Million * Payroll (in $Million) + 66.4 Wins Or about $6.5 Million per win Actual calculation will be left as an exercise Image source: New York Times, 2010
  14. 14. Q: How well does linear relationship apply to everything? A: Pretty well*
  15. 15. US Population From 1650 to 1850 the US population grew non linearly Y = A* (b *x)n Sadly not linear…. Or is it? Image source: http://onlinestatbook.com/2/transformations/tukey.html
  16. 16. US Population But - after taking the log of the population and the function is Y = M * x + b Image source: http://onlinestatbook.com/2/transformations/tukey.html
  17. 17. Many datasets can be made linear after just one transformation Q: What if that isn’t good enough?
  18. 18. The next step beyond linear regression is called polynomial regression in the form of y=Nx2 + Mx + b or higher order. Each additional term adds increased accuracy. More general (math nerd) form: y = ∑ Mnxn
  19. 19. Q: How accurate is my regression line? How is it measured?
  20. 20. Accuracy (or alternatively error) is measured by taking the distance between the value predicted by the regression from the actual value. This is called the residual. Residuals are often expressed by data scientists as residuals squared. A’s have a positive residual Mets have a negative residual Which team would you rather be?
  21. 21. Great! So now we can make predictions right? Right?
  22. 22. Data Science Understand how data science builds off of statistics to make predictions and power some of the most common uses. ➔ Numeric Predictions Describing a relationship between two variables. ➔ Categorization Understanding the some measure of how valid and strong the result is.
  23. 23. Statistics primarily DESCRIBE data sets, but are not set up to PREDICT the values. Example, please???
  24. 24. Which line on the right is the “best” and if there was another point in the set where do you think it might be? Statistics says the black line is the best. Human intuition thinks the green line might be better. But we still don’t know
  25. 25. Another example: The graphs to the right show increasing number of polynomial terms used to fit data on house size vs price. Adapted from http://www.astroml.org/sklearn_tutorial/practical.html
  26. 26. Q: I think I get it. What do we do about this problem?
  27. 27. Data scientists divide the data into two: Train and Test
  28. 28. Training set is used to set up the model - aka fitting parameters Test set is used to measure how well the model predicts the value of data.
  29. 29. Comparing the difference between actual and predicted values - residuals - indicates whether your model is down for the count …
  30. 30. Or worthy of a championship
  31. 31. Great. I now understand the data science process, but it’s not yet magic. What else can it do?
  32. 32. Using a model based on parameters, a computer can group items into categories and make choices
  33. 33. Uhhh OK… Example Please?
  34. 34. Problem: We want to predict if a stranger at the gate of our castle is a Stark or a Lannister. Should we trust them? We don’t want to get killed.
  35. 35. Name Eye Color Hair color Stark Ned Grey Dark Brown Y Robb Blue Dark Brown Y Sansa Blue Red Y Arya Grey Brown Y Bran Brown Brown Y Rickon Blue Brown Y Lyanna Blue Brown Y Benjen Brown Brown Y Tywin Green Blonde N Tyrion Green/Black Dirty Blonde N Jamie Green Brown N Cersei Green Blonde N
  36. 36. Name Eye Color Eye Number Hair color Hair Number Stark Ned Grey 4 Dark Brown 5 1 Robb Blue 2 Dark Brown 5 1 Sansa Blue 2 Red 2 1 Arya Grey 4 Brown 3 1 Bran Brown 3 Brown 3 1 Rickon Blue 2 Brown 3 1 Lyanna Blue 2 Brown 3 1 Benjen Brown 3 Brown 3 1 Tywin Green 1 Blonde 1 0 Tyrion Green/Black 1.5 Dirty Blonde 2 0 Jamie Green 1 Brown 3 0 Cersei Green 1 Blonde 1 0
  37. 37. Training Name Eye Number Hair Number Stark Ned 4 5 1 Sansa 2 5 1 Bran 3 2 1 Rickon 2 3 1 Tyrion 1.5 2 0 Jamie 1 3 0 Testing Name Eye Color Hair Number Stark Robb 2 5 1 Arya 4 3 1 Tywin 1 1 0 Cersei 1 1 0
  38. 38. Training Name Eye Number Hair Number Stark Model Outcome Residuals Squared Ned 4 5 1 1.3235 0.10465225 Sansa 2 5 1 0.7275 0.07425625 Bran 3 2 1 0.7786 0.04901796 Rickon 2 3 1 0.5629 0.19105641 Tyrion 1.5 2 0 0.3316 0.10995856 Jamie 1 3 0 0.2649 0.07017201 Stark = 0.298 (eye number) + 0.0823 (hair number) - 0.28
  39. 39. Test Name Eye Number Hair Number Stark Model Outcome Residuals Squared Robb 2 5 1 0.7275 0.07425625 Arya 4 3 1 1.1589 0.02524921 Tywin 1 1 0 0.1003 0.01006009 Cersei 1 1 0 0.1003 0.01006009 Average Residual Squared Training: 0.0998 Test: 0.0299
  40. 40. Working with data scientists How to have better two way conversations with your data science team and handle objections ➔ Data cleanliness Describing a relationship between two variables. ➔ Model fit Understanding the some measure of how valid and strong the result is.
  41. 41. “The data just isn’t clean enough to work with” But it’s in a database isn’t that good enough?
  42. 42. Real world data is never as clean as one would hope. There is always the danger of missing fields, mistyped entries, previous wrong answers etc.
  43. 43. Solutions: ● Removing bad data columns and checking effect on user ● Making assumptions about missing data ● Spend time tracking down better data ● Find alternative sources for data Always check the effect that a change in the data has on a user experience.
  44. 44. “We can’t ship. The learning curve is broken.”
  45. 45. Going back to our housing problem will help us identify what might be going on.
  46. 46. A learning curve shows how performance of each test and train data sets perform as the side increases. Learning curves are used to understand the basic characteristics of the model fit.
  47. 47. Learning curve can show bias, meaning the test and training set both give similar answers but the answers are incorrect. Data scientists can often fix this problem with more work.
  48. 48. The other issue is called variance, meaning the test and training set give different answers. This type of issue is the hardest for your data science team to deal with.
  49. 49. Some solutions to variance or underfit might be: ● Increasing the amount of data ● Figuring out the exact impact on users ● Increasing complexity of the model ○ Higher order model ○ More variables
  50. 50. “This is cool. Where can I learn more?”
  51. 51. Bibliography For more information there are many great resources online that give overviews as well as in depth info made for people who want to get into data science ➔ Andrew Ng’s Course Machine Learning on Coursera ➔ Coursera ◆ Machine Learning Foundations: A Case Study Approach ◆ A Crash Course in Data Science ➔ Kaggle Hosts competitions and datasets. Also has a tutorial to walk through a machine learning example ➔ Sci-Kit Learn Documentation Documentation for the most popular machine learning library
  52. 52. Upcoming Courses San Francisco Weeknights: September 6th Weekends: September 10th Apply At www.productschool.com www.productschool.com
  53. 53. www.productschool.com Upcoming Workshops Rsvp On Eventbrite August 3: From Building Products to Managing Them August 10: Coding For Non Coders August 17: Product Owners: How to Get Your Development Team to Love You August 24: PM Life at an Early Stage Startup

×