Building Data Products

4,909 views

Published on

Josh Wills shares how to be successful building data products and explains what a Data Scientist is at the Federal Big Data Forum.

Building Data Products

  1. 1. Building  Data  Products   Josh  Wills,  Senior  Director  of  Data  Science  1
  2. 2. About  Me  2  
  3. 3. What  Do  Data  Scien<sts  Do?  3
  4. 4. What  I  Think  I  Do  4
  5. 5. What  Other  People  Think  I  Do  5
  6. 6. What  I  Actually  Do  6  
  7. 7. Data  Science  and  Data  Products  7
  8. 8. Thinking  About  Data  Products  8
  9. 9. The  Best  Way  To  Find  Insights  9
  10. 10. Build  A  Team  10
  11. 11. Measure  Everything  11
  12. 12. Solve  the  Right  Problem  12  
  13. 13. Building  Data  Products  with  Hadoop  13
  14. 14. Hadoop  as  a  PlaMorm  for  Data  Products  14
  15. 15. ETL,  Data  Science,  and  Machine  Learning  15  
  16. 16. Changing  the  Unit  of  Analysis  16
  17. 17. Machine  Learning  and  You  17
  18. 18. The  Five  Ques<ons   1.  When  should  I  use  it?     2.  What  does  the  input  look  like?   3.  What  does  the  output  look  like?   4.  How  many  parameters  do  I  have  to  tune?   5.  Why  will  it  fail?  18
  19. 19. 1.  Collabora<ve  Filtering  19
  20. 20. Collabora<ve  Filtering  (cont.)   1.  To  see  things  that  are  hidden.   2.  <user_id>,<item_id>,<weight>   3.  <item1>,<item2>,<score>   4.  The  distance  metric  and  the  weight  calcula<ons.   5.  If  the  input  data  is  too  sparse.  20
  21. 21. Collabora<ve  Filtering  on  Hadoop  21
  22. 22. 2.  K-­‐Means  Clustering  22
  23. 23. K-­‐Means  Clustering  (cont.)   1.  To  find  anomalous  events.   2.  Vectors  of  normally  distributed  values.   3.  Cluster  centroids.   4.  The  choice(s)  of  K.   5.  The  points  aren’t  even  remotely  normally   distributed.  23
  24. 24. K-­‐Means  on  Hadoop  24
  25. 25. 3.  Random  Forests  25
  26. 26. Random  Forests  (cont.)   1.  To  classify  and  predict.   2.  A  dependent  variable  and  many  independent   variables.   3.  Lots  and  lots  of  liale  trees.   4.  The  number  of  variables  to  consider  at  each  level.   5.  Too  many  independent  variables.  26
  27. 27. Random  Forests  on  Hadoop   •  R’s  randomForest  and   rhadoop  tools   •  Map:  par<<on  the  input   data  among  the   reducers   •  Reduce:  fit  the  random   forests  to  each  par<<on   •  Re-­‐combine  the   resul<ng  trees  in  the   client  27  
  28. 28. The  Art  of  Model  Design  28
  29. 29. Cau<on:  Mind  the  Gap  29  
  30. 30. The  Joy  of  Experiments  30
  31. 31. Introduc<on  to  Data  Science:   Building  Recommender  Systems   hap://university.cloudera.com/  31
  32. 32. Thank  you!    Josh  Wills,  Director  of  Data  Science,  Cloudera            @josh_wills  

×