0
Building	  Data	  Products	      Josh	  Wills,	  Senior	  Director	  of	  Data	  Science	  1
About	  Me	  2	  
What	  Do	  Data	  Scien<sts	  Do?	  3
What	  I	  Think	  I	  Do	  4
What	  Other	  People	  Think	  I	  Do	  5
What	  I	  Actually	  Do	  6	  
Data	  Science	  and	  Data	  Products	  7
Thinking	  About	  Data	  Products	  8
The	  Best	  Way	  To	  Find	  Insights	  9
Build	  A	  Team	  10
Measure	  Everything	  11
Solve	  the	  Right	  Problem	  12	  
Building	  Data	  Products	  with	  Hadoop	  13
Hadoop	  as	  a	  PlaMorm	  for	  Data	  Products	  14
ETL,	  Data	  Science,	  and	  Machine	  Learning	  15	  
Changing	  the	  Unit	  of	  Analysis	  16
Machine	  Learning	  and	  You	  17
The	  Five	  Ques<ons	       1.     When	  should	  I	  use	  it?	       	       2.     What	  does	  the	  input	  look	 ...
1.	  Collabora<ve	  Filtering	  19
Collabora<ve	  Filtering	  (cont.)	       1.    To	  see	  things	  that	  are	  hidden.	       2.    <user_id>,<item_id>,...
Collabora<ve	  Filtering	  on	  Hadoop	  21
2.	  K-­‐Means	  Clustering	  22
K-­‐Means	  Clustering	  (cont.)	       1.    To	  find	  anomalous	  events.	       2.    Vectors	  of	  normally	  distri...
K-­‐Means	  on	  Hadoop	  24
3.	  Random	  Forests	  25
Random	  Forests	  (cont.)	       1.    To	  classify	  and	  predict.	       2.    A	  dependent	  variable	  and	  many	...
Random	  Forests	  on	  Hadoop	           •    R’s	  randomForest	  and	                rhadoop	  tools	           •    Ma...
The	  Art	  of	  Model	  Design	  28
Cau<on:	  Mind	  the	  Gap	  29	  
The	  Joy	  of	  Experiments	  30
Introduc<on	  to	  Data	  Science:	       Building	  Recommender	  Systems	         hap://university.cloudera.com/	  31
Thank	  you!	  	  Josh	  Wills,	  Director	  of	  Data	  Science,	  Cloudera 	     	     	     	     	     	  @josh_wills	  
Upcoming SlideShare
Loading in...5
×

Building Data Products

3,356

Published on

Josh Wills shares how to be successful building data products and explains what a Data Scientist is at the Federal Big Data Forum.

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,356
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
71
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Building Data Products"

  1. 1. Building  Data  Products   Josh  Wills,  Senior  Director  of  Data  Science  1
  2. 2. About  Me  2  
  3. 3. What  Do  Data  Scien<sts  Do?  3
  4. 4. What  I  Think  I  Do  4
  5. 5. What  Other  People  Think  I  Do  5
  6. 6. What  I  Actually  Do  6  
  7. 7. Data  Science  and  Data  Products  7
  8. 8. Thinking  About  Data  Products  8
  9. 9. The  Best  Way  To  Find  Insights  9
  10. 10. Build  A  Team  10
  11. 11. Measure  Everything  11
  12. 12. Solve  the  Right  Problem  12  
  13. 13. Building  Data  Products  with  Hadoop  13
  14. 14. Hadoop  as  a  PlaMorm  for  Data  Products  14
  15. 15. ETL,  Data  Science,  and  Machine  Learning  15  
  16. 16. Changing  the  Unit  of  Analysis  16
  17. 17. Machine  Learning  and  You  17
  18. 18. The  Five  Ques<ons   1.  When  should  I  use  it?     2.  What  does  the  input  look  like?   3.  What  does  the  output  look  like?   4.  How  many  parameters  do  I  have  to  tune?   5.  Why  will  it  fail?  18
  19. 19. 1.  Collabora<ve  Filtering  19
  20. 20. Collabora<ve  Filtering  (cont.)   1.  To  see  things  that  are  hidden.   2.  <user_id>,<item_id>,<weight>   3.  <item1>,<item2>,<score>   4.  The  distance  metric  and  the  weight  calcula<ons.   5.  If  the  input  data  is  too  sparse.  20
  21. 21. Collabora<ve  Filtering  on  Hadoop  21
  22. 22. 2.  K-­‐Means  Clustering  22
  23. 23. K-­‐Means  Clustering  (cont.)   1.  To  find  anomalous  events.   2.  Vectors  of  normally  distributed  values.   3.  Cluster  centroids.   4.  The  choice(s)  of  K.   5.  The  points  aren’t  even  remotely  normally   distributed.  23
  24. 24. K-­‐Means  on  Hadoop  24
  25. 25. 3.  Random  Forests  25
  26. 26. Random  Forests  (cont.)   1.  To  classify  and  predict.   2.  A  dependent  variable  and  many  independent   variables.   3.  Lots  and  lots  of  liale  trees.   4.  The  number  of  variables  to  consider  at  each  level.   5.  Too  many  independent  variables.  26
  27. 27. Random  Forests  on  Hadoop   •  R’s  randomForest  and   rhadoop  tools   •  Map:  par<<on  the  input   data  among  the   reducers   •  Reduce:  fit  the  random   forests  to  each  par<<on   •  Re-­‐combine  the   resul<ng  trees  in  the   client  27  
  28. 28. The  Art  of  Model  Design  28
  29. 29. Cau<on:  Mind  the  Gap  29  
  30. 30. The  Joy  of  Experiments  30
  31. 31. Introduc<on  to  Data  Science:   Building  Recommender  Systems   hap://university.cloudera.com/  31
  32. 32. Thank  you!    Josh  Wills,  Director  of  Data  Science,  Cloudera            @josh_wills  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×