• Share
  • Email
  • Embed
  • Like
  • Private Content
Building Data Products
 

Building Data Products

on

  • 2,853 views

Josh Wills shares how to be successful building data products and explains what a Data Scientist is at the Federal Big Data Forum.

Josh Wills shares how to be successful building data products and explains what a Data Scientist is at the Federal Big Data Forum.

Statistics

Views

Total Views
2,853
Views on SlideShare
2,666
Embed Views
187

Actions

Likes
3
Downloads
65
Comments
0

6 Embeds 187

http://www.cloudera.com 94
http://linux.wwing.net 54
https://twitter.com 32
http://author01.mtv.cloudera.com 3
http://cloudera.com 3
http://blog.cloudera.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Building Data Products Building Data Products Presentation Transcript

    • Building  Data  Products   Josh  Wills,  Senior  Director  of  Data  Science  1
    • About  Me  2  
    • What  Do  Data  Scien<sts  Do?  3
    • What  I  Think  I  Do  4
    • What  Other  People  Think  I  Do  5
    • What  I  Actually  Do  6  
    • Data  Science  and  Data  Products  7
    • Thinking  About  Data  Products  8
    • The  Best  Way  To  Find  Insights  9
    • Build  A  Team  10
    • Measure  Everything  11
    • Solve  the  Right  Problem  12  
    • Building  Data  Products  with  Hadoop  13
    • Hadoop  as  a  PlaMorm  for  Data  Products  14
    • ETL,  Data  Science,  and  Machine  Learning  15  
    • Changing  the  Unit  of  Analysis  16
    • Machine  Learning  and  You  17
    • The  Five  Ques<ons   1.  When  should  I  use  it?     2.  What  does  the  input  look  like?   3.  What  does  the  output  look  like?   4.  How  many  parameters  do  I  have  to  tune?   5.  Why  will  it  fail?  18
    • 1.  Collabora<ve  Filtering  19
    • Collabora<ve  Filtering  (cont.)   1.  To  see  things  that  are  hidden.   2.  <user_id>,<item_id>,<weight>   3.  <item1>,<item2>,<score>   4.  The  distance  metric  and  the  weight  calcula<ons.   5.  If  the  input  data  is  too  sparse.  20
    • Collabora<ve  Filtering  on  Hadoop  21
    • 2.  K-­‐Means  Clustering  22
    • K-­‐Means  Clustering  (cont.)   1.  To  find  anomalous  events.   2.  Vectors  of  normally  distributed  values.   3.  Cluster  centroids.   4.  The  choice(s)  of  K.   5.  The  points  aren’t  even  remotely  normally   distributed.  23
    • K-­‐Means  on  Hadoop  24
    • 3.  Random  Forests  25
    • Random  Forests  (cont.)   1.  To  classify  and  predict.   2.  A  dependent  variable  and  many  independent   variables.   3.  Lots  and  lots  of  liale  trees.   4.  The  number  of  variables  to  consider  at  each  level.   5.  Too  many  independent  variables.  26
    • Random  Forests  on  Hadoop   •  R’s  randomForest  and   rhadoop  tools   •  Map:  par<<on  the  input   data  among  the   reducers   •  Reduce:  fit  the  random   forests  to  each  par<<on   •  Re-­‐combine  the   resul<ng  trees  in  the   client  27  
    • The  Art  of  Model  Design  28
    • Cau<on:  Mind  the  Gap  29  
    • The  Joy  of  Experiments  30
    • Introduc<on  to  Data  Science:   Building  Recommender  Systems   hap://university.cloudera.com/  31
    • Thank  you!    Josh  Wills,  Director  of  Data  Science,  Cloudera            @josh_wills